410 likes | 590 Views
Dynamic Feedback: An Effective Technique for Adaptive Computing. Pedro Diniz and Martin Rinard Department of Computer Science University of California, Santa Barbara http://www.cs.ucsb.edu/~{pedro,martin}. Basic Issue: Efficient Implementation of Atomic Operations in Object-Based Languages
E N D
Dynamic Feedback:An Effective Techniquefor Adaptive Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California, Santa Barbara http://www.cs.ucsb.edu/~{pedro,martin}
Basic Issue: Efficient Implementation of Atomic Operations in Object-Based Languages Approach: Reduce Lock Overhead by Coarsening Lock Granularity Problem: Coarsening Lock Granularity May Reduce Available Concurrency
Solution: Dynamic Feedback • Multiple Lock Coarsening Policies • Dynamic Feedback • Generate Multiple Versions of Code • Measure Dynamic Overhead of Each Policy • Dynamically Select Best Version • Context • Parallelizing Compiler • Irregular Object-Based Programs • Pointer-Based Data Structures • Commutativity Analysis
Talk Outline • Lock Coarsening • Dynamic Feedback • Experimental Results • Related Work • Conclusions
Model of Computation Atomic Operations • Parallel Programs • Serial Phases • Parallel Phases Serial Phase Parallel Phase Serial Phase • Atomic Operations on Shared Objects • Mutual Exclusion Locks • Acquire Constructs • Release Constructs L.acquire() L.release() Mutual Exclusion Region
Problem: Lock Overhead L.acquire() L.release() L.acquire() L.release()
L.acquire() L.release() L.acquire() L.release() L.acquire() L.release() Solution: Lock Coarsening Original After Lock Coarsening Reference: Diniz and Rinard “Synchronization Transformations for Parallel Computing”, POPL97
Lock Coarsening Trade-Off • Advantage: • Reduces Number of Executed Acquires and Releases • Reduces Acquire and Release Overhead • Disadvantage: May Introduce False Exclusion • Multiple Processors Attempt to Acquire Same Lock • Processor Holding the Lock is Executing Code that was Originally in No Mutual Exclusion Region
L.acquire() L.release() L.acquire() L.release() L.acquire() • • • L.release() L.acquire() L.release() L.acquire() L.release() False Exclusion False Exclusion Original After Lock Coarsening
Lock Coarsening Policy Goal: Limit Potential Severity of False Exclusion Mechanism: Multiple Lock Coarsening Policies • Original: Never Coarsen Granularity • Bounded: Coarsen Granularity Only Within Cycle-Free Subgraphs of ICFG • Aggressive: Always Coarsen Granularity
Choosing Best Policy • Best Lock Coarsening Policy May Depend On • Topology of Data Structures • Dynamic Schedule Of Computation • Information Required to Choose Best Policy Unavailable at Compile Time • Complications • Different Phases May Have Different Best Policy • In Same Phase, Best Policy May Change Over Time
Code Version Original Bounded Aggressive Aggressive Original Overhead Time Sampling Phase Production Phase Sampling Phase Solution: Dynamic Feedback • Generated Code Executes • Sampling Phases: Measure Performance of Different Policies • Production Phases : Use Best Policy From Sampling Phase • Periodically Resample to Discover Best Policy Changes
Guaranteed Performance Bounds • Assumptions: • Overhead Changes Bounded by Exponential Decay Functions • Worst Case Scenario: • No Useful Work During Sampling Phase • Sampled Overheads Are Same For All Versions • Overhead of Selected Version Increases at Maximum Rate • Overhead of Other Versions Decreases at Maximum Rate Overhead V0 Time S S S P
T T T Work - Work Š T i j i Work = 1P+SN (1 - o1(t)) dt P+SN P P+SN Work - Work Š (P+SN) opt 0 opt Guaranteed Performance Bound Definition 1. Policy p is at Most Worse Than Policy p over a Time Interval T if i j Work = 0T (1 - oi(t)) dt where Definition 2. Dynamic Feedback is at Most Worse Than the Optimal if where Result 1. To Guarantee this Bound (1 - ) P + (1/) e(-P) Š (- 1) SN + (1/)
Guaranteed Performance Bounds (1 - ) P + (1/) e(-P) (- 1) SN + (1/) Constraint Values Feasible Region Production Interval P Production Interval Too Short: Unable to Amortize Sampling Overhead Production Interval Too Long: May Execute Suboptimal Policy for Long Time Basic Constraint: Decay Rate () Must be Small Enough
Dynamic Feedback: Implementation • Code Generation • Measuring Policy Overhead • Interval Selection • Interval Expiration • Policy Switch
Code Generation • Statically Generate Different Code Versions for Each Policy • Alternative: Dynamic Code Generation • Advantages of Static Code Generation: • Simplicity of Implementation • Fast Policy Switching • Potential Drawback of Static Code Generation • Code Size (In Practice Not a Problem)
Measuring Policy Overhead • Sources of Overhead • Locking Overhead • Waiting Overhead • Compute Locking Overhead • Count Number of Executed Acquire/Release Constructs • Estimate Waiting Overhead • Count Number of Spins on Locks Waiting to be Released ( ( ) ) Number of Spins Number of Acquire/Release Acquire/Release Execution Time x + x Spin Time Sampled Overhead = Sampling Time
Interval Selection and Expiration • Fixed Interval Values • Sampling Interval: 10 milliseconds • Production Interval: 10 seconds • Good Results for Wide Range of Interval Values • Polling Code for Expiration Detection • Location: Back Edges of Parallel Loop • Advantage: Low Overhead • Disadvantage: Potential Interaction with Iteration Size Atomic Operations Polling Points
Policy Switch • Synchronous • Processors Poll Timer to Detect Interval Expiration • Barrier At End of Each Interval • Advantages: • Consistent Transitions • Clean Overhead Measurements • Disadvantages: • Need to Synchronize All Processors • Potential Idle Time At Barrier
Experimental Results • Parallelizing Compiler Based on Commutativity Analysis [PLDI’96] • Set of Complete Scientific Applications • Barnes-Hut N-Body Solver (1500 lines of C++) • Liquid Water Simulation Code (1850 lines of C++) • Seismic Modeling String Code (2050 lines of C++) • Different Lock Coarsening Policies • Dynamic Feedback • Performance on Stanford DASH Multiprocessor
60 60 60 Dynamic Dynamic Original Original 40 40 40 Dynamic Serial Serial Size Text Segment (Kbytes) Size Text Segment (Kbytes) Original Size Text Segment (Kbytes) Serial 20 20 20 0 0 0 Barnes-Hut Water String Code Sizes
60 40 Original Percentage Lock Overhead 20 Bounded Aggressive 0 Barnes-Hut (16K Particles) Lock Overhead Percentage of Time that the Single Processor Execution Spends Acquiring and Releasing Mutual Exclusion Locks 60 60 40 40 Percentage Lock Overhead Percentage Lock Overhead 20 20 Original Bounded Original Aggressive 0 0 Aggressive String (Big Well Model) Water (512 Molecules)
Aggressive Bounded Original Contention Overhead Percentage of Time that Processors Spend Waiting to Acquire Locks Held by Other Processors 100 100 100 75 75 75 50 50 50 Contention Percentage 25 25 25 0 0 0 0 4 8 12 16 0 4 8 12 16 0 4 8 12 16 Processors Processors Processors Barnes-Hut (16K Particles) Water (512 Molecules) String (Big Well Model)
Ideal Performance Results: Barnes-Hut 16 Aggressive Dynamic 12 Feedback Bounded Speedup 8 Original 4 0 0 4 8 12 16 Number of Processors Barnes-Hut on DASH (16K Particles)
Ideal Bounded Dynamic Feedback Original Aggressive Performance Results: Water 16 12 Speedup 8 4 0 0 4 8 12 16 Number of Processors Water on DASH (512 Molecules)
Ideal Original Dynamic Feedback Aggressive Performance Results: String 16 12 Speedup 8 4 0 0 4 8 12 16 Number of Processors String on DASH (Big Well Model)
Summary • Code Size Is Not An Issue • Lock Coarsening Has Significant Performance Impact • Best Lock Coarsening Policy Varies With Application • Dynamic Feedback Delivers Code With Performance Comparable to The Best Static Lock Coarsening Policy
Related Work • Adaptive Execution Techniques (Saavedra Park:PACT96) • Dynamic Dispatch Optimizations (Hölzle Ungar:PLDI94) • Dynamic Code Generation (Engler:PLDI96) • Profiling (Brewer:PPoPP95) • Synchronization Optimizations (Plevyak et al:POPL95)
Conclusions • Dynamic Feedback • Generated Code Adapts to Different Execution Environments • Integration with Parallelizing Compiler • Irregular Object-Based Programs • Pointer-Based Linked Data Structures • Commutativity Analysis • Evaluation with Three Complete Applications • Performance Comparable to Best Hand-Tuned Optimization
16 Ideal 14 Aggressive Bounded 12 Original 10 8 6 4 2 0 0 2 4 6 8 10 12 14 16 Number of Processors Performance Results : Barnes-Hut Speedup Barnes-Hut (16K Particles)
16 Ideal Bounded 14 12 Original Aggressive 10 Speedup 8 6 4 2 0 0 2 4 6 8 10 12 14 16 Number of Processors Performance Results: Water Water (512 Molecules)
16 Ideal 14 Original 12 Aggressive 10 8 Speedup 6 4 2 0 0 2 4 6 8 10 12 14 16 Number of Processors Performance Results: String String (Big Well Model)
Policy Switch Timer Expires Policy 1 Timer Expires Policy 2
Motivation Challenges: • Match Best Implementation to Environment • Heterogeneous and Mobile Systems Goal: • Develop Mechanisms to Support Code that Adapts to Environment Characteristics Technique: • Dynamic Feedback
Overhead for Barnes-Hut 0.5 0.4 Original 0.3 Sampled Overhead Bounded 0.2 0.1 Aggressive 0 0 5 10 15 20 25 Execution Time (Seconds) Barnes-Hut on DASH (8 Processors) FORCES Loop Data Set - 16K Particles
0.5 0.4 0.3 Sampled Overhead 0.2 Original 0.1 Bounded 0 0 10 20 30 40 50 60 Execution Time (Seconds) Overhead for Water Water on DASH (8 Processors) INTERF Loop Data Set - 512 Molecules
1 Aggressive 0.8 0.6 Sampled Overhead 0.4 0.2 Original 0 0 10 20 30 40 50 60 Execution Time (Seconds) Overhead for Water Water on DASH (8 Processors) POTENG Loop Data Set - 512 Molecules
1 Aggressive 0.8 0.6 Sampled Overhead 0.4 0.2 Original 0 0 100 200 300 400 500 Execution Time (Seconds) Overhead for String String on DASH (8 Processors) PROJFWD Loop Data Set -Big Well
Code Version Aggressive Bounded Original Aggressive Overhead Time Sampling Phase Production Phase Sampling Phase Dynamic Feedback