1 / 21

Beyond Auto-Parallelization: Compilers for Many-Core Systems

Beyond Auto-Parallelization: Compilers for Many-Core Systems. University of Edinburgh http://www.homepages.inf.ed.ac.uk/mc. Marcelo Cintra. Compilers for Parallel Computers (Today). Auto-parallelizing compilers

donkor
Download Presentation

Beyond Auto-Parallelization: Compilers for Many-Core Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Beyond Auto-Parallelization: Compilers for Many-Core Systems University of Edinburgh http://www.homepages.inf.ed.ac.uk/mc Marcelo Cintra

  2. Compilers for Parallel Computers (Today) • Auto-parallelizing compilers • “Holy grail”: convert sequential programs into parallel programs with little or no user intervention • Only partial success, despite decades of work • No performance debugging tools • For explicitly parallel languages/annotations (e.g., OpenMP, Java Threads) • Main goal: correctly map high-level data and control flow to hardware/OS threads and communication • Secondary goal: perform simple optimizations specific to parallel execution • Simple correctness and performance debugging tools Moore for Less Keynote - September 2008

  3. Compilers for Parallel Computers (Future) • Data flow/dependence analysis tools – unsafe/speculative • Probabilistic approaches • Profile-based approaches • Multithreading-specific optimization toolbox • Including alternative/speculative parallel programming models (e.g., Transactional Memory (TM)) • Auto-parallelizing compilers – with speculation • Thread-level speculation (TLS) • Helper threads Holistic parallelizing tool chain. Moore for Less Keynote - September 2008

  4. Why Be Speculative? • Performance of programs ultimately limited by control and data flows • Most compiler optimizations exploit knowledge of control and data flows • Techniques based on complete/accurate knowledge of control and data flows are reaching their limit • True for both sequential and parallel optimizations Future compiler optimizations must rely on incomplete knowledge: speculative execution Moore for Less Keynote - September 2008

  5. Compilers for Parallel Computers (Future) Unsafe P-way parallel Dependence/Flow Analysis Tool TLS TM Seq. Parallelizing Compiler Auto-TLS Compiler Auto-TLS Compiler <P-way parallel Moore for Less Keynote - September 2008

  6. Outline • Context and Motivation • History and status-quo of auto-parallelizing compilers • Data dependence analysis for array-based programs • Data dependence analysis for irregular programs • Auto-parallelizing compilers for TLS • TLS execution model (speculative parallelization) • Static compiler cost model (PACT’04, TACO’07) Moore for Less Keynote - September 2008

  7. Data Dependence Analysis for Arrays • Based on mathematical evaluation of array index expressions within loop nests • Progressively more capable analyses (e.g., GCD test, Banerjee test), but still restricted to affine loop index expressions • Coupled with mathematical framework to represent loop transformations (e.g., loop interchange, skewing) that can help expose more parallelism Moore for Less Keynote - September 2008

  8. Data Dependence Analysis for Arrays • What’s wrong with traditional data dependence? • Not all index expressions are affine or even statically defined (e.g., subscripted subscripts) • Not all loops are well structured (e.g., conditional exits, control flow) • Not all procedures are analyzable (e.g., unavailable code, aliasing, global data access) • Not all applications make intense use of arrays (e.g., trees, hash tables, linked lists, etc) and loop nests Moore for Less Keynote - September 2008

  9. Data Dependence Analysis for Irregular Programs • Based on ad-hoc analyses (e.g., pointer analysis, shape analysis, task graph analysis) ? There isn’t a comprehensive data dependence analysis framework for irregular applications Moore for Less Keynote - September 2008

  10. Outline • Context and Motivation • History and status-quo of auto-parallelizing compilers • Data dependence analysis for array-based programs • Data dependence analysis for irregular programs • Auto-parallelizing compilers for TLS • TLS execution model (speculative parallelization) • Static compiler cost model (PACT’04, TACO’07) Moore for Less Keynote - September 2008

  11. RAW Thread Level Speculation (TLS) • Assume no dependences and execute threads in parallel • While speculating, buffer speculative data separately • Track data accesses and monitor cross-thread violations • Squash offending threads and restart them • All this can be done in hardware, software, or a combination for(i=0; i<100; i++) { … = A[L[i]] + … A[K[i]] = … } Iteration J … = A[4]+… A[5] = ... Iteration J+1 … = A[2]+… A[2] = ... Iteration J+2 … = A[5]+… A[6] = ... Moore for Less Keynote - September 2008

  12. TLS Overheads • Squash & restart: re-executing the threads • Speculative buffer overflow: speculative buffer is full, thread stalls until becomes non-speculative • Dispatch & commit: writing back speculative data into memory and starting next speculative thread • Load imbalance: processor waiting for thread to become non-speculative to commit Moore for Less Keynote - September 2008

  13. Coping with overheads: Cost Model! • Compiler cost models are key to guide optimizations, but no such cost model exists for TLS • Speculative parallelization can deliver significant speedup or slowdown • Several speculation overheads • Overheads are hard to estimate (e.g., squash?) • A prediction of the value of speedup can be useful • e.g. multi-tasking environment • program A wants to run speculatively in parallel on 4 cores ( predicted speedup 1.8 ) • other programs waiting to be scheduled • OS decides it does not pay off Moore for Less Keynote - September 2008

  14. TLS Overheads • Squash & restart: re-executing the threads • Hard because violations are highly unpredictable • Speculative buffer overflow: speculative buffer is full, thread stalls until becomes non-speculative • Hard because write-sets are somewhat unpredictable • Dispatch & commit: writing back speculative data into memory and starting next speculative thread • Hard because write-sets are somewhat unpredictable • Load imbalance: processor waiting for thread to become non-speculative to commit • Hard because workloads are very unpredictable and order does matter due to in-order commit requirement Moore for Less Keynote - September 2008

  15. Our Compiler Cost Model: Highlights • First fully static compiler cost model for TLS • Can handle all TLS overheads in a single framework • Including loop imbalance, which is not handled by any other cost model • Produces not only a qualitative (“good” or “bad”) assessment of the TLS benefits but instead a quantitative value (i.e., expected speedup/slowdown) • Can be easily integrated into most compilers at the intermediate representation level • Simple and fast to compute Moore for Less Keynote - September 2008

  16. Very varied speedup/slowdown behavior Speedup Distribution Moore for Less Keynote - September 2008

  17. Only 17% false positives (performance degradation) Negligible false negatives (missed opportunities) Most speedups/slowdowns correctly predicted by the model Model Accuracy (I): Outcomes Moore for Less Keynote - September 2008

  18. Current Developments • Done: • Completed implementation of TLS code generator in GCC • Doing: • Implementing cost model in this TLS GCC • Profiling TLS program behavior (with IBM and U. of Manchester) • To Do: • Develop hybrid cost models based on static and profile information • Develop “intelligent” cost models based on Machine Learning (with U. of Manchester) Moore for Less Keynote - September 2008

  19. Summary • Paraphrasing M. Snir† (UIUC): “parallel programming will have to become synonymous with programming” • However, • Better (and unsafe) data dependence analysis tools • Explicit (and speculative) parallel models • Auto-parallelizing (speculative) compilers • Much work still needs to be done. • At U. of Edinburgh: • Auto-parallelizing TLS compilers • TLS hardware • STM (software TM) † Director of Intel+Microsoft’s UPCRC Moore for Less Keynote - September 2008

  20. Acknowledgments • Research Team and Collaborators • Jialin Dou • Salman Khan • Polychronis Xekalakis • Nikolas Ioannou • Fabricio Goes • Constantino Ribeiro • Dr. G. Brown, Dr. M. Lujan, Prof. I. Watson (U. of Manchester) • Prof. Diego Llanos (U. of Valladolid) • Funding • UK – EPSRC: GR/R65169/01 EP/G000697/1 Moore for Less Keynote - September 2008

  21. Beyond Auto-Parallelization: Compilers for Many-Core Systems University of Edinburgh http://www.homepages.inf.ed.ac.uk/mc Marcelo Cintra

More Related