1 / 46

Preparing for Petascale and Beyond

Preparing for Petascale and Beyond. Celso L. Mendes http://charm.cs.uiuc.edu/people/cmendes Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana Champaign. Presentation Outline. Present Status HPC Landscape, Petascale, Exascale

mikaia
Download Presentation

Preparing for Petascale and Beyond

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Preparing for Petascale and Beyond Celso L. Mendes http://charm.cs.uiuc.edu/people/cmendes Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana Champaign

  2. Presentation Outline • Present Status • HPC Landscape, Petascale, Exascale • Parallel Programming Lab • Mission and approach • Programming methodology • Scalability results for S&E applications • Other extensions and opportunities • Some ongoing research directions • Happening at Illinois • BlueWaters, NCSA/IACAT • Intel/Microsoft, NVIDIA, HP/Intel/Yahoo!, … 4/2/2014 LNCC-08 2

  3. Current HPC Landscape Source: top500.org • Petascale era started! • Roadrunner@LANL (#1 in Top500): • Linpack: 1.026 Pflops, Peak: 1,375 Pflops • Heterogeneous systems starting to spread (Cell, GPUs, …) • Multicore processors widely used • Current trends: 4/2/2014 LNCC-08 3

  4. Current HPC Landscape (cont.) • Processor counts: • #1 Roadrunner@LANL: 122K • #2 BG/L@LLNL: 212K • #3 BG/P@ANL: 163K • Exascale: sooner than we imagine… • U.S. Dep. of Energy town hall meetings in 2007: • LBNL (April), ORNL (May), ANL (August) • Goals: discuss exascale possibilities, how to accelerate it • Sections: • Climate, Energy, Biology, Socioeconomic Modeling, Astrophysics, Math & Algorithms, Software, Hardware • Report: http://www.er.doe.gov/ASCR/ProgramDocuments/TownHall.pdf LNCC-08 4

  5. Current HPC Landscape (cont.) • Current reality: • Steady increase in processor counts • Systems become multicore or heterogeneous • “Memory wall” effects worsening • MPI programming model still dominant • Challenges (now and into foreseeable future): • How to explore new systems’ power • Capacity x Capability – different problems • Capacity is a concern for system managers • Capability is a concern for users • How to program in parallel effectively • Both multicore (desktop) and million-core (supercomputers) LNCC-08 5

  6. Parallel Programming Lab LNCC-08 6

  7. Parallel Programming Lab - PPL PPL, April’2008 • http://charm.cs.uiuc.edu • One of the largest research groups at Illinois • Currently: • 1 faculty, 3 research scientists, 4 research programmers • 13 grad students, 1 undergrad student • Open positions  4/2/2014 LNCC-08 7

  8. PPL Mission and Approach • To enhance Performance and Productivity in programming complex parallel applications • Performance: scalable to thousands of processors • Productivity: of human programmers • Complex: irregular structure, dynamic variations • Application-oriented yet CS-centered research • Develop enabling technology, for a wide collection of apps. • Develop, use and test it in the context of real applications • Embody it into easy to use abstractions • Implementation: Charm++ • Object-oriented runtime infrastructure • Freely available for non-commercial use 4/2/2014 LNCC-08 8

  9. Application-Oriented Parallel Abstractions Synergy between Computer Science research and applications has been beneficial to both LeanCP Space-time meshing Other Applications Issues NAMD Charm++ Techniques & libraries Rocket Simulation ChaNGa LNCC-08 9

  10. Programming Methodology LNCC-08 10

  11. Benefits of Virtualization • Software engineering • Number of virtual processors can be independently controlled • Separate VP sets for different modules in an application • Message driven execution • Adaptive overlap of computation/communication • Dynamic mapping • Heterogeneous clusters • Vacate, adjust to speed, share • Automatic checkpointing • Change set of processors used • Automatic dynamic load balancing • Communication optimization System implementation User View Methodology: Migratable Objects Programmer: [Over] decomposition into objects (“virtual processors” - VPs) Runtime:Assigns VPs to real processors dynamically, during execution Enables adaptive runtime strategies Implementations: Charm++, AMPI LNCC-08 11

  12. MPI processes MPI “processes” Implemented as virtual processes (user-level migratable threads) Real Processors Adaptive MPI (AMPI): MPI + Virtualization • Each virtual process implemented as a user-levelthread embedded in a Charm object • Must properly handle globals and statics (analogous to what’s needed in OpenMP) • But… thread context-switch is much faster than other techniques LNCC-08 12

  13. Parallel Decomposition and Processors • MPI-style: • Encourages decomposition into P pieces, where P is the number of physical processors available • If the natural decomposition is a cube, then the number of processors must be a cube • Overlap of comput./communication is a user’s responsibility • Charm++/AMPI style: “virtual processors” • Decompose into natural objects of the application • Let the runtime map them to physical processors • Decouple decomposition from load balancing LNCC-08 13

  14. Decomposition independent of numCores Solid Solid Solid . . . Fluid Fluid Fluid 1 2 P . . . Solid1 Solid2 Solid3 Solidn . . . Fluid1 Fluid2 Fluidm • Rocket simulation example under traditional MPI vs. Charm++/AMPI framework • Benefits: load balance, communication optimizations, modularity LNCC-08 14

  15. Dynamic Load Balancing • Based on Principle of Persistence • Computational loads and communication patterns tend to persist, even in dynamic computations • Recent past is a good predictor of near future • Implementation in Charm++: • Computational entities (nodes, structured grid points, particles…) are partitioned into objects • Load from objects may be measured during execution • Objects are migrated across processors for balancing load • Much smaller problem than repartitioning entire dataset • Several available policies for load-balancing decisions LNCC-08 15

  16. Typical Load Balancing Phases Regular Timesteps Detailed, aggressive Load Balancing Time Instrumented Timesteps Refinement Load Balancing LNCC-08 16

  17. Examples of Science & Engineering Charm++ Applications LNCC-08 17

  18. NAMD: A Production MD program • NAMD • Fully featured program • NIH-funded development • Distributed free of charge (~20,000 registered users) • Binaries and source code • Installed at NSF centers • 20% cycles (NCSA, PSC) • User training and support • Large published simulations • Gordon-Bell award in 2002 • URL: www.ks.uiuc.edu/Research/namd LNCC-08 18

  19. Spatial Decomposition Via Charm++ • Atoms distributed to cubes based on their location • Size of each cube : • Just a bit larger than cut-off radius • Communicate only with neighbors • Work: for each pair of nbr objects • C/C ratio: O(1) • However: • Load Imbalance • Limited Parallelism Charm++ is useful to handle this Cells, Cubes or “Patches” LNCC-08 19

  20. Object-based Parallelization for MD Force Decomposition + Spatial Decomposition • Now, we have many objects to apply load-balancing: • Each diamond can be assigned to any processor • Number of diamonds (3D): • 14 * Number of patches • 2-away variation: • Half-size cubes, 5x5x5 inter. • 3-away interactions: 7x7x7 • Prototype NAMD versions created for Cell, GPUs LNCC-08 20

  21. Performance of NAMD: STMV STMV: ~1 million atoms Time (ms per step) Number of cores LNCC-08 21

  22. LNCC-08 22

  23. ChaNGa: Cosmological Simulations • Collaborative project (NSF ITR) • with Prof. Tom Quinn, Univ. of Washington • Components: gravity (done), gas dynamics (almost) • Barnes-Hut tree code • Particles represented hierarchically in a tree according to their spatial position • “Pieces” of the tree distributed across processors • Gravity computation: • “Nearby” particles: computed precisely • “Distant” particles: approximated by remote node’s center • Software-caching mechanism, critical for performance • Multi-timestepping: update frequently only the fastest particles (see Jetley et al, IPDPS’2008) LNCC-08 23

  24. ChaNGa Performance Results obtained on BlueGene/L No multi-timestepping, simple load-balancers LNCC-08 24

  25. Other Opportunities LNCC-08 25

  26. MPI Extensions in AMPI • Automatic load balancing • MPI_Migrate(): collective operation, possible migration • Asynchronous collective operations • e.g. MPI_Ialltoall() • Post operation, test/wait for completion; do work in between • Checkpointing support • MPI_Checkpoint() • Checkpoint into disk • MPI_MemCheckpoint() • Checkpoint in memory, with remote redundancy LNCC-08 26

  27. Performance Tuning for Future Machines • For example, Blue Waters will arrive in 2011 • But we need to prepare applications for it, starting now • Even for extant machines: • Full size machine may not be available as often as needed for tuning runs • A simulation-based approach is needed • Our approach: BigSim • Based on Charm++ virtualization approach • Full scale program Emulation • Trace-driven Simulation • History: developed for BlueGene predictions LNCC-08 27

  28. BigSim Simulation System • General system organization • Emulation: • Run an existing, full-scale MPI, AMPI or Charm++ application • Uses an emulation layer that pretends to be (say) 100k cores • Target cores are emulated as Charm+ virtual processors • Resulting traces (aka logs): • Characteristics of SEBs (Sequential Execution Blocks) • Dependences between SEBs and messages LNCC-08 28

  29. BigSim Simulation System (cont.) • Trace driven parallel simulation • Typically run on tens to hundreds of processors • Multiple resolution simulation of sequential execution: • from simple scaling factor to cycle-accurate modeling • Multiple resolution simulation of the Network: • from simple latency/bw model to detailed packet and switching port level modeling • Generates Timing traces just as a real app on full scale machine • Phase 3: Analyze performance • Identify bottlenecks, even w/o predicting exact performance • Carry out various “what-if” analysis LNCC-08 29

  30. Projections: Performance Visualization LNCC-08 30

  31. BigSim Validation: BG/L Predictions LNCC-08 31

  32. Some Ongoing Research Directions LNCC-08 32

  33. Load Balancing for Large Machines: I • Centralized balancers achieve best balance • Collect object-communication graph on one processor • But won’t scale beyond tens of thousands of nodes • Fully distributed load balancers • Avoid bottleneck but.. Achieve poor load balance • Not adequately agile • Hierarchical load balancers • Careful control of what information goes up and down the hierarchy can lead to fast, high-quality balancers LNCC-08 33

  34. Load Balancing for Large Machines: II • Interconnection topology starts to matter again • Was hidden due to wormhole routing etc. • Latency variation is still small... • But bandwidth occupancy (link contention) is a problem • Topology aware load balancers • Some general heuristics have shown good performance • But may require too much compute power • Also, special-purpose heuristic work fine when applicable • Preliminary results: • see Bhatele & Kale’s paper, LSPP@IPDPS’2008 • Still, many open challenges LNCC-08 34

  35. Major Challenges in Applications • NAMD: • Scalable PME (long range forces) – 3D FFT • Specialized balancers for multi-resolution cases • Ex: ChaNGa running highly-clustered cosmological datasets and multi-timestepping Time Black: Processor Activity processor (a) Singlestepping (b) Multi-timestepping (c) Multi-timestepping + special load-balancing LNCC-08 35

  36. BigSim: Challenges • BigSim’s simple diagram hides many complexities • Emulation: • Automatic Out-of-core support for large memory footprint apps • Simulation: • Accuracy vs cost tradeoffs • Interpolation mechanisms for prediction of serial performance • Memory management optimizations • I/O optimizations for handling (many) large trace files • Performance analysis: • Need scalable tools • Active area of research LNCC-08 36

  37. Automatic Checkpointing Migrate objects to disk In-memory checkpointing as an option Both schemes above are available in Charm++ Proactive Fault Handling Migrate objects to other processors upon detecting imminent fault Adjust processor-level parallel data structures Rebalance load after migrations HiPC’07 paper: Chakravorty et al Scalable fault tolerance When a processor out of 100,000 fails, all 99,999 shouldn’t have to run back to their checkpoints! Sender-side message logging Restart can be speeded up by spreading out objects from failed processor IPDPS’07 paper: Chakravorty & Kale Ongoing effort to minimize logging protocol overheads Fault Tolerance LNCC-08 37

  38. Higher Level Languages & Interoperability LNCC-08 38

  39. HPC at Illinois LNCC-08 39

  40. HPC at Illinois • Many other exciting developments • Microsoft/Intel parallel computing research center • Parallel Programming Classes • CS-420: Parallel Programming Sci. and Enginnering • ECE-498: NVIDIA/ECE collaboration • HP/Intel/Yahoo! Institute • NCSA’s Blue Waters system approved for 2011 • see http://www.ncsa.uiuc.edu/BlueWaters/ • NCSA/IACAT new institute • see http://www.iacat.uiuc.edu/ 4/2/2014 LNCC-08 40

  41. Microsoft/Intel UPCRC • Universal Parallel Computing Research Center • 5 year funding, 2 centers: • Univ.Illinois & Univ.Cal.-Berkeley • Joint effort by Intel/Microsoft: $2M/year • Mission: • Conduct research to make parallel programming broadly accessible and “easy” • Focus areas: • Programming, Translation, Execution, Applications • URL: http://www.upcrc.illinois.edu/ 4/2/2014 LNCC-08 41

  42. Parallel Programming Classes • CS-420: Parallel Programming • Introduction to fundamental issues in parallelism • Students from both CS and other engineering areas • Offered every semester, by CS Profs. Kale or Padua • ECE-498: Progr. Massively Parallel Processors • Focus on GPU programming techniques • ECE Prof. Wen-Mei Hwu • NVIDIA’s Chief Scientist David Kirk • URL: http://courses.ece.uiuc.edu/ece498/al1 4/2/2014 LNCC-08 42

  43. HP/Intel/Yahoo! Initiative • Cloud Computing Testbed - worldwide • Goal: • Study Internet-scale systems, focusing on data-intensive applications using distributed computational resources • Areas of study: • Networking, OS, virtual machines, distributed systems, data-mining, Web search, network measurement, and multimedia • Illinois/CS testbed site: • 1,024-core HP system with 200 TB of disk space • External access via an upcoming proposal selection process • URL: http://www.hp.com/hpinfo/newsroom/press/2008/080729xa.html 4/2/2014 LNCC-08 43

  44. Our Sponsors LNCC-08 44

  45. PPL Funding Sources • National Science Foundation • BigSim, Cosmology, Languages • Dep. of Energy • Charm++ (Load-Bal., Fault-Toler.), Quantum Chemistry • National Institutes of Health • NAMD • NCSA/NSF, NCSA/IACAT • Blue Waters project, applications • Dep. of Energy / UIUC Rocket Center • AMPI, applications • Nasa • Cosmology/Visualization 4/2/2014 LNCC-08 45

  46. Obrigado ! LNCC-08 46

More Related