800 likes | 1.03k Views
Scalable Molecular Dynamics for Large Biomolecular Systems. Robert Brunner James C Phillips Laxmikant Kale Department of Computer Science and Theoretical Biophysics Group University of Illinois at Urbana Champaign. Parallel Computing with Data-driven Objects. Laxmikant (Sanjay) Kale
E N D
Scalable Molecular Dynamicsfor Large Biomolecular Systems Robert Brunner James C Phillips Laxmikant Kale Department of Computer Science and Theoretical Biophysics Group University of Illinois at Urbana Champaign
Parallel Computing withData-driven Objects Laxmikant (Sanjay) Kale Parallel Programming Laboratory Department of Computer Science http://charm.cs.uiuc.edu
Overview • Context: approach and methodology • Molecular dynamics for biomolecules • Our program, NAMD • Basic parallelization strategy • NAMD performance optimizations • Techniques • Results • Conclusions: summary, lessons and future work
Parallel Programming Laboratory • Objective: Enhance performance and productivity in parallel programming • For complex, dynamic applications • Scalable to thousands of processors • Theme: • Adaptive techniques for handling dynamic behavior • Strategy: Look for optimal division of labor between human programmer and the “system” • Let the programmer specify what to do in parallel • Let the system decide when and where to run them • Data driven objects as the substrate: Charm++
System Mapped Objects 5 8 1 1 2 10 4 3 8 2 3 9 7 5 6 10 9 4 9 12 11 13 6 13 7 11 12
Data Driven Execution Scheduler Scheduler Message Q Message Q
Charm++ • Parallel C++ with data driven objects • Object Arrays and collections • Asynchronous method invocation • Object Groups: • global object with a “representative” on each PE • Prioritized scheduling • Mature, robust, portable • http://charm.cs.uiuc.edu
Multi-partition Decomposition • Writing applications with Charm++ • Decompose the problem into a large number of chunks • Implements chunks as objects • Or, now, as MPI threads (AMPI on top of Charm++) • Let Charm++ map and remap objects • Allow for migration of objects • If desired, specify potential migration points
Load Balancing Mechanisms • Re-map and migrate objects • Registration mechanisms facilitate migration • Efficient message delivery strategies • Efficient global operations • Such as reductions and broadcasts • Several classes of load balancing strategies provided • Incremental • Centralized as well as distributed • Measurement based
Principle of Persistence • An observation about CSE applications • Extension of principle of locality • Behavior of objects, including computational load and communication patterns, tend to persist over time • Application induced imbalance: • Abrupt, but infrequent, or • Slow, cumulative • Rarely: frequent, large changes • Our framework still deals with this case as well • Measurement based strategies
Measurement-Based Load Balancing Strategies • Collect timing data for several cycles • Run heuristic load balancer • Several alternative ones • Robert Brunner’s recent Ph.D. thesis: • Instrumentation framework • Strategies • Performance comparisons
Molecular Dynamics ApoA-I: 92k Atoms
Molecular Dynamics and NAMD • MD is used to understand the structure and function of biomolecules • Proteins, DNA, membranes • NAMD is a production-quality MD program • Active use by biophysicists (published science) • 50,000+ lines of C++ code • 1000+ registered users • Features include: • CHARMM and XPLOR compatibility • PME electrostatics and multiple timestepping • Steered and interactive simulation via VMDl
NAMD Contributors • PI s : • Laxmikant Kale, Klaus Schulten, Robert Skeel • NAMD Version 1: • Robert Brunner, Andrew Dalke, Attila Gursoy, Bill Humphrey, Mark Nelson • NAMD2: • M. Bhandarkar, R. Brunner, Justin Gullingsrud, A. Gursoy, N.Krawetz, J. Phillips, A. Shinozaki, K. Varadarajan, Gengbin Zheng, .. Theoretical Biophysics Group, supported by NIH
Molecular Dynamics • Collection of [charged] atoms, with bonds • Newtonian mechanics • At each time-step • Calculate forces on each atom • Bonds • Non-bonded: electrostatic and van der Waal’s • Calculate velocities and advance positions • 1 femtosecond time-step, millions needed! • Thousands of atoms (1,000 - 100,000)
Cut-off Radius • Use of cut-off radius to reduce work • 8 - 14 Å • Far away atoms ignored! (screening effects) • 80-95 % work is non-bonded force computations • Some simulations need faraway contributions • Particle-Mesh Ewald (PME) • Even so, cut-off based computations are important: • Near-atom calculations constitute the bulk of the above • Multiple time-stepping is used: k cut-off steps, 1 PME • So, (k-1) steps do just cut-off based simulation
Early methods • Atom replication: • Each processor has data for all atoms • Force calculations parallelized • Collection of forces: O(N log p) communication • Computation: O(N/P) • Communication/computation Ratio: O(P log P) : Not Scalable • Atom Decomposition • Partition the atoms array across processors • Nearby atoms may not be on the same processor • Communication: O(N) per processor • Ratio: O(N) / (N / P) = O(P): Not Scalable
Force Decomposition • Distribute force matrix to processors • Matrix is sparse, non uniform • Each processor has one block • Communication: • Ratio: • Better scalability in practice • Can use 100+ processors • Plimpton: • Hwang, Saltz, et al: • 6% on 32 processors • 36% on 128 processor • Yet not scalable in the sense defined here!
Spatial Decomposition • Allocate close-by atoms to the same processor • Three variations possible: • Partitioning into P boxes, 1 per processor • Good scalability, but hard to implement • Partitioning into fixed size boxes, each a little larger than the cut-off distance • Partitioning into smaller boxes • Communication: O(N/P) • Communication/Computation ratio: O(1) • So, scalable in principle
Ongoing work • Plimpton, Hendrickson: • new spatial decomposition • NWChem (PNL) • Peter Kollman, Yong Duan et al: • microsecond simulation • AMBER version (SANDER)
Spatial Decomposition in NAMD But the load balancing problems are still severe
FD + SD • Now, we have many more objects to load balance: • Each diamond can be assigned to any processor • Number of diamonds (3D): • 14·Number of Patches
Bond Forces • Multiple types of forces: • Bonds(2), angles(3), dihedrals (4), .. • Luckily, each involves atoms in neighboring patches only • Straightforward implementation: • Send message to all neighbors, • receive forces from them • 26*2 messages per patch!
Bond Forces • Assume one patch per processor: • An angle force involving atoms in patches(x1,y1,z1), (x2,y2,z2), (x3,y3,z3)is calculated in patch: (max{xi}, max{yi}, max{zi}) A C B
NAMD Implementation • Multiple objects per processor • Different types: patches, pairwise forces, bonded forces • Each may have its data ready at different times • Need ability to map and remap them • Need prioritized scheduling • Charm++ supports all of these
Load Balancing • Is a major challenge for this application • Especially for a large number of processors • Unpredictable workloads • Each diamond (force “compute” object) and patch encapsulate variable amount of work • Static estimates are inaccurate • Very slow variations across timesteps • Measurement-based load balancing framework Cell (patch) Compute Cell (patch)
Load Balancing Strategy Greedy variant (simplified): Sort compute objects (diamonds) Repeat (until all assigned) S = set of all processors that: -- are not overloaded -- generate least new commun. P = least loaded {S} Assign heaviest compute to P Refinement: Repeat - Pick a compute from the most overloaded PE - Assign it to a suitable underloaded PE Until (No movement) Cell Compute Cell
Speedups in 1998 ApoA-I: 92k atoms
Optimizations • Series of optimizations • Examples discussed here: • Grainsize distributions (bimodal) • Integration: message sending overheads • Several other optimizations • Separation of bond/angle/dihedral objects • Inter-patch and intra-patch • Prioritization • Local synchronization to avoid interference across steps
Grainsize and Amdahls’s Law • A variant of Amdahl’s law, for objects, would be: • The fastest time can be no shorter than the time for the biggest single object! • How did it apply to us? • Sequential step time was 57 seconds • To run on 2k processors, no object should be more than 28 msecs. • Should be even shorter • Grainsize analysis via projections showed that was not so..
Grainsize Analysis Solution: Split compute objects that may have too much work: using a heuristics based on number of interacting atoms Problem
Performance Audit • Through the optimization process, • an audit was kept to decide where to look to improve performance Total Ideal Actual Total 57.04 86 nonBonded 52.44 49.77 Bonds 3.16 3.9 Integration 1.44 3.05 Overhead 0 7.97 Imbalance 0 10.45 Idle 0 9.25 Receives 0 1.61 Integration time doubled
Integration Overhead Analysis integration Problem: integration time had doubled from sequential run
Integration Overhead Example • The projections pictures showed the overhead was associated with sending messages. • Many cells were sending 30-40 messages. • The overhead was still too much compared with the cost of messages. • Code analysis: memory allocations! • Identical message is being sent to 30+ processors. • Simple multicast support was added to Charm++ • Mainly eliminates memory allocations (and some copying)
ApoA-I on ASCI Red 57 ms/step
BC1 on ASCI Red 58.4 GFlops
Lessons Learned • Need to downsize objects! • Choose smallest possible grainsize that amortizes overhead • One of the biggest challenge • Was getting time for performance tuning runs on parallel machines
Future and Planned Work • Increased speedups on 2k-10k processors • Smaller grainsizes • Parallelizing integration further • New algorithms for reducing communication impact • New load balancing strategies • Further performance improvements for PME • With multiple timestepping • Needs multi-phase load balancing • Speedup on small molecules! • Interactive molecular dynamics
More Information • Charm++ and associated framework: • http://charm.cs.uiuc.edu • NAMD and associated biophysics tools: • http://www.ks.uiuc.edu • Both include downloadable software
Parallel Programming Laboratory • Funding: • Dept of Energy (via Rocket center) • National Science Foundation • National Institute of Health • Group Members Affiliated (NIH/Biophysics) Jim Phillips Kirby Vandivoort Joshua Unger Gengbin Zheng Jay Desouza Sameer Kumar Chee wai Lee Milind Bhandarkar Terry Wilmarth Orion Lawlor Neelam Saboo Arun Singla Karthikeyan Mahesh
The Parallel Programming Problem • Is there one? • We can all write MPI programs, right? • Several Large Machines in use • But: • New complex apps with dynamic and irregular structure • Should all application scientists also be experts in parallel computing?