130 likes | 279 Views
Chee Wai Lee cheelee@uiuc.edu Parallel Programming Laboratory Computer Science Department University of Illinois at Urbana-Champaign http://charm.cs.uiuc.edu. NAMD and BG/L. Outline. BG/L Platform overview Optimization Efforts: Context Optimization Efforts: Approaches Topology Awareness
E N D
Chee Wai Lee cheelee@uiuc.edu Parallel Programming Laboratory Computer Science Department University of Illinois at Urbana-Champaign http://charm.cs.uiuc.edu NAMD and BG/L
Outline • BG/L Platform overview • Optimization Efforts: Context • Optimization Efforts: Approaches • Topology Awareness • Load Balancing • Parallelism • Computation/Communication Overlap • Results
Bluegene/L Platform Review • Hardware characteristics: • PowerPC 440 700 Mhz 32-bit processors • 2 Processors per node, no cache coherence • 4MB L3 Cache • 512 MB memory per node • 6 outgoing FIFO links per node • 3D Torus interconnect
Bluegene/L Platform Review (2) • Other characteristics: • Microkernel on compute nodes, minimal OS interference.
Outline • BG/L Platform overview • Optimization Efforts: Context • Optimization Efforts: Approaches • Topology Awareness • Load Balancing • Parallelism • Computation/Communication Overlap • Results
Objectives • Scale the 92,000 atom benchmark apoa1 as far as possible. • Sought understanding of scaling issues involved on the BG/L machine.
Outline • BG/L Platform overview • Optimization Efforts: Context • Optimization Efforts: Approaches • Topology Awareness • Load Balancing • Parallelism • Computation/Communication Overlap • Results
Topology Awareness • Distribute Patches according to the topology. • Logically align the NAMD 3D patch grid to BG/L's processor grid. • Patch Grid divided by Orthogonal Recursive Bisection (ORB) scheme. • Processor Grid is divided in similar proportions and assigned to corresponding Patch subgrids. • Topology aware spanning tree for multicasts.
Load Balancing • Framework optimizations • Memory footprint had to be reduced to accommodate the desired number of processors. • Spanning Tree implemented to handle large numbers of incoming messages to pe 0. • Spread non-migratable work better • Bonded computations (eg. Dihedrals) allocated off processors with Patch work where possible.
More Parallelism • 2-away computation. Patches interact with neighbors of neighbors. • User-tunable configuration option. • Break up compute objects. • Another User-tunable configuration option. • Balance tradeoffs in grainsize vs overheads. • PME pencil decomposition efforts.
Overlap of Computation and Communication • Hurt by lack of cache-coherence. • One processor can serve as communication co-processor if the L1 caches are flushed for large messages. Hurts too much. • Make use of FIFO link buffers. Every so often in NAMD's outer loop, we make AdvanceCommunication() calls.
Outline • BG/L Platform overview • Optimization Efforts: Context • Optimization Efforts: Approaches • Results
Results Nodes Processors Mode Time (watson) 32 32 co 347 ms 128 128 co 97.2 ms 512 512 co 23.7 ms 1024 1024 co 13.8 ms 2048 2048 co 8.6 ms 4096 4096 co 6.2 ms 8192 Processor scaling was achieved at 5.2ms per step