1 / 41

ANTON D.E Shaw Research

ANTON D.E Shaw Research. Force Fields: Typical Energy Functions. Bond stretches Angle bending Torsional rotation Improper torsion (sp2) Electrostatic interaction Lennard-Jones interaction. Molecular Dynamics. Solve Newton ’ s equation for a molecular system:.

noel
Download Presentation

ANTON D.E Shaw Research

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ANTON D.E Shaw Research

  2. Force Fields: Typical Energy Functions Bond stretches Angle bending Torsional rotation Improper torsion (sp2) Electrostatic interaction Lennard-Jones interaction

  3. Molecular Dynamics • Solve Newton’s equation for a molecular system:

  4. Integrator: Verlet Algorithm Start with {r(t), v(t)}, integrate it to {r(t+Dt), v(t+Dt)}: {r(t+Dt), v(t+Dt)} The new position at t+Dt: {r(t), v(t)} (1) Similarly, the old position at t-Dt: (2) Add (1) and (2): (3) Thus the velocity at t is: (4)

  5. Iterate Iterate Iterate ... and iterate ... and iterate Integrate Newton’s laws of motion Molecular Dynamics Iterate

  6. Two Distinct Problems Problem 1: Simulate many short trajectories Problem 2: Simulate one long trajectory

  7. Simulating Many Short Trajectories • Can answer surprising number of interesting questions • Can be done using • Many slow computers • Distributed processing approach • Little inter-processor communication • E.g., Pande’s Folding at Home project

  8. Simulating One Long Trajectory • Harder problem • Essential to elucidate many biologically interesting processes • Requires a single machine with • Extremely high performance • Truly massive parallelism • Lots of inter-processor communication

  9. DESRES Goal • Single, millisecond-scale MD simulations (long trajectories) • Protein with 64K or more atoms • Explicit water molecules • Why? • That’s the time scale at which many biologically interesting things start to happen

  10. Protein Folding Image: Istvan Kolossvary & Annabel Todd, D. E. Shaw Research

  11. What Will It Take to Simulate a Millisecond? • We need an enormous increase in speed • Current (single processor): ~ 100 ms / fs • Goal will require < 10 ms / fs • Required speedup: > 10,000x faster than current single-processor speed ~ 1,000x faster than current parallel implementations • Can’t accept >10,000x the power (~5 Megawatts)!

  12. What Takes So Long? • Inner loop of force field evaluation looks at all pairsof atoms (within distance R) • On the order of 64K atoms in typical system • Repeat ~1012 times • Current approaches too slow by several orders of magnitude • What can be done?

  13. MD Simulator Requirements • Parallelization • (getting an idea of the level of computation needed) • For every time step, every atom must communicate within its cutt-off radius with every other atom. • 2) A lot of inter-processor communication that can be scaled well is needed.

  14. MD Simulator Requirements • Parallelization • (getting an idea of the level of computation needed) • Whole System is broken down into boxes (processing nodes) • Each node handles the bonded interactions within • NT method for non-bonded interactions (much more common). • NT method for Atom Migration

  15. Why Specialized Hardware? • 1) Need a huge number of arithmetic processing elements • 2) A lot of inter-processor communication that can be scaled well is needed. • 3) Memory is not an issue • With 25,000 atoms (64bytes each) total=1.6MB over 512 nodes=3.2KB/node which is < most L1 Memory Communication Computation Needs

  16. Memory Communication Computation Needs Why Specialized Hardware? • Consider Moore’s Law on 10X improvement in 5 years vs. Anton’s 1000X in 1 year. • Can great discoveries wait? • Can use custom pipelines with more precision, increased datapath logic speed, over less silicon area. • Have Tailored ISA’s for geometric calculations+ • Programmability for accommodating various force fields and integration algorithms • Dedicated memory for each particle to accumulate forces

  17. ANTON Strategy • New architectures • Design a specialized machine • Enormously parallel architecture • Based on special-purpose ASICs • Dramatically faster for MD, but less flexible • Projected completion: 2008 • New algorithms • Applicable to • Conventional clusters • Our own machine • Scale to very large # of processing elements

  18. Interdisciplinary Lab

  19. Alternative Machine Architectures • Conventional cluster of commodity processors • General-purpose scientific supercomputer • Special-purpose molecular dynamics machine

  20. Conventional Cluster of Commodity Processors • Strengths: • Flexibility • Mass market economies of scale • Limitations • Doesn’t exploit special features of the problem • Communication bottlenecks • Between processor and memory • Among processors • Insufficient arithmetic power

  21. General-Purpose Scientific Supercomputer • E.g., IBM Blue Gene • More demanding goal than ours • General-purpose scientific supercomputing • Fast for wide range of applications • Strengths: • Flexibility • Ease of programmability • Limitations for MD simulations • Expensive • Still not fast enough for our purposes

  22. Anton: Special-Purpose MD Machine • Strengths: • Several orders of magnitude faster for MD • Excellent cost/performance characteristics • Limitations: • Not designed for other scientific applications • They’d be difficult to program • Still wouldn’t be especially fast • Limited flexibility

  23. Anton System-Level Organization • Multiple segments (probably 8 in first machine) • 512 nodes (each consists of one ASIC plus DRAM) per segment • Organized in an 8 x 8 x 8 toroidal mesh • Each ASIC equivalent performance to roughly 500 general purpose microprocessors • ASIC power similar to a single microprocessor

  24. 3D Torus Network

  25. Why a 3D Torus? • Topology reflects physical space being simulated: • Three-dimensional nearest neighbor connections • Periodic boundary conditions • Bulk of communications is to near neighbors • No switching to reach immediate neighbors

  26. Source of Speedup on Our Machine • Judicious use of arithmetic specialization • Flexibility, programmability only where needed • Elsewhere, hardware tailored for speed • Tables and parameters, but not programmable • Carefully choreographedcommunication • Data flows to just where it’s needed • Almost never need to access off-chip memory

  27. Two Subsystems on Each ASIC • Programmable,general-purpose • Efficient geometricoperations • Modest clock rate • Pairwise point interactions • Enormously parallel • Aggressive clock rate Flexible Subsystem Specialized Subsystem

  28. 33M gate ASIC Two computational subsystems connected by communication ring Hardware datapaths compute over 25 billion interactions/s Full machine has 512 ASICs in a 3D torus 13 embedded processors Anton

  29. Where We Use Specialized Hardware Specialized hardware (with tables, parameters) where: Inner loop Simple, regular algorithmic structure Unlikely to change Examples: Electrostatic forces Van der Waals interactions

  30. Example: Particle Interaction Pipeline (one of 32)

  31. High-Throughput Interaction Subsystem • Executes Non-bonded MD interaction calculations (Charge Spreading & Force Interpolation) • Accumulates forces on each particle as data streams through. • ICB Controls flow of data through the HTIS, programmable ISA extensions, acts as a buffering, pre-fetching, synchronization, and write back controller

  32. Array of 32 Particle Interaction Pipelines

  33. Advantages of Particle Interaction Pipelines • Save area that would have been allocated to • Cache • Control logic • Wires • Achieve extremely high arithmetic density • Save time that would have been spent on • Cache misses, • Load/store instructions • Misc. data shuffling

  34. Where We Use Flexible Hardware • Use programmable hardware where: • Algorithm less regular • Smaller % of total computation • E.g., local interactions (fewer of them) • More likely to change • Examples: • Bonded interactions • Bond length constraints • Experimentation with • New, short-range force field terms • Alternative integration techniques

  35. Forms of Parallelism in Flexible Subsystem • The Flexible Subsystem exploits three forms of parallelism: • Multi-core parallelism (4 Tensilicas, 8 Geometry Cores) • Instruction-level parallelism • SIMD parallelism – calculate on 3D and 4D vectors as single operation

  36. Overview of the Flexible Subsystem GC = Geometry Core(each a VLIW processor)

  37. Geometry Core(one of 8; 64 pipelined lanes/chip)

  38. But Communication is Still a Bottleneck • Scalability limited by inter-chip communication • To execute a single millisecond-scale simulation, • Need a huge number of processing elements • Must dramatically reduce amount of data transferred between these processing elements • Can’t do this without fundamentally new algorithms: • A family of Neutral Territory (NT) methods that reduce pair interaction communication load significantly • A new variant of Ewald distant method, Gaussian Split Ewald (GSE) which simplifies calculation and communication for distant interactions • These are the subject of a different talk.

  39. Anton in Action

  40. Simulation Evaluations • 500X NAMD 80-100X Desmond 100X Blue Matter

  41. GPU+FPGA ??? FFT and LJ 6*GDDR5 LVDS FPGA GPU HIGH SPEED SERIAL I/O UP TO 2 Tbit/S 16*PCIe

More Related