1 / 27

Learning From the Stanford/DOE Visualization Cluster

Learning From the Stanford/DOE Visualization Cluster. Mike Houston, Greg Humphreys, Randall Frank, Pat Hanrahan. Outline. Stanford’s current cluster Design decisions Performance evaluation Bottleneck evaluation Cluster “Landscape” General classification Bottleneck evaluation

johana
Download Presentation

Learning From the Stanford/DOE Visualization Cluster

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning From the Stanford/DOE Visualization Cluster Mike Houston, Greg Humphreys, Randall Frank, Pat Hanrahan

  2. Outline • Stanford’s current cluster • Design decisions • Performance evaluation • Bottleneck evaluation • Cluster “Landscape” • General classification • Bottleneck evaluation • Stanford’s next cluster • Design goals • Research directions

  3. Stanford/DOE Visualization Cluster The Chromium Cluster

  4. Cluster Configuration (Jan. 2000) • Cluster: 32 graphics nodes + 4 server nodes • Computer: Compaq SP750 • 2 processors (800 MHz PIII Xeon, 133MHz FSB) • i840 core logic (big issue for vis-clusters) • Simultaneous fast graphics and networking • Network: 64-bit, 66 MHz PCI • Graphics: AGP-4x • 256 MB memory • 18GB SCSI 160 disk (+ 3*36GB on servers) • Graphics (Sept. 2002) • 16 NVIDIA GeForce3 w/ DVI (64 MB) • 16 NVIDIA GeForce4 TI4200 w/ DVI (128 MB) • Network • Myrinet 64-bit, 66 MHz (LANai 7)

  5. Graphics Evaluation • NVIDIA GeForce3 • 25 MTri/s triangle rate observed • 680 MPix/s fill rate observed • NVIDIA GeForce4 • 60 MTri/s triangle rate observed • 800 MPix/s fill rate observed • Read Pixels performance • 35 MPix/s (140 MB/s) RGBA • 22 MPix/s (87 MB/s) Depth • Draw Pixels performance • 45 MPix/s (180 MB/s) RGBA • 21 MPix/s (85 MB/s) Depth

  6. Network Evaluation • Myrinet LANai 7 PCI64A boards • Theoretical Limit: 160 MB/s • 142 MB/s observed peak under Linux • ~100 MB/s observed sustained under Linux • ServerNet not chosen • Driver support • Large switching infrastructure required • Gigabit Ethernet • Performance and scalability concerns

  7. Myrinet Issues • Fairness: Clients starved of network resources • Implemented credit scheme to minimize congestion • Lack of buffering in switching fabric • Causes poor performance in high load conditions • Open issue Partitioned Cluster Unpartitioned Cluster

  8. i840 Chipset Evaluation • 66MHz 64bit PCI performance not full speed: • 210 MB/s PCI read (40% of theoretical peak) • 288 MB/s PCI write (54% of theoretical peak) • Combined read/write ~121 MB/s • AGP • Fast Writes / Side Band Addressing unstable under Linux

  9. Sort-First Performance • Configuration • Application runs application on client • Primitives distributed to servers • Tiled Display • 4x3 @ 1024x768 • Total resolution: 4096x2304, 9 Megapixel • Quake 3 • 50 fps • Atlantis • 450 fps

  10. Sort-Last Performance • Configuration • Parallel rendering on multiple nodes • Composite to final display node • Volume Rendering on 16 nodes • 1.57 GVox/s [Humphreys 02] • 1.82 GVox/s (tuned) 9/02 • 256x256x1024 volume1 rendered twice 1Data Courtesy of G. A Johnson, G.P.Cofer, S.L Gewalt, and L.W. Hedlund from the Duke Center for In Vivo Microscopy (an NIH/NCRR National Resource)

  11. Cluster Accomplishments • Development Platform • WireGL • Chromium • Cluster configuration replicated • Interactive Performance • 256x512x1024 volume @ 15fps • 9 Megapixel Quake3 @ 50fps

  12. Sources of Bottlenecks • Sort-First • Packing speed (processor) • Primitive distribution (network and bus) • Rendering (processor and graphics chip) • Sort-Last • Rendering (graphics chip) • Composite (network, bus, and read/draw pixels)

  13. Bottleneck Evaluation – Stanford • Sort-First: Processor and Network • Sort-Last: Network and Read/Draw

  14. The Landscape of Graphics Clusters • Many Options • Low End <$2500/node • Mid End ~$5000/node • High End >$7500/node • Tradeoffs • Different bottlenecks • Price/Performance • Scalability • Usage • Evaluation • Based off of published benchmarks and specs

  15. Cluster Interconnect Options • Many choices • GigE • ~100 MB/s • Myrinet 2000 (http://www.myrinet.com) • 245MB/s • SCI/Dolphin (http://www.dolphinics.com) • 326 MB/s • Quadrics (http://www.quadrics.com) • 340 MB/s • Future options • 10 GigE • Infiniband • HyperTransport

  16. Low End • General Definition • Single CPU • Consumer Mainboard • Integrated Graphics • High Speed commodity network • Example Node Configuration • Nvidia NForce2 • AMD Athlon 2400+ • 512 MB DDR • GigE and 10/100 • 1U rack chassis • Estimated Price: $1500

  17. Bottleneck Evaluation – Low End • Bus/Network limited

  18. Mid End • General Definition • Dual Processor • “Workstation” mainboard • High performance bus • 64-bit PCI or PCI-X • High Speed Commodity / Low end cluster interconnect • High-End consumer graphics board • Example Node Configuration • Intel i860 • Dual Intel P4 Xeon 2.4GHz • 2GB RDRAM • ATI Radeon 9700 • GigE onboard + Myrinet 2000 • 2U rack chassis • Estimated Price: $4000

  19. Bottleneck Evaluation – Mid End • Sort-First: Network limited • Sort-Last: Read/Draw and Network limited

  20. High End • General Definition • Dual or Quad processor • Cutting edge bus • PCI-X, HyperTransport, PCI Enhanced • High Speed Commodity/ High end cluster interconnect • “Professional” graphics board • RAID system • Example Node Configuration • ServerWorks GC-WS • Dual P4 Xeon 2.6GHz • Nvidia Quadro4 900XGL • 4GB DDR • GigE onboard + Infiniband • Estimated Price: $7500

  21. Bottleneck Evaluation – High End • Sort-First: Well balanced • Sort-Last: Read/Draw limited

  22. Balanced System is Key • Only as fast as slowest component • Spend money where it matters!

  23. Goals for Next Cluster • Performance • Sort-Last • 5 GVox/s • 1 GTri/s • Sort-First at 4096x2304 • Quake3 @ >100fps • Research • Remote visualization • Time-varying datasets • Compositing

  24. What we plan to build • 16 Node cluster, 1U nodes • Mainboard chipsets • Intel Placer • ServerWorks GC-WS • AMD Hammer • Memory • 2-4GB • Graphics Chip • Nvidia NV30 • ATI R300/350 • Interconnect • Infiniband, Quadrics • Disk • IDE RAID or SCSI

  25. Continuing Chipset Issues • Why do chipsets perform so poorly? • “Workstation” • Intel i860 • 215 MB/s read (40% of theoretical) • 300 MB/s write (56% of theoretical) • AMD 760MPX • 300 MB/s read (56% of theoretical) • 312 MB/s write (59% of theoretical) • “Server” • ServerWorks ServerSet III LE • 423 MB/s read (79% of theoretical) • 486 MB/s write (91% of theoretical) • Why can’t a “server” have an AGP slot? Performance numbers from http://www.conservativecomputer.com

  26. Ongoing Bottlenecks • Readback performance • Will be fixed “soon” • Hardware compositing? • Chipset Performance • Achieve fraction of theoretical • Need faster busses in commodity chipsets • Network Performance • Scalability • Fast is VERY expensive

  27. Conclusions • What we still need • More vendors • More chipsets • More performance • Graphics Clusters are getting better • Chipsets • Interconnects • Form factor • Processing • Graphics Chips • Things are really starting to get interesting!

More Related