High Performance Linux Clusters

High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC

Overview of San Diego Supercomputer Center • Founded in 1985 • Non-military access to supercomputers • Over 400 employees • Mission: Innovate, develop, and deploy technology to advance science • Recognized as an international leader in: • Grid and Cluster Computing • Data Management • High Performance Computing • Networking • Visualization • Primarily funded by NSF

My Background • 1984 - 1998: NCR - Helped to build the world’s largest database computers • Saw the transistion from proprietary parallel systems to clusters • 1999 - 2000: HPVM - Helped build Windows clusters • 2000 - Now: Rocks - Helping to build Linux-based clusters

Why Clusters?

Moore’s Law

Cluster Pioneers • In the mid-1990s, Network of Workstations project (UC Berkeley) and the Beowulf Project (NASA) asked the question: Can You Build a High Performance Machine From Commodity Components?

The Answer is: Yes Source: Dave Pierce, SIO

The Answer is: Yes

Types of Clusters • High Availability • Generally small (less than 8 nodes) • Visualization • High Performance • Computational tools for scientific computing • Large database machines

High Availability Cluster • Composed of redundant components and multiple communication paths

Visualization Cluster • Each node in the cluster drives a display

High Performance Cluster • Constructed with many compute nodes and often a high-performance interconnect

Cluster Hardware Components

Cluster Processors • Pentium/Athlon • Opteron • Itanium

Processors: x86 • Most prevalent processor used in commodity clustering • Fastest integer processor on the planet: • 3.4 GHz Pentium 4, SPEC2000int: 1705

Processors: x86 • Capable floating point performance • #5 machine on Top500 list built with Pentium 4 processors

Processors: Opteron • Newest 64-bit processor • Excellent integer performance • SPEC2000int: 1655 • Good floating point performance • SPEC2000fp: 1691 • #10 machine on Top500

Processors: Itanium • First systems released June 2001 • Decent integer performance • SPEC2000int: 1404 • Fastest floating-point performance on the planet • SPEC2000fp: 2161 • Impressive Linpack efficiency: 86%

Processors Summary

But What You Really Build? • Itanium: Dell PowerEdge 3250 • Two 1.4 GHz CPUs (1.5 MB cache) • 11.2 Gflops peak • 2 GB memory • 36 GB disk • $7,700 • Two 1.5 GHz (6 MB cache) makes the system cost ~$17,700 • 1.4 GHz vs. 1.5 GHz • ~7% slower • ~130% cheaper

Opteron • IBM eServer 325 • Two 2.0 GHz Opteron 246 • 8 Gflops peak • 2 GB memory • 36 GB disk • $4,539 • Two 2.4 GHz CPUs: $5,691 • 2.0 GHz vs. 2.4 GHz • ~17% slower • ~25% cheaper

Pentium 4 Xeon • HP DL140 • Two 3.06 GHz CPUs • 12 Gflops peak • 2 GB memory • 80 GB disk • $2,815 • Two 3.2 GHz: $3,368 • 3.06 GHz vs. 3.2 GHz • ~4% slower • ~20% cheaper

If You Had $100,000 To Spend On A Compute Farm

What People Are Buying • Gartner study • Servers shipped in 1Q04 • Itanium: 6,281 • Opteron: 31,184 • Opteron shipped 5x more servers than Itanium

What Are People Buying • Gartner study • Servers shipped in 1Q04 • Itanium: 6,281 • Opteron: 31,184 • Pentium: 1,000,000 • Pentium shipped 30x more than Opteron

Interconnects

Interconnects • Ethernet • Most prevalent on clusters • Low-latency interconnects • Myrinet • Infiniband • Quadrics • Ammasso

Why Low-Latency Interconnects? • Performance • Lower latency • Higher bandwidth • Accomplished through OS-bypass

How Low Latency Interconnects Work • Decrease latency for a packet by reducing the number memory copies per packet

Bisection Bandwidth • Definition: If split system in half, what is the maximum amount of data that can pass between each half? • Assuming 1 Gb/s links: • Bisection bandwidth = 1 Gb/s

Bisection Bandwidth • Assuming 1 Gb/s links: • Bisection bandwidth = 2 Gb/s

Bisection Bandwidth • Definition: Full bisection bandwidth is a network topology that can support N/2 simultaneous communication streams. • That is, the nodes on one half of the network can communicate with the nodes on the other half at full speed.

Large Networks • When run out of ports on a single switch, then you must add another network stage • In example above: Assuming 1 Gb/s links, uplinks from stage 1 switches to stage 2 switches must carry at least 6 Gb/s

Large Networks • With low-port count switches, need many switches on large systems in order to maintain full bisection bandwidth • 128-node system with 32-port switches requires 12 switches and 256 total cables

Myrinet • Long-time interconnect vendor • Delivering products since 1995 • Deliver single 128-port full bisection bandwidth switch • MPI Performance: • Latency: 6.7 us • Bandwidth: 245 MB/s • Cost/port (based on 64-port configuration): $1000 • Switch + NIC + cable • http://www.myri.com/myrinet/product_list.html

Myrinet • Recently announced 256-port switch • Available August 2004

Myrinet • #5 System on Top500 list • System sustains 64% of peak performance • But smaller Myrinet-connected systems hit 70-75% of peak

Quadrics • QsNetII E-series • Released at the end of May 2004 • Deliver 128-port standalone switches • MPI Performance: • Latency: 3 us • Bandwidth: 900 MB/s • Cost/port (based on 64-port configuration): $1800 • Switch + NIC + cable • http://doc.quadrics.com/Quadrics/QuadricsHome.nsf/DisplayPages/A3EE4AED738B6E2480256DD30057B227

Quadrics • #2 on Top500 list • Sustains 86% of peak • Other Quadrics-connected systems on Top500 list sustain 70-75% of peak

Infiniband • Newest cluster interconnect • Currently shipping 32-port switches and 192-port switches • MPI Performance: • Latency: 6.8 us • Bandwidth: 840 MB/s • Estimated cost/port (based on 64-port configuration): $1700 - 3000 • Switch + NIC + cable • http://www.techonline.com/community/related_content/24364

Ethernet • Latency: 80 us • Bandwidth: 100 MB/s • Top500 list has ethernet-based systems sustaining between 35-59% of peak

Ethernet • What we did with 128 nodes and a $13,000 ethernet network • $101 / port • $28/port with our latest Gigabit Ethernet switch • Sustained 48% of peak • With Myrinet, would have sustained ~1 Tflop • At a cost of ~$130,000 • Roughly 1/3 the cost of the system

Rockstar Topology • 24-port switches • Not a symmetric network • Best case - 4:1 bisection bandwidth • Worst case - 8:1 • Average - 5.3:1

Low-Latency Ethernet • Bring os-bypass to ethernet • Projected performance: • Latency: less than 20 us • Bandwidth: 100 MB/s • Potentially could merge management and high-performance networks • Vendor “Ammasso”

Application Benefits

Storage

Local Storage • Exported to compute nodes via NFS

Network Attached Storage • A NAS box is an embedded NFS appliance

Storage Area Network • Provides a disk block interface over a network (Fibre Channel or Ethernet) • Moves the shared disks out of the servers and onto the network • Still requires a central service to coordinate file system operations

Parallel Virtual File System • PVFS version 1 has no fault tolerance • PVFS version 2 (in beta) has fault tolerance mechanisms

High Performance Linux Clusters

High Performance Linux Clusters

Presentation Transcript

Protocol-Dependent Message-Passing Performance on Linux Clusters

High Performance Computing, Clusters, and Productivity

High Performance LU Factorization for Non-dedicated Clusters

Linux Clusters in ITD

Parallel Simulations on High-Performance Clusters

Parallel Computing With High Performance Computing Clusters (HPCs)

The Largest Linux Clusters

High Performance Linux Clusters

Architectural Interactions in High Performance Clusters

High-Performance Clusters part 1: Performance

Linux Clusters for High-Performance Computing

High Performance, Dense, Low Power Linux Clusters

Cheapest Linux VPS with High Performance

High-Performance VPS Servers for Linux Web Hosting

High Performance Computing, Clusters, and Productivity

High-Performance Clusters part 2: Generality

High-Performance Clusters part 1: Performance

Architectural Interactions in High Performance Clusters

Protocol-Dependent Message-Passing Performance on Linux Clusters

INFOMIR MAG322 - High-Performance Linux Set Top Box