280 likes | 489 Views
Commodity Computing Clusters - next generation supercomputers?. Paweł Pisarczyk, ATM S. A. pawel.pisarczyk@atm.com.pl. Agenda . Introduction Supercomputer classification Architecture and implementations Commodity clusters Processors Operating systems Summary. Supercomputer.
E N D
Commodity Computing Clusters - next generation supercomputers? Paweł Pisarczyk, ATM S. A. pawel.pisarczyk@atm.com.pl
Agenda • Introduction • Supercomputer classification • Architecture and implementations • Commodity clusters • Processors • Operating systems • Summary
Supercomputer • „A supercomputer is a device for turning compute-bound problems into I/O-bound problem” - Seymour Cray • A supercomputer is a computer system that leads the world in terms of processing capacity, particularly speed of calculations, at the time of its introduction. source: http://en.wikipedia.org
Supercomputer History (1) • 1945-50 - Manchester Mark I • 1950-55 - MIT Whirlwind • 1955-60 - IBM 7090 - 210 KFLOPS • 1960-65 - CDC 6600 -10.24 MFLOPS • 1965-70 - CDC 7600 - 32.27 MFLOPS • 1970-75 - CDC Cyber 76
Supercomputer History (2) • 1975-80 - Cray-1 - 160 MFLOPS • 1980-85 - Cray X-MP - 500 MFLOPS • 1985-90 - Cray Y-MP - 1.3 GFLOPS • 1990-95 - Fujitsu Numerical Wind Tunnel - 236 GFLOPS • 1995-00 - Intel ASCI Red - 2.150 TFLOPS • 2000-02 - IBM ASCI White, SP Power3 375 MHz - 7.226 TFLOPS • 2002-03 - NEC Earth Simulator - 35 TFLOPS
Supercomputer Classes (1) • General-purpose supercomputers: • vector processing machines - the same operation carried out on a large amount of data simultaneously • tightly connected cluster computers (NUMA) - communication oriented architectures engineered from ground up, based on high speed interconnects and large number of processors • commodity clusters - collection of large number of commodity PCs (COTS) interconnected by high-bandwidth low-latency network
Supercomputer Classes (2) • Special-purpose supercomputers - high performance computing devices with a hardware architecture dedicated to solve a single problem (equipped with custom ASICS or FPGA chips) Examples • Deep Blue • GRAPE for astrophysics
Flynn taxonomy - 1972 (1) • SISD - Single Instruction Single Data (DEC, Sun Microsystems, PC) • SIMD - Single Instruction Multiple Data • computers with large number o processing units (i.e. ALUs) - CPP DAP Gamma II, Quadrics Apemille • vector processing machines - NEC SX6, IA32 MMX • MISD - Multiple Instruction Single Data • theoretical model, no practical implementation
Flynn taxonomy - 1972 (2) • MIMD - Multiple Instruction Multiple Data • SM-MIMD - Shared Memory MIMD • global address space • SMP systems and ccNUMA systems • DM-MIMD - Distributed Memory MIMD • many nodes with local address spaces • high-bandwidth, low-latency communication • common NUMA architectures (Non Uniform Memory Access) • operating system have to be communication oriented (Mach project)
SM-MIMD implementations • S-COMA - Simple Cache-Only Memory Architecture • common SMP systems • ccNUMA - Cache Coherent NUMA • SGI Origin 3000 • SGI Altix 3000 • HP SuperDome
S-COMA (SMP) RAM L2 cache L2 cache L2 cache CPU 0 CPU 1 CPU N
RAM K L3 cache L2 cache L2 cache CPU N-1 CPU N ccNUMA RAM 0 L3 cache L2 cache L2 cache CPU 0 CPU 1
ccNUMA implementation SGI Altix 3000 (ccNUMA) • 64 Itanium 2 (IA64) processors • C-brick modules with 2 CPUs and ASIC SHUB • NUMAflex, NUMAlink interconnects (6.4 GB/s, 2.4 GB/s) • Modified Linux kernel (2.6 NUMA support)
DM-MIMD implementations • Massively parallel systems (NUMA) • communication oriented architecture • low-latency, high-bandwidth interconnects • topologies: hypercube, torus, tree • Butterfly networks, Omega networks, engineered from ground up communication
DM-MIMD implementations • Commodity clusters • a cluster is a collection of connected, independent computers working in unison to solve a problem • COTS technology • nodes are interconnected by Ethernet LAN, Myrinet, QsNet ELAN etc. • computation can be performed by using popular programming toolkits and frameworks: OpenMP, MPI • clusters require dedicated management software
NUMA implementations Cray T3E-1350 • Processor: Alpha 21164 675 MHz • Number of CPUs: 40 - 2176 • 3-D Torus topology • Operating system: UNICOS/mk - microkernel based • Peak performance: 3 TFLOPS
Commodity cluster implementation (1) Linux Networx/Quadrics • Processor: Intel Xeon 2.4 GHz • CPUs: 2304 • Interconnections: QsNet ELAN3 • Operating system: Linux + management tools + Lustre Cluster File System • Peak performance: 7.6 TFLOPS • 3rd computer on TOP500 list • Developed for Lawrence Livermore National Laboratory in 2002
Commodity cluster implementation (2) HP XC6000 Cluster (XC3000 Cluster) • Processor: Intel Itanium 2 6M 1.5 GHz (Intel Xeon 3 GHz) • Node: HP Integrity rx2600 (HP ProLiant DL380) • Number of processors: 34-512 • Interconnections: QsNet ELAN3 (Myricom Myrinet XP) • Operating system: Linux + SSI Middleware + management tools + Lustre Cluster File System • Peak performance: 34 CPUs - 204 GFLOPS, 512 CPUs - 3 TFLOPS
Commodity Clusters - software • Operating system - Linux or SSI Linux (Single System Image) • Platform for specialized applications for science, engineering and business (simulation, modeling, data mining) • Distributed computation environments are used for software development (OpenMP, MPI) • Common supercomputer applications require porting to clusters
Performance Scaling Scale Right Scale-Up (SMP, ccNUMA) Scale-Out (Cluster)
Processors (1) • Many types of existing processors are used in supercomputers • Microprocessor development directions: • Increasing of clock frequency and speed instruction stream processing • Processing of large collection of data in single processor instruction - SIMD • Control path multiplication – multithreading
Processors (2) • Vector processors • NEC SX-6 • Cray (Cray X1) • RISC processors • MIPS • IBM Power4 • Alpha • CISC processors • IA32 • AMD x86-64 • VLIW processors • IA64
Intel Itanium 2 features • State-of-the-art unconventional 64-bit architecture • New programming model implementing VLIW paradigm • EPIC technology – Explicitly Parallel Instruction Computing – compiler determines instruction dependency informing processor how to process an instruction stream parallel • Many registers (128 64-bit), register stack management • 6 GFLOPS peak performance • Full advantages of the processor can be used by dedicated compiler
Operating systems • Monolithic kernel based OSs - UNIX (modification of existing solutions) • BSD • Solaris • Irix • Linux • Microkernel based OSs • Mach
Microkernel architecture Task A Task B Task C Kernel Kernel Hardware Hardware
Summary • Today’s there is a lot of supercomputer architectures • Both vector processors and common RISC, CISC, VLIW chips are used for supercomputers • Commodity clusters under control of Linux OS are an attractive method for supercomputer implementation
TOP 500 list (1) 1. Earth Simulator, NEC - 35.86 TFLOPS 2. HP Alphaserver SC, HP - 13.88 TFLOPS 3. Linux Networx / Quadrics IA32 - 7.634 TFLOPS
Source: http://www.top500.org/list/2003/06/ Top 500 list (2)