1 / 53

Computer Architecture II

Computer Architecture II. Contents. Preliminaries Top500 Scalability Blue Gene History BG/L, BG/C, BG/P, BG/Q Scalable OS for BG/L Scalable file systems BG/P at Argonne National Lab Conclusions. Top500. Generalities. Since 1993 twice a year: June and November

chessa
Download Presentation

Computer Architecture II

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computer Architecture II

  2. Contents • Preliminaries • Top500 • Scalability • Blue Gene • History • BG/L, BG/C, BG/P, BG/Q • Scalable OS for BG/L • Scalable file systems • BG/P at Argonne National Lab • Conclusions

  3. Top500

  4. Generalities • Since 1993 twice a year: June and November • Ranking of the most powerful computing systems in the world • Ranking criteria: performance of the LINPACK benchmark • Jack Dongarra alma máter • Site web: www.top500.org

  5. HPL: High-Performance Linpack • solves a dense system of linear equations • Variant of LU factorization of matrices of size N • measure of a computer’s floating-point rate of execution • computation done in 64 bit floating point arithmetic • Rpeak : theoretic system performance • upper bound for the real performance (in MFLOP) • Ex: Intel Itanium 2 at 1.5 GHz 4 FP/s -> 6GFLOPS • Nmax: obtained by varying N and choosing the maximum performance • Rmax : maximum real performance achieved for Nmax • N1/2: size of problem needed to achieve ½ ofRmax

  6. Jack Dongarra´s slide

  7. Amdahl´s law • Suppose a fraction f of your application is not parallelizable • 1-f : parallelizable on p processors Speedup(P) = T1 /Tp <= T1/(f T1 + (1-f) T1 /p) = 1/(f + (1-f)/p) <= 1/f

  8. Amdahl’s Law (for 1024 processors)

  9. Sequential Work Speedup ≤ Max Work on any Processor Load Balance • Work: data access, computation • Not just equal work, but must be busy at same time • Ex: Speedup ≤1000/400 = 2.5

  10. Sequential Work Speedup < Max (Work + Synch Wait Time + Comm Cost) Communication and synchronization • Communication is expensive! • Measure: communication to computation ratio • Inherent communication • Determined by assignment of tasks to processes • Actual communication may be larger (artifactual) • One principle: Assign tasks that access same data to same process Process 1 Process 2 Process 3 Communication Work Synchronization point Synchronization wait time

  11. Blue Gene

  12. Blue Gene partners • IBM • “Blue”: The corporate color of IBM • “Gene”: The intended use of the Blue Gene clusters – Computational biology, specifically, protein folding • Lawrence Livermore National Lab • Department of Energy • Academia

  13. BG History

  14. Family • BG/L • BG/C • BG/P • BG/Q

  15. System Blue Gene/L 64 Racks, 64x32x32 Rack 32 Node Cards Node Card 180/360 TF/s 32 TB (32 chips 4x4x2) 16 compute, 0-2 IO cards 2.8/5.6 TF/s 512 GB Compute Card 2 chips, 1x2x1 90/180 GF/s 16 GB Chip 2 processors 5.6/11.2 GF/s 1.0 GB 2.8/5.6 GF/s 4 MB

  16. Technical specifications • 64 cabinets which contain 65.536 high-performance compute nodes (chips) • 1.024 I/O nodes. • 32-bit PowerPC processors • 5 networks • The main memory has a size of 33 terabytes. • Maximum performance of 183.5 TFLOPS when using one processor for computation and the other one for communication, and 367 TFLOPS if using both for computation.

  17. Blue Gene / L • Networks: • 3D Torus • Collective Network • Global Barrier/Interrupt • Gigabit Ethernet (I/O & Connectivity) • Control (system boot, debug, monitoring)

  18. Networks • Three dimensional torus • - Compute nodes • Global tree • - collective communication • - I/O • Ethernet • Control network

  19. Three-dimensional (3D) torus network in which the nodes (red balls) are connected to their six nearest-neighbor nodes in a 3D mesh.

  20. Blue Gene / L • Processor: PowerPC 440 700Mhz • Low power allows dense packaging • External Memory: 512MB SDRAM per node / 1GB • Slow embedded core at a clock speed of 700 Mhz • 32 KB L1 cache • L2 is a small prefetch buffer • 4MB Embedded DRAM L3 cache

  21. PowerPC 440 core

  22. BG/L compute ASIC • Non-cache coherent L1 • Pre-fetch buffer L2 • Shared 4MB DRAM (L3) • Interface to external DRAM • 5 network interfaces • Torus, collective, global barrier, Ethernet, control

  23. Block diagram

  24. Blue Gene / L • Compute Nodes: • Dual processor, 1024 per Rack • I/O Nodes: • Dual processor, 16-128 per Rack

  25. Blue Gene / L • Compute Nodes: • Proprietary kernel (tailored to processor design) • I/O Nodes: • Embedded Linux • Front-end and service nodes: • Suse SLES 9 Linux (familiarity with users)

  26. Blue Gene / L • Performance: • Peak performance per rack: 5,73 TFlops • Linpack performance per rack: 4,71 TFlops

  27. Blue Gene / C • a.k.a Cyclops64 • massively parallel (first supercomputer on a chip) • Processors with a 96 port, 7 stage non-internally blocking crossbar switch. • Theoretical peak performance (chip): 80 GFlops

  28. Blue Gene / C • Cellular architecture • 64-bit Cyclops64 chip: • 500 Mhz • 80 processors ( each has 2 thread units and a FP unit) • Software • Cyclops64 exposes much of the underyling hardware to the programmer, allowing the programer to write very high performance, finely tuned software.

  29. Blue Gene / C • Picture of BG/C • Performances: • Board: 320 GFlops • Rack: 15,76 Tflops • System: 1,1 PFlops

  30. Blue Gene / P • Similar Architecture to BG/L, but • Cache coherent L1 cache • 4 cores per nodes • 10 Gbit Ethernet external IO infrastructure • Scales upto 3-PFLOPS • More energy efficient • 167TF/s by 2007, 1PF by 2008

  31. Blue Gene / Q • Continuation of Blue Gene/L and /P • Targeting 10PF/s by 2010/2011 • Higher freq at similar performance / watt • Similar number of nodes • Many more cores • More generally useful • Aggressive compiler • New network: Scalable and cheap

  32. Motivationfor a scalable OS • Blue Gene/L is currently the world’s fastest and most scalable supercomputer • Several system components contribute to that scalability. • The Operating Systems for the different nodes of Blue Gene/L are among the components responsible for that scalability. • The OS overhead on one node affects the scalability of the whole system • Goal: design a scalable solution for the OS.

  33. High level view of BG/L • Principle: the structure of the software should reflect the structure of the hardware.

  34. BG/L Partitioning • Space-sharing • Divided along natural boundaries into partitions • Each partition can run only one job • Each node can be in one of this modes • Coprocessor: one processor assists the other • Virtual node: two separate processors, each of them with its own memory space

  35. OS • Compute nodes: dedicated OS • I/O nodes: dedicated OS • Service nodes: conventional off-the-shelf OS • Front-end nodes: program compilation, debug, submit • File servers: store data , no specific for BG/L

  36. BG/L OS solution • Components: I/O, service nodes, CNK • The compute and I/O nodes organized into logical entities called processing sets or psets: 1 I/O node + a collection of CNs • 8, 16, 64, 128 CNs • Logical concept • Should reflect physical proximity => fast communication • Job: collection of N compute processes (on CNs) • Own private address space • Message passing • MPI: ranks 0, N-1

  37. High level view of BG/L

  38. BG/L OS solution:CNK • Compute node: run only compute processes an all the compute nodes of a particular partition can execute in two different modes: • Coprocessor mode • Virtual node mode • Compute Node Kernel (CNK): simple OS • Creates an address spaces • Load code and initialize data • Transfer processor control to the loaded executable

  39. CNK • consumes 1MB • Creates either • One address space of 511/1023MB • 2 address spaces of 255/511MB • No virtual memory, no paging • The entire mapping fits into the TLB of PowerPC • Load in push mode: 1 CN reads the executable from FS and sends to all the others • One image loaded and then stays out of the way!!!

  40. CNK • No OS scheduling (one thread) • No memory management (No TLB overhead) • No local file services • User level execution until: • Process requests a system call • Hardware interrupts: timer (requested by application), abnormal events • Syscall • Simple: handled locally (getting the time, set an alarm) • Complex: forward to I/O nodes • Unsupported (fork/mmap): error

  41. Benefits of the simple solution • Robustness: simple design, implementation, test, debugging • Scalability: no interference among compute nodes • Low system noise • Performance measurements

  42. I/O node • Two roles in Blue Gene/L: • Act as an effective master of its corresponding pset • To offer services request from compute nodes in its pset • Mainly I/O operations on locally mounted FSs • Only one processor used: due to the lack of memory coherency • Executes an embedded version of the Linux operating system: • Does not use any swap space • it has an in-memory root file system • it uses little memory • lacks the majority of LINUX daemons.

  43. I/O node • Complete TCP/IP stack • Supported FS: NFS, GPFS, Lustre, PVFS • Main process: Control and I/O daemon (CIOD) • Launch a job • Job manager sends the request to the service node • Service node contacts the CIOD • CIOD sends the executable to all processes in pset

  44. System calls

  45. Service nodes • run the Blue Gene/L control system. • Tight integration with CNs and IONs • CN and IONs: stateless, no persistent memory • Responsible for operation and monitoring the CNs and I/ONs • Creates system partitions and isolates it • Computes network routing for torus, collective and global interrupt networks • loads OS code for CNs and I/ONs

  46. Problems • Not fully POSIX compliant • Many applications need • Process/thread creation • Full server sockets • Shared memory segments • Memory mapped files

  47. File systems for BG systems • Need for scalable file systems: NFS is not a solution • Most supercomputers and clusters in top 500 use one of these parallel file systems • GPFS • Lustre • PVFS2

  48. GPFS/PVFS/Lustre mounted on the I/O nodes File system servers

More Related