1 / 29

Case study

Case study. IBM Bluegene/L system InfiniBand. Interconnect Family share in top 500 supercomputers on 06/2011 and 11/2012. 06/2011. 11/2012. Overview of the IBM Blue Gene/L System Architecture. Design objectives Hardware overview System architecture Node architecture

duff
Download Presentation

Case study

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Case study • IBM Bluegene/L system • InfiniBand

  2. Interconnect Family share in top 500 supercomputers on 06/2011 and 11/2012 06/2011 11/2012

  3. Overview of the IBM Blue Gene/L System Architecture • Design objectives • Hardware overview • System architecture • Node architecture • Interconnect architecture

  4. Highlights • A 64K-node highly integrated supercomputer based on system-on-a-chip technology • Two ASICs • Blue Gene/L compute (BLC), Blue Gene/L Link (BLL) • Distributed memory, massively parallel processing (MPP) architecture. • Use the message passing programming model (MPI). • 360 Tflops peak performance • Optimized for cost/performance

  5. Design objectives • Objective 1: 360-Tflops supercomputer • Earth Simulator (Japan, fastest supercomputer from 2002 to 2004): 35.86 Tflops • Objective 2: power efficiency • Performance/rack = performance/watt * watt/rack • Watt/rack is a constant of around 20kW • Performance/watt determines performance/rack

  6. Power efficiency: • 360Tflops => 20 megawatts with conventional processors • Need low-power processor design (2-10 times better power efficiency)

  7. Design objectives (continue) • Objective 3: extreme scalability • Optimized for cost/performance  use low power, less powerful processors  need a lot of processors • Up to 65536 processors. • Interconnect scalability

  8. Blue Gene/L system components

  9. Blue Gene/L Compute ASIC • 2 Power PC440 cores with floating-point enhancements • 700MHz • Everything of a typical superscalar processor • Pipelined microarchitecture with dual instruction fetch, decode, and out of order issue, out of order dispatch, out of order execution and out of order completion, etc • 1 W each through extensive power management

  10. Blue Gene/L Compute ASIC

  11. Memory system on a BGL node • BG/L only supports distributed memory paradigm. • No need for efficient support for cache coherence on each node. • Coherence enforced by software if needed. • Two cores operate in two modes: • Communication coprocessor mode • Need coherence, managed in system level libraries • Virtual node mode • Memory is physical partitioned (not shared).

  12. Blue Gene/L networks • Five networks. • 100 Mbps Ethernet control network for diagnostics, debugging, and some other things. • 1000 Mbps Ethernet for I/O • Three high-band width, low-latency networks for data transmission and synchronization. • 3-D torus network for point-to-point communication • Collective network for global operations • Barrier network • All network logic is integrated in the BG/L node ASIC • Memory mapped interfaces from user space

  13. 3-D torus network • Support p2p communication • Link bandwidth 1.4Gb/s, 6 bidirectional link per node (1.2GB/s). • 64x32x32 torus: diameter 32+16+16=64 hops, worst case hardware latency 6.4us. • Cut-through routing • Adaptive routing

  14. Collective network • Binary tree topology, static routing • Link bandwidth: 2.8Gb/s • Maximum hardware latency: 5us • With arithmetic and logical hardware: can perform integer operation on the data • Efficient support for reduce, scan, global sum, and broadcast operations • Floating point operation can be done with 2 passes.

  15. Barrier network • Hardware support for global synchronization. • 1.5us for barrier on 64K nodes.

  16. IBM BlueGene/L summary • Optimize cost/performance • limiting applications. • Use low power design • Lower frequency, system-on-a-chip • Great performance per watt metric • Scalability support • Hardware support for global communication and barrier • Low latency, high bandwidth support

  17. Case 2: Infiniband architecture • Specification (Infiniband architecture specification release 1.2.1, January 2008/Oct. 2006) available at Infiniband Trade Association (http://www.infinibandta.org)

  18. Infiniband architecture overview

  19. Infiniband architecture overview • Components: • Links • Channel adaptors • Switches • Routers • The specification allows Infiniband wide area network, but mostly adopted as a system/storage area network. • Topology: • Irregular • Regular: Fat tree , hypercube, torus, etc. • Link speed: • Single data rate (SDR): 2.5Gbps (X), 10Gbps (4X), and 30Gbps (12X). • Double data rate (DDR): 5Gbps (X), 20 Gbps (4X) • Quad data rate (QDR): 40Gbps (4X) • Fourteen data rate (FDR): 56Gbps (4X)

  20. Layers: somewhat similar to TCP/IP • Physical layer • Link layer • Error detection (CRC checksum) • flow control (credit based) • switching, virtual lanes (VL), • forwarding table computed by subnet manager • Single path deterministic routing (not adaptive) • Network layer: across subnets. • No use for the cluster environment • Transport layer • Reliable/unreliable, connection/datagram • Verbs: interface between adaptors and OS/Users

  21. Infiniband Link layer Packet format: • Local Route Header (LRH): 8 bytes. Used for local routing by switches within a IBA subnet • Global Route Header (GRH): 40 Bytes. Used for routing between subnets • Base Transport header (BTH): 12 Bytes, for IBA transport • Extended transport header • Reliable datagram extended transport header (RDETH): 4 bytes, just for reliable datagram • Datagram extended transport header (DETH): 8 bytes • RDMA extended transport header (RETH): 16 bytes • Atomic, ACK, Atomic ACK, • Immediate DATA extended transport header: 4 bytes, optimized for small packets. • Invariant CRC and variant CRC: • CRC for fields not changed and changed.

  22. Local Route Header: • Switching based on the destination port address (LID) • Multipath switching by allocating multiple LIDs to one port

  23. Subnet management • Initialize the network • Discover subnet topology and topology changes, compute the paths, assign LIDs, distribute the routes, configure devices. • Related devices and entities • Devices: Channel Adapters (CA), Host Channel Adapters, switches, routers • Subnet manager(SM): discovering, configuring, activating and managing the subnet • A subnet management agent(SMA) in every device generates, responses to control packets (subnet management packets (SMPs)), and configures local components for subnet management • SM exchange control packets with SMA with subnet management interface (SMI).

  24. Subnet Management phases: • Topology discovery: sending direct routed SMP to every port and processing the responses. • Path computation: computing valid paths between each pair of end node • Path distribution phase: configuring the forwarding table

  25. Base transport header:

  26. Verbs • OS/Users access the adaptor through verbs • Communication mechanism: Queue Pair (QP) • Users can queue up a set of instructions that the hardware executes. • A pair of queues in each QP: one for send, one for receive. • Users can post send requests to the send queue and receive requests to the receive queue. • Three types of send operations: SEND, RDMA-(WRITE, READ, ATOMIC), MEMORY-BINDING • One receive operation (matching SEND)

  27. To communicate: • Make system calls to setup everything (open QP, bind QP to port, bind complete queues, connect local QP to remote QP, register memory, etc). • Post send/receive requests as user level instructions. • Check completion.

  28. InfiniBand has an almost perfect software/network interface: • The network subsystem realizes most user level functionality. • Network supports in-order delivery and and fault tolerance. • Buffer management is pushed out to the user. • OS bypass: User level accesses to the network interface. A few machine instructions will accomplish the transmission task without involving the OS.

More Related