Impact of Large-Scale Computer Systems Today

Advanced computer systems(Chapter 12) http://www.pds.ewi.tudelft.nl/~iosup/Courses/2011_ti1400_12.ppt

Large-Scale Computer Systems Today • Low-energy defibrillation • Saves lives • Affects >2M people/year • Studies involving both laboratory experiments and computational simulation Source: TeraGrid science highlights 2010, https://www.teragrid.org/c/document_library/get_file?uuid=e950f0a1-abb6-4de5-a509-46e535040ecf&groupId=14002

Large-Scale Computer Systems Today • Genome sequencing • May save lives • The $1,000 barrier • Large-scale molecular dynamics simulations • Tectonic plate movement • May save lives • Adaptive fine mesh simulations • Using 200,000 processors Source: TeraGrid science highlights 2010, https://www.teragrid.org/c/document_library/get_file?uuid=e950f0a1-abb6-4de5-a509-46e535040ecf&groupId=14002

Large-Scale Computer Systems Today • Public Content Generation • Wikipedia • Affects how we think about collaborations • “The distribution of effort has increasingly become more uneven, unequal”Sorin Adam MateiPurdue University Source: TeraGrid science highlights 2010, https://www.teragrid.org/c/document_library/get_file?uuid=e950f0a1-abb6-4de5-a509-46e535040ecf&groupId=14002

Large-Scale Computer Systems Today • Online Gaming • World of Warcraft, Zynga • Affects >250M people • “As an organization, World of Warcraft utilizes 20,000 computer systems, 1.3 petabytes of storage, and more than 4600 people.” • 75,000 cores • Upkeep: >135,000$/day (?) Source: http://www.gamasutra.com/php-bin/news_index.php?story=25307 and http://spectrum.ieee.org/consumer-electronics/gaming/engineering-everquest/0and http://35yards.wordpress.com/2011/03/01/world-of-warcraft-by-the-numbers/

Why parallelism (1/4) • Fundamental laws of nature: • example: channel widths are becoming so small that quantum propertiesare going to determine device behaviour • signal propagation timeincreases when channel widths shrink

Why parallelism (2/4) • Engineering constraints: • Phase transition timeof a component is a good measure for the maximum obtainable computing speed • example: optical or superconducting devices can switch in 10-12seconds • optimistic suggestion:1 TIPS(Tera Instructions Per Second, 1012) is possible • However, we must calculate something • assume we need 10 phase transitions: 0.1 TIPS

Why parallelism (3/4) But what about memory ? 0.5 cm • It takes light approximately 16 picoseconds to cross • 0.5 cm, yielding a possible execution rate of 60 GIPS • However, in silicon, speed is about 10 times slower, • resulting in 6 GIPS

Why parallelism (4/4) • Speed of sequential computers is limited toa few GIPS • Improvements by using parallelism: • multiple functional units(instruction-level parallelism) • multiple CPUs(parallel processing)

Quantum Computing? Source: http://www.engadget.com/2011/05/29/d-wave-sells-first-commercial-quantum-computer-to-lockheed-marti/ • “Qubits are quantum bits that can be in an “on”, “off”, or “both” state due to fuzzy physics at the atomic level.” • Does surrounding noise matter? • Wim van Dam, Nature Physics 2007 • May 25, 2011 • Lockheed Martin (10M$) • D-Wave One 128 qubit

Agenda • Introduction • The Flynn Classification of Computers • Types of Multi-Processors • Interconnection Networks • Memory Organization in Multi-Processors • Program Parallelism and Shared Variables • Multi-Computers • A Programmer’s View • Performance Considerations

Classification of computers (Flynn Taxonomy) • Single Instruction, Single Data (SISD) • conventional system • Single Instruction, Multiple Data (SIMD) • one instruction on multiple data objects • Multiple Instruction, Multiple Data (MIMD) • multiple instruction streams on multiple data streams • Multiple Instruction, Single Data (MISD) • ?????

SIMD (Array) Processors ..... PE CM-5’91 Instruction Issuing Unit INCR CM-2’87Peak: 28GFLOPSSustainable:5-10% PE = Processing Element Sources: http://cs.adelaide.edu.au/~sacpc/hardware.html#cm5 and http://www.paulos.net/other/cm2.html and http://boards.straightdope.com/sdmb/archive/index.php/t-515675.html (about the blinking leds)

MIMDUniform Memory Access (UMA) architecture Any processor can access directly any memory. P1 P2 Pm ...... interconnection network 0 3 . ...... 1 4 . 2 5 N M1 M2 Mk Uniform Memory Access (UMA) computer

MIMDNUMA architecture Any processor can access directly any memory. 0 3 . P1 P2 ...... Pm 1 4 . 2 5 N M1 M2 Mm interconnection network Non-Uniform Memory Access (NUMA) computer realization in hardware or in software (distributed shared memory)

MIMDDistributed memory architecture Any processor can access any memory, but sometimesthrough another processor (via messages). 0 0 0 P1 P2 Pm 1 1 1 ...... 2 2 2 M1 M2 Mm interconnection network

Example 1: Graphical Processing Units’s CPU versus GPU • CPU: Much cache and control logic • GPU: Much compute logic

GPU Architecture SIMD architecture • Multiple SIMD units • SIMD pipelining • Simple processors • High branch penalty • Efficient operation on • parallel data • regular streaming

Example 2: Cell B.E. Distributed memory architecture 8 identical cores PowerPC

Example 3: Intel Quad-core Shared Memory MIMD

Example 4: Large MIMD Clusters BlueGene/L

Supercomputers Over Time Source: http://www.top500.org

Agenda • Introduction • The Flynn Classification of Computers • Types of Multi-Processors • Interconnection Networks (I/O) • Memory Organization in Multi-Processors • Program Parallelism and Shared Variables • Multi-Computers • A Programmer’s View • Performance Considerations

Interconnection networks(I/O between processors) • Difficulty in building systemswith many processors: the interconnections • Important parameters: • Diameter: • Maximal distance between any two processors • Degree: • Maximal number of connections per processor • Total number of connections (Cost) • Bisection width • Largest number of simultaneous messages

Multiple bus Bus 1 Bus 2 (Multiple) bus structures

Cross bar Sun E10000 N N2 switches Cross-bar interconnection network Source: http://www.cray-cyber.org/systems/E10k_detail.php

Multi-stage networks (1/4) stage1 stage2 stage3 P0 path from P5 to P3 P1 P3 8 modules 3-bit ids P5 P7

Multi-stage networks (2/4) connections P4-P0 and P5-P3 both use 0 0 P0 P1 1 P3 1 0 P4 P5 Shuffle Network stage3 stage1 stage2 “Shuffle”: 2 x ½ deck, interleave

Multi-stage network (3/4) • Multistage networks: multiple steps • Example: Shuffle or Omega network • Every processor identified by three-bit number (in general, n-bit number) • Message from processor to another contains identifier of destination • Routing algorithm: In every stage, • inspect one bit of destination • if 0: use upper output • if 1: use lower output

Multi-stage network (4/4) • Properties: • Let N = 2nbe the number ofprocessing elements • Number of stages n = log2N • Number of switches per stage N/2 • Total number of (2x2) switches N(log2N)/2 • Not every pair of connections can be simultaneously realized • Blocking

Hypercubes (1/3) Non-uniform delay, so for NUMA architectures. 10 11 n.2n-1 connections n = 2 maximum distancenhops 00 01 • Connected PEs differ by 1 bit • Routing: • - scan bits from right to left • - if different, send to neighbor • with same bit different • - repeat until end 000 -> 111 011 111 010 110 n = 3 001 101 100 000

Hypercubes (2/3) • Question: what is the average distance between two nodes in a hypercube?

Mesh Constant number of connections per node

Torus mesh with wrap-around connections

Tree

Fat tree … Nodes have multiple parents

Local networks • Ethernet • based on collision detection • upon collision, back off and randomly try later • speedup to100Gb/s (Terabit Ethernet?) • Token ring • based on tokencirculation on ring • possession of token allows putting message on the ring PC PC

Memory organization (1/2) UMA architectures. Processor Secondary Cache Network Interface network

Memory organization (2/2) NUMA architectures. Processor Secondary Cache Local Memory Network Interface network

Cache coherence • Problem: caches in multiprocessors may have copies of the same variable • Copies must be kept identical • Cache coherence:all copies of a shared variable have the same value • Solutions: • write throughto shared memory and all caches • invalidatecache entries in all other caches • Snoopy caches: • Proc.Elements sense write and adapt cache or do invalidate

Parallelism Language construct: PARBEGIN PARBEGIN task_1; task_2; .... …. task_n; PAREND task 1 task n PAREND

Shared variables (1/4) Task_1 Task_2 ..... STW R2, SUM(0) ..... ..... STW R2, SUM(0) ..... SUM shared memory T1 T2

Shared variables (2/4) • Suppose processsors both 1 and 2 execute: LW A,R0 /* A is variable in main memory */ ADD R1,R0 STW R0,A • Initially: • A=100 • R1 in processor 1 is 20 • R1 in processor 2 is 40 • What is the final value of A? 120, 140, 160? Now consider the final value of A is yourbank account balance.

Shared variables (3/4) • So there is a need for mutual exclusion: • different components of the same program need exclusive access to a data structure to ensure consistent values • Occurs in many situations: • access to shared variables • access to a printer • A solution: a single instruction (Test&Set) that • testswhether somebody else accesses the variable • if so, continue testing (busy waiting) • if not, indicates that the variable is being accessed

Shared variables (4/4) Task_1 Task_2 crit: T&S LOCK,crit ...... STW R2, SUM(0) ..... CLR LOCK crit: T&S LOCK,crit ...... STW R2, SUM(0) ..... CLR LOCK shared memory SUM LOCK T1 T2

Agenda • Introduction • The Flynn Classification of Computers • Types of Multi-Processors • Interconnection Networks • Memory Organization in Multi-Processors • Program Parallelism and Shared Variables • Multi-Computers [earlier, see Token Ring et al.] • A Programmer’s View • Performance Considerations

Example program • Compute dot product of two vectors with a • sequential program • two tasks with shared memory • two tasks with distributed memory using messages • Primitives in parallel programs: • create_thread() (create a (sub)process) • mypid() (who am I?)

Impact of Large-Scale Computer Systems Today

Impact of Large-Scale Computer Systems Today

Presentation Transcript

Advanced Computer Architecture

Chapter 12 Advanced Programming: Systems Calls

CS 3214 Introduction to Computer Systems

Advanced Computer Vision

Chapter 12: Advanced Operating Systems

Advanced Computer Vision

Advanced Computer Vision

Advanced Computer Networks

Advanced Computer Vision

Advanced Computer Vision

Genetic Computer School Computer Systems Fundamentals

Chapter 12: Advanced Troubleshooting

Advanced Computer Vision

Advanced Topics in Computer Systems (ACS, R01)

Advanced Computer Vision

Advanced Computer Vision

Advanced Operating Systems

Advanced Computer Architecture

CS 3214 Introduction to Computer Systems

Advanced Computer Vision

Chapter 12: Advanced Operating Systems