610 likes | 617 Views
Explore the impact of large-scale computer systems in various fields such as low-energy defibrillation, genome sequencing, public content generation, and online gaming.
E N D
Advanced computer systems(Chapter 12) http://www.pds.ewi.tudelft.nl/~iosup/Courses/2011_ti1400_12.ppt
Large-Scale Computer Systems Today • Low-energy defibrillation • Saves lives • Affects >2M people/year • Studies involving both laboratory experiments and computational simulation Source: TeraGrid science highlights 2010, https://www.teragrid.org/c/document_library/get_file?uuid=e950f0a1-abb6-4de5-a509-46e535040ecf&groupId=14002
Large-Scale Computer Systems Today • Genome sequencing • May save lives • The $1,000 barrier • Large-scale molecular dynamics simulations • Tectonic plate movement • May save lives • Adaptive fine mesh simulations • Using 200,000 processors Source: TeraGrid science highlights 2010, https://www.teragrid.org/c/document_library/get_file?uuid=e950f0a1-abb6-4de5-a509-46e535040ecf&groupId=14002
Large-Scale Computer Systems Today • Public Content Generation • Wikipedia • Affects how we think about collaborations • “The distribution of effort has increasingly become more uneven, unequal”Sorin Adam MateiPurdue University Source: TeraGrid science highlights 2010, https://www.teragrid.org/c/document_library/get_file?uuid=e950f0a1-abb6-4de5-a509-46e535040ecf&groupId=14002
Large-Scale Computer Systems Today • Online Gaming • World of Warcraft, Zynga • Affects >250M people • “As an organization, World of Warcraft utilizes 20,000 computer systems, 1.3 petabytes of storage, and more than 4600 people.” • 75,000 cores • Upkeep: >135,000$/day (?) Source: http://www.gamasutra.com/php-bin/news_index.php?story=25307 and http://spectrum.ieee.org/consumer-electronics/gaming/engineering-everquest/0and http://35yards.wordpress.com/2011/03/01/world-of-warcraft-by-the-numbers/
Why parallelism (1/4) • Fundamental laws of nature: • example: channel widths are becoming so small that quantum propertiesare going to determine device behaviour • signal propagation timeincreases when channel widths shrink
Why parallelism (2/4) • Engineering constraints: • Phase transition timeof a component is a good measure for the maximum obtainable computing speed • example: optical or superconducting devices can switch in 10-12seconds • optimistic suggestion:1 TIPS(Tera Instructions Per Second, 1012) is possible • However, we must calculate something • assume we need 10 phase transitions: 0.1 TIPS
Why parallelism (3/4) But what about memory ? 0.5 cm • It takes light approximately 16 picoseconds to cross • 0.5 cm, yielding a possible execution rate of 60 GIPS • However, in silicon, speed is about 10 times slower, • resulting in 6 GIPS
Why parallelism (4/4) • Speed of sequential computers is limited toa few GIPS • Improvements by using parallelism: • multiple functional units(instruction-level parallelism) • multiple CPUs(parallel processing)
Quantum Computing? Source: http://www.engadget.com/2011/05/29/d-wave-sells-first-commercial-quantum-computer-to-lockheed-marti/ • “Qubits are quantum bits that can be in an “on”, “off”, or “both” state due to fuzzy physics at the atomic level.” • Does surrounding noise matter? • Wim van Dam, Nature Physics 2007 • May 25, 2011 • Lockheed Martin (10M$) • D-Wave One 128 qubit
Agenda • Introduction • The Flynn Classification of Computers • Types of Multi-Processors • Interconnection Networks • Memory Organization in Multi-Processors • Program Parallelism and Shared Variables • Multi-Computers • A Programmer’s View • Performance Considerations
Classification of computers (Flynn Taxonomy) • Single Instruction, Single Data (SISD) • conventional system • Single Instruction, Multiple Data (SIMD) • one instruction on multiple data objects • Multiple Instruction, Multiple Data (MIMD) • multiple instruction streams on multiple data streams • Multiple Instruction, Single Data (MISD) • ?????
Agenda • Introduction • The Flynn Classification of Computers • Types of Multi-Processors • Interconnection Networks • Memory Organization in Multi-Processors • Program Parallelism and Shared Variables • Multi-Computers • A Programmer’s View • Performance Considerations
SIMD (Array) Processors ..... PE CM-5’91 Instruction Issuing Unit INCR CM-2’87Peak: 28GFLOPSSustainable:5-10% PE = Processing Element Sources: http://cs.adelaide.edu.au/~sacpc/hardware.html#cm5 and http://www.paulos.net/other/cm2.html and http://boards.straightdope.com/sdmb/archive/index.php/t-515675.html (about the blinking leds)
MIMDUniform Memory Access (UMA) architecture Any processor can access directly any memory. P1 P2 Pm ...... interconnection network 0 3 . ...... 1 4 . 2 5 N M1 M2 Mk Uniform Memory Access (UMA) computer
MIMDNUMA architecture Any processor can access directly any memory. 0 3 . P1 P2 ...... Pm 1 4 . 2 5 N M1 M2 Mm interconnection network Non-Uniform Memory Access (NUMA) computer realization in hardware or in software (distributed shared memory)
MIMDDistributed memory architecture Any processor can access any memory, but sometimesthrough another processor (via messages). 0 0 0 P1 P2 Pm 1 1 1 ...... 2 2 2 M1 M2 Mm interconnection network
Example 1: Graphical Processing Units’s CPU versus GPU • CPU: Much cache and control logic • GPU: Much compute logic
GPU Architecture SIMD architecture • Multiple SIMD units • SIMD pipelining • Simple processors • High branch penalty • Efficient operation on • parallel data • regular streaming
Example 2: Cell B.E. Distributed memory architecture 8 identical cores PowerPC
Example 3: Intel Quad-core Shared Memory MIMD
Example 4: Large MIMD Clusters BlueGene/L
Supercomputers Over Time Source: http://www.top500.org
Agenda • Introduction • The Flynn Classification of Computers • Types of Multi-Processors • Interconnection Networks (I/O) • Memory Organization in Multi-Processors • Program Parallelism and Shared Variables • Multi-Computers • A Programmer’s View • Performance Considerations
Interconnection networks(I/O between processors) • Difficulty in building systemswith many processors: the interconnections • Important parameters: • Diameter: • Maximal distance between any two processors • Degree: • Maximal number of connections per processor • Total number of connections (Cost) • Bisection width • Largest number of simultaneous messages
Multiple bus Bus 1 Bus 2 (Multiple) bus structures
Cross bar Sun E10000 N N2 switches Cross-bar interconnection network Source: http://www.cray-cyber.org/systems/E10k_detail.php
Multi-stage networks (1/4) stage1 stage2 stage3 P0 path from P5 to P3 P1 P3 8 modules 3-bit ids P5 P7
Multi-stage networks (2/4) connections P4-P0 and P5-P3 both use 0 0 P0 P1 1 P3 1 0 P4 P5 Shuffle Network stage3 stage1 stage2 “Shuffle”: 2 x ½ deck, interleave
Multi-stage network (3/4) • Multistage networks: multiple steps • Example: Shuffle or Omega network • Every processor identified by three-bit number (in general, n-bit number) • Message from processor to another contains identifier of destination • Routing algorithm: In every stage, • inspect one bit of destination • if 0: use upper output • if 1: use lower output
Multi-stage network (4/4) • Properties: • Let N = 2nbe the number ofprocessing elements • Number of stages n = log2N • Number of switches per stage N/2 • Total number of (2x2) switches N(log2N)/2 • Not every pair of connections can be simultaneously realized • Blocking
Hypercubes (1/3) Non-uniform delay, so for NUMA architectures. 10 11 n.2n-1 connections n = 2 maximum distancenhops 00 01 • Connected PEs differ by 1 bit • Routing: • - scan bits from right to left • - if different, send to neighbor • with same bit different • - repeat until end 000 -> 111 011 111 010 110 n = 3 001 101 100 000
Hypercubes (2/3) • Question: what is the average distance between two nodes in a hypercube?
Mesh Constant number of connections per node
Torus mesh with wrap-around connections
Fat tree … Nodes have multiple parents
Local networks • Ethernet • based on collision detection • upon collision, back off and randomly try later • speedup to100Gb/s (Terabit Ethernet?) • Token ring • based on tokencirculation on ring • possession of token allows putting message on the ring PC PC
Agenda • Introduction • The Flynn Classification of Computers • Types of Multi-Processors • Interconnection Networks • Memory Organization in Multi-Processors • Program Parallelism and Shared Variables • Multi-Computers • A Programmer’s View • Performance Considerations
Memory organization (1/2) UMA architectures. Processor Secondary Cache Network Interface network
Memory organization (2/2) NUMA architectures. Processor Secondary Cache Local Memory Network Interface network
Cache coherence • Problem: caches in multiprocessors may have copies of the same variable • Copies must be kept identical • Cache coherence:all copies of a shared variable have the same value • Solutions: • write throughto shared memory and all caches • invalidatecache entries in all other caches • Snoopy caches: • Proc.Elements sense write and adapt cache or do invalidate
Agenda • Introduction • The Flynn Classification of Computers • Types of Multi-Processors • Interconnection Networks • Memory Organization in Multi-Processors • Program Parallelism and Shared Variables • Multi-Computers • A Programmer’s View • Performance Considerations
Parallelism Language construct: PARBEGIN PARBEGIN task_1; task_2; .... …. task_n; PAREND task 1 task n PAREND
Shared variables (1/4) Task_1 Task_2 ..... STW R2, SUM(0) ..... ..... STW R2, SUM(0) ..... SUM shared memory T1 T2
Shared variables (2/4) • Suppose processsors both 1 and 2 execute: LW A,R0 /* A is variable in main memory */ ADD R1,R0 STW R0,A • Initially: • A=100 • R1 in processor 1 is 20 • R1 in processor 2 is 40 • What is the final value of A? 120, 140, 160? Now consider the final value of A is yourbank account balance.
Shared variables (3/4) • So there is a need for mutual exclusion: • different components of the same program need exclusive access to a data structure to ensure consistent values • Occurs in many situations: • access to shared variables • access to a printer • A solution: a single instruction (Test&Set) that • testswhether somebody else accesses the variable • if so, continue testing (busy waiting) • if not, indicates that the variable is being accessed
Shared variables (4/4) Task_1 Task_2 crit: T&S LOCK,crit ...... STW R2, SUM(0) ..... CLR LOCK crit: T&S LOCK,crit ...... STW R2, SUM(0) ..... CLR LOCK shared memory SUM LOCK T1 T2
Agenda • Introduction • The Flynn Classification of Computers • Types of Multi-Processors • Interconnection Networks • Memory Organization in Multi-Processors • Program Parallelism and Shared Variables • Multi-Computers [earlier, see Token Ring et al.] • A Programmer’s View • Performance Considerations
Example program • Compute dot product of two vectors with a • sequential program • two tasks with shared memory • two tasks with distributed memory using messages • Primitives in parallel programs: • create_thread() (create a (sub)process) • mypid() (who am I?)