Computer Architecture

Computer Architecture MIMD Parallel Processors Iolanthe II racing in Waitemata Harbour

Classification of Parallel Processors • Flynn’s Taxonomy • Classifies according to instruction and data stream • Single Instruction Single Data • Sequential processors • Single Instruction Multiple Data • CM-2 – multiple small processors • Vector processors • Parts of commercial processors - MMX, Altivec • Multiple Instruction Single Data • ? • Multiple Instruction Multiple Data • General Parallel Processors

MIMD Systems • Recipe • Buy a few high performance commercial PEs • DEC Alpha • MIPS R10000 • UltraSPARC • Pentium? • Put them together with some memory and peripherals on a common bus • Instant parallel processor! • How to program it?

Programming Model • Problem not unique to MIMD • Even sequential machines need one • von Neuman (stored program) model • Parallel - Splitting the work load • Data • Distribute data to PEs • Instructions • Distribute tasks to PEs • Synchronization • Having divided the data & tasks,how do we synchronize tasks?

Programming Model • Shared Memory Model • Flavour of the year • Generally thought to be simplest to manage • All PEs see a common(virtual) address space • PEs communicateby writing into the common address space

Data Distribution • Trivial • All the data sits in the common address space • Any PE can access it! • Uniform Memory Access(UMA) systems • All PEs access all data with same tacc • Non-UMA (NUMA) systems • Memory is physically distributed • Some PEs are “closer” to some addresses • More later!

Synchronisation • Read static shared data • No problem! • Update problem • PE0 writes x • PE1 reads x • How to ensure thatPE1 reads the lastvalue written by PE0? • Semaphores • Lock resources(memory areas or ...)while being updatedby one PE

Synchronisation • Semaphore • Data structure in memory • Count of waiters • -1 = resource free • >= 0 resource in use • Pointer to list of waiters • Two operations • Wait • Proceed immediately if resource free(waiter count = -1) • Notify • Advise semaphore that you have finished with resource • Decrement waiter count • First waiter will be given control

Semaphores - Implementation • Scenario • Semaphore free (-1) • PE0: wait .. • Resource free, so PE0 uses it (sets 0) • PE1: wait .. • Reads count (0) • Starts to increment it .. • PE0notify .. • Gets bus and writes -1 • PE1:(finishing wait) • Adds 1 to 0, writes 1 to count, adds PE1 TCB to list • Stalemate! • Who issues notify to free the resource?

Atomic Operations • Problem • PE0 wrote a new value (-1) after PE1 had read the counter • PE1 increments the value it read (0) and writes it back • Solution • PE1’s read and update must be atomic • No other PE must gain access to counterwhile PE1 is updating • Usually an architecture will provide • Test and set instruction • Read a memory location, test it,if it’s 0, write a new value,else do nothing • Atomic or indivisible .. No other PE can access the value until the operation is complete

Atomic Operations • Test & Set • Read a memory location, test it,if it’s 0, write a new value,else do nothing • Can be used to guard a resource • When the location contains 0 -access to the resource is allowed • Non-zero value means the resource is locked • Semaphore: • Simple semaphore (no wait list) • Implement directly • Waiter “backs off” and tries again (rather than being queued) • Complex semaphore (with wait list) • Guards the wait counter

Atomic Operations • Processor must provide an atomic operation for • Multi-tasking or multi-threading on a single PE • Multiple processes • Interrupts occur at arbitrary points in time • including timer interrupts signaling end of time-slice • Any process can be interrupted in the middle of a read-modify-write sequence • Shared memory multi-processors • One PE can lose control of the bus after the read of a read-modify-write • Cache? • Later!

Atomic Operations • Variations • Provide equivalent capability • Sometimes appear in strange guises! • Read-modify-write bus transactions • Memory location is read, modified and written back as a single, indivisible operation • Test and exchange • Check register’s value, if 0, exchange with memory • Reservation Register (PowerPC) • lwarx - load word and reserve indexed • stwcx - store word conditional indexed • Reservation register stores address of reserved word • Reservation and use can be separated by sequence of instructions

Barriers • In shared memoryenvironment • PEs must know whenanother PE hasproduced a result • Simplest case:barrier for all PEs • Must be inserted byprogrammer • Potentially expensive • All PEs stall and waste time in the barrier

Cache? • What happens to cachedlocations?

Multiple Caches • Coherence • PEA reads location xfrom memory • Copy in cache A • PEB reads location x from memory • Copy in cache B • PEA adds 1

Multiple Caches - Inconsistent states • Coherence • PEA reads location xfrom memory • Copy in cache A • PEB reads location x from memory • Copy in cache B • PEA adds 1 • A’s copy now 201 • PEB reads location x • reads 200 from cache B

Multiple Caches - Inconsistent states • Coherence • PEA reads location xfrom memory • Copy in cache A • PEB reads location x from memory • Copy in cache B • PEA adds 1 • A’s copy now 201 • PEB reads location x • reads 200 from cache B • Caches and memory are now inconsistent ornotcoherent

Cache - Maintaining Coherence • Invalidate on write • PEA reads location xfrom memory • Copy in cache A • PEB reads location x from memory • Copy in cache B • PEA adds 1 • A’s copy now 201 • Issues invalidate x • Cache B marks x invalid • Invalidate is address transaction only

Cache - Maintaining Coherence • Reading the new value • PEB reads location x • Main memoryis wrong also • PEAsnoops read • Realises it hasvalid copy • PEA issues retry

Cache - Maintaining Coherence • Reading the new value • PEB reads location x • Main memoryis wrong also • PEAsnoops read • Realises it hasvalid copy • PEA issues retry • PEA writes x back • Memory now correct • PEB reads location x again • Reads latest version

Coherent Cache - Snooping • SIU “snoops” bus for transactions • Addresses compared with local cache • Matches • Initiate retries • Local copy is modified • Local copy is written to bus • Invalidate local copies • Another PE is writing • Mark local copies shared • second PE is readingsame value

Coherent Cache - MESI protocol • Cache line has 4 states • Invalid • Modified • Only valid copy • Memory copy is invalid • Exclusive • Only cached copy • Memory copy is valid • Shared • Multiple cached copies • Memory copy is valid

MESI State Diagram • Note the number of bus transactions needed! WH Write Hit WM Write Miss RH Read Hit RMS Read Miss Shared RME Read Miss Exclusive SHW Snoop Hit Write

Coherent Cache - The Cost • Cache coherency transactions • Additional transactions needed • Shared • Write Hit • Other caches must be notified • Modified • Other PE read • Push-out needed • Other PE write • Push-out needed - writing one word of n-word line • Invalid - modified in other cache • Read or write • Wait for push-out

Clusters • A bus which is too long becomes slow! • eg PCI is limited to 10 TTL loads • Lots of processors? • On the same bus • Bus speed must be limited • Low communication rate • Better to use a single PE! • Clusters • ~8 processors on a bus

Clusters £8 cache coherent (CC) processors on a bus Interconnect network ~100? clusters

Clusters Network Interface Unit Detects requests for “remote” memory

Clusters Message despatched to remote cluster’s NIU Memory Request Message

Clusters - Shared Memory • Non Uniform Memory Access • Access time to memory depends on location! From PEs in this cluster This memory is much closer than this one!

Clusters - Shared Memory • Non Uniform Memory Access • Access time to memory depends on location! Worse! NIU needs to maintain cache coherence across the entire machine

Clusters - Maintaining Cache Coherence • NIU (or equivalent) maintains directory • Directory Entries • All lines from local memory cached elsewhere • NIU software (firmware) • Checks memory requests against directory • Update directory • Send invalidate messages to other clusters • Fetch modified (dirty) lines from other clusters • Remote memory access cost • 100s of cycles! Directory (Cluster 2) Address Status Clusters 4340 S 1, 3, 8 5260 E 9

Clusters - “Off the shelf” • Commercial clusters • Provide page migration • Make copy of a remote page on the local PE • Programmer remains responsible for coherence • Don’t provide hardware support for cache coherence (across network) • Fully CC machines may never be available! • Software Systems • .... è

Shared Memory Systems • Software Systems • eg Treadmarks • Provide shared memory on page basis • Software • detects references to remote pages • moves copy to local memory • Reduces shared memory overhead • Provides some of the shared memory model convenience • Without swamping interconnection network with messages • Message overhead is too high for a single word! • Word basis is too expensive!!

Shared Memory Systems - Granularity • Granularity • Word basis is too expensive!! • Sharing data at low granularity • Fine grain sharing • Access / sharing for individual words • Overheads too high • Number of messages • Message overhead is high for one word • Compare • Burst access to memory • Don’t fetch a single word - • Overhead (bus protocol) is too high • Amortize cost of access over multiple words

Shared Memory Systems - Granularity • Coarse Grain Systems • Transferring data from cluster to cluster • Overhead • Messages • Updating directory • Amortise the overhead over a whole page • Lower relative overhead • Applies to thread size also • Split program into small threads of control • Parallel Overhead • cost of setting up & starting each thread • cost of synchronising at the end of a set of threads • Can be more efficient to run a single sequential thread!

Coarse Grain Systems • So far ... • Most experiments suggest that fine grain systems are impractical • Larger, coarser grain • Blocks of data • Threads of computation • needed to reduce overall computation time by using multiple processors • Too Fine grain parallel systems • can run slower than a single processor!

Parallel Overhead • Ideal • Time = 1/n • Add Overhead • Time > optimal • No point to usemore than4 PEs!!

Parallel Overhead • Shared memory systems • Best results if you • Share on large block basis eg page • Split program into coarse grain(long running) threads • Give away some parallelismto achieve any parallel speedup! • Coarse grain • Data • Computation There’s parallelism at the instruction level too!The instruction issue unit in a sequential processor is trying to exploit it!

Clusters - Improving multiple PE performance • Bandwidth to memory • Cache reduces dependency on the memory-CPU interface • 95% cache hits • 5% of memory accesses crossing the interface • but add • a few PEs and • a few CC transactions • even if the interface was coping before,it won’t in a multiprocessor system! A major bottleneck!

Clusters - Improving multiple PE performance • Bus protocols add to access time • Request / Grant / Release phases needed • “Point-to-point” is faster! • Cross-bar switch interface to memory • No PE contends with any other for the common bus Cross-bar? Name taken from old telephone exchanges!

Clusters - Memory Bandwidth • Modern Clusters • Use “Point-to-point” X-bar interfaces to memory to get bandwidth! • Cache coherence? • Now really hard!! • How does each cachesnoop all transactions?

Programming Model • Distributed Memory • Message passing • Alternative to shared memory • Each PE has own address space • PEs communicate with messages • Messages providesynchronisation • PE can block orwait for a message

Programming Model - Distributed Memory • Distributed Memory Systems • Hardware is simple! • Network can be as simple as ethernet • Networks of Workstations model • Commodity (cheap!) PEs • Commodity Network • Standard • Ethernet • ATM • Proprietary • Myrinet • Achilles (UWA!)

Programming Model - Distributed Memory • Distributed Memory Systems • Software is considered harder • Programmer responsible for • Distributing data to individual PEs • Explicit Thread control • Starting, stopping & synchronising • At least two commonly available systems • Parallel Virtual Machine (PVM) • Message Passing Interface (MPI) • Built on two operations • Send data, destPE, block | don’t block • Receive data, srcPE, block | don’t block • Blocking ensures synchronisation

Programming Model - Distributed Memory • Distributed Memory Systems • Performance generally better (versus shared memory) • Shared memory has hidden overheads • Grain size poorly chosen • eg data doesn’t fit into pages • Unnecessary coherencetransactions • Updating a shared region (each page)before end of computation • MP system waits and updates page when computation is complete

Programming Model - Distributed Memory • Distributed Memory Systems • Performance generally better (versus shared memory) • False sharing • Severely degrades performance • May not be apparent on superficial analysis Memory page PEa accesses this data This whole page ping-pongs between PEa and PEb PEb accesses this data

Distributed Memory - Summary • Simpler (almost trivial) hardware • Software • More programmer effort • Explicit data distribution • Explicit synchronisation • Performance generally better • Programmer knows more about the problem • Communicates only when necessary • Communication grain size can be optimum • Lower overheads

Data Flow • Conventional programming models are control driven • Instruction sequence is precisely specified • Sequence specifies control • which instruction the CPU will execute next • Execution rule: • Execute an instruction when its predecessor has completed • s1: r = a*b;s2: s = c*d;s3: y = r + s; s2 executes when s1 is complete s3 executes when s2 is complete

Computer Architecture