Lecture 24: Intro to Multi-processors

Lecture 24:Intro to Multi-processors Michael B. Greenwald Computer Architecture CIS 501 Fall 1999

Administrative stuff • Final exam will be in room Moore 216, 8:30-10:30am on Thursday, December 16th. • HW #6 delayed until Thursday, Dec. 9th.Q #3: cache is write-allocateQ #3: “adding proc[s] … not worthwhile” => you define, but be consistent! • Penn CISter’s women’s luncheon on Wednesday, December 8th, 12:30-2:30, • Polar Bear Lounge (129 Pender) • Hosted by Professors Martha Palmer & Susan Davidson • questions? Yblubin@seas.upenn.edu

Multiprocessors • Processing Organization: • SISD, SIMD, MISD, MIMD • Memory Hardware organization: • UMA vs. NUMA • Memory Software Structure: • Shared Memory vs. Message Passing

Fundamental Issues • 4 Issues to characterize parallel machines/systems 1) Naming 2) Synchronization 3) Latency and Bandwidth 4) Consistency

Example: Small-Scale MP Designs • Memory: centralized with uniform access time (“uma”) and bus interconnect, I/O • Caches serve to: • Increase bandwidth versus bus/memory • Reduce latency of access • Valuable for both private data and shared data • What about cache consistency? (0 or more levels of cache) Main Memory

Large-Scale MP Designs • Memory: distributed with nonuniform access time (“numa”) and scalable interconnect (distributed memory) 1 cycle 40 cycles 100 cycles Low Latency High Reliability What about cache consistency?

Fundamental Issues • 4 Issues to characterize parallel machines 1) Naming 2) Synchronization 3) Latency and Bandwidth 4) Consistency: Details 1) Snoopy cache 2) Directory cache 3) Synchronization 4) Memory Consistency models

CPU CPU CPU CPU Cache Cache Cache Cache Impact of multiple CPUs on cache • If location X is in more than one cache it must have the same value in every cache! Dirty/Migrated Replicated Shared Memory

Cache Coherence and consistency • Coherence: every cache/CPU must have a coherent view of memory • If P writes X to A, then reads A, if no other proc writes A, then P reads X • If P1 writes X to A, and no other processor writes to A, then P2 will eventually read X from A. • If P1 writes X to A, and P2 writes Y to A, then every processor will either read X then Y, or Y then X, but all will see the writes in the same order. • Consistency: memory consistency model tells us when writes to different locations will be seen by readers.

Potential HW Coherency Solutions • Snooping Solution (Snoopy Bus): • Send all requests for data to all processors • Processors snoop to see if they have a copy and respond accordingly • Requires broadcast, since caching information is at processors • Works well with bus (natural broadcast medium) • Dominates for small scale machines (most of the market) • Directory-Based Schemes • Keep track of what is being shared in one centralized place • Distributed memory => distributed directory for scalability(avoids bottlenecks) • Send point-to-point requests to processors via network • Scales better than Snooping • Actually existed BEFORE Snooping-based schemes

Implementing Coherency Protocols • 2 Basic approaches: • Write-invalidate • Write-update (a.k.a. write broadcast) • Write invalidate: • Acquire exclusive ownership before writing (invalidate other cache copies) • Need to arbitrate between two attempted writes • Write broadcast: • Broadcast the write

Basic Snoopy Protocols • Write Invalidate Protocol: • Multiple readers, single writer • Write to shared data: an invalidate is sent to all caches which snoop and invalidate any copies • Read Miss: • Write-through: memory is always up-to-date • Write-back: snoop in caches to find most recent copy • Write Broadcast Protocol (typically write through): • Write to shared data: broadcast on bus, processors snoop, and update any copies • Read miss: memory is always up-to-date • Write serialization: bus serializes requests! • Bus is single point of arbitration

Basic Snoopy Protocols • Write Invalidate versus Broadcast: • Invalidate requires one transaction per write-run • Invalidate uses spatial locality: one transaction per block • Broadcast has lower latency between write and read • Broadcast: BW (increased) vs. latency (decreased) tradeoff Name Protocol Type Memory-write policy Machines using Write Once Write invalidate Write back First snoopy protocol. after first write Synapse N+1 Write invalidate Write back 1st cache-coherent MPs Berkeley Write invalidate Write back Berkeley SPUR Illinois Write invalidate Write back SGI Power and Challenge “Firefly” Write broadcast Write back private, Write through shared SPARCCenter 2000

Snooping Cache Variations MESI Protocol Modfied (private,^=Memory) eXclusive (private,=Memory) Shared (shared,=Memory) Invalid Illinois Protocol Private Dirty Private Clean Shared Invalid Basic Protocol Exclusive Shared Invalid Berkeley Protocol Owned Exclusive Owned Shared Shared Invalid Owner can update via bus invalidate operation Owner must write back when replaced in cache If read sourced from memory, then Private Clean if read sourced from other cache, then Shared Can write in cache if held private clean or dirty

An Example Snoopy Protocol • Invalidation protocol, write-back cache • Each block of memory is in one state: • Clean in all caches and up-to-date in memory (Shared) • OR Dirty in exactly one cache (Exclusive) • OR Not in any caches • Each cache block is in one state (track these): • Shared : block can be read • OR Exclusive : cache has only copy, its writeable, and dirty • OR Invalid : block contains no data • Read misses: cause all caches to snoop bus • Writes to clean line are treated as misses

Snoopy-Cache Simple model of State Machine-CPU CPU Read hit CPU Read Shared (read/only) Invalid • State machinefor CPU requestsfor each cache block Place read miss on bus CPU Write CPU read miss Write back block CPU Read miss Place read miss on bus Place Write Miss on bus CPU Write Place Write Miss on Bus Cache Block State Exclusive (read/write) CPU read hit CPU write hit CPU Write Miss Write back cache block Place write miss on bus

Snoopy-Cache Simple model of State Machine-Remote Ops Write miss for this block Shared (read/only) Invalid • State machinefor bus requests for each cache block • Appendix E gives details of bus requests Write Back Block; (abort memory access) Write Back Block; (abort memory access) Write miss for this block Read miss for this block Exclusive (read/write)

CPU Read hit More Accurate Snoopy-Cache State Machine Remote Write or Miss due to address conflict Shared (read/only) Invalid • State machinefor CPU and bus requestsfor each memory block • Invalid stateif in memory CPU Read Place read miss on bus CPU Write Place Write Miss on bus CPU Write Place Write Miss on Bus Remote Write or Miss due to address conflict Write back block Remote Read Write back block Exclusive (read/write) CPU read hit CPU write hit

Snoop Cache Extensions CPU Read hit Remote Write or Miss due to address conflict Shared (read/only) Extensions: • Fourth State: Ownership • Shared-> Modified, need invalidate only (upgrade request), don’t read memoryBerkeley Protocol • Clean exclusive state (no miss for private data on write)MESI Protocol • Cache supplies data when shared state (no memory access)Illinois Protocol Invalid CPU ReadPlace read miss on bus CPU Write Place Write Miss on bus Remote Write or Miss due to address conflict Write back block Remote Read Place Data on Bus? Remote Read Write back block CPU Write Place Write Miss on Bus? Exclusive (read/only) Modified (read/write) CPU read hit CPU write hit CPU Write Place Write Miss on Bus? CPU Read hit

Snooping Coherency Implementation Complications • Write Races: • Cannot update cache until bus is obtained • Otherwise, another processor may get bus first, and then write the same cache block! • Two step process: • Arbitrate for bus • Place miss on bus and complete operation • If miss occurs to block while waiting for bus, handle miss (invalidate may be needed) and then restart. • Split transaction bus: • Bus transaction is not atomic: can have multiple outstanding transactions for a block • Multiple misses can interleave, allowing two caches to grab block in the Exclusive state • Must track and prevent multiple misses for one block • Must support interventions and invalidations

Implementing Snooping Caches • Multiple processors must be on bus, access to both addresses and data • Add a few new commands to perform coherency, in addition to read and write • Processors continuously snoop on address bus • If address matches tag, either invalidate or update • Since every bus transaction checks cache tags, could interfere with CPU just to check: • solution 1: duplicate set of tags for L1 caches to allow checks in parallel with CPU • solution 2:L2 cache already duplicate and underutilized, provided L2 obeys inclusion with L1 cache • block size, associativity of L2 affects L1

Implementing Snooping Caches • Bus serializes writes, getting bus ensures no one else can perform memory operation • On a miss in a write back cache, may have the desired copy and it is dirty, so must reply • Add extra state bit to cache to determine shared or not • Add 4th state (MESI)

CPU Read hit Remote Write or Miss Shared Invalid Readmiss on bus Writemiss on bus Remote Write or Miss Write Back CPU Write Place Write Miss on Bus Remote Read Write Back Exclusive CPU read hit CPU write hit Example Bus Processor 1 Processor 2 Memory Assumes initial cache state is invalid and A1 and A2 map to same cache block, but A1 ^= A2

CPU Read hit Remote Write or Miss Shared Invalid Readmiss on bus Writemiss on bus Remote Write or Miss Write Back CPU Write Place Write Miss on Bus Remote Read Write Back Exclusive CPU read hit CPU write hit Example: Step 1 Assumes initial cache state is invalid and A1 and A2 map to same cache block, but A1 ^= A2. Active arrow =

CPU Read hit Remote Write or Miss Shared Invalid Readmiss on bus Writemiss on bus Remote Write or Miss Write Back CPU Write Place Write Miss on Bus Remote Read Write Back Exclusive CPU read hit CPU write hit Example: Step 2 Assumes initial cache state is invalid and A1 and A2 map to same cache block, but A1 ^= A2

CPU Read hit Remote Write or Miss Shared Invalid Readmiss on bus Writemiss on bus Remote Write or Miss Write Back CPU Write Place Write Miss on Bus Remote Read Write Back Exclusive CPU read hit CPU write hit Example: Step 3 A1 Assumes initial cache state is invalid and A1 and A2 map to same cache block, but A1 ^= A2.

CPU Read hit Remote Write or Miss Shared Invalid Readmiss on bus Writemiss on bus Remote Write or Miss Write Back CPU Write Place Write Miss on Bus Remote Read Write Back Exclusive CPU read hit CPU write hit Example: Step 4 A1 Assumes initial cache state is invalid and A1 and A2 map to same cache block, but A1 ^= A2

CPU Read hit Remote Write or Miss Shared Invalid Readmiss on bus Writemiss on bus Remote Write or Miss Write Back CPU Write Place Write Miss on Bus Remote Read Write Back Exclusive CPU read hit CPU write hit Example: Step 5 A1 A1 Assumes initial cache state is invalid and A1 and A2 map to same cache block, but A1 ^= A2

Lecture 24: Intro to Multi-processors

Lecture 24: Intro to Multi-processors

Presentation Transcript

Framework For Supporting Multi-Service Edge Packet Processing On Network Processors

III. Multicore Processors (6)

Deep Packet Inspection Which Implementation Platform?

Lecture 08: Visualization Intro

STL on Limited Local Memory ( LLM) Multi-core Processors

Lecture 2 (Mapping Applications to Multi-core Arch)

Introduction

Physical Design of FabScalar Generated Superscalar Processors

The Microarchitecture of FPGA-Based Soft Processors

Task Partitioning for Multi-Core Network Processors

Intro to APUSH

Future processors: What is on the horizon for HEP computing

G043 – Lecture 03 Motherboards and Processors

Chem 80A Intro.

High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors *

III. Multicore Processors (4)

Multithreaded Processors

Concurrency

A case for Virtual Channel Processors

Lecture 10: Processors

III. Multicore Processors (4)