1 / 60

LECTURE 5 A Brief History of TM

LECTURE 5 A Brief History of TM. Precursors of Computing: ENIAC. 5000 ops/second 486k $ in 1946 19k vacuum tubes 200K watts 67 cubic meters. Latest trends: Intel Nehalem. 1.9 billion transistors 12 billion ops per second 4 microprocessors 8 MB of on-chip memory 100 W

clem
Download Presentation

LECTURE 5 A Brief History of TM

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LECTURE 5A Brief History of TM

  2. Precursors of Computing: ENIAC • 5000 ops/second • 486k $ in 1946 • 19k vacuum tubes • 200K watts • 67 cubicmeters

  3. Latest trends: Intel Nehalem • 1.9 billion transistors • 12 billion ops per second • 4 microprocessors • 8 MB of on-chip memory • 100 W • 246 square millimeters

  4. The Way: Not just Chip Frequency! • 1970s: Programmable controllers, single chip microprocessors • 1980s: Instruction pipelines, cache hierarchies • 1990s: Speculative execution, Superscalar processors • 2000s: Multicore chips, embedded computing

  5. Pipelining • Split the processing of an instruction into a series of independent steps • Classic pipeline • Instruction Fetch (IF) • Instruction Decode (ID) • Execute (EX) • Memory Access (MEM) • Register Write Back (WB)

  6. Pipelining Different parts of the CPU used for different stages of the pipeline

  7. Pipelining • Throughput: Speed of the slowest step instead of the whole instruction • More expensive design • Performance of a pipelined processor depends on the executing program, and is harder to predict than a non-pipelined processor

  8. Superscalar • Executes multiple instructions per clock cycle by simultaneously dispatching to redundant functional units • Think of it as multiple parallel pipelines, each processing instructions from a single stream • Limitation: Degree of intrinsic parallelism in the stream

  9. Out of Order Execution (OOE) • Multiple instructions fetched • Instructions dispatched to an instruction queue (also called instruction buffer or reservation stations) • Instruction waits in the queue until the input operands are available • Note that the instruction may leave the queue before earlier instructions • Results are queued

  10. Speculation in ILP • Pipelining, OOE, Superscalars all consist of certain “speculation” Branch prediction • There has always been some speculation “circuitry” in processors

  11. Forms of parallelism • Functional: Perform tasks that are functionally different in parallel, e.g. building a house – plumber, carpenter, electrician • Pipeline: Perform tasks that are different in a particular order, e.g. lunch buffet • Data: Perform the same task on different data, e.g. grading exams, MapReduce

  12. Limitations of ILP • Finite amount of ILP in any sequence of instructions • Another possibility: Thread Level Parallelism (Functional parallelism) • How to get multiple threads? • Write parallel programs • Thread level speculation • Code parallelization

  13. Thread Level Speculation • Takes a sequence of instructions • Arbitrarily breaks it into a sequenced group of threads that may run in parallel • Allows for oblivious parallelization of sequential programs • Parallelization by speculation dynamically finds parallelism at runtime, and thus is not conservative

  14. Code parallelization • Implemented in compilers, e.g. SUIF • Problems: Hard to identify dependencies between pieces of code and data at compile time

  15. CMP (Chip Multiprocessors) • Forward data between parallel threads • Detect when reads occur too early • Safely discard speculative state after violations • Retire speculative writes in correct order • Examples: Stanford HYDRA, Wisconsin Multiscalar, CMU Stampede (1995-2000)

  16. Cache Coherence • Consistency of data stored in local caches of a shared resource (Wiki definition) • Protocols • MESI • MOESI • MOSI • MSI

  17. P1 P2 P3 P4 CACHE CACHE CACHE CACHE INTERCONNECTION NETWORK MAIN MEMORY

  18. 2-state Invalidation Cache Protocol Write Through, No Allocation Valid indicates cache presence BusWr PrRd / -- PrWr / BusWr VALID INVALID PrWr / BusWr PrRd / BusRd X/Y: Action X / Reaction Y PrRd: Processor Read PrWr: Processor Write BusRd: Fetch a cache block BusWr: Write through one word --: No action

  19. 2-State Protocol • Simple hardware and protocol • Requires high bandwidth (every write goes on bus!)

  20. 3-state Protocol (MSI) • Modified • Shared • Invalid

  21. MSI State Diagram PrRd / -- BusRdX/BusWB PrWr / -- M I PrWr / BusRdX BusRd/BusWB PrRd /BusRd BusRdX/-- PrWr / BusRdX S PrRd / -- BusRd/--

  22. Further Improvements • MESI: Illinois protocol • MOESI

  23. FIRST TRANSACTIONAL MEMORIES

  24. Precursors: Knight (1986) • Idea of TLS • Two caches per processor • The first idea to propose the use of caches and cache coherence to maintain and enforce ordering among speculatively parallelized regions of a sequential code in the presence of unknown memory dependencies

  25. The word “Transactional Memory” • Introduced by Herlihy and Moss in 1991 • Idea: Adapt the cache coherence protocol so that transactional accesses are monitored

  26. ISCA 93 • Six new instructions • Load-transactional • Load-transactional-exclusion • Store-transactional • Commit • Abort • Validate • New processor flags • Tactive: Is a transaction currently active? • Tstatus: Is the active transaction in progress, or aborted?

  27. Transactional Cache • States: MESI • Additional transactional tags: EMPTY, NORMAL, XCOMMIT, XABORT • Transactional operations create two entries: one with XCOMMIT and one with XABORT • Modifications made to XABORT on Store

  28. Extra three bus cycles • T_READ: On a transactional load • T_RFO: On a transactional load exclusive, or a store • BUSY: Full cache or other reasons (prevent deadlocks or mutual aborts)

  29. Load_transactional • LT: • Search TxCache for an XABORT entry. Return if one exists • No XABORT entry  Search for a NORMAL entry. Change it to XABORT. Allocate a second entry with tag XCOMMIT and same data • Else, issue a T_READ cycle. Behaves as Goodman’s read. Two entries created: tagged with XABORT and XCOMMIT.

  30. Load_transactional_exclusive • Similar to LT • Instead of T_READ, T_RFO used on a miss

  31. Store • Similar to LTX • Changes the XABORT entry’s data too

  32. Validate • Returns the TSTATUS flag • If the TSTATUS flag is FALSE • Sets TSTATUS to TRUE • Sets TACTIVE to FALSE

  33. Abort • Discards the XABORT entries (sets their tags as EMPTY) • Sets the tags of XCOMMIT entries as NORMAL • Sets the TSTATUS to TRUE • Sets the TACTIVE to FALSE

  34. Commit • Discards the XCOMMIT entries (sets their tags to EMPTY) • Sets the tags of XABORT entries to NORMAL • Sets TSTATUS to TRUE • Sets TACTIVE to FALSE

  35. Digression • Why transactional memories instead of locks? • Locks create several problems and require programmers to properly use them • Priority inversion: Lower priority process that holds a lock preempted when a higher priority that needs the lock • Convoying: Process holding a lock is descheduled, and no other process can progress • Deadlock: Two or more processes attempt to lock same set of objects in different orders

  36. Digression • Transactional memory was invented as a faster means of performing lock-free synchronization • That is why, earliest TM implementations have no misspeculations. They have aborts due to capacity constraints (HTM) or lock contentions

  37. Speculative Lock Elision (SLE) • Another reason to use TM! • Speculatively execute critical sections guarded by locks • Use cache coherence and rollback for recovery from misspeculation

  38. Hardware TMs in general • Great idea, efficient implementations • Limitations • High cost of implementation • Small transactional buffer sizes • Context switches • Solutions: Unbounded HTM

  39. SOFTWare TM

  40. Advantage • More flexible than hardware, allows to experiment with variety of algorithms • Fewer limitations imposed by fixed size hardware, like caches

  41. Access Granularity • Detects conflicting accesses on objects / words / regions • Object: Easy implementation, but lot of false conflicts • Word: Less false conflicts • Region: Less overhead than words

  42. Update • How the global memory is updated: Direct / deferred • Direct: The transaction directly modifies the object itself, logs the original value in order to restore in case of abort • Deferred: The transaction makes local modifications, and changes global memory only on commit

  43. Conflict Detection • When are the conflicts detected: Eager / lazy / mixed • What is a conflict: Multiple accesses, one of them is a write • For commit, a transaction must acquire every location updated. Eager if acquired at the first update operation, lazy if done at the time of commit. • Mixed: Eagerly detects write/write conflicts, and lazily detects read/write conflicts

  44. STM: 1995 • Memory to be accessed in a transaction known in advance • Lock-free: Transactions help each other • Motivation: Replace N-word CAS, implement lock-free data structures etc

  45. The System Model We assume that every shared memory location supports these 4 operations: • Writei(L,v) - thread i writes v to L • Readi(L,v) - thread i reads v from L • LLi(L,v) - thread i reads v from L and marks that L was read by I • SCi(L,v) - thread i writes v to L and returns success if L is marked as read by i. Otherwise it returns failure.

  46. Thread classRec { booleanstable = false; boolean,intstatus= (false,0); //can have two values… booleanallWritten = false; intversion = 0; intsize = 0; intlocs[] = {null}; intoldValues[] = {null}; } Each thread is defined by an instance of a Rec class (short for record). The Rec instance defines the current transaction the thread is executing (only one transaction at a time)

  47. status version size locs[] oldValues[] status version size locs[] oldValues[] status version size locs[] oldValues[] The STM Object This is the shared memory Memory Ownerships Pointers to threads Rec2 Recn Rec1

  48. Flow of a transaction STM Threads release Ownerships startTransaction Thread i Success updateMemory initialize Failure Initiate helping transaction to failed loc (isInitiator:=F) isInitiator? calcNewValues transaction T F acquire Ownerships (Null, 0) (Failure,failed loc) agreeOldValues release Ownerships

  49. The STM Object publicclass STM { intmemory[]; Recownerships[]; publicboolean, int[] startTranscation(Recrec, int[] dataSet){...}; privatevoid initialize(Recrec, int[] dataSet) privatevoid transaction(Recrec, int version, booleanisInitiator) {...}; privatevoidacquireOwnerships(Recrec, int version) {...}; privatevoidreleaseOwnershipd(Recrec, int version) {...}; privatevoidagreeOldValues(Recrec, int version) {...}; privatevoidupdateMemory(Recrec, int version, int[] newvalues) {...}; }

  50. Implementation rec – The thread that executes this transaction. dataSet – The location in memory it needs to own. publicboolean, int[] startTranscation(Recrec, int[] dataSet) { initialize(rec, dataSet); rec.stable = true; transaction(rec, rec.version, true); rec.stable = false; rec.version++; if (rec.status) return (true, rec.oldValues); elsereturnfalse; } This notifies other threads that I can be helped

More Related