250 likes | 375 Views
A Regulated Transitive Reduction (RTR) for Longer Memory Race Recording (ASLPOS’06). Rastislav Bodik. Mark D. Hill. Min Xu. Shimin Chen LBA Reading Group Presentation. Why Do You Need a Recorder?. % gdb a.out gdb> run Program received SIGSEGV. In get() at hash.c:45
E N D
A Regulated Transitive Reduction (RTR) for Longer Memory Race Recording (ASLPOS’06) Rastislav Bodik Mark D. Hill Min Xu Shimin Chen LBA Reading Group Presentation
Why Do You Need a Recorder? • % gdb a.out • gdb> run • Program received SIGSEGV. • In get() at hash.c:45 • 45 a = bucket->d; • % gcc sim.c • % a.out • Segmentation fault • % • % gcc para-sim.c • % a.out • Segmentation fault • % • % gdb a.out • gdb> run • Program exited normally. • gdb> • % gcc para-sim.c • % a.out • Segmentation fault • Race recorded in “log” • % • % gdb a.out log • gdb> run • Program received SIGSEGV. • In get() at para-hash.c:67 • 67 a = bucket->d;
Applicability: Programs – data race Systems – non-SC Ideally … Long recording: small log Low runtime overhead Low cost • % gcc para-sim.c • % a.out • Segmentation fault • Race recorded in “log” • % • % gdb a.out log • gdb> run • Program received SIGSEGV. • In get() at para-hash.c:67 • 67 a = bucket->d;
Flight Data Recorder (ISCA’03) • Full-system Record-Replay • Recording memory races: • Assumes Sequential Consistency (SC) • Record order of instruction interleaving • Target cache-coherence multiprocessor server • Piggyback on coherence protocol: little extra H/W • Recording system states: SafetyNet • Recording I/Os • Results: • Non-trivial recording interval: 1 second • Negligible runtime overhead: less than 2% • Can be “Always On”
RTR • Better memory race log compression • 1 byte per Kilo instructions • Dealing with Total Store Ordering • In this talk, I will try to describe a full picture combining FDR and RTR.
Outline • Introduction • Recording System State • Recording Input/Output • Recording Memory Races • Dealing with TSO • Summary
Recording System State (based on SafetyNet) • Purpose: re-construct the initial state (registers, TLB, main memory) at the beginning of the replay interval • Policy: FDR’s 1second replay interval • Take a logical checkpoint every 1/3 second • Reserve memory space to store logs for 4 checkpoints • Logical checkpoint: • Quiesce entire system to take a physical checkpoint • Registers and TLB states (4248 bytes/processor on SPARC V9) • Log old value of a cache line upon first update • Add an “already-updated” bit per cache line
Outline • Introduction • Recording System State • Recording Input/Output • Recording Memory Races • Dealing with TSO • Summary
Recording I/O Instruction count + interrupt number DMA store values I/O loads
Outline • Introduction • Recording System State • Recording Input/Output • Recording Memory Races • Dealing with TSO • Summary
Dependence Log 1 1 Log J: 23 14 35 46 16 bytes 2 2 3 3 Log I: 23 4 4 5 5 Log Size: 5*16=80 bytes (10 integers) 6 6 Log All Dependence Thread I Thread J ld A add st B st C st C ld B ld D st A sub st C ld B st D Replay • But too many dependence
TR Reduced Log Log J: 23 35 46 Log I: 23 Log Size: 64 bytes (8 integers) Netzer’s Transitive Reduction (TR)approximated by FDR Thread I Thread J TR reduced 1 ld A add 1 st B st C 2 2 st C ld B 3 3 ld D st A 4 4 sub st C 5 5 ld B st D 6 6 Replay • How to further reduce log size?
RTR • Actively creating artificial dependencies • Stricter • Vectorized
From I to J Vectors • “Regulate” Replay From J to I Vectors The Intuition of the RTR Algorithm After Reduction
New Reduced Log Log J: 23 45 Log I: 23 stricter sub st C 5 5 Reduced Log Size: 48 bytes (6 integers) ld B st D 6 6 Stricter Dependences to Aid Vectorization Thread I Thread J 1 ld A add 1 st B st C 2 2 st C ld B 3 3 ld D st A 4 4 Replay • Fewer dependencies to log
Vectorized Log Log J: x=3,5, ∆=1 Log I: x=3, ∆=1 Vector Deps. Log Size: 40 bytes (5 integers) Compress Vectorized Dependencies Thread I Thread J 1 ld A add 1 st B st C 2 2 st C ld B 3 3 ld D st A 4 4 sub st C 5 5 ld B st D 6 6 Replay • TRRTR: fewer deps + fewer byte/dep
H/W Considerations • (IC) Instruction count per core -- easy • (VIC[p]) record previously seen senders’ largest time stamps for transitive reduction • (CTS[b]) time stamp per cache block: • i.e. record IC upon load/store commits • At commit time: • Figure out memory address – how difficult? • Write CTS: decoupled timestamp memory
H/W Considerations Cont’d • Piggyback on cache coherence messages • FDR: CTS[b] • RTR: CTS[b] & sender’s IC • Logic to perform algorithm at the receiver side • FDR: integer comparison, update VIC[sender], generate log record • RTR: in addition, max/min, integer subtraction • Augment directory structure • Record last owner for evicted blocks • Cache must respond to inquiries about evicted blocks: reply with CTS[SET/LRU]
Outline • Introduction • Recording System State • Recording Input/Output • Recording Memory Races • Dealing with TSO • Summary
Total Store Ordering • FIFO Write buffer • A store commits by placing its value into write buffer • A store is ordered when it exits the write buffer and updates the memory • Stores are ordered in commit order (FIFO) • Load can obtain values from write buffer or from memory system
Problems with TSO • /* XXX */ is memory order • The two examples create cycles that will result in replay deadlocks
Solution • Identify problematic load instructions • Monitor invalidation in [t1, t2] • t1: the load (or the previous store that feeds the load) is ordered at memory • t2: all preceding instructions are ordered • Log load values and replay these load instructions by values • HW: similar to the misspeculation detection circuitry in SC systems (e.g. MIPS R10000) • Insufficient for supporting Processor Consistency and other more relaxed models
Conclusion • RTR1 byte/kilo-instruction • Based on Netzer’s transitive reduction • Create stricter dependencies • Vectorize dependencies to compress log • Avoid overly-strict hence no deadlock