150 likes | 310 Views
ECE 569 High Performance Processors and Systems. Administrative HW3 — presentations Topics? Presentation dates? Options: Tues 3/11 Tues 3/18 Thurs 3/20. Rank your choice: 1 st , 2 nd , 3 rd. Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution. By:
E N D
ECE 569 High Performance Processors and Systems • Administrative • HW3 — presentations • Topics? • Presentation dates? Options: • Tues 3/11 • Tues 3/18 • Thurs 3/20 Rank your choice: 1st, 2nd, 3rd ECE 569 -- 04 Mar 2014
Speculative Lock Elision:Enabling Highly Concurrent Multithreaded Execution • By: • Ravi Rajwar • James Goodman • U. of Wisconsin, Madison • Where: • 34th International Symposium on Microarchitecture (MICRO) • Dec 2001, Austin TX
High performance requires parallel execution • Typically threads running across cores T1 T2 T3 T4 Core Core Core Core Memory ECE 569 -- 04 Mar 2014
Locks are often needed to ensure correct execution • e.g. when threads are accessing a shared resource Locks can significantly impact performance T1 T2 T3 T4 Lock Unlock Core Core Core Core Memory ECE 569 -- 04 Mar 2014
Key insight: • Acquiring lock is often unnecessary for correct execution • Why is lock unnecessary? First… • Threads don't enter critical section at the same time T1 T2 T3 T4 Lock Unlock ECE 569 -- 04 Mar 2014
Why is lock unnecessary? Second… • Threads access different parts of shared resource • Example: hash table T2 T3 Locks are used to prevent race condition, but only occurs on hash to same index hash(x) hash(y) ECE 569 -- 04 Mar 2014
Idea: • Let hardware dynamically identify lock • HW speculatively executes critical section without lock • If HW detects memory conflict, discards state & re-executes with lock • If HW reaches unlock, commits state & skips unlock Uses existing cache coherence protocols and speculation support ― new HW not required! ECE 569 -- 04 Mar 2014
How? • Let hardware dynamically identify lock • HW speculatively executes critical section without lock • If HW detects memory conflict, discards state & re-executes with lock • If HW reaches unlock, skips unlock & commits state • Identify: the lock acquire (store) in Load-Locked/Store-Conditional • Execute: like branch prediction, perform speculative execution • Detect: use cache coherence to detect (1) data read is modified by another, or (2) data written is read/written by another • Identify: store to same location as Store-Conditional, at which point commit state, exit speculation ECE 569 -- 04 Mar 2014
Silent store-pair elision: • By skipping stores, lock remains unlocked so no one waits! ECE 569 -- 04 Mar 2014
Results? CMP: chip multiprocessor SMP: shared-memory multiprocessor DSM: distributed shared memory Taller is better! ECE 569 -- 04 Mar 2014
Results? Normalized execution time (we want < 1.0, lower better) Portion spent accessing & waiting on locks None were slower Many were faster A few significantly faster ECE 569 -- 04 Mar 2014
Benchmarks: Barnes: high lock and data contention Cholesky: shared work queues with high contention Mp3D: frequent locking but with little contention in critical section Radiosity: shared work queues with high contention Water-nsq: little contention Ocean-cont: conditional locking ECE 569 -- 04 Mar 2014
Summary: • It works! • Has been added to most recent Intel, high-end chipsets • Performance gains: • Less waiting for locks ==> more parallelism • Fewer locks ==> less waiting & cache disruption ==> reduced latency • Fewer accesses to locks ==> less memory traffic ECE 569 -- 04 Mar 2014
Drawbacks? Costs? • ? • ? ECE 569 -- 04 Mar 2014
Presentation • Always have: • Motivation • Idea • Results • Pros and cons • Any insights beyond what paper says • Text on slides + Visuals • Demos if possible