1 / 28

DeNovoND: Efficient Hardware Support for Disciplined Non-Determinism

DeNovoND: Efficient Hardware Support for Disciplined Non-Determinism. Hyojin Sung , Rakesh Komuravelli , and Sarita V. Adve Department of Computer Science University of Illinois at Urbana-Champaign. Motivation. Shared memory is de-facto model for multicore SW and HW BUT …

truong
Download Presentation

DeNovoND: Efficient Hardware Support for Disciplined Non-Determinism

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DeNovoND: Efficient Hardware Support for Disciplined Non-Determinism Hyojin Sung, RakeshKomuravelli, and Sarita V. Adve Department of Computer Science University of Illinois at Urbana-Champaign

  2. Motivation • Shared memory is de-facto model for multicore SW and HW • BUT … • Complex SW: data races, unstructured parallelism, memory model, … • Inefficient HW: complex coherence/consistency, unnecessary traffic, … • Recent work on disciplined shared memory • SW: Easier programming model • HW: If SW is more disciplined, can we build more efficient HW? • DeNovo: Holistic rethinking of entire memory hierarchy

  3. Disciplined Shared Memory Disciplined Shared-Memory = Global address space + Implicit, anywhere communication, synchronization • Explicit, structured side-effects

  4. Disciplined Shared Memory • Deterministic Parallel Java (DPJ) – strong safety properties • Determinism-by-default, simple semantics OOPSLA ‘09 explicit effects Disciplined Shared Memory structured parallel control • DeNovo – performance, complexity and power efficient • Simplify coherence and consistency PACT ‘11

  5. Limitation • DeNovo for deterministic programs • Important assumptions • No conflicting concurrent accesses, only barrier synchronization • Known side-effects • Allowed DeNovo to eliminate design complexity and inefficiency • Challenges for nondeterministic programs • The assumptions do not hold any more • Can have conflicting concurrent accesses, support lock synchronization • Side-effects unknown in critical sections • Applications with lock-based non-determinism are common

  6. Contribution • Deterministic Parallel Java (DPJ) – strong safety properties • Determinism-by-default, simple semantics Explicit & safe non-determinism POPL ‘11 explicit effects Disciplined Shared Memory structured parallel control • DeNovoND: Non-deterministic codes with benefits of DeNovo • Minimal additional HW for non-determinism • Comparable performance to MESI • 30% lower network traffic than MESI • PLUS all advantages of DeNovo for deterministic codes

  7. Outline • Motivation • Background • DPJ/DeNovo for deterministic codes • DPJ support for disciplined non-determinism • DeNovoND Design • DeNovoND Implementation • Evaluation • Conclusion and Future Work

  8. DPJ for Deterministic Codes . . . • Structured parallel control • Fork-join parallelism • Explicit region and effect • Regions divide heap • Read or write effects on regions • Data-race freedom guarantee • Simple, modular type checking ST ST ST ST LD . . . write effect heap

  9. DPJ for Deterministic Codes . . . Hardware – simplify coherence problems! • Java-compatible type system • Structured parallel control • Fork-join parallelism • Explicit region and effect • Regions divide heap • Read or write effects on regions • Data-race freedom guarantee • Simple, modular type checking ST ST ST ST LD . . . write effect heap

  10. DeNovo for Deterministic Codes • Coherence Enforcement • Invalidate stale copies in private cache • Track up-to-date copy • Explicit effects • Compiler knows all writeable regions in this parallel phase • Cache can self-invalidate before next parallel phase • Registration • Directory keeps track of one up-to-date copy • Writer registers itself before next parallel phase

  11. DeNovo for Deterministic Codes • No space overhead • Keep valid data or registered core id • LLC data arrays double as directory • No transient states • No invalidation traffic • No false sharing registry Invalid Valid Read Write Write Registered

  12. Example Run L1 of Core 1 L1 of Core 2 X in DeNovo-region Y in DeNovo-region ST ST . . Registration Registration Shared L2 Ack Ack self-invalidate( ) Registered Valid Invalid

  13. DPJ Support for Safe Non-Determinism . . . • Nondeterminism comes from conflicting concurrent accesses • Isolate these accesses as “atomic” • Enclosed in “atomic” sections • “Atomic” regions and effects • “Disciplined” non-determinism • Race freedom, strong isolation • Determinism-by-default semantics ST LD . . . • DeNovoND converts “atomic” statements into locks

  14. Outline • Motivation • Background • DeNovoND Design • Memory Consistency Model • Distributed Queue-based Lock • DeNovoND Implementation • Evaluation • Conclusion and Future Work

  15. Memory Consistency Model . . . • Deterministic accesses • Same task in this parallel phase • Or before this parallel phase DeNovo Coherence Mechanism . . ST 0xa Parallel Phase LD 0xa

  16. Memory Consistency Model . . . • Non-deterministic accesses • Same task in this parallel phase • Or before this parallel phase • Or in preceding critical sections . . ST 0xa Parallel Phase ST 0xa Critical Section LD 0xa

  17. Coherence for non-deterministic data • Coherence Enforcement • Invalidate stale copies in private cache • Track up-to-date copy • When to invalidate? • Between the start of critical section and any read • What to invalidate? • Entire cache? regions with “atomic” effect? • Track atomic writes in a signature, transfer with lock • Registration • Writer updates before next critical section

  18. Distributed Queue-based Lock • Lock primitive that works on DeNovoND • No directory, no write invalidation  No spinning for lock • Modeled after QOSB Lock • Lock requests form a distributed queue • But much simpler • Details in the paper

  19. Outline • Motivation • Background • DeNovoND Design • DeNovoND Implementation • Evaluation • Conclusion and Future Work

  20. Access Signatures • Simple and small hardware Bloom filter per core • Track accesses with “atomic” effects only • Only 256 bits suffice • Operations on Bloom filter • On write: insert address • On read: query filter for address for self-invalidation

  21. Read miss Registration Example Run Registration lock transfer X in DeNovo-region Y in DeNovo-region Z in atomic DeNovo-region W in atomic DeNovo-region L1 of Core 1 L1 of Core 2 Read miss Z W Z W Ack LD ST ST lock transfer . . LD Shared L2 Ack self-invalidate( ) self-invalidate( ) reset filter

  22. Optimization to reduce self-invalidation X in DeNovo-region Y in DeNovo-region Z in atomic DeNovo-region W in atomic DeNovo-region • loads in Registered state • “Touched-atomic” bit • Set on first atomic load • Subsequent load don’t self-invalidate • More in the paper ST LD . . LD LD self-invalidate( )

  23. Overheads • Hardware Bloom filter • 256 bits per core • Storage overhead • One additional state, but no storage overhead (2 bits) • “Touched-atomic” bit per word in L1 • Communication overhead • Bloom filter piggybacked on lock transfer message • Writeback messages for locks • Lock writebacks carry more info

  24. Evaluation Methodology • Simulator: Simics + GEMS + Garnet • System Parameters • 16 in-order cores • Workloads • SPLASH-2, PARSEC and STAMP • Unchanged except region/effect and self-invalidation • Protocols • MESI and DeNovoND • With idealized locks and realistic locks

  25. MESI vs. DeNovoND: Idealized lock • DeNovoND performs comparable to MESI for all apps • For both DIL-INF and DIL-256 barnesocean water fluidanimatestreamclustertspkmeans ssca2

  26. MESI vs. DeNovoND: Realistic lock • pthread lock vs. distributed queue-based lock • DeNovoND performs comparable or better than MESI barnesocean water fluidanimatestreamclustertspkmeans ssca2

  27. Network Traffic (Realistic lock) • DeNovoND has 33% less traffic than MESI (67% max) • No invalidation traffic • Reduced load misses due to lack of false sharing barnesocean water fluidanimatestreamclustertspkmeans ssca2

  28. Conclusions and Future Work • DeNovoND: Efficient HW support for non-determinism • Minimal additional HW for safe non-determinism • Comparable performance to MESI • 30% lower network traffic than MESI • PLUS all advantages of DeNovo for deterministic codes • Future work: broaden the application space further • Pipeline parallelism, “lock-free” data structures, OS, legacy codes…

More Related