1 / 41

Transactional memory

Dynamic Performance Tuning of Word-Based Software Transactional Memory Pascal Felber Christof Fetzer Torvald Riegel Prepared by Gil Sadis Transactional memory Introduction Related Work TinySTM Basic Algorithm Hierarchical Locking Experimental Evaluation Dynamic Tuning Conclusions

paul
Download Presentation

Transactional memory

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dynamic Performance Tuning of Word-Based Software Transactional Memory Pascal Felber ChristofFetzer TorvaldRiegel Prepared by Gil Sadis Transactional memory

  2. Introduction • Related Work • TinySTM • Basic Algorithm • Hierarchical Locking • Experimental Evaluation • Dynamic Tuning • Conclusions Outline

  3. Glossary • TL2 – One of the fastest word-based software transactional memories designed by David Dice, Ori Shalev, and Nir Shavit in 2006 • TM – Transactional Memory • STM – Software Transactional Memory • Encounter-time locking – memory writes are done by first temporarily acquiring a lock for a given location, writing the value directly, and logging it in the undo log • Commit-time locking – locks memory locations only during the commit phase Introduction

  4. TM has been proposed as a lightweight mechanism to synchronize threads TM alleviates many of the problems associated with locking TM offer the benefits of transactions without incurring the overhead of a database TM makes memory act in a transactional way like a database Introduction

  5. “There is no ‘one-size-fits-all’ STM implementation and adaptive mechanisms are necessary to make the most of an STM infrastructure.” • The performance of STM implementations depends on several factors: • Design – word-based vs. object-based, lock-based vs. non-blocking, write-through vs. write-back • Configuration parameters – for example,number of locks or the mapping of locks to memory addresses • Workload – for example, the ratio of update to read-only transactions Introduction

  6. A new idea: TinySTM a lightweight and highly efficient lock based implementation STM that will dynamically tune its performance in runtime Introduces novel mechanisms to speed up the validation cost for large read sets without increasing the abort rate Introduction

  7. Introduction • Related Work • TinySTM • Basic Algorithm • Hierarchical Locking • Experimental Evaluation • Dynamic Tuning • Conclusions Outline

  8. Word-based TM • Access memory at the granularity of machine words or larger chunks of memory • More widely applicable, for example in applications that do not explicitly specify associated objects and run in unmanaged environments • Most word-based STM designs rely upon a shared array of locks to manage concurrent accesses to memory Related work

  9. Object-based TM • Access memory only at object granularity • Require the TM to be aware of the object associated with every access • Example for object-based TM – Lazy Snapshot Algorithm (LSA). The LSA verifies at each object access that the view observed by a transaction is consistent Related work

  10. Time-based TM (TBTM) • Based on a notion of time or progress • A global time base to reason about the consistency of data accessed by transactions and about the order in which transactions commit • The simplest implementation for a global time base is a shared integer counter • On large systems in which contention on this counter results in a significant bottleneck, external clocks or multiple synchronized physical clocks can be used as scalable time bases Related work

  11. Introduction • Related Work • TinySTM • Basic Algorithm • Hierarchical Locking • Experimental Evaluation • Dynamic Tuning • Conclusions Outline

  12. Word-based STM implementation that uses locks to protect shared memory locations Uses a time-based design Uses a single version, word-based variant of LSA algorithm and is very similar to TL2’s algorithm, however, follows different design strategies on some key aspects tinystm

  13. Uses encounter-time locking for 2 main reasons: • The empirical observations appear to indicate that detecting conflicts early often increases the transaction throughput because transactions do not perform useless work. Commit-time locking may help avoid some read-write conflicts, but in general conflicts discovered at commit time cannot be solved without aborting at least one transaction • It allows us to efficiently handle reads-after-writes without requiring expensive or complex mechanisms tinystm

  14. TinySTM implements two strategies for accesses to memory: • Write-through – transactions directly write to memory and revert their updates in case they need to abort • Write-back – transactions delay their updates to memory until commit time TinySTM

  15. As most word-based STM designs, TinySTM relies upon a shared array of locks to manage concurrent accesses to memory Each lock covers a portion of the address space Each lock is the size of an address and Its least significant bit is used to indicate whether the lock is owned TinySTM – basic algorithm (locks and versions)

  16. If it is not owned, we store in the remaining bits a version number that corresponds to the commit timestamp of the transaction that last wrote to one of the memory locations covered by the lock • If the lock is owned, we store in the remaining bits an address to either the owner transaction (when using write-through), or an entry in the write set of the owner transaction (when using write-back). TinySTM – basic algorithm (locks and versions)

  17. When writing to a memory location, a transaction first identifies the lock entry that covers the memory address and atomically reads its value • If the lock bit is set, the transaction checks if it is the owner of the lock. In that case, it simply writes the new value and returns. Otherwise, the transaction can try to wait for some time or abort immediately. TinySTM uses the later. • If the lock bit is not set, the transaction tries to acquire the lock by writing a new value in the entry TinySTM – basic algorithm (Reads & writes)

  18. When reading a memory location, a transaction must verify that the lock is not owed. To that end, the transaction reads the lock, then the memory location, and finally the lock again If the lock is not owned and its value (i.e. version number) did not change between both reads, then the value read is consistent TinySTM – basic algorithm (Reads & writes)

  19. Write-through access • Updates are written directly to memory and previous values are stored in an undo log to be reinstated upon abort • Has lower commit-time overhead • Write-back access • updates are stored in a write log and written to memory upon commit • Has lower abort overhead TinySTM – basic algorithm (Write-through vs. write-back)

  20. Using dynamic memory within transactions is not trivial: • Consider the case of a transaction that inserts an element in a dynamic data structure such as a linked list • If memory is allocated but the transaction fails, it might not be properly reclaimed, which results in memory leaks • One cannot free memory in a transaction unless one can guarantee that it will not abort • TinySTM provides memory-management functions that allow transactional code to use dynamic memory TinySTM – basic algorithm (Memory Management)

  21. TinySTM uses a shared counter as clock In case the contention on this global counter becomes a bottleneck in large systems, we can use more scalable time bases such as an external clock or multiple synchronized physical clocks TinySTM – basic algorithm (Clock MaNAGEMENT)

  22. TinySTM – Hierarchical Locking

  23. TinySTM maintains a smaller hierarchical array of h << l counters As atomic operations are costly on most architectures, the size of the hierarchical array must be chosen with care: larger h values reduce the validation overhead but may require more atomic operations TinySTM – Hierarchical Locking

  24. Memory addresses are mapped to the counters using a hash function A counter covers multiple locks and the associated memory addresses 2 memory locations that are mapped to the same lock are also mapped to the same counter TinySTM – Hierarchical Locking

  25. Calculation: • When choosing l as a multiple of h, typically l = 2^i, h = 2^j, i > j • lock index = (hash(addr) mod l) • counter index = (hash(addr) mod h) TinySTM – Hierarchical Locking

  26. Each transaction additionally maintains 2 private data structures: a read mask and a write mask of h bits each Read sets are partitioned into h independent parts When reading or writing a memory location, a transaction will first determine to which shared counter i in the hierarchical array it maps TinySTM – Hierarchical Locking

  27. Evaluation used the same red-black tree benchmark application as used for the evaluation of TL2 and also a linked list All tests were run on an 8-core Intel Xeon machine at 2 GHz running Linux 2.6.18-4 (64-bit) TinySTM – Experimental Evaluation

  28. TinySTM – Experimental Evaluation

  29. TinySTM – Experimental Evaluation

  30. Introduction • Related Work • TinySTM • Basic Algorithm • Hierarchical Locking • Experimental Evaluation • Dynamic Tuning • Conclusions Outline

  31. TinySTM’s most important tuning parameters: • The hash function to map a memory location to a lock. TinySTM right-shifts the address and computes the rest modulo the size of the lock array (#shifts) • The number of entries in the lock array (l or #locks) • The size of the array used for the hierarchical locking (h) Dynamic Tuning

  32. Dynamic Tuning

  33. Dynamic Tuning

  34. The first observation is that with an increasing number of locks, we get an increase in throughput • A smaller number of locks could reduce the validation time of an update transaction (because we need to check less locks), but the performance penalty of false sharing dominates Dynamic Tuning

  35. The shift tuning parameter improves the sharing of locks within a transaction The number of shifts specifies how many consecutive words are assigned to the same lock Dynamic Tuning

  36. Small array limits the overhead of atomic operations and permits a quick check if an update transaction can commit However, too small an array will result in many false positives Dynamic Tuning

  37. Tuning strategy : • Start with a sensible number of locks, 2^16; shift of 0; hierarchical array of size 1 • 8 possible moves: (1-2) double/halve the number of locks, (3-4) increase/decrease the number of shifts, (5-6) double/halve the size of the hierarchical array, (7)a nop, and (8) reverse • Reverse occurs when: • 2% performance decrease • 10% away from the configuration with the highest throughput so far Dynamic Tuning

  38. Dynamic Tuning

  39. Introduction • Related Work • TinySTM • Basic Algorithm • Hierarchical Locking • Experimental Evaluation • Dynamic Tuning • Conclusions Outline

  40. Automatic tuning and adaptivity are especially important given that there is no agreement on what constitutes a typical workload or a good benchmark for transactional memory It allow us to exploit the full potential of current TM designs, while being ready for workload classes yet to be identified Conclusions

  41. Thank You!

More Related