1 / 26

Hoard: A Scalable Memory Allocator for Multithreaded Applications

Hoard: A Scalable Memory Allocator for Multithreaded Applications Berger * , McKinley + , Blumofe * , Wilson * * UT Austin, + Massachusetts ASPLOS 2000 Presented by Bogdan Simion. Dynamic Memory Allocation. Highly parallel applications common e.g., databases, web servers, assignment # 2

marcos
Download Presentation

Hoard: A Scalable Memory Allocator for Multithreaded Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hoard: A Scalable Memory Allocator for Multithreaded Applications Berger*, McKinley+, Blumofe*, Wilson* *UT Austin, +Massachusetts ASPLOS 2000 Presented by BogdanSimion

  2. Dynamic Memory Allocation • Highly parallel applications common • e.g., databases, web servers, assignment # 2 • Dynamic memory allocation ubiquitous • malloc, free, new, delete, etc. • Serial memory allocators are inadequate • Sufficient for correctness • Do not scale for multithreaded applications • Existing concurrent allocators do not meet requirements

  3. Parallel Allocator Requirements • Speed • As fast as a serial allocator on a single-processor system • Scalability • Performance scales linearly with number of processors • False sharing avoidance • Does not introduce false sharing of cache lines • Low fragmentation • Keep (OS Allocated Mem / App. Allocated Mem) low • How does fragmentation affect performance?

  4. Parallel Allocator Requirements • Speed • As fast as a serial allocator on a single-processor system • Scalability • Performance scales linearly with number of processors • False sharing avoidance • Does not introduce false sharing of cache lines • Low fragmentation • Keep (OS Allocated Mem / App. Allocated Mem) low • Fragmentation affects performance: Locality & Swapping

  5. False Sharing • Multiple processors share bytes on the same cache line without sharing bytes • CPU 1 uses int @ 0x1000 and CPU 2 uses 0x1004 • Program induced: data passed between threads • How can an allocator avoid this?

  6. False Sharing • Multiple processors share bytes on the same cache line without sharing bytes • CPU 1 uses int @ 0x1000 and CPU 2 uses 0x1004 • Program induced: data passed between threads • How can an allocator avoid this? • Allocator induced: • Active – malloc returns heap objects on same cache line to different threads • Passive – free allows a future malloc to produce false sharing

  7. Blowup • A special case of fragmentation: max allocated max allocated by ideal uniprocessor allocator • Unbounded, or grows linearly with # of CPUs • Caused by parallel allocator not using freed memory to satisfy future allocation requests • E.g., thread-private heaps with no ownership & producer-consumer • All data allocated on the producer’s heap and released on the consumer’s heap

  8. Related Work(i.e., What to Avoid in Assignment 2) • Serial single heap • 1 locked free list • Low fragmentation, quite fast • Lock contention => poor scaling • Active false sharing • Concurrent single heap • 1 locked free list per block size • Reduces to serial single heap in the common case (most allocated objects are of only a few sizes) • Active false sharing • Too many expensive locks or atomic operations

  9. Related Work(i.e., What to Avoid in Assignment 2) • Multiple-heaps: • Heap assignment: 1-to-1, round-robin, map function • 1. Pure private heaps • Blocks freed to calling thread's heap • Unbounded memory use for producer-consumer • Passive false sharing

  10. Related Work(i.e., What to Avoid in Assignment 2) • 2. Private heaps with ownership • Blocks freed to allocating thread's heap • Has O(P) blowup unless there is some affordance for redistributing freed memory • Some actively induce false sharing • 3. Private heaps with thresholds • Vee and Hsu, DYNIX • efficient and scalable • a hierarchy of per-processor heaps and shared heaps • O(1) blowup • Passively induce false sharing

  11. Related Work(i.e., What to Avoid in Assignment 2)

  12. Hoard’s scalable memory allocator • Fast (performance) • Highly scalable (with P) • Avoid false sharing • Memory efficient (low fragmentation, avoid blowup)

  13. Hoard's Design • Per-processor heaps and a single global heap • Threads are mapped to a processor's heap • N.B., a thread that is scheduled on another processor still uses its original processor's heap • Heaps divided into page-aligned superblocks • Superblock divided into blocks • All of a superblock's blocks have the same size • Blocks have size b, b2, b3, b4, ... • Bounds internal fragmentation

  14. Bounding Blowup • A heap owns some superblocks • Assigned from global heap on allocation • A heap only allocates from superblocks it owns • When no mem available in any superblock on a thread’s heap • Obtain a superblock from the global heap, if available • If not (global heap empty too), create new superblock request from OS, and add to thread’s heap • Does not return empty superblocks to OS • Reuse them instead

  15. Bounding Blowup • Superblocks returned to global heap when f, the empty fraction,of blocks are not in use • If heap not more than f empty, and has K (ct, fixed) or fewer superblocks, no SBs get moved to global heap • Intuitively, these conditions maintain invariants about the proportion of wasted space on each heap

  16. Bounding Blowup • Superblocks returned to global heap when f, the empty fraction,of blocks are not in use • Gives O(1) blowup • Also limits the amount of false sharing since released SBs are guaranteed to be at least f-empty • “Fullness” groups: • bins of superblocks with the same “fullness“ range • LIFO order => reuse superblock that is already in memory; also likely to reuse a block already in cache => maintains good locality

  17. Example • f = 0.25 • K = 0

  18. Avoiding False Sharing • Heap allocations are made from superblocks • Different superblocks lie on different cache lines • Each superblock owned by one heap • Avoids false sharing • Freed memory returns to allocating superblock • Avoids passively-induced false sharing • How can the allocator still induce false sharing?

  19. Avoiding False Sharing • Heap allocations are made from superblocks • Different superblocks lie on different cache lines • Each superblock owned by one heap • Avoids false sharing • Freed memory returns to allocating superblock • Avoids passively-induced false sharing • How can the allocator still induce false sharing? • Multiple running threads using the same heap • Superblocks returned to the global heap aren't empty

  20. Malloc Algorithm

  21. Free Algorithm

  22. Avoiding Contention • Lock contention low for scalable applications • Allocation by one thread and freeing by another is uncommon • Producer-consumer is realistic worst case • Memory operations serialized for two threads • Global heap rarely accessed • Steady-state memory use is within a constant factor of maximum memory use

  23. Results • In general, Hoard performs & scales very well • Performance & scalability poor when few objects relative to distinct block sizes • Most requests result in superblock creation • This is an uncommon memory allocation pattern

  24. Results - performance

  25. Results – avoid FS

  26. Read the paper for full details! • Assignment 2… coming soon!

More Related