230 likes | 242 Views
Optimizing RAM-latency Dominated Applications. Yandong Mao , Cody Cutler, Robert Morris MIT CSAIL. RAM-latency may dominate performance. RAM-latency dominated applications follow long pointer chains working set >> on-chip cache A lot of cache misses -> stalling on RAM fetches
E N D
Optimizing RAM-latency Dominated Applications Yandong Mao, Cody Cutler, Robert Morris MIT CSAIL
RAM-latency may dominate performance • RAM-latency dominated applications • follow long pointer chains • working set >> on-chip cache • A lot of cache misses -> stalling on RAM fetches • Example: Garbage Collector • Identify live objects by following inter-object pointers • Spend much of its time stalling to follow pointers, due to RAM latency
Addressing RAM-latency bottleneck? • View RAM as we view disk • High latency • A similar set of optimization techniques • Batching • Sorting • Access I/O in parallel and asynchronously
Outline • Hardware Background • Three techniques to address RAM-latency • Linearization: Garbage Collector • Interleaving: Masstree • Parallelization: Masstree • Discussion • Conclusion
Three Relevant Hardware Features • Intel Xeon X5690 • 1. Fetch RAM before needed • Hardware prefetcher – sequential or strided access pattern • Software prefetch • Out-of-order execution 2 3 5 0 1 4 RAM Controller 2. Parallel accesses to different channels Channel 0 Channel 1 Channel 2 3. Row buffer cache inside memory channel
Per-array row buffer cache • Each channel has many of arrays shown below • Each array has an additional row: row buffer • Memory access: check row buffer, reload if miss Data Rows Hit in row buffer: 2x-5x faster than miss! Sequential access: 3.5x higher throughput than random access! Row buffer 4096 bytes
Linearization memory accesses for Garbage Collector • Garbage Collector goal • Find live objects (tracing) • starts from root (stack, global variables) • follows object pointers of live objects • Claim space for unreachable objects • Bottleneck of tracing: RAM-latency • Pointer addresses are unpredictable and non-sequential • Each access -> cache miss -> stall for RAM-fetch
Observation • Arrange objects in tracing order during garbage collection • Subsequent tracing would access memory in sequential order • Take advantage of two hardware features • Hardware prefechers: prefetch into cache • Higher row buffer hit rate
Benchmark and result • Time of tracing 1.8 GB of live data • HSQLDB 2.2.9: a RDBMS engine in Java • Compacting Collector of Hotspot JVM from OpenJDK7u6 • Use copy collection to reorder objects in tracing order • Result: tracing in sequential order is 1.3X faster than random order • Future work • better linearizing algorithm than copy collection algorithm (use twice the memory!) • measure application-level performance improvement
Interleaving on Masstree • Not always possible to linearize memory access • Masstree: a high performance in-memory key value store for multi-core • All cores share a single B+tree • Each core: a dedicated working thread • Scales well on multi-core • Focus on Masstree with single-thread for now
Single-threaded Masstree is RAM-latency dominated • Careful design to avoid RAM fetches • trie of B+trees, inline key fragments and children in tree nodes • Accessing one fat B+tree node in one RAM-latency • Still RAM-latency dominated! • Each key-lookup follows a random path • O(N) RAM-latency (hundreds of cycles) per-lookup • A million lookups per second
Batch and interleave tree lookups • Batch key lookups • Interleave computation and RAM fetch using software prefetch
Find child containing A in E prefetch(B) 2. Find child containing X in E prefetch(F) E 3. Find child containing A in B prefetch(A) B is already in cache! 4. Find child containing X in F prefetch(X) F is already in cache! F B X D A • Perform a batch of lookups w/o stalling on RAM-fetch! • As long as computation (inspecting a batch of nodes) > RAM-latency • 30% improvement with batch of five
Parallelizing Masstree • Interesting observation • applications are limited by RAM-latency, not by CPU • but adding more cores help! • Reason • RAM is a parallel system • More cores keeps RAM busier • Compare with interleaving technique • Same effect: keep RAM busier • Difference: from one core, and from multi-cores
Parallelization improves performance by issuing more RAM loads
Interleaving and Parallelization can be complementary Beats Masstree by 12-30% Improvement decreases with more cores Parallelization alone can saturate
Discussion • Applicability • Lessons • Interleaving seems more general than linearization • applied to Garbage Collector? • Interleaving is more difficult than parallelization • requires batching and concurrency control • Challenges in automatic interleaving • Need to identify and resolve conflicting access • Difficult or impossible without programmers’ help
Discussion • Interleaving on certain data structures • Data structures and potential applications • B+tree: Masstree • other applications use in-memory B+tree? • Hashtable: Memcached • A single hashtable • Multi-get API: natural batching and interleaving • Preliminary result: interleaving hashtable improves throughput by 1.3X
Discussion • Profiling tools • Linux perf • Look at most expensive function • Manually inspect • Maybe misleading • computation limited or RAM-latency limited? • RAM stalls based tool?
Related Work • PALM[Jason11]: B+tree with same interleaving technique • RAM parallelization at different levels: regulation considered harmful[Park13]
Conclusion • Identifies a class of applications: dominated by RAM-latency • Three techniques to address RAM-latency bottleneck of two applications • Improve your program similarly?
Single-threaded Masstree is RAM-latency dominated … B+tree, indexed by k[0:7] Trie: a tree where each level is indexed by fixed-length key fragment … B+tree, indexed by k[8:15]