Course: CSCI 780 – Advanced Topics on Caching

The Performance Impact of Kernel Prefetching on Buffer Cache Replacement Algorithmsby Ali R. Butt, Chris Gniady, and Y.Charlie Hu,SIGMETRICS’05 Course: CSCI 780 – Advanced Topics on Caching Techniques in Computer and Distributed Systems Presenter: Chuan Yue

Outline • The Buffer Cache • Linux Kernel Prefetching • Adapted Buffer Cache Replacement Algorithms • Simulation Results • Conclusions • Discussions

virtual memory memory- mapped I/O I/O using read/write page cache buffer cache disk Buffer Cache in Main Memory • Two kinds of I/O operations: • Direct access read()/write() use block-based buffer cache • Memory-mapped I/O share page cache with the virtual memory system • Naturally that leads to two separate buffers • Problems: • Double buffering • Inconsistencies

virtual memory memory- mapped I/O I/O using read/write unified buffer cache disk Unification of Buffer Cache and Page Cache • A unified buffer cache uses the same page cache to store • Virtual memory pages • Memory-mapped pages • Ordinary file system I/O • Issues: • complex interactions between file system and VM

Buffer Cache Management • Designing effective buffer cache replacement algorithms is a fundamental challenge in improving system performance • Traditional file I/O system • Virtual memory system • Various buffer cache replacement algorithms • LRU replacement is widely used • LRU’s inability to cope with access patterns with weak locality • Other well-known algorithms that utilize recency information: LRU-2, 2Q, LIRS, LRFU, MQ, ARC

Prefetching • Prefetching is another highly effective technique used for improving the I/O performance • The main motivation for prefetching is to overlap computation with I/O and thus reduce exposed latency of I/O • Various prefetching techniques: • Prefetching using user inserted hints of I/O access patterns • Drawback: placing burden on programmer • File system kernel-driven prefetching in modern operating systems • Synchronous read-ahead to amortize seek cost • Asynchronous prefetching after detecting sequential access patterns

The impact of kernel prefetching on buffer cache replacement algorithms’ performance • The close interactions between caching and prefetching • Prefetching file blocks into cache can be harmful (P. Cao, et. al., 1995) • Both replacement policy & prefetching  buffer cache hit ratio • Hit ratio, prefetching & clustering  I/O disk traffic • I/O disk traffic  file system performance • Almost all proposed buffer cache replacement algorithms didn’t take into account the kernel driven prefetching • The work in this paper: • Shows the potential performance impact of kernel prefetching on buffer cache replacement algorithms • Presents the simulation results on 8 adapted replacement algorithms

Kernel components on the path from file system operations to the disk

window window window new group group group Kernel Prefetching in Linux • Prefetching is based on the pattern of accesses to the file • Only considers prefetching for read accesses • Beneficial for sequential accesses to a file • Read-ahead Group and Read-ahead Window • Synchronous Prefetching and Asynchronous Prefetching 6 2 4 8 10 1 5 7 3 9

Belady’s algorithm can be non-optimal given kernel prefetching • Access sequence: a c e g i k m o a b c d e f g h i j k l m n o p • Without prefetching: Belady’s Alg. 16 cache misses; LRU 23 cache misses • With prefetching: Belady’s Alg. 8 cache misses; LRU 6 cache misses

Prefetching has been ignored in algorithm design • Caching algorithms have been proposed and studied without considering prefetching • OPT • LRU • LRU-K [SIGMOD 1993] • 2Q [VLDB 1994] • LRFU [TC 2001] • MQ [USENIX 2001] • LIRS [SIGMETRICS 2002] • ARC [FAST 2003] • Changes to OPT, LRU, 2Q, LIRS will be explained

OPT • OPT is based on Belady’s cache replacement algorithms. • Off-line, has the knowledge of future references • In the presence of the Linux kernel prefetching • Prefetched blocks are assumed to be accessed most recently and inserted into the cache according to the original OPT algorithm • But, OPT is added the capability to immediately determine wrong prefetches, i.e., prefetched blocks that • will not be accessed on-demand at all, or • will be accessed further in future than all other blocks in the cache • Wrong prefetched blocks become immediate candidates for removal

LRU • LRU is the most widely used replacement policy • In the presence of the kernel prefetching, adapted LRU: • Each access, kernel determines the number of blocks that need to be prefetched • Prefetched blocks are inserted in the MRU locations just like regular blocks

2Q • Three buffers and the algorithm: • A1in queue: all missed blocks are initially placed • A1out queue: when blocks are replaced from the A1in queue in the FIFO order, their addresses are temporarily placed • Am queue: When a block is re-referenced and its address is in the A1out queue, it is promoted to Am queue Block 10, 11, 12, 13, 14, 11, 12, 22, A1in Am 11 10 12 14 22 13 A1out Address only

2Q – With Adaptation (In the presence of the kernel prefetching) • Prefetched blocks are treated as on-demand blocks: • A prefetched block is placed into the A1in queue initially • On the subsequent on-demand access, the block stays in the A1in queue • If the prefetched block is evicted from the A1in queue before any on-demand access, it is simply discarded, as opposed to being moved into the A1out queue • If a block currently in the A1out queue is prefetched, it is promoted into Am queue as if it is accessed on-demand Demand & Prefetch blocks 10,11, 12, 11,13, 14,11, 22, 23 A1in Am 23 10 11 14 12 22 13 A1out Address only

LIRS • Dynamically and responsively maintains the LIR block set and HIR block set and keeps LIR block set in the cache • In the presence of the kernel prefetching, adapted LIRS: • Prefetched blocks are not inserted into the LIRS stack S, they are only inserted into the HIR stack Q • If a prefetched block did not have an existing entry in LIRS stack S, the first on-demand access to the block will cause it to be inserted onto the top of LIRS stack S as a HIR block • If a prefetched block exists in LIRS stack S, the first on-demand access to the block will be treated as a LIR block access

Performance Evaluation • Trace collection • Interception of I/O system calls (using modified linux strace utility) • Collect I/O access type, time, file identifier (inode), and I/O size • Timing accurate trace simulator • Detailed implementation of kernel prefetching and clustering • Interface with DiskSim simulator to simulator I/O time • Implementation of: OPT, LRU, LRU-2, LRFU, LIRS, MQ, 2Q, ARC • Metrics • Hit ratio • Aggregated synchronous and asynchronous disk I/O requests • Actual running time

Applications and Trace Statistics (Concurrent applications: Multi1: cscope, gcc; Multi2: cscope, gcc, viewperf; Multi3: glimpse, TPC-H.)

Hit ratio results for cscope • Kernel prefetching has a significant impact on the hit ratio • The improvement for different algorithms differ • Prefeching can result in significant changes in the relative performance of replacement algorithms

Disk requests results for cscope • The clustering of I/O requests in the presence of prefetching results in a significant reduction in the number of disk requests • The effect is complex and closely tied to the file access patterns

Execution time results for cscope • Reduction in the # of disk requests due to kernel prefetching does not necessarily translate into reduction in execution time.

Results for other three sequential access applications • Glimpse • It also benefits from prefetching • The changes in the relative behavior of different algorithms observed in cscope with prefetching are also observed in glimpse • Viewperf • It benefits the most from prefetching • The behavior of different cache replacement algorithms is similar to that observed in cscope • Gcc • Many accesses are to small files, little opportunity for prefetching • All three performance metrics are almost identical with and without prefetching

Hit ratio results for tpc-h • Prefetching provides little improvement on the hit ratio for random access pattern

Disk requests results for tpc-h • Most of prefetched blocks are not accessed and as a result the number of disk requests is doubled

Execution time results for tpc-h • The significant increase in the number of I/Os translates into a significant increase in the execution time

Results for concurrent applications • Multi1: cscope, gcc • Similar as that of cscope • Multi2: cscope, gcc, viewperf • Similar as that of Multi1, however, prefetching does not improve the execution time because viewperf is CPU-bound • Multi3: glimpse, TPC-H • Similar as that of tpc-h

Number and size of synchronous and asynchronous disk I/Os in cscope at 128MB cache size • The total number of disk requests with prefetching is as least 30% lower than without prefetching for all schemes except OPT • Most reduction in disk requests comes from issuing asynchronous disk requests which can be overlapped with CPU time

Conclusions • In this research work, the authors • Proposed prefetching implementation for different replacement algorithms • Built a timing simulator to evaluate relative performances • The paper shows • Prefetching impacts hit ratio, disk requests, execution time • Comparison of hit ratios is insufficient • Kernel prefetching can narrow the performance gap of different replacement algorithms • Kernel prefetching can also change the relative performance benefits of different replacement algorithms • Future buffer caching research should • Take into consideration prefetching and I/O clustering • Simulate execution time

Discussions (1) • Good points • No new algorithm; but the paper is the first to simulate and compare the impact of kernel prefetching on well-known cache buffer replacement algorithms • Results are not very astonishing, we can guess the general results for sequential and random workloads; but this paper is the first to report the results • Bad points • The simulation is only based on I/O traces. It better VM traces based results are also presented. • Concurrent applications simulation results are not analyzed in detail (in this paper itself). • It better the unification of buffer cache and page cache in many OSes be considered. It better the competition between process page access and file cache page access be simulated and analyzed.

Discussions (2) • Some questions: • Regarding Belady’s anomaly: • In LIRS paper: Belady's anomaly appears in 2Q and ARC for glimpse workload • In this paper: Without prefetching, their simulation results didn't show Belady's anomaly. With prefetching, Belady's anomaly appears in ARC for glimpse workload • Why the differences? LRU has no Belady's anomaly. How about other algorithms? • Regarding simulations: • Is there any relationship between cache size selection (in simulation) with the real environment where the trace is collected? • Is the performance under thrashing condition still worth simulating?

References • “A Study of Integrated Prefetching and Caching Strategies”, P.Cao, et., al., ACM SIGMETRICS, 1995 • “Making LRU Friendly to Weak Locality Workloads: A Novel Replacement Algorithm to Improve Buffer Cache Performance”, S. Jiang and X. Zhang, IEEE Transactions on Computers, VOL.54, NO.9, SEPTEMBER 2005 • “CLOCK-Pro: An Effective Improvement of the CLOCK Replacement”, S. Jiang, F. Chen, and X. Zhang, Proceedings of 2005 USENIX Annual Technical Conference (USENIX'05) • "Page Replacement in Linux 2.4 Memory Management," Rik van Riel, Proc. of 2001 USENIX Technical Conference, FREENIX track • Towards and O(1) VM: Making Linux virtual memory management scale towards large amounts of physical memory, Rik van Riel, Proceedings of the Linux Symposium, July 2003 • “Journal File Systems in Linux, June 21th, 2005” (http://bulma.net/impresion.phtml?nIdNoticia=1154) • “The Buffer Cache, June 21th, 2005” (http://www.faqs.org/docs/linux_admin/buffer-cache.html) • “The Performance Impact of Kernel Prefetching on Buffer Cache Replacement”, Chris Gniady, et., al., (Purdue University), ACM SIGMETRICS 2005 presentation slides • More on File System (lecture notes, June 22th, 2005) (http://www.cs.rochester.edu/~kshen/csc256-spring2005/lectures/lecture16-file2.pdf)

Thank you!

Course: CSCI 780 – Advanced Topics on Caching