Informed Prefetching and Caching

Informed Prefetching and Caching R. Hugo Patterson, Garth A. Gibson, Eka Ginting, Daniel Stodolsky, Jim Zelenka

Balance caching against prefetching Distribute cache buffers among competing applications Contribution • One of basic functions of file system: • Management of disk accesses • Management of main-memory file buffers • Approach: • Use hints from I/O-intensive applications to prefetch aggressively enough to eliminate I/O stall time while maximizing buffer availability for caching • How to allocate cache buffers dynamically among competing hinting and non-hinting applications for the greatest performance benefit

Motivation • Storage parallelism • CPU  I/O performance dependence  • Cache   cache-hit ratios  • I/O intensive applications: • Amount of data processed >> file cache size • Locality is poor or limited • Frequently non-sequential accesses • Large I/O stall time/total execution time • Access patterns are largely predictable How can I/O workloads be improved to take full advantage of the hardware that already exists?

Aggressive prefetching ASAP: the four virtues of I/O workloads • Avoidance: not a scalable solution to the I/O bottleneck • Sequentiality: scale for writes but not for reads • Asynchrony: scalable through write buffering, scaling for reads depends on prefetching aggressiveness • Parallelism: scalable for explicitly parallel I/O requests; but for serial workloads, scalable parallelisms achieved by scaling no. of asyn requests Asynchrony eliminates write latency, and parallelism provides throughput. No existing techniques scalably relieve the I/O bottleneck for reads.

Prefetching • Aggressive prefetching for reads  writing buffers for writes

Hints • Historical Information: LRU cache replacement algorithm • Sequential readahead: prefetching up to 64 blocks ahead when it detects long sequential runs • Disclosure: hints based on advance knowledge • A mechanism for portable I/O optimizations • Providing evidence for a policy decision • Conforms to software engineering principles of modularity

512 buffers (1/3 cache) Unread hinted prefetch LRU count_unread_buffers-- Informed Prefetching • System: TIP-1 implemented in OSF/1, which has 2 I/O optimizations • Application: 5 I/O-intensive benchmarks single threaded, data fetched from FS • Hardware: DEC3000/500 workstation, 1 150 MHz 21064 processor, 128 MB RAM, 5 KZTSA fast SCSI-2 adapters, each hosting 3 HP2247 1GB disks, 12MB (1536 x 8KB) cache • Stripe unit: 64 KB • Cluster prefetch: 5 prefetches. • Disk scheduler: striper SCAN

Read from the beginning to the end Agrep • Agrep woodworking 224_newsgroup_msg: 358 disk blocks

Agrep (cont’d) • Elapsed time for the sum of 4 searches is reduced by up to 84%

Postgres • Join of two relations • Outer relation: 20,000 unindexed tuples (3.2 MB) • Inner relation: 200,000 tuples (32 MB) and indexed (5 MB) • Output about 4,000 tuples written sequentially

Postgres (cont’d)

Postgres (cont’d) • Elapsed time reduced by up to 55%

MCHF Davidson algorithm • MCHF: A suite of computational-chemistry programs used for atomic-structures calculations • Davidson algorithm: an element of MCHF that computes, by successive refinement, the extreme eigenvalue-eigenvector pairs of a large, sparse, real, symmetric matrix stored on disk • Matrix size: 17 MB • The algorithm repeatedly accesses the same large file sequentially.

MCHF Davidson algorithm (cont’d)

MCHF Davidson algorithm (cont’d) • Hints disclose only sequential access in one large file. • OSF/1’s aggressive readahead performs better than TIP-1. • Neither OSF/1 nor informed prefetching alone uses the 12 MB of cache buffers well. • LRU replacement algorithm flushes all of the blocks before any of them are reused.

Informed caching • Goal: allocate cache buffers to minimize application elapsed time • Approach: estimate the impact on execution time of alternative buffer allocations and then choose the best allocation • 3 broad uses for each buffer: • Caching recently used data in the traditional LRU queue • Prefetching data according to hints • Caching data that a predictor indicates will be reused in the future

Three uses of cache buffers Difficult to estimate the performance of allocations at a global level

Cost-benefit analysis • System model: from which the various cost and benefit estimates are derived • Derivations: for each component • Comparison: how to compare the estimates at a global level to find the globally least valuable buffer and the globally most beneficial consumer

System assumptions • Assumptions: • Modern OS with a file buffer cache running on a uniprocessor with sufficient memory to make available number of cache buffers • Workload emphasized on read-intensive applications • All application I/O accesses request a single file block that can be read in a single disk access and that the requests are not too bursty. • System parameters are constant. • Enough parallelism, no congestion

Elapsed time # I/O req. Avg app CPU time between requests Overhead: allocating of a buffer, queuing the request at the drive, and servicing the interrupt when the I/O completes System model Avg time to service an I/O req.

Cost of deallocating LRU buffer

X requests . . . . . . . . . . . . . . . . . . . . prefetch consume The benefit of prefetching • Prefetching a block can mask some of the latency of a disk read, • is the upper bound of the benefit of fetching a block. • If the prefetch can be delayed and still complete before it is needed, we consider there to be no benefit from starting the prefetch now. There is no benefit from prefetching further than P

The prefetch horizon

A buffer should be reallocated from the LRU cache for prefetching Comparison of LRU cost to prefetching benefit • Shared resources: cache buffers • Common currency: T/access = T/buffer Rate of unhinted demand accesses Rate of hinted accesses

. . . . . . . . . . . . . . . . . . . . flush Hint access The cost of flushing a hinted block • When should we flush a hinted block? y accesses y-P P prefetch back • Cost:

Separate estimators for LRU cache and for each independent Stream of hints Putting it all together: global min-max • 3 estimates: • Which block should be replaced when a buffer is needed for prefetching or to service a demand request? The globally least valuable block in the cache. • Should a cache buffer be used to prefetch data now? Prefetch if the expected benefit is greater than the expected cost of flushing or stealing the least valuable block.

A global min-max valuation of blocks Value estimators • LRU cache: i- th position if the LRU queue • Hint estimators: • Global value=max(value_LRU,value_hint) • Globally least valuable block = min(global value)

The informed cache manager discovers MRU caching without being specifically coded to implement this policy. Informed caching example: MRU

Implementation of informed caching and prefetch

Implementation of informed caching and prefetch(cont’d)

Performance improvement by informed caching

Balance contention

Future work • Richer hint languages to disclosure future accesses • Strategies for dealing with imprecise but still useful hints • Cost-benefit model adapted to non-uniform bandwidths • Extensibility, e.g.: VM estimator to track VM pages

Informed Prefetching and Caching

Informed Prefetching and Caching

Presentation Transcript

Supporting High Performance I/O with Effective Caching and Prefetching

Replication, Caching, Prefetching and Hoarding for Mobile Computing

Cache and Caching

Prefetching

Informed Mobile Prefetching

CS7810 Prefetching

ECE7995 Caching and Prefetching Techniques in Computer Systems

ECE 7995 Caching And Prefetching Techniques In Computer System

Caching and TLBs

Hierarchical Caching and Prefetching for Continuous Media Servers with Smart Disks

Memory and Caching

Software Prefetching

Energy Efficient Prefetching and Caching

Predictive Caching and Prefetching of Query Results in Search Engines

Caching and TLBs

Web Prefetching

15-740/18-740 Computer Architecture Lecture 17: Prefetching, Caching, Multi-core

Prefetching Techniques

Locality and Caching

Energy Efficient Prefetching and Caching

Caching and TLBs