1 / 22

Cache Pipelining with Partial Operand Knowledge

Cache Pipelining with Partial Operand Knowledge. Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison. http://www.ece.wisc.edu/~pharm. Cache Power Consumption. Increasing on-chip cache size Increasing cache power consumption

cale
Download Presentation

Cache Pipelining with Partial Operand Knowledge

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison http://www.ece.wisc.edu/~pharm

  2. Cache Power Consumption • Increasing on-chip cache size • Increasing cache power consumption • Increasing clock frequency • Increasing dynamic power • Lots of prior work to reduce cache power consumption

  3. Prior Work • Cache subbanking, bitline segmentation[Su et al. 1995, Ghose et al. 2001] • Cache decomposition [Huang et al. 2001] • Block buffering [Su et al. 1995] • Reducing Leakage power • Drowsy caches [Flautner et al. 2002, Kim et al. 2002] • Cache decay [Kaxiras et al. 2001] • Gated Vdd [Powell et al. 2000]

  4. Cache Subbanking • Proposed by Su et al. 1995 • Fetching only requested subline • Partitioned data array vertically into several subbanks • Further study by Ghose et al. 2001 • Partitioned data array vertically and horizontally • Only activate the requested subbanks

  5. Bit-sliced ALU • Originally proposed by Hsu et al. 1985 • Slices the addition operations • i.e. 32-bit addition -> four 8-bit addition • Avoids waiting for full-width addition • Bypasses partial operand result • Has been successfully implemented in Pentium 4 staggered adder

  6. Outline • Motivation • Prior Work • Bit-sliced Cache • Experiment Results • Conclusion

  7. Power Consumption in Cache • Row decoding consumes up to 40% of active power

  8. Bit-sliced Cache • Extends cache subbanking technique • Saves decoding power • Enables only row decoders that are accessed • Serializes subarray decoding with row decoding • Uses low order index bits to select row decoder • Minimal changes to subbanking technique

  9. Pipelining the Cache Access • Cache access time increases due to: • Serializing subarray decoder with row decoder • Pipeline the access to hide the delay • Need to balance the latency of each stage • Choose operations for each stage carefully • Provide more throughput • Same throughput as a conventional cache with n ports

  10. Pipelined-Cache’s Access Steps Cycle 1 <Cycle 1> • Start subarray decoding for data and tag Cycle 2 • Activate necessary row decoders • Read tag array while waiting Cycle 3 <Cycle 2> • Read data array • Concurrently, do partial tag comparison Cycle 4 • Compare the rest of the tag bits • Use tag comparison result to select data

  11. Bit-sliced Cache

  12. Bit-sliced Cache + Bit-sliced ALU • Optimal performance benefit • Cache access starts sooner • As soon as the first slice is available • Limited number of subarrays • According to the number of bits per slice • When the bitslice is too small • Unable to achieve optimal power saving

  13. lw R1, 0(R3) lw R4, 4(R3) lw R4, 4(R3) addi R3, R3, 4 add R3, R2, R1 add R3, R2, R1 addi R3, R3, 4 add R3, R2, R1 addi R3, R3, 4 lw R4, 4(R3) lw R1, 0(R3) Pipelining with Bit-sliced Cache Pipelined Execution Stage with Pipelined Cache lw R1, 0(R3) Bit-sliced Execution Stage with Pipelined Cache Bit-sliced Execution Stage with Bit-sliced Cache

  14. Cache Model Simulation • Estimates energy consumption and cache latency • Uses a modified version of CACTI 3.0 • Parameters: Ntbl, Ndbl, Ntwl, Ndwl. • Enumerates all possible configurations • Chooses the one with the best weighted value (cycle time and energy consumption) • Simulates: • Various cache sizes (8K-512K), 64 B blocks • DM, 2-way, 4-way, and 8-way • Uses 0.18 um technology

  15. Processor Simulation • Estimates performance benefit • Uses a heavily modified SimpleScalar 3.0 • Supports bit-sliced execution stage • Supports speculative slice execution • Benchmarks • Eight Spec2000 Integer benchmarks • Full reference input set • Fast forward 500M, simulate 100M

  16. Machine Configuration • 4-wide fetch, issue, commit • 128 entry ROB • 32 entry scheduler • 20 stage pipeline • 64K-entry gshare • L1 I-Cache: 32KB, 2-way, 64B block • L1 D-Cache: 8KB, 4-way, 64B block • L2 Cache: 512KB, 8-way, 128B block

  17. Energy Consumption / Access

  18. Cycle Time Comparison

  19. Speed Up Comparison

  20. Speed Up Comparison

  21. Conclusion • Bit-sliced cache • Achieves significant power reduction • Without adds much complexity • Adds some delay to access latency • Pipelined bit-sliced cache • Reduces cycle time • Provides more bandwidth • Measurable speed up (w/ bit-sliced ALU)

  22. Question? Thank you

More Related