Lecture on High Performance Processor Architecture ( CS05162 )

Lecture on High Performance Processor Architecture(CS05162) Introduction on Novel Memory-on-chip Architectures An Hong han@ustc.edu.cn Fall 2007 University of Science and Technology of China Department of Computer Science and Technology

Outline • Cache • History • Problems and Solutions • 2K papers on caches by Y2K: Do we need more? • Processor-in-Memory(PIM) • Understand what PIM architectures are • Understand motivation for PIM architectures • Performance Issues • Technological Motivations/Constraints • Be able to discuss different PIM architectures • Different Applications • Different Implementations • IRAM CS of USTC AN Hong

Cache history • Caches introduced (commercially) more than 30 years ago in the IBM 360/85 • already a processor-memory gap • Oblivious to the ISA • caches were organization, not architecture • Many different organizations • direct-mapped, set-associative, skewed-associative, sector, decoupled sector etc. • Caches are ubiquitous • On-chip, off-chip • But also, disk caches, web caches, trace caches etc. • Multilevel cache hierarchy • With inclusion or exclusion CS of USTC AN Hong

Cache history • Cache exposed to the ISA • Prefetch, Fence, Purge etc. • Cache exposed to the compiler • Code and data placement • Cache exposed to the O.S. • Page coloring • Many different write policies • copy-back, write-through, fetch-on-write, write-around, write-allocate etc. CS of USTC AN Hong

Cache history • Numerous cache assists, for example: • For storage: write-buffers, victim caches, temporal/spatial caches • For overlap: lock-up free caches • For latency reduction: prefetch • For better cache utilization: bypass mechanisms, dynamic line sizes • etc ... CS of USTC AN Hong

Cache history: Caches and Parallelism • Cache coherence • Directory schemes • Snoopy protocols • Synchronization • Test-and-test-and-set • load linked -- store conditional • Models of memory consistency CS of USTC AN Hong

When were the 2K papers being written? • A few facts: • 1980 textbook: < 10 pages on caches (2%) • 1996 textbook: > 120 pages on caches (20%) • Smith survey (1982) • About 40 references on caches • Uhlig and Mudge survey on trace-driven simulation (1997) • About 25 references specific to cache performance only • Many more on tools for performance etc. CS of USTC AN Hong

Cache research vs. time Largest number (14) 1st session on caches CS of USTC AN Hong

The Memory Latency Problem • Technological Trend: Memory latency is getting longer relative to microprocessor speed (50% per year) • Problem: Memory Latency - Conventional Memory Hierarchy Insufficient: • Many applications have large data sets that are accessed non-contiguously. • Some SPEC benchmarks spend more than half of their time stalling [Lebeck and Wood 1994]. • Domain: benchmarks with large data sets: symbolic, signal processing and scientific programs CS of USTC AN Hong

Present Latency Solutions and Limitation CS of USTC AN Hong

The Memory Bandwidth Problem • It’s expensive! • Often ignored • Processor-centric optimization to bridge the gap but lead to memory-bandwidth problems • Prefetching • Speculation • Multithreading hide latency Can we always just trade bandwidth for latency? CS of USTC AN Hong

Present Bandwidth Solutions • Wider/faster connections to memory • Rambus DRAM • Use higher signaling rates on existing pins • Use more pins for the memory interface • Larger on-chip caches • Fewer requests to DRAM • Only effective if larger caches improve hit rate • Traffic-efficient requests • Only request what you need • Caches are “guessing” that you might need adjacent data • Compression? CS of USTC AN Hong

Present Bandwidth Solutions • More efficient on-chip caches • Only 1/20 – 1/3 of the data in a cache is live • Again, caches are “guessing” what will be used again • Spatial vs. temporal vs. no locality • Logic/DRAM integration(Processor-centric architectures)，E.g.IRAM • Put the memory on the processor • On-chip bandwidth is cheaper than pin bandwidth • You will still probably have external DRAM as well • Memory-centric architectures, E.g.PIM • “Smart” memory (PIM) • Put processing elements wherever there is memory CS of USTC AN Hong

Modern DRAM • Not truly random access • Synchronous • Three-dimensional • Bank • Row • Column • Shared address lines • Shared data lines CS of USTC AN Hong

DRAM Throughput and Latency CS of USTC AN Hong

What is a PIM Architectures? PIM = Processor-in-Memory, or Processing in Memory, Computing in Memory • Traditional System Architecture: Processor on one chip, memory on others • Processor-in-memory Architecture: Integrate processor and memory onto the same die • Depend on ability to fabricate DRAM on same chip as processor • Enabling huge improvements in Latency and Bandwidth CS of USTC AN Hong

Motivation • Growing gap between processor and memory performance • Memory (DRAM) speed increases 7%/year • Processor speed increases 55%/year • Caches help some, but not enough • Increase in number of transistors/chip • 1 GB DRAMs predicted for 2001 (Might not make this) • Processor architectures approaching 1000M transistors • Bandwidth loss at chip crossings • DRAM chips not getting wider at same rate their capacity increases • Time to read/write entire contents of chip increasing with DRAM generations • Maybe !! Logic close to memory ‘may’ provide high bandwidth, low latency access to memory CS of USTC AN Hong

Motivation • System Issues • Memory demands of applications increasing at half the rate of DRAM density • System memory size being set by bandwidth needs, not memory capacity requirements • Would shifting data-intensive computations to the memory system help ? … • Advances in fabrication technology make integration of logic and memory practical CS of USTC AN Hong

Technological Hurdles • DRAM fabrication processes traditionally very different than logic • DRAM: Minimize size of bit cell • Want to maximize capacitances for memory stability and slow refresh rates • Many polysilicon layers to build capacitors • Regular layout doesn’t require many metal layers • Logic: Optimize for circuit speed, density • Want to minmize capacitances for high-speed operation • Many metal layers to connect gates • Transistors optimized for speed • Logic also has a much less standard structure than memory • Effect: • Implementing DRAM in logic process hurts density • Implementing logic in DRAM process hurts speed (20-100%) • Need fabrication processes that provide both DRAM and logic features • IBM has a production fab line optimized for integrating DRAM and logic CS of USTC AN Hong

What to do with PIM? • Simplest approach: just integrate system’s main memory onto the die • Relatively easy to design • Reduces system cost • Provides at best incremental performance improvements • Evolutionary approach: optimize memory system for PIM • Much wider connections between levels • Can make structure of main memory visible to processor • Revolutionary approach: move computing into memory • Powerful main processor • Less-powerful processors associated with memory • Distribute computation across different resources CS of USTC AN Hong

Classification: PIM Based Systems • Integral parts of larger systems augmenting the system performance • e.g. Active Pages, FlexRAM, Smart Memories • Separate standalone systems individually for embedded processing • e.g. IRAM • Arrays for high performance computing • e.g. Blue Gene CS of USTC AN Hong

Classification: PIM Based Systems • Davis Active Pages (Chong et al. 1998) • Associate small processor with each page of DRAM • Effectively, multiprocessor on each memory chip • Co-processor paradigm; reconfigurable logic in memory; apps such as scatter-gather • A memory + FPGA • Illinois FlexRAM (Torrellas et al. 1999) • Place a processor on each memory chip • Divide computation • Memory chip = Simple multiprocessor + superscalar + banks of DRAM; memory intensive apps. • Blue Gene(IBM, 1999) CS of USTC AN Hong

Classification: PIM Based Systems • Berkeley IRAM (Patterson et al. 1997) • Integrates memory onto main processor die • Explores on-chip memory architectures • Vector processor; data stream apps; low power • FBRAM (Deering et al. 1994) • Graphics in memory • 1T-SRAM(1999) • Simple logic + DRAM => as if SRAM chip • Include: on-chip refresh controller, a small true SRAM cache CS of USTC AN Hong

IRAM • Computing Target • Computer Architecture Research biased towards desktop and server application • IRAM research inclined towards personal mobile computing • IRAM goals • High performance for multimedia functions • Small size, and power efficiency • Low design complexity • Merging technology of processor and memory • All the memory accesses remain within a single chip • Bandwidth can be as high as 100 to 200 Gbytes/sec • Access latency is less than 20ns • Good solution for data intensive streaming application CS of USTC AN Hong

IRAM : Architecture Back Treating on-chip memory as main memory Berkeley V-IRAM CS of USTC AN Hong

IBM Blue Gene(1999) • 1999年12月6日,IBM宣布将耗资1亿美元研制一套代号为 “Blue Gene”的超级计算机,用于大规模分子生物学模拟,特别是对计算生物学中最基本的问题:蛋白质折叠机制进行研究. • 每秒1千万亿次浮点操作(Petaop/s). • 是当时最快的计算机的500倍; • 是Deep Blue的1000倍; • 是当前世界排名前40位的计算机的40倍; • 是当前桌面PC机的2百万倍. 基于Moore’s Law, 超级计算机要达到这个量级至少需要15年, IBM决定要在5年内实现 CS of USTC AN Hong

64个6英尺高的机柜 • 每个机柜配置8块主板 • 每块主板上有64个芯片 • 每个芯片上包含32个处理器共100万个处理器 CS of USTC AN Hong

Comparison • Applicability • Cost-Performance Benefit • Ease of Programming • Scalability • Chip area and power requirements • Extent of success achieved • Other Issues like … CS of USTC AN Hong

Current Industry Response • Manufacturing techniques for memory are different from those for logic • PIM would ‘decommodatize’ the memory market, and hurt interoperability • Power Consumption: Effect on DRAM • Memory chip-to-chip communication can become a primary bottleneck • Research in high speed memory interfaces for faster performance than PIMs might bring in new structural design issues. • At Intel, CRL PIM is not being pursued as a hot option right now CS of USTC AN Hong

Future of PIM • Optimal PIM ISA and Organization • Simple-Strong-Accurate PIM CAD-tools • Programming Model • Operating System Support • Compiler Support • Algorithms and Data Structures • Static/Dynamic Load Balancing • Interface between PIM and Non-PIM Systems CS of USTC AN Hong

Issues • Bridging the Gap !! Can we really do it ? • How do we program these things? • IRAM: Standard model of computation • FlexRAM: Compiler allocates work to processors • View PIM as multiprocessor system • Active Pages: It’s the programmer’s problem • Programmer responsible for selecting work for each processor • BlueGene • How is the PIM Potential to influence industry ? CS of USTC AN Hong

Lecture on High Performance Processor Architecture ( CS05162 )