1 / 15

Pushing Performance, Efficiency and Scalability of Microprocessors

Pushing Performance, Efficiency and Scalability of Microprocessors. CERCS IAB Meeting, Fall 2006 Gabriel Loh. Research Overview. Funding from state of GA, Intel, MARCO Currently 2 PhD students, 2 MS Active undergrad research as well Collaborations Universities: PSU, UO, Rutgers

umika
Download Presentation

Pushing Performance, Efficiency and Scalability of Microprocessors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pushing Performance, Efficiency and Scalability of Microprocessors CERCS IAB Meeting, Fall 2006 Gabriel Loh

  2. Research Overview • Funding from state of GA, Intel, MARCO • Currently 2 PhD students, 2 MS • Active undergrad research as well • Collaborations • Universities: PSU, UO, Rutgers • Industry: Intel, IBM

  3. Research Focus • “Near-term” microprocessor design issues • ~ 5-year time scale • Power/performance/complexity • Traditional uniprocessor performance • Multi-core performance • “Longer-term” • Keeping Moore’s Law alive for the longer term • Primarily, 3D integration for now

  4. Scaling Performance and Efficiency • Multi-cores are here, but single-thread perf still matters • Intel Core 2 Duo is multi-core, but… • Single core is more OOO than ever • Larger instruction window, improved branch prediction, speculative load-store ordering, wider pipe and decoders • But power also really matters • Lower clock speeds, different channel length transistors, more uop fusion, …

  5. Research Focus • Maximum performance within bounds • Bounds = power, area, TDP, … • Single-core performance helps multi-core performance, too • For future multi-core systems, need to strike a good balance between 1T and MT • Most of our research is at the uarch level • Caches, branch predictors, instruction schedulers, memory queue design, memory dependence prediction, etc.

  6. Highlight: Traditional Caching [MICRO’06] • Well known that different apps respond differently to different replacement policies • Previous work in the OS domain has described adaptive replacement with provable bounds on performance • Adapted techniques for on-chip caches

  7. Idea…

  8. Adaptive Cache Implementation • Theoretical Guarantees • Miss rate provably bounded to be within a factor of two of the better algorithm In practice, it’s much better

  9. Current Research • Working on multi-core generalizations of adaptive caching and other ways to manage shared resources • Uniprocessor microarchitecture • Scalable memory scheduling [MICRO’06] • Memory dependence prediction [HPCA’06] • Branch prediction […] • And more…

  10. Longer-Term Processor Scaling • Limitations/Obstacles • Wire scaling • Latency/performance • Power • Feature size • Lithography, parametric variations • Off-chip communication

  11. 3D Integration Active Layer 1 • Wire • Power/perf. • Off-chip • Feature size • Limitations, variations Metal Layers 1 Die-to-Die Vias Metal Layers 2 Active Layer 2 Die/Wafer Stacking Less RC  faster, lower-power

  12. Wordline length halved • in our studies, WL was critical for latency 3D Bitline Stacking • Bitline length halved • BL reduction has greater impact on power savings • Split decoder  no activity stacking 3D Wordline Stacking Example: Caches We’ve studied a wide variety of other CPU building blocks Simplified 2D SRAM Array

  13. Uarch-level 3D design Smaller footprint  faster and lower-power Width-based gating  even lower power, close to original power density Overall: 47% performance gain at only 2 degree temperature increase Example: 4-die significance-partitioned datapath Use uarch prediction mechanism for early determination of width

  14. 3D Research Summary • Circuit-level [ICCD’05,ISVLSI’06,ISCAS’06,GLSVLSI’06] • Uarch-level [MICRO’06 (w/ ),HPCA’07] • Tutorial papers [JETC’06] • Tutorial [MICRO’06] • Tools [DATE’06,TCAD’07] w/ GTCAD & • Parametric Variations w/ Jim Meindl • Funding, equip from ,

  15. Summary • loh@cc • http://www.cc.gatech.edu/~loh • Lots of exciting work going on here

More Related