1 / 66

Hybrid System Emulation

Hybrid System Emulation. Taeweon Suh Computer Science Education Korea University January 2010. Agenda. Scope Background Related Work Hybrid System Emulation Case Studies L3 Cache Emulation Evaluation of Coherence Traffic Efficiency HW/SW Co-Simulation Conclusions. Scope.

shiloh
Download Presentation

Hybrid System Emulation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hybrid System Emulation Taeweon Suh Computer Science Education Korea University January 2010

  2. Agenda • Scope • Background • Related Work • Hybrid System Emulation • Case Studies • L3 Cache Emulation • Evaluation of Coherence Traffic Efficiency • HW/SW Co-Simulation • Conclusions

  3. Scope • A typical computer system (till Core 2) CPU Main Memory (DDR2) FSB (Front-Side Bus) North Bridge DMI (Direct Media I/F) South Bridge

  4. Scope (Cont.) • A Nehalem-based computer system Main Memory (DDR3) CPU Quickpath (Intel) or Hypertransport (AMD) North Bridge DMI (Direct Media I/F) South Bridge

  5. Scope (Cont.) CPU CPU core core Scope of this talk … L1, L2 L1, L2 Main Memory (DDR2) FSB North Bridge DMI South Bridge

  6. Agenda • Scope • Background • Related Work • Hybrid System Emulation • Case Studies • L3 Cache Emulation • Evaluation of Coherence Traffic Efficiency • HW/SW Co-Simulation • Conclusions

  7. Background • Computer architecture research has been done mostly with software simulation • Pros • Relatively easy-to-implement • Flexibility • Observability • Debuggability • Cons • Simulation time • Difficulty modeling real-world such as I/O

  8. Background (Cont.) • What is an alternative? • FPGA (Field-Programmable Gate Array) • Reconfigurability • Programmable hardware • Short turn-around time • High operation frequency • Observability and debuggability • Many IPs provided • CPUs, memory controller, etc.

  9. Background (Cont.) • FPGA capability example • Reconfigurable Pentium Reconfigurable Pentium FPGA Reconfigurable Pentium Real Pentium

  10. Agenda • Scope • Background • Related Work • Hybrid System Emulation • Case Studies • L3 Cache Emulation • Evaluation of Coherence Traffic Efficiency • HW/SW Co-Simulation • Conclusions

  11. Related Work • MemorIES (2000) • Memory Instrumentation and Emulation System from IBM T.J. Watson • L3 Cache and/or coherence protocol emulation • Plugged into 6xx bus of RS/6000 SMP machine • Passive emulator

  12. Related Work (Cont.) • RAMP • Research Accelerator for Multiple Processors • Parallel computer architecture • Multi-core HW/SW research • Full emulator • Multi-disciplinary project by UC-Berkeley, Stanford, CMU, UT-Austin, MIT and Intel FPGAs BEE2 board

  13. Agenda • Scope • Background • Related Work • Hybrid System Emulation • Case Studies • L3 Cache Emulation • Evaluation of Coherence Traffic Efficiency • HW/SW Co-Simulation • Conclusions

  14. Hybrid System Emulation • Combination of FPGA and a real system • FPGA is deployed in a system of interest • FPGA interacts with a system • Monitor transactions from the system • Provide feedback to the system • System-level active emulation • Run workload in a real system • Research, measure and evaluate the emulated components in a full-system configuration • FPGA is deployed on FSB in this research

  15. Intel server system FPGA board Pentium-III Hybrid System EmulationExperiment Setup • Use an Intel server system equipped with two Pentium-IIIs • Replace one Pentium-III with an FPGA • FPGA actively participates in transactions on FSB Pentium-III FPGA Pentium-III Front-side bus (FSB) North Bridge 2GB SDRAM

  16. Hybrid System EmulationFront-side Bus (FSB) • FSB protocol • 7-stage pipelined bus (Pentium-III) • Request1, request2, error1, error2, snoop, response, data • How FPGA participates in FSB transactions? • Snoop stall • Part of cache coherence mechanism • Delaying the snoop response • Cache-to-cache transfer • Part of cache coherence mechanism • Providing data from a processor’s cache to the requester via FSB

  17. Modified Exclusive Shared Invalid shared cache-to-cache transfer invalidation Hybrid System EmulationCache Coherence Protocol • Example: MESI Protocol • Snoop-based protocol • Intel implements MESI Example Pentium-III (P0) Pentium-III (P1) 1. P0: read 2. P1: read S abcd M abcd E 1234 S 1234 I ----- S abcd I 1234 I ----- S 1234 3. P1: write (abcd) 4. P0: read “snoop stall” North Bridge Main Memory 1234 abcd

  18. Agenda • Scope • Background • Related Work • Hybrid System Emulation • Case Studies • L3 Cache Emulation┼ • Evaluation of Coherence Traffic Efficiency • HW/SW Co-Simulation • Conclusions ┼Erico Nurvitadhi, Jumnit Hong and Shih-Lien Lu “Active Cache Emulator.”, IEEE Transactions on VLSI Systems, 2008

  19. L3 Cache EmulationMethodology • L3 cache emulation methodology • Implement L3 tags in FPGA • If missed, inject snoop stalls and store the information in L3 tag • “New” memory access latency (= L3 miss latency) = snoop stalls + memory access latency • If hit, no snoop stall • L3 latency (L3 hit latency) = memory access latency FPGA Pentium-III Hit! L3 TAG L1, L2 Miss! Front-side bus (FSB) Snoop stalls data data North Bridge 2GB SDRAM

  20. L3 Cache EmulationExperiment Environment • Operating system • Windows XP • Validation of emulated L3 cache • RightMark Memory Analyzer ┼ ┼RightMark Memory Analyzer, http://cpu.rightmark.org/products/rmma.shtml

  21. L3 Cache EmulationExperiment Result • RightMark Memory Analyzer result Main Memory Access latency (nsec) Access latency (CPU cycle) L3 Cache L2 Cache L1 Cache Working set size

  22. Agenda • Scope • Background • Related Work • Hybrid System Emulation • Case Studies • L3 Cache Emulation • Evaluation of Coherence Traffic Efficiency ┼ • HW/SW Co-Simulation • Conclusions ┼Taeweon Suh, Shih-Lien Lu and Hsien-Hsin S. Lee, “An FPGA Approach to Quantifying Coherence Traffic Efficiency on Multiprocessor Systems.”, 17th FPL 2007

  23. Evaluation of Coherence Traffic EfficiencyMethodology • Evaluation methodology • Implement an L2 cache in FPGA • Save evicted cache lines into thecache • Supply data using cache-to-cache transfer when P-III requests it next time • Measure execution time of benchmarks and compare with the baseline FPGA Pentium-III (MESI) D$ Front-side bus (FSB) “cache-to-cache transfer” North Bridge 2GB SDRAM

  24. Evaluation of Coherence Traffic EfficiencyExperiment Environment • Operating system • Redhat Linux 2.4.20-8 • Natively run SPEC2000 benchmark • Selection of benchmark does not affect the evaluation as long as reasonable # bus traffic is generated • FPGA sends statistics information to PC via UART • # cache-to-cache transfers per second • # invalidation traffic per second

  25. Evaluation of Coherence Traffic EfficiencyExperiment Results • Average # cache-to-cache transfers / second Average # cache-to-cache transfers/sec 804.2K/sec 433.3K/sec gzip vpr gcc mcf parser gap bzip2 twolf average

  26. Average execution time: 5635 seconds (93 min) Evaluation of Coherence Traffic EfficiencyExperiment Results (Cont.) • Average execution time increase • Baseline: benchmarks execution on a single P-III without FPGA • data is always supplied from main memory 191 seconds 171 seconds

  27. Evaluation of Coherence Traffic EfficiencyRun-time Breakdown • Run-time estimation with 256KB cache in FPGA 69 ~ 138 seconds 381 ~ 762 seconds • Note that the execution time increased 171 seconds on average out of average total execution time (5635 seconds) of the baseline • Cache-to-cache transfer is responsible for at least 33 (171-138) second increase Cache-to-cache transfer on P-III server system is NOT as efficient as main memory access!

  28. Agenda • Scope • Background • Related Work • Hybrid System Emulation • Case Studies • L3 Cache Emulation • Evaluation of Coherence Traffic Efficiency • HW/SW Co-Simulation ┼ • Conclusions ┼Taeweon Suh, Hsien-Hsin S. Lee, Hsien-Hsin S. Lee, and John Shen, “Initial Observations of Hardware/Software Co-Simulation using FPGA in Architecture Research.”, 2nd WARFP 2006

  29. HW/SW Co-SimulationMotivation • Gain advantages from both software simulation and hardware emulation • Flexibility • High-speed • Idea • Offload heavy software routines into FPGA • The remaining simulator interacts with FPGA

  30. FPGA Pentium-III (MESI) Front-side bus (FSB) North Bridge 2GB SDRAM HW/SW Co-SimulationCommunication Method • Communication between P-III and FPGA • Use FSB as communication medium • Allocate one page in memory for communication • Send data to FPGA: write-through cache mode • Receive data from FPGA: cache-to-cache transfer “read”bus transaction “write”bus transaction “cache-to-cache transfer”

  31. Baseline (h:m:s) Co-simulation (h:m:s) difference (h:m:s) mcf + 0:02:12 2:18:38 2:20:50 3:03:58 3:06:50 + 0:02:52 bzip2 2:56:38 2:59:28 + 0:02:50 crafty eon-cook 2:43:52 2:45:45 + 0:01:53 gcc-166 3:45:30 3:48:56 + 0:03:26 3:34:57 parser 3:37:27 + 0:02:30 2:42:30 perl 2:45:50 + 0:03:20 2:43:30 2:45:28 twolf + 0:01:58 HW/SW Co-SimulationCo-Simulation Results • Preliminary experiment result with SimpeScalar for correctness checkup • Implement a simple function (mem_access_latency) into FPGA

  32. HW/SW Co-SimulationAnalysis & Learnings • Reason for the slowdown • FSB access is expensive • Too simple function (mem_access_latency) • Device driver overhead • Success criteria • Time-consuming software routines • Reasonable FPGA access frequency

  33. CPU0 CPU1 CPU2 CPU3 L1,L2 L1,L2 L1,L2 L1,L2 L3 L3 L3 L3 Ring I/F Ring I/F Ring I/F Ring I/F Ring I/F Ring I/F Ring I/F Ring I/F L3 L3 L3 L3 CPU4 CPU5 CPU6 CPU7 L1,L2 L1,L2 L1,L2 L1,L2 HW/SW Co-SimulationResearch Opportunity • Multi-core research • Implement distributed lowest level caches, and interconnection network such as ring or mesh in FPGA FPGA

  34. Agenda • Scope • Background • Related Work • Hybrid System Emulation • Case Studies • L3 Cache Emulation • Evaluation of Coherence Traffic Efficiency • HW/SW Co-Simulation • Conclusions

  35. Conclusions • Hybrid system emulation • Deploy FPGA to a place of interest in a system • System-level active emulation • Take advantage of an existing system • Presented 3 usage cases in computer architecture research • L3 cache emulation • Evaluation of coherence traffic efficiency • HW/SW co-simulation • FPGA-based emulation provides an alternative to software simulation

  36. Questions, Comments? Thanks for your attention!

  37. Backup Slides

  38. Modified Exclusive Shared Invalid shared cache-to-cache invalidate Evaluation of Coherence Traffic EfficiencyCache Coherence Protocol • Example: MESI Protocol • Snoop-based protocol • Intel implements MESI Example P1 P0 1. P0: read 2. P1: read 3. P1: write (abcd) S abcd M abcd E 1234 S 1234 I ----- S abcd I 1234 I ----- S 1234 4. P0: read North Bridge 1234 abcd

  39. L3 Cache EmulationMotivation • Software simulation has limitations • Simulation time • Reduced dataset and workload • Results could be offset by 100% or more • Passive emulation has limitations • Monitor transactions • Impact of emulated components on system can not be modeled • Full-simulation requires much more effort • Take much longer time to develop • Develop a full system • Adapt workload to a new system

  40. L3 Cache EmulationMotivation (Cont.) • Active Cache Emulation (ACE) • Take advantage of an existing system • Deploy an emulated component to a place of interest

  41. 8 State machine FSB pipeline PC via UART Logic Analyzer L3 Cache EmulationHW Design • Implemented modules in FPGA • State machines • Keep track of up to 8 FSB transactions • L3 Tags • L3 in FPGA varies from 1MB to 64MB • Block size varies from 32B to 512B • Statistics module FPGA (Xilinx Virtex-II) L3 cache Tag Registers for statistics Front-side bus (FSB)

  42. Xilinx Virtex-II FPGA 8 Direct-mapped cache Data Tag State machine write-back the rest PC via UART cache-to-cache Logic Analyzer Front-side bus (FSB) Evaluation of Coherence Traffic EfficiencyHW Design • Implemented modules in FPGA • State machines • Keep track of FSB transactions • Taking evicted data from FSB • Initiating cache-to-cache transfer • Direct-mapped caches • Cache size in FPGA varies from 1KB to 256KB • Note that Pentium-III has 256KB 4-way set associative L2 • Statistics module Registers for statistics

  43. HW/SW Co-SimulationImplementation • Hardware (FPGA) implementation • State machines • Monitoring bus transactions on FSB • Checking bus transaction types (read or write) • Managing cache-to-cache transfer • Software functions to FPGA • Statistics counters • Software implementation • Linux device driver • Specific physical address is needed for communication • Allocate one page of memory for FPGA access via Linux device driver • Simulator modification for accessing FPGA

  44. L3 Cache EmulationExperiment Results (Cont.) • Comparison with SimpleScalar simulation

  45. Evaluation of Coherence Traffic EfficiencyMotivation • Evaluation of coherence traffic efficiency • Why important? • Understand the impact of coherence traffic on system performance • Reflect into communication architecture • Problems with traditional methods • Evaluation of protocols themselves • Software simulations • Experiments on SMP machines: ambiguous • Solution • A novel method to measure the intrinsic delay of coherence traffic and evaluate its efficiency

  46. Evaluation of Coherence Traffic EfficiencyExperiment Results (Cont.) • Average increase of invalidation traffic / second Average increase of invalidation traffic/sec 306.8K/sec 157.5K/sec gzip vpr gcc mcf parser gap bzip2 twolf average

  47. # cache-to-cache transfer Hit rate = # data read (full cache line) Evaluation of Coherence Traffic EfficiencyExperiment Results (Cont.) • Average hit rate in the FPGA’s cache 64.89% Average hit rate (%) 16.9% gzip vpr gcc mcf parser gap bzip2 twolf average

  48. Motivation • Traditionally, evaluations of coherence protocols focused on reducing bus traffic incurred along with state transitions of coherence protocols • Trace-based simulations were mostly used for the protocol evaluations • Software simulations are too slow to perform the broad range analysis of system behaviors • In addition, it is very difficult to do exact real-world modeling such as I/Os • System-wide performance impact of coherence traffic has not been explicitly investigated using real systems • This research provides a new method to evaluate and characterize coherence traffic efficiency of snoop-based, invalidation protocols using an off-the-shelf system and an FPGA

  49. MemorIES (ASPLOS 2000) BEE2 board Motivation and Contribution • Evaluation of coherence traffic efficiency • Motivation • Memory wall becomes higher • Important to understand the impact of communication among processors • Traditionally, evaluation of coherence protocols focused on protocols themselves • Software-based simulation • FPGA technology • Original Pentium fits into one Xilinx Virtex-4 LX200 • Recent emulation effort • RAMP consortium • Contribution • A novel method to measure the intrinsic delay of coherence traffic and evaluate its efficiency using emulation technique

  50. Cache Coherence Protocols • Well-known technique for data consistency among multiprocessor with caches • Classification • Snoop-based protocols • Rely on broadcasting on shared bus • Based on shared memory • Symmetric access to main memory • Limited scalability • Used to build small-scale multiprocessor systems • Very popular in servers and workstations • Directory-based protocols • Message-based communication via interconnection network • Based on distributed shared memory (DSM) • Cache coherent non-uniform memory Access (ccNUMA) • Scalable • Used to build large-scale systems • Actively studied in 1990s

More Related