Hybrid System Emulation

Hybrid System Emulation Taeweon Suh Computer Science Education Korea University January 2010

Agenda • Scope • Background • Related Work • Hybrid System Emulation • Case Studies • L3 Cache Emulation • Evaluation of Coherence Traffic Efficiency • HW/SW Co-Simulation • Conclusions

Scope • A typical computer system (till Core 2) CPU Main Memory (DDR2) FSB (Front-Side Bus) North Bridge DMI (Direct Media I/F) South Bridge

Scope (Cont.) • A Nehalem-based computer system Main Memory (DDR3) CPU Quickpath (Intel) or Hypertransport (AMD) North Bridge DMI (Direct Media I/F) South Bridge

Scope (Cont.) CPU CPU core core Scope of this talk … L1, L2 L1, L2 Main Memory (DDR2) FSB North Bridge DMI South Bridge

Background • Computer architecture research has been done mostly with software simulation • Pros • Relatively easy-to-implement • Flexibility • Observability • Debuggability • Cons • Simulation time • Difficulty modeling real-world such as I/O

Background (Cont.) • What is an alternative? • FPGA (Field-Programmable Gate Array) • Reconfigurability • Programmable hardware • Short turn-around time • High operation frequency • Observability and debuggability • Many IPs provided • CPUs, memory controller, etc.

Background (Cont.) • FPGA capability example • Reconfigurable Pentium Reconfigurable Pentium FPGA Reconfigurable Pentium Real Pentium

Related Work • MemorIES (2000) • Memory Instrumentation and Emulation System from IBM T.J. Watson • L3 Cache and/or coherence protocol emulation • Plugged into 6xx bus of RS/6000 SMP machine • Passive emulator

Related Work (Cont.) • RAMP • Research Accelerator for Multiple Processors • Parallel computer architecture • Multi-core HW/SW research • Full emulator • Multi-disciplinary project by UC-Berkeley, Stanford, CMU, UT-Austin, MIT and Intel FPGAs BEE2 board

Hybrid System Emulation • Combination of FPGA and a real system • FPGA is deployed in a system of interest • FPGA interacts with a system • Monitor transactions from the system • Provide feedback to the system • System-level active emulation • Run workload in a real system • Research, measure and evaluate the emulated components in a full-system configuration • FPGA is deployed on FSB in this research

Intel server system FPGA board Pentium-III Hybrid System EmulationExperiment Setup • Use an Intel server system equipped with two Pentium-IIIs • Replace one Pentium-III with an FPGA • FPGA actively participates in transactions on FSB Pentium-III FPGA Pentium-III Front-side bus (FSB) North Bridge 2GB SDRAM

Hybrid System EmulationFront-side Bus (FSB) • FSB protocol • 7-stage pipelined bus (Pentium-III) • Request1, request2, error1, error2, snoop, response, data • How FPGA participates in FSB transactions? • Snoop stall • Part of cache coherence mechanism • Delaying the snoop response • Cache-to-cache transfer • Part of cache coherence mechanism • Providing data from a processor’s cache to the requester via FSB

Modified Exclusive Shared Invalid shared cache-to-cache transfer invalidation Hybrid System EmulationCache Coherence Protocol • Example: MESI Protocol • Snoop-based protocol • Intel implements MESI Example Pentium-III (P0) Pentium-III (P1) 1. P0: read 2. P1: read S abcd M abcd E 1234 S 1234 I ----- S abcd I 1234 I ----- S 1234 3. P1: write (abcd) 4. P0: read “snoop stall” North Bridge Main Memory 1234 abcd

Agenda • Scope • Background • Related Work • Hybrid System Emulation • Case Studies • L3 Cache Emulation┼ • Evaluation of Coherence Traffic Efficiency • HW/SW Co-Simulation • Conclusions ┼Erico Nurvitadhi, Jumnit Hong and Shih-Lien Lu “Active Cache Emulator.”, IEEE Transactions on VLSI Systems, 2008

L3 Cache EmulationMethodology • L3 cache emulation methodology • Implement L3 tags in FPGA • If missed, inject snoop stalls and store the information in L3 tag • “New” memory access latency (= L3 miss latency) = snoop stalls + memory access latency • If hit, no snoop stall • L3 latency (L3 hit latency) = memory access latency FPGA Pentium-III Hit! L3 TAG L1, L2 Miss! Front-side bus (FSB) Snoop stalls data data North Bridge 2GB SDRAM

L3 Cache EmulationExperiment Environment • Operating system • Windows XP • Validation of emulated L3 cache • RightMark Memory Analyzer ┼ ┼RightMark Memory Analyzer, http://cpu.rightmark.org/products/rmma.shtml

L3 Cache EmulationExperiment Result • RightMark Memory Analyzer result Main Memory Access latency (nsec) Access latency (CPU cycle) L3 Cache L2 Cache L1 Cache Working set size

Agenda • Scope • Background • Related Work • Hybrid System Emulation • Case Studies • L3 Cache Emulation • Evaluation of Coherence Traffic Efficiency ┼ • HW/SW Co-Simulation • Conclusions ┼Taeweon Suh, Shih-Lien Lu and Hsien-Hsin S. Lee, “An FPGA Approach to Quantifying Coherence Traffic Efficiency on Multiprocessor Systems.”, 17th FPL 2007

Evaluation of Coherence Traffic EfficiencyMethodology • Evaluation methodology • Implement an L2 cache in FPGA • Save evicted cache lines into thecache • Supply data using cache-to-cache transfer when P-III requests it next time • Measure execution time of benchmarks and compare with the baseline FPGA Pentium-III (MESI) D$ Front-side bus (FSB) “cache-to-cache transfer” North Bridge 2GB SDRAM

Evaluation of Coherence Traffic EfficiencyExperiment Environment • Operating system • Redhat Linux 2.4.20-8 • Natively run SPEC2000 benchmark • Selection of benchmark does not affect the evaluation as long as reasonable # bus traffic is generated • FPGA sends statistics information to PC via UART • # cache-to-cache transfers per second • # invalidation traffic per second

Evaluation of Coherence Traffic EfficiencyExperiment Results • Average # cache-to-cache transfers / second Average # cache-to-cache transfers/sec 804.2K/sec 433.3K/sec gzip vpr gcc mcf parser gap bzip2 twolf average

Average execution time: 5635 seconds (93 min) Evaluation of Coherence Traffic EfficiencyExperiment Results (Cont.) • Average execution time increase • Baseline: benchmarks execution on a single P-III without FPGA • data is always supplied from main memory 191 seconds 171 seconds

Evaluation of Coherence Traffic EfficiencyRun-time Breakdown • Run-time estimation with 256KB cache in FPGA 69 ~ 138 seconds 381 ~ 762 seconds • Note that the execution time increased 171 seconds on average out of average total execution time (5635 seconds) of the baseline • Cache-to-cache transfer is responsible for at least 33 (171-138) second increase Cache-to-cache transfer on P-III server system is NOT as efficient as main memory access!

Agenda • Scope • Background • Related Work • Hybrid System Emulation • Case Studies • L3 Cache Emulation • Evaluation of Coherence Traffic Efficiency • HW/SW Co-Simulation ┼ • Conclusions ┼Taeweon Suh, Hsien-Hsin S. Lee, Hsien-Hsin S. Lee, and John Shen, “Initial Observations of Hardware/Software Co-Simulation using FPGA in Architecture Research.”, 2nd WARFP 2006

HW/SW Co-SimulationMotivation • Gain advantages from both software simulation and hardware emulation • Flexibility • High-speed • Idea • Offload heavy software routines into FPGA • The remaining simulator interacts with FPGA

FPGA Pentium-III (MESI) Front-side bus (FSB) North Bridge 2GB SDRAM HW/SW Co-SimulationCommunication Method • Communication between P-III and FPGA • Use FSB as communication medium • Allocate one page in memory for communication • Send data to FPGA: write-through cache mode • Receive data from FPGA: cache-to-cache transfer “read”bus transaction “write”bus transaction “cache-to-cache transfer”

Baseline (h:m:s) Co-simulation (h:m:s) difference (h:m:s) mcf + 0:02:12 2:18:38 2:20:50 3:03:58 3:06:50 + 0:02:52 bzip2 2:56:38 2:59:28 + 0:02:50 crafty eon-cook 2:43:52 2:45:45 + 0:01:53 gcc-166 3:45:30 3:48:56 + 0:03:26 3:34:57 parser 3:37:27 + 0:02:30 2:42:30 perl 2:45:50 + 0:03:20 2:43:30 2:45:28 twolf + 0:01:58 HW/SW Co-SimulationCo-Simulation Results • Preliminary experiment result with SimpeScalar for correctness checkup • Implement a simple function (mem_access_latency) into FPGA

HW/SW Co-SimulationAnalysis & Learnings • Reason for the slowdown • FSB access is expensive • Too simple function (mem_access_latency) • Device driver overhead • Success criteria • Time-consuming software routines • Reasonable FPGA access frequency

CPU0 CPU1 CPU2 CPU3 L1,L2 L1,L2 L1,L2 L1,L2 L3 L3 L3 L3 Ring I/F Ring I/F Ring I/F Ring I/F Ring I/F Ring I/F Ring I/F Ring I/F L3 L3 L3 L3 CPU4 CPU5 CPU6 CPU7 L1,L2 L1,L2 L1,L2 L1,L2 HW/SW Co-SimulationResearch Opportunity • Multi-core research • Implement distributed lowest level caches, and interconnection network such as ring or mesh in FPGA FPGA

Conclusions • Hybrid system emulation • Deploy FPGA to a place of interest in a system • System-level active emulation • Take advantage of an existing system • Presented 3 usage cases in computer architecture research • L3 cache emulation • Evaluation of coherence traffic efficiency • HW/SW co-simulation • FPGA-based emulation provides an alternative to software simulation

Questions, Comments? Thanks for your attention!

Backup Slides

Modified Exclusive Shared Invalid shared cache-to-cache invalidate Evaluation of Coherence Traffic EfficiencyCache Coherence Protocol • Example: MESI Protocol • Snoop-based protocol • Intel implements MESI Example P1 P0 1. P0: read 2. P1: read 3. P1: write (abcd) S abcd M abcd E 1234 S 1234 I ----- S abcd I 1234 I ----- S 1234 4. P0: read North Bridge 1234 abcd

L3 Cache EmulationMotivation • Software simulation has limitations • Simulation time • Reduced dataset and workload • Results could be offset by 100% or more • Passive emulation has limitations • Monitor transactions • Impact of emulated components on system can not be modeled • Full-simulation requires much more effort • Take much longer time to develop • Develop a full system • Adapt workload to a new system

L3 Cache EmulationMotivation (Cont.) • Active Cache Emulation (ACE) • Take advantage of an existing system • Deploy an emulated component to a place of interest

8 State machine FSB pipeline PC via UART Logic Analyzer L3 Cache EmulationHW Design • Implemented modules in FPGA • State machines • Keep track of up to 8 FSB transactions • L3 Tags • L3 in FPGA varies from 1MB to 64MB • Block size varies from 32B to 512B • Statistics module FPGA (Xilinx Virtex-II) L3 cache Tag Registers for statistics Front-side bus (FSB)

Xilinx Virtex-II FPGA 8 Direct-mapped cache Data Tag State machine write-back the rest PC via UART cache-to-cache Logic Analyzer Front-side bus (FSB) Evaluation of Coherence Traffic EfficiencyHW Design • Implemented modules in FPGA • State machines • Keep track of FSB transactions • Taking evicted data from FSB • Initiating cache-to-cache transfer • Direct-mapped caches • Cache size in FPGA varies from 1KB to 256KB • Note that Pentium-III has 256KB 4-way set associative L2 • Statistics module Registers for statistics

HW/SW Co-SimulationImplementation • Hardware (FPGA) implementation • State machines • Monitoring bus transactions on FSB • Checking bus transaction types (read or write) • Managing cache-to-cache transfer • Software functions to FPGA • Statistics counters • Software implementation • Linux device driver • Specific physical address is needed for communication • Allocate one page of memory for FPGA access via Linux device driver • Simulator modification for accessing FPGA

L3 Cache EmulationExperiment Results (Cont.) • Comparison with SimpleScalar simulation

Evaluation of Coherence Traffic EfficiencyMotivation • Evaluation of coherence traffic efficiency • Why important? • Understand the impact of coherence traffic on system performance • Reflect into communication architecture • Problems with traditional methods • Evaluation of protocols themselves • Software simulations • Experiments on SMP machines: ambiguous • Solution • A novel method to measure the intrinsic delay of coherence traffic and evaluate its efficiency

Evaluation of Coherence Traffic EfficiencyExperiment Results (Cont.) • Average increase of invalidation traffic / second Average increase of invalidation traffic/sec 306.8K/sec 157.5K/sec gzip vpr gcc mcf parser gap bzip2 twolf average

# cache-to-cache transfer Hit rate = # data read (full cache line) Evaluation of Coherence Traffic EfficiencyExperiment Results (Cont.) • Average hit rate in the FPGA’s cache 64.89% Average hit rate (%) 16.9% gzip vpr gcc mcf parser gap bzip2 twolf average

Motivation • Traditionally, evaluations of coherence protocols focused on reducing bus traffic incurred along with state transitions of coherence protocols • Trace-based simulations were mostly used for the protocol evaluations • Software simulations are too slow to perform the broad range analysis of system behaviors • In addition, it is very difficult to do exact real-world modeling such as I/Os • System-wide performance impact of coherence traffic has not been explicitly investigated using real systems • This research provides a new method to evaluate and characterize coherence traffic efficiency of snoop-based, invalidation protocols using an off-the-shelf system and an FPGA

MemorIES (ASPLOS 2000) BEE2 board Motivation and Contribution • Evaluation of coherence traffic efficiency • Motivation • Memory wall becomes higher • Important to understand the impact of communication among processors • Traditionally, evaluation of coherence protocols focused on protocols themselves • Software-based simulation • FPGA technology • Original Pentium fits into one Xilinx Virtex-4 LX200 • Recent emulation effort • RAMP consortium • Contribution • A novel method to measure the intrinsic delay of coherence traffic and evaluate its efficiency using emulation technique

Cache Coherence Protocols • Well-known technique for data consistency among multiprocessor with caches • Classification • Snoop-based protocols • Rely on broadcasting on shared bus • Based on shared memory • Symmetric access to main memory • Limited scalability • Used to build small-scale multiprocessor systems • Very popular in servers and workstations • Directory-based protocols • Message-based communication via interconnection network • Based on distributed shared memory (DSM) • Cache coherent non-uniform memory Access (ccNUMA) • Scalable • Used to build large-scale systems • Actively studied in 1990s

Hybrid System Emulation

Hybrid System Emulation

Presentation Transcript

Hybrid Propulsion System Basics

Hybrid Secure-Entry System

Ensemble Emulation

Battery- Supercapacitor Hybrid System

Emulation

Battery- Supercapacitor Hybrid System

Wireless Emulation for system evaluation

Limitations – emulation

Real-Time System-On-A-Chip Emulation

Hybrid SFV VRP system

Freemote: A Wireless Sensor Networks Emulation System

8086 emulation

User Emulation

Network Emulation

8086 emulation

GV-system/NVR/Hybrid

Basic Hybrid IP System

Hybrid System Market

Vectorized Emulation

8086 emulation

8086 emulation