340 likes | 712 Views
MacSim Simulator. HPArch Research Group. MacSim Tutorial. Part 2. Overview of MacSim Introduction For b lack box approach users Part 3: Details of MacSim For computer architecture researchers Part 4. MacSim -SST case studies Ocelot-MacSim case studies Research using Ocelot
E N D
MacSim Simulator HPArch Research Group
MacSim Tutorial • Part 2. Overview of MacSim • Introduction • For black box approach users • Part 3: Details of MacSim • For computer architecture researchers • Part 4. • MacSim-SST case studies • Ocelot-MacSim case studies • Research using Ocelot • Research using MacSim MacSim Tutorial (In ISCA-39, 2012)
Introduction of MacSim • Heterogeneous architecture simulator (x86+PTX) • Developed from Georgia Tech • Trace driven simulator • Internal RISC style micro-op generation module • X86 traces – using Pin, PTX traces – using GPUOcelot • Cycle-level simulator • Cores, caches, memory systems are modeled • Support various simulations - single/multi-threaded application, multi-program, heterogeneous (CPU+GPU) MacSim Tutorial (In ISCA-39, 2012)
MacSim’sTarget Architectures Flexible design to support various platforms Integration with a parallel simulator (SST) to support high-performance computing systems From mobile to Exascale computing systems MacSim Tutorial (In ISCA-39, 2012)
Simulator Infrastructure Prof. Yalamanchili (Georgia Tech) CUDA code (.cu) NVCC (Compiler) GPUOcelot Trace Generator PTX code Heterogeneous Architecture Timing & Power Simulator Instruction Thread information X86 binaries PIN Trace Generator Open GL code PIN (API Generator) Attila (OpenGL Emulator) Ongoing Work MacSim Tutorial (In ISCA-39, 2012)
Getting MacSim & Build • Getting MacSim • Stable version – google code projecthttp://macsim.googlecode.com/files/macsim-1.0.tar.gz • Latest code from SVN repository • Directions are explained inhttp://code.google.com/p/macsim/wiki/GettingMacsim • How to build • http://code.google.com/p/macsim/wiki/BuildingMacsim • Chapter 2 of manual provides an instruction to build • README file in the simulator directory MacSim Tutorial (In ISCA-39, 2012)
Other Software Packages… • Macsim package • IRIS (NoC simulator from Prof. Yalamanchili’s group) is included • CPU trace generator • Download PIN separately. Trace generator tool is in the MacSim Package • GPU trace generator • Download Ocelot Separately. Trace generator is in the Ocelot’s package • MacSim-SST • SST needs to be downloaded separately • Energy Introspector (From Prof. Yalamanchili’s group) • EI is a power model based on McPAT, HotSpot. Because of McPAT license issue, currently EI cannot be distributed, but we will resolve this issue soon MacSim Tutorial (In ISCA-39, 2012)
Simulation MacSim Tutorial (In ISCA-39, 2012)
MacSim Run • Once build process is successful, binary will be created in • macsim-top/trunk/bin/macsim • Screenshot of a simulation • Now, How to configure simulation models ? MacSim Tutorial (In ISCA-39, 2012)
Setting up Architectures Core type 1 Core type 2 Core type 3 Core type 1 Core type 2 Core type 3 Core type 1 Core type 2 Core type 3 Core type 1 Core type 2 Core type 3 Core type 1 Core type 2 Core type 3 Memory • Knob variables need to set up (3 ways) • Default value in the source code • Params.in • Command line MacSim Tutorial (In ISCA-39, 2012)
Example) 4-Core 2-way SMT • Configuration • 4 cores • 2-way SMT .def param<NUM_SIM_CORES, num_sim_cores, int, 4> num_sim_cores 4 // 4 cores num_sim_small_cores 0 num_sim_medium_cores 0 num_sim_large_cores 4 max_threads_per_large_core 2 large_core_type x86 repeat_trace 1 params.in commandline ./macsim –num_sim_cores=4 MacSim Tutorial (In ISCA-39, 2012)
Example) CPU+GPU Heterogeneous num_sim_cores8 // 4 CPUs + 4 GPUs num_sim_small_cores4 // 4 GPU num_sim_medium_cores 0 num_sim_large_cores 4 // 4 CPUs core_typeptx // specify small cores large_core_type x86 cpu_frequency 3 gpu_frequency 1.5 repeat_trace 1 • Usually, we use small core for GPU and large for CPU • GPU has internally multiple processing elements (N-wide SIMD) • To configure CPU+GPU arch. • Set up number of cores and type accordingly MacSim Tutorial (In ISCA-39, 2012)
Example) Multi-Program Simulation 4<-- number of applications /sample/mcf/trace.txt <- appl 1 /sample/gcc/trace.txt <- appl 2 /sample/mm/trace.txt <- appl 3 /sample/blackscholes/trace.txt <- appl 4 Blackscholes MCF GCC MM thread 1 MM thread 2 • Multiple Applications • Set up from trace_file_list MacSim Tutorial (In ISCA-39, 2012)
Repeating Traces mcf Program 1 gcc gcc gcc gcc Program 2 bfs bfs bfs bfs bfs Program 3 Execution time for each application is different. Provide an option to enable repeat short traces until the longest trace ends Whether it’s the right way to simulate? MacSim Tutorial (In ISCA-39, 2012)
Sample Configuration Files • Sample configuration files in • macsim-top/trunk/params MacSim Tutorial (In ISCA-39, 2012)
Limited Support of Multi-thread Applications Host thread Main thread Threads spawn GPU Kernel invocation Barrier core core core core Thread spawn is modeled. Lock is not modeled. MacSim Tutorial (In ISCA-39, 2012)
Trace Generation It will be covered in Part-III Trace generator will generate thread execution information is automatically. Users do not need to worry about this. MacSim Tutorial (In ISCA-39, 2012)
Clock Domain # Clock clock_cpu 3 clock_gpu 1.5 clock_l3 1 clock_noc 1 clock_mc 1.6 MacSim Tutorial (In ISCA-39, 2012) • MacSim has 5 different clock domains • CPU • GPU • Last-level cache • Interconnection network • DRAM
Microarchitecture MacSim Macro instructions with decoded information from Pin’s XED Pin Trace decoder uops Timing/ power simulator XED Front-end Decode Rename Schedule Execution Retire Memory X86 instructions are mapped to uops PTX instructions are mapped to uops (almost 1-1 mapping) Pipeline stages MacSim Tutorial (In ISCA-39, 2012)
Microarchitecture Setup-I • Front-end, DEC/Rename: Just a simple FIFO queue. • fetch_latency 5 // front-end depth • alloc_latency 5 // decode/allocation depth • width // pipeline width (same width for all the pipeline) • bp_dir_mechgshare • bp_hist_length 14 // branch history length • Rename: create RAW dependency (map structure) • rob_size96 // ROB size • Scheduler // in-order scheduler, ooo scheduler • schedule io, ooo // instruction scheduling policy MacSim Tutorial (In ISCA-39, 2012)
Microarchitecture Setup-II • Execution latency • Fixed uop latency (macsim-top/def/uop_latency_[x86,ptx].def) • Variable latency: Cache/Memory latency • Instruction scheduling rates • isched_rate 4 // # of integer inst. that can be executed per cycle • msched_rate 2 // # of memory inst. that can be executed per cycle • fsched_rate 2 // # of FP inst. That can be executed per cycle MacSim Tutorial (In ISCA-39, 2012)
Microarchitecture Setup-III L3 only • Cache configuration • # of sets, # of associativity, line size, # of banks, etc. (See manual) • Cache size = # of sets x assoc x line_size x # of tiles • DRAM configuration • Frequency, bus width, column/activate/precharge latency • # of Memory controllers, # banks, # channels, row buffer size, DRAM scheduling policy • Simple, but fast DRAM model that models key features • MacSimis connected with DRAM-SIM2 • Users can use DRAM-SIM2 for a detailed DRAM timing simulation MacSim Tutorial (In ISCA-39, 2012)
Simulation Outputs • Statistics • Simulation outputs: *.stat.out • macsim/trunk/def file has stat definition(more details in Part-III) • Important Stats • IPC = INST_COUNT_TOT/CYC_COUNT_TOT • CPI = CYC_COUNT_TOT/INST_COUNT_TOT • Per Core stats • IPC for core 0 INST_COUNT_CORE_0/CYC_COUNT_CORE_0 • Multiple applications stats • *.stat.out.<application_id> e.g.) memory.stat.out.0, bp.stat.out.1 • Each stat file contains stats only for the first running (repeated simulations are ignored) MacSim Tutorial (In ISCA-39, 2012)
Other Stats • Memory Systems • L[1-3]_HIT_CPU/L[1-3]_HIT_GPU • L[1-3]_MISS_CPU/L[1-3]_MISS_GPU • Front-end • BP_ON_PATH_[CORRECT/MISPREDICT/MISFETCH ] • Instruction profiling • Based on instruction category. inst.stat.out • More details regarding statistics are in the documentation • We will provide simple script file to fetch stat data MacSim Tutorial (In ISCA-39, 2012)
GPGPU Support MacSim Tutorial (In ISCA-39, 2012)
GPGPU Support Features • Multi-threading support is already there. • Different ISAs: using micro-ops • Warp ? • One warp is treated as one thread. Each thread generates its own trace file. Active bit information is included • Trace format will be explained in Part-III • Thread and block scheduling • Block-level barrier, block-level scheduling/retirement • More details will be explained in Part-III • Different memory structures • Memory systems MacSim Tutorial (In ISCA-39, 2012)
Handling Vector Memory Operations SIMD load instruction Addr 0 Addr 1 Addr 2 Addr 3 Addr 4 Addr 5 Addr 6 Addr 7 Coalesced Uncoalesced Meminstwith 128B size Trace file 64B Request 32B Req. 32B Req. Trace file TraceInst TraceInst_begin TraceMem1 TraceMem2 TraceMem3 TraceInst_end start of memory instruction marker end of memory instruction marker Include the memory access by each thread of a warp as a separate instruction in the trace In trace, mark these accesses as coming from the same warp MacSim Tutorial (In ISCA-39, 2012)
Handling Vector Memory Operations Trace file MacSim TraceInst_begin TraceMem1 TraceMem2 TraceMem3 … TraceMemN TraceInst_end start of memory instruction marker Parent uop Mem_type: ld #children: 8 uop Children uops addr0 addr1 addr2 addr3 end of memory instruction marker addr4 addr5 … addrN • During simulation, form a “parent” uop that holds all the individual memory accesses as its child uops • Parent uop flows through the pipeline, only in the memory stage, the individual children uops are issued to the memory • Parent uop is ready for retirement when all children have completed MacSim Tutorial (In ISCA-39, 2012)
Enhanced MacSim MacSim Tutorial (In ISCA-39, 2012)
More Features with MacSim Node Node router router Node Topology (Ring, Mesh, Torus, ..) Node • IRIS (From Prof. Yalamanchili’s group) • Flit-level interconnection network simulator • Virtual channel, credit-based flow controldeadlock-avoidance, … • Part-IV will cover more. • MacSim-SST • Parallel simulation MacSim Tutorial (In ISCA-39, 2012)