360 likes | 512 Views
CS 838: Pervasive Parallelism Profiling and Parallelization of the Multifacet GEMS Simulation Infrastructure. Instructor: Mark D. Hill. Jake Adriaens jtadriaens@wisc.edu Dan Gibson gibson@cs.wisc.edu. Problem. Simulation is (really) slow! Simics alone runs at ~ 5 MIPS (fast!)
E N D
CS 838: Pervasive ParallelismProfiling and Parallelization of the Multifacet GEMS Simulation Infrastructure Instructor: Mark D. Hill Jake Adriaens jtadriaens@wisc.edu Dan Gibson gibson@cs.wisc.edu
Problem • Simulation is (really) slow! • Simics alone runs at ~ 5 MIPS (fast!) • Add Ruby ~ 50 KIPS • Add Opal ~ 20 KIPS • Fast simulations lead to faster evaluation of new ideas. • Running many simulations in parallel (via Condor, for instance) is great for shrinking error bars, less useful for development. • Fast simulations useful for educational purposes • Remember how long it took to simulate HW 5, HW 6? • Simulations of long-running commercial workloads can take hours or DAYS, even on top-of-the-line hardware CS 838
More Motivation – Why Parallelize? Chips currently look like this: A couple of cores Memory & I/O Control On-Chip Cache Dual-Core AMD Opteron Die Photo From: Microprocessor Report: Best Servers of 2004 CS 838
$ BANK $ BANK $ BANK $ BANK CORE CORE CORE CORE CORE CORE CORE CORE Interconnect More Motivation – Why Parallelize? Soon, chips may look like this: More cores! Many more threads The free lunch is over: To get speedup out of multithreaded processors, programmers must implement parallel programs. (for now) CS 838
Summary • Good News: Found parallelism in GEMS • Ruby’s event queue often contains independent events • Opal has some implicit parallelism, as it simulates many logically independent processors • Bad News: Speedup potential is limited • In most cases, execution within Simics dominates execution time • Amdahl’s Law suggests parallelization of GEMS will yield small increases in performance • Good News: Discovered inefficiencies • The way GEMS uses Simics greatly affects Simics • Isolated troublesome API calls and stalled processor effects • Bad News: Simics isn’t very thread-friendly • No thread-safe functionality • Calling Simics API requires a (costly) thread switch! CS 838
Summary • More Bad News: Parallelization of Ruby was not (entirely) successful • Demonstrated little/no performance gain • Suffers from deadlock • We have a good excuse for this… • Nondeterministic • Fixable, minor effect • Assumptions of non-concurrent execution • Ready()/Operate() pairs CS 838
What Next? • Overview of Simics/Ruby/Opal • Lengthy example • Profiling Experiments • Description of profiling experiments • Results • Effects Ruby / Opal have on Simics • “Null” module experiments • Parallel Ruby • …and its catastrophic failure • Observations • Conclusions CS 838
Simics / Ruby / Opal Overview - 1 Random Tester Opal Simics Deterministic Contended locks Trace flie Detailed Processor Model Microbenchmarks Simics loadable modules E1 E2 E3 E4 E5 Graphics borrowed from GEMS ISCA-32 Tutorial Presentation, www.cs.wisc.edu/gems CS 838
Opal + Ruby + Simics Operation Opal Install Module Detailed Processor Model Start Sim Install Module Instruction Fetches E1 I-Fetch Complete E2 E3 E4 E5 I-Fetch Complete loop: add R2 R2 R3 beqz R2 loop add R1 R2 R3 sub R7 R8 R9 ld R2 A ld R8 B beq R2 R8 eq call my_func1 eq: ld R2 A beq R2 R4 eq call my_func2 my_func1: ld R8 C ret F F F F Simics Graphics borrowed from GEMS ISCA-32 Tutorial Presentation, www.cs.wisc.edu/gems CS 838
Opal + Ruby + Simics Operation Opal API Calls for Decoding Detailed Processor Model Instruction Fetches E1 E2 E3 E4 E5 loop: add R2 R2 R3 beqz R2 loop add R1 R2 R3 sub R7 R8 R9 ld R2 A ld R8 B beq R2 R8 eq call my_func1 eq: ld R2 A beq R2 R4 eq call my_func2 my_func1: ld R8 C ret D D D D F F F F Simics Graphics borrowed from GEMS ISCA-32 Tutorial Presentation, www.cs.wisc.edu/gems CS 838
Opal + Ruby + Simics Operation Opal Step 1 Instr. Detailed Processor Model E1 E2 E3 E4 E5 loop: add R2 R2 R3 beqz R2 loop add R1 R2 R3 sub R7 R8 R9 ld R2 A ld R8 B beq R2 R8 eq call my_func1 eq: ld R2 A beq R2 R4 eq call my_func2 my_func1: ld R8 C ret C X X W X X S X D D D D X S S X D D D D F F F F Simics Graphics borrowed from GEMS ISCA-32 Tutorial Presentation, www.cs.wisc.edu/gems CS 838
Opal + Ruby + Simics Operation Opal Step 3 Instrs. Detailed Processor Model ld A ld B E1 E2 E3 E4 E5 loop: add R2 R2 R3 beqz R2 loop add R1 R2 R3 sub R7 R8 R9 ld R2 A ld R8 B beq R2 R8 eq call my_func1 eq: ld R2 A beq R2 R4 eq call my_func2 my_func1: ld R8 C ret C C C M M S W X S X X Simics Graphics borrowed from GEMS ISCA-32 Tutorial Presentation, www.cs.wisc.edu/gems CS 838
Opal + Ruby + Simics Operation Opal Detailed Processor Model A=1 B=1 ld C E1 E2 E3 E4 E5 loop: add R2 R2 R3 beqz R2 loop add R1 R2 R3 sub R7 R8 R9 ld R2 A ld R8 B beq R2 R8 eq call my_func1 eq: ld R2 A beq R2 R4 eq call my_func2 my_func1: ld R8 C ret S S S W S S M W Simics Graphics borrowed from GEMS ISCA-32 Tutorial Presentation, www.cs.wisc.edu/gems CS 838
Opal + Ruby + Simics Operation Opal Step 4 Instrs. Detailed Processor Model I-Fetch call E1 E2 E3 E4 E5 loop: add R2 R2 R3 beqz R2 loop add R1 R2 R3 sub R7 R8 R9 ld R2 A ld R8 B beq R2 R8 eq call my_func1 eq: ld R2 A beq R2 R4 eq call my_func2 my_func1: ld R8 C ret C C C C X F Simics Simple, Right? Graphics borrowed from GEMS ISCA-32 Tutorial Presentation, www.cs.wisc.edu/gems CS 838
Finding Parallelism • Lots of parallelism opportunities in the example! • Ruby/Opal (as described) could be run by separate threads! • Ruby is a discrete event simulator… • Can we apply Fujimoto’s PDES strategies directly? • Places we found parallelism: • Ruby’s Event Queue (Experiment 1) • Opal in general, on a per-processor basis (Experiment 2) • Modular structure (not explored) • But how much speedup can we gain through parallelism? CS 838
Experiment 1: Ruby’s Event Queue • Ruby is already a discrete event simulator (DES) • Making it a parallel DES (PDES) ala Fujimoto might be a way to speed things up! • Already has implicit lookahead of 1, due to existing event scheduling constraints. • How many events are available for processing in a given cycle of the event queue? • Too few could limit lookahead properties • How long does a typical event execute? • Short events could make the queue itself a bottleneck CS 838
Results 1 – Ruby’s Event Queue Percentage of All Events Event Counts Event Duration Simics Time = SimTime – RubyTime = ~80% CS 838
Experiment 2 – Opal’s Per-Processor Parallelism • Opal simulates multiple logically independent processors • Simulated processor independence => Parallelism • Use one thread per simulated processor? • Raises work imbalance issues • In practice, the work imbalance is tolerable • Processors are only logically independent • A common sequential bottleneck is shared between all Opal processors: Simics CS 838
All other API calls SIM_break_simulation API call SIM_read_phys_memory API call Experiment 2 – Opal’s Per-Processor Parallelism SIM_continue API Call Opal Best Parallel Opal Speedup <= 40%! Execution time in Opal+Simics Simulation CS 838
Experiment 2 – Opal’s Per-Processor Parallelism • Why is SIM_continue so slow? • Opal uses SIM_continue to logically progress the simulation by a small number (1-4) instructions at a time. • SIM_continue performs extensive start-up and tear-down optimization, expecting large (10,000+) step sizes • Increasing Opal’s stepping size decreases total SIM_continue time significantly, but makes fine-grained simulation difficult • Why is SIM_read_phys_memory so slow? • One call to SIM_read_phys_memory ~ 1us of execution time • Reads from a proprietary-format compressed file • Used by Opal once for every load instruction • Loads are quite frequent! CS 838
Our Thread’s Output, having just returned from an API call Something our thread does crashes one of the Simics threads! Experiment 3 – Simics API Calls • Can there be more bad news? • Yes. • How does Simics react to alien threads using its API? Thread 5 returned from Simics. patch PC: 0x1034e68 0x1034e64 *** ASSERTION ERROR: in line 7530 of file 'v9_service_routines_1.c' with RCSID '@(#) $Id: v9.sg,v 33.0.2.31 2004/10/08 12:23:07 am Exp $' Please report this. Simics will now self-signal an abort. patch NPC: 0x1034e6c 0x1034e68 *** Simics getting shaky, switching to 'safe' mode. *** Simics (thread 31) received an abort signal, probably an assertion. *** thread 31 exiting. CS 838
Experiment 3 – Simics API Calls • Simics forbids calling the Simics API from alien threads • SIM_thread_safe_callback is the only mechanism to use interface from threads • Slow (see table) • Non-blocking • Must have released “Main Simics Thread” (MST) CS 838
Intermediate Conclusions • Interactions with Simics limit our ability to exploit parallelism in Ruby and Opal • Simics is fast without Ruby and/or Opal • Ruby and Opal in isolation are reasonably fast • Ruby and Opal cause slowdowns in Simics • The interactions between the GEMS modules and Simics result in performance loss CS 838
Experiment 4 – “NULL” Modules • To study Simics slowdown, we use “NULL” modules: • Empty, trivial modules that use interfaces similar to Ruby and Opal • Modules contribute very little to runtime directly • Effectively isolates Simics performance from module performance • NullRUBY( X ) • A simple memory timing model, using the same interface as Ruby • Models a memory with a constant latency (X cycles per access) • NullOPAL( IPC ) • A trivial processor model, using a similar interface as Opal • Steps Simics (with SIM_continue) by IPC instructions per cycle CS 838
NULLRUBY(0) increases execution time by 2x-3x on average. This is logically equivalent to having no timing model installed. Experiment 4 – “NULL” Modules CS 838
Runtime increases ~linearly (or greater) as memory latency increases. Processors stalled on memory requests are costly to simulate! Experiment 4 – “NULL” Modules CS 838
Experiment 4 – “NULL” Modules Using SIM_continue with a stepping quanta of 10 is 3x-7x faster than the Opal default of 1! CS 838
Ruby (with simulated memory latency of 300 cycles) slows Simics about as much as NullRUBY(200) Experiment 4 – “NULL” Modules CS 838
In agreement with the pie chart, the runtime of SIM_continue accounts for about half of the Opal+Simics runtime Experiment 4 – “NULL” Modules CS 838
“NULL” Module Observations • Simulations are slow because of interactions between Simics, Ruby, and Opal • T(Simics+Modules) != T(Simics) + T(Modules) • Little or no speedup is possible from parallelizing Ruby and/or Opal with the current Simics interfaces • Suggested improvements dramatically affect fidelity of simulations • Increasing Opal’s step size reduces accuracy • Optimizing Simics memory stall time requires coarse-grain simulation CS 838
Parallelizing Ruby • Despite overwhelming likelihood of failure, parallelize anyway! • Obstacles: • Assumptions of non-concurrency • Portions of Ruby are auto-generated • Simics threading hurdles • 48,059 lines of C++ in 312 separate files. CS 838
Parallelizing Ruby • Final implementation suffers from frequent deadlock • Fine-grained locking leads to many deadlock opportunities • Can’t always acquire locks in same order: • Lock ordering by meaning of protected object: Locks have different semantic meanings for different logical events (input vs. output queues) • Lock ordering by address of the lock: May need to acquire a lock in order to determine which locks are needed • Lock ordering by simulated chip topology: Need knowledge of “where” a particular event is occurring in simulated chip • Coarse-grained locking has worse performance than a single thread CS 838
Parallelizing Ruby • Occasionally (for very short simulations), no deadlock occurs (soln: coarse-grain locks) • Some non-determinism, but results are actually quite close to sequential version • Almost no speedup CS 838
Parallelizing Ruby • Other challenges: • Ready()/Operate() pairs violate object-encapsulated synchronization • Ready() status may change between calls of Ready() and Operate() • Fine-grained locking with object-encapsulated synchronization greatly simplified by Solaris-only lock recursion • x86-64 pthread libraries on main simulation machines do not support lock recursion • Unidentified sharing leads to difficult races • Interactions with Simics require extreme synchronization CS 838
Closing Remarks • Improvements must be made to Ruby/Simics and Opal/Simics interfaces • Parallelization of Ruby requires a substantial re-write of Ruby’s event queue and associated classes • Incorporate knowledge of network topology to provide a lock acquisition order • Replace “event” abstraction with “active object” abstraction, which is race-free. • Parallel programming is hard • Chip manufacturers should be worried CS 838
Opal Detailed Processor Model ? ? The End Simics Graphics borrowed from GEMS ISCA-32 Tutorial Presentation, www.cs.wisc.edu/gems CS 838