Jake Adriaens jtadriaens@wisc Dan Gibson gibson@cs.wisc

CS 838: Pervasive ParallelismProfiling and Parallelization of the Multifacet GEMS Simulation Infrastructure Instructor: Mark D. Hill Jake Adriaens jtadriaens@wisc.edu Dan Gibson gibson@cs.wisc.edu

Problem • Simulation is (really) slow! • Simics alone runs at ~ 5 MIPS (fast!) • Add Ruby ~ 50 KIPS • Add Opal ~ 20 KIPS • Fast simulations lead to faster evaluation of new ideas. • Running many simulations in parallel (via Condor, for instance) is great for shrinking error bars, less useful for development. • Fast simulations useful for educational purposes • Remember how long it took to simulate HW 5, HW 6? • Simulations of long-running commercial workloads can take hours or DAYS, even on top-of-the-line hardware CS 838

More Motivation – Why Parallelize? Chips currently look like this: A couple of cores Memory & I/O Control On-Chip Cache Dual-Core AMD Opteron Die Photo From: Microprocessor Report: Best Servers of 2004 CS 838

$ BANK $ BANK $ BANK $ BANK CORE CORE CORE CORE CORE CORE CORE CORE Interconnect More Motivation – Why Parallelize? Soon, chips may look like this: More cores! Many more threads The free lunch is over: To get speedup out of multithreaded processors, programmers must implement parallel programs. (for now) CS 838

Summary • Good News: Found parallelism in GEMS • Ruby’s event queue often contains independent events • Opal has some implicit parallelism, as it simulates many logically independent processors • Bad News: Speedup potential is limited • In most cases, execution within Simics dominates execution time • Amdahl’s Law suggests parallelization of GEMS will yield small increases in performance • Good News: Discovered inefficiencies • The way GEMS uses Simics greatly affects Simics • Isolated troublesome API calls and stalled processor effects • Bad News: Simics isn’t very thread-friendly • No thread-safe functionality • Calling Simics API requires a (costly) thread switch! CS 838

Summary • More Bad News: Parallelization of Ruby was not (entirely) successful • Demonstrated little/no performance gain • Suffers from deadlock • We have a good excuse for this… • Nondeterministic • Fixable, minor effect • Assumptions of non-concurrent execution • Ready()/Operate() pairs CS 838

What Next? • Overview of Simics/Ruby/Opal • Lengthy example • Profiling Experiments • Description of profiling experiments • Results • Effects Ruby / Opal have on Simics • “Null” module experiments • Parallel Ruby • …and its catastrophic failure • Observations • Conclusions CS 838

Simics / Ruby / Opal Overview - 1 Random Tester Opal Simics Deterministic Contended locks Trace flie Detailed Processor Model Microbenchmarks Simics loadable modules E1 E2 E3 E4 E5 Graphics borrowed from GEMS ISCA-32 Tutorial Presentation, www.cs.wisc.edu/gems CS 838

Opal + Ruby + Simics Operation Opal Install Module Detailed Processor Model Start Sim Install Module Instruction Fetches E1 I-Fetch Complete E2 E3 E4 E5 I-Fetch Complete loop: add R2 R2 R3 beqz R2 loop add R1 R2 R3 sub R7 R8 R9 ld R2 A ld R8 B beq R2 R8 eq call my_func1 eq: ld R2 A beq R2 R4 eq call my_func2 my_func1: ld R8 C ret F F F F Simics Graphics borrowed from GEMS ISCA-32 Tutorial Presentation, www.cs.wisc.edu/gems CS 838

Opal + Ruby + Simics Operation Opal API Calls for Decoding Detailed Processor Model Instruction Fetches E1 E2 E3 E4 E5 loop: add R2 R2 R3 beqz R2 loop add R1 R2 R3 sub R7 R8 R9 ld R2 A ld R8 B beq R2 R8 eq call my_func1 eq: ld R2 A beq R2 R4 eq call my_func2 my_func1: ld R8 C ret D D D D F F F F Simics Graphics borrowed from GEMS ISCA-32 Tutorial Presentation, www.cs.wisc.edu/gems CS 838

Opal + Ruby + Simics Operation Opal Step 1 Instr. Detailed Processor Model E1 E2 E3 E4 E5 loop: add R2 R2 R3 beqz R2 loop add R1 R2 R3 sub R7 R8 R9 ld R2 A ld R8 B beq R2 R8 eq call my_func1 eq: ld R2 A beq R2 R4 eq call my_func2 my_func1: ld R8 C ret C X X W X X S X D D D D X S S X D D D D F F F F Simics Graphics borrowed from GEMS ISCA-32 Tutorial Presentation, www.cs.wisc.edu/gems CS 838

Opal + Ruby + Simics Operation Opal Step 3 Instrs. Detailed Processor Model ld A ld B E1 E2 E3 E4 E5 loop: add R2 R2 R3 beqz R2 loop add R1 R2 R3 sub R7 R8 R9 ld R2 A ld R8 B beq R2 R8 eq call my_func1 eq: ld R2 A beq R2 R4 eq call my_func2 my_func1: ld R8 C ret C C C M M S W X S X X Simics Graphics borrowed from GEMS ISCA-32 Tutorial Presentation, www.cs.wisc.edu/gems CS 838

Opal + Ruby + Simics Operation Opal Detailed Processor Model A=1 B=1 ld C E1 E2 E3 E4 E5 loop: add R2 R2 R3 beqz R2 loop add R1 R2 R3 sub R7 R8 R9 ld R2 A ld R8 B beq R2 R8 eq call my_func1 eq: ld R2 A beq R2 R4 eq call my_func2 my_func1: ld R8 C ret S S S W S S M W Simics Graphics borrowed from GEMS ISCA-32 Tutorial Presentation, www.cs.wisc.edu/gems CS 838

Opal + Ruby + Simics Operation Opal Step 4 Instrs. Detailed Processor Model I-Fetch call E1 E2 E3 E4 E5 loop: add R2 R2 R3 beqz R2 loop add R1 R2 R3 sub R7 R8 R9 ld R2 A ld R8 B beq R2 R8 eq call my_func1 eq: ld R2 A beq R2 R4 eq call my_func2 my_func1: ld R8 C ret C C C C X F Simics Simple, Right? Graphics borrowed from GEMS ISCA-32 Tutorial Presentation, www.cs.wisc.edu/gems CS 838

Finding Parallelism • Lots of parallelism opportunities in the example! • Ruby/Opal (as described) could be run by separate threads! • Ruby is a discrete event simulator… • Can we apply Fujimoto’s PDES strategies directly? • Places we found parallelism: • Ruby’s Event Queue (Experiment 1) • Opal in general, on a per-processor basis (Experiment 2) • Modular structure (not explored) • But how much speedup can we gain through parallelism? CS 838

Experiment 1: Ruby’s Event Queue • Ruby is already a discrete event simulator (DES) • Making it a parallel DES (PDES) ala Fujimoto might be a way to speed things up! • Already has implicit lookahead of 1, due to existing event scheduling constraints. • How many events are available for processing in a given cycle of the event queue? • Too few could limit lookahead properties • How long does a typical event execute? • Short events could make the queue itself a bottleneck CS 838

Results 1 – Ruby’s Event Queue Percentage of All Events Event Counts Event Duration Simics Time = SimTime – RubyTime = ~80% CS 838

Experiment 2 – Opal’s Per-Processor Parallelism • Opal simulates multiple logically independent processors • Simulated processor independence => Parallelism • Use one thread per simulated processor? • Raises work imbalance issues • In practice, the work imbalance is tolerable • Processors are only logically independent • A common sequential bottleneck is shared between all Opal processors: Simics CS 838

All other API calls SIM_break_simulation API call SIM_read_phys_memory API call Experiment 2 – Opal’s Per-Processor Parallelism SIM_continue API Call Opal Best Parallel Opal Speedup <= 40%! Execution time in Opal+Simics Simulation CS 838

Experiment 2 – Opal’s Per-Processor Parallelism • Why is SIM_continue so slow? • Opal uses SIM_continue to logically progress the simulation by a small number (1-4) instructions at a time. • SIM_continue performs extensive start-up and tear-down optimization, expecting large (10,000+) step sizes • Increasing Opal’s stepping size decreases total SIM_continue time significantly, but makes fine-grained simulation difficult • Why is SIM_read_phys_memory so slow? • One call to SIM_read_phys_memory ~ 1us of execution time • Reads from a proprietary-format compressed file • Used by Opal once for every load instruction • Loads are quite frequent! CS 838

Our Thread’s Output, having just returned from an API call Something our thread does crashes one of the Simics threads! Experiment 3 – Simics API Calls • Can there be more bad news? • Yes. • How does Simics react to alien threads using its API? Thread 5 returned from Simics. patch PC: 0x1034e68 0x1034e64 *** ASSERTION ERROR: in line 7530 of file 'v9_service_routines_1.c' with RCSID '@(#) $Id: v9.sg,v 33.0.2.31 2004/10/08 12:23:07 am Exp $' Please report this. Simics will now self-signal an abort. patch NPC: 0x1034e6c 0x1034e68 *** Simics getting shaky, switching to 'safe' mode. *** Simics (thread 31) received an abort signal, probably an assertion. *** thread 31 exiting. CS 838

Experiment 3 – Simics API Calls • Simics forbids calling the Simics API from alien threads • SIM_thread_safe_callback is the only mechanism to use interface from threads • Slow (see table) • Non-blocking • Must have released “Main Simics Thread” (MST) CS 838

Intermediate Conclusions • Interactions with Simics limit our ability to exploit parallelism in Ruby and Opal • Simics is fast without Ruby and/or Opal • Ruby and Opal in isolation are reasonably fast • Ruby and Opal cause slowdowns in Simics • The interactions between the GEMS modules and Simics result in performance loss CS 838

Experiment 4 – “NULL” Modules • To study Simics slowdown, we use “NULL” modules: • Empty, trivial modules that use interfaces similar to Ruby and Opal • Modules contribute very little to runtime directly • Effectively isolates Simics performance from module performance • NullRUBY( X ) • A simple memory timing model, using the same interface as Ruby • Models a memory with a constant latency (X cycles per access) • NullOPAL( IPC ) • A trivial processor model, using a similar interface as Opal • Steps Simics (with SIM_continue) by IPC instructions per cycle CS 838

NULLRUBY(0) increases execution time by 2x-3x on average. This is logically equivalent to having no timing model installed. Experiment 4 – “NULL” Modules CS 838

Runtime increases ~linearly (or greater) as memory latency increases. Processors stalled on memory requests are costly to simulate! Experiment 4 – “NULL” Modules CS 838

Experiment 4 – “NULL” Modules Using SIM_continue with a stepping quanta of 10 is 3x-7x faster than the Opal default of 1! CS 838

Ruby (with simulated memory latency of 300 cycles) slows Simics about as much as NullRUBY(200) Experiment 4 – “NULL” Modules CS 838

In agreement with the pie chart, the runtime of SIM_continue accounts for about half of the Opal+Simics runtime Experiment 4 – “NULL” Modules CS 838

“NULL” Module Observations • Simulations are slow because of interactions between Simics, Ruby, and Opal • T(Simics+Modules) != T(Simics) + T(Modules) • Little or no speedup is possible from parallelizing Ruby and/or Opal with the current Simics interfaces • Suggested improvements dramatically affect fidelity of simulations • Increasing Opal’s step size reduces accuracy • Optimizing Simics memory stall time requires coarse-grain simulation CS 838

Parallelizing Ruby • Despite overwhelming likelihood of failure, parallelize anyway! • Obstacles: • Assumptions of non-concurrency • Portions of Ruby are auto-generated • Simics threading hurdles • 48,059 lines of C++ in 312 separate files. CS 838

Parallelizing Ruby • Final implementation suffers from frequent deadlock • Fine-grained locking leads to many deadlock opportunities • Can’t always acquire locks in same order: • Lock ordering by meaning of protected object: Locks have different semantic meanings for different logical events (input vs. output queues) • Lock ordering by address of the lock: May need to acquire a lock in order to determine which locks are needed • Lock ordering by simulated chip topology: Need knowledge of “where” a particular event is occurring in simulated chip • Coarse-grained locking has worse performance than a single thread CS 838

Parallelizing Ruby • Occasionally (for very short simulations), no deadlock occurs (soln: coarse-grain locks) • Some non-determinism, but results are actually quite close to sequential version • Almost no speedup CS 838

Parallelizing Ruby • Other challenges: • Ready()/Operate() pairs violate object-encapsulated synchronization • Ready() status may change between calls of Ready() and Operate() • Fine-grained locking with object-encapsulated synchronization greatly simplified by Solaris-only lock recursion • x86-64 pthread libraries on main simulation machines do not support lock recursion • Unidentified sharing leads to difficult races • Interactions with Simics require extreme synchronization CS 838

Closing Remarks • Improvements must be made to Ruby/Simics and Opal/Simics interfaces • Parallelization of Ruby requires a substantial re-write of Ruby’s event queue and associated classes • Incorporate knowledge of network topology to provide a lock acquisition order • Replace “event” abstraction with “active object” abstraction, which is race-free. • Parallel programming is hard • Chip manufacturers should be worried CS 838

Opal Detailed Processor Model ? ? The End Simics Graphics borrowed from GEMS ISCA-32 Tutorial Presentation, www.cs.wisc.edu/gems CS 838

Jake Adriaens jtadriaens@wisc Dan Gibson gibson@cs.wisc

Jake Adriaens jtadriaens@wisc Dan Gibson gibson@cs.wisc

Presentation Transcript

Sarah Gibson

Dan M. Gibson Executive Director, MASI

Eleanor Jack Gibson

Gibson Girls

Althea Gibson

Gibson Guitars

Gibson Guitars

Althea Gibson

Althea Gibson

Gwen Gibson

Sarah Gibson

Gibson Girls

Sarah Gibson

Gena Gibson

Josh Gibson

By: Fred Gibson

Eleanor Gibson

Gibson Low

Ivanka MITROVIC, Dan GIBSON, Derek THORKELSON, Dan MARSHALL

Steve Gibson Homes

Gibson & Hughes