Embedded Computer Architecture

Embedded Computer Architecture Exploiting ILP VLIW architectures TU/e 5KK73 Henk Corporaal

operation 1 operation 2 operation 3 operation 4 operation 5 What are we talking about? ILP = Instruction Level Parallelism = ability to perform multiple operations (or instructions), from a single instruction stream, in parallel VLIW = Very Long Instruction Word architecture Instruction format example of 5 issue VLIW: Embedded Computer Architecture H. Corporaal and B. Mesman

instr instr op op op op op op op nop op op op nop op nop op op op op op op op op op op op op op instr instr instr instr instr Compiler instr instr instr instr instr instr instr execute 1 instr/cycle 3 ops/cycle instr instr instr execute 1 instr/cycle 3-issue VLIW RISC CPU Single Issue RISC vs VLIW Embedded Computer Architecture H. Corporaal and B. Mesman

Topics Overview • How to speed up your processor? • What options do you have? • Operation/Instruction Level Parallelism • Limits on ILP • VLIW • Examples • Clustering • Code generation (2nd slide-set) • Hands-on Embedded Computer Architecture H. Corporaal and B. Mesman

IF IF IF IF DC DC DC DC RF RF RF RF EX EX EX EX WB WB WB WB Speed-upPipelined Execution of Instructions IF: Instruction Fetch DC: Instruction Decode RF: Register Fetch EX: Execute instruction WB: Write Result Register CYCLE 1 2 3 4 5 6 7 8 1 2 INSTRUCTION 3 4 Simple 5-stage pipeline • Purpose of pipelining: • Reduce #gate_levels in critical path • Reduce CPI close to one (instead of a large number for the multicycle machine) • More efficient Hardware • Problems • Hazards: pipeline stalls • Structural hazards: add more hardware • Control hazards, branch penalties: use branch prediction • Data hazards: by passing required Embedded Computer Architecture H. Corporaal and B. Mesman

* Speed-upPipelined Execution of Instructions Superpipelining: • Split one or more of the critical pipeline stages • Superpipelining degree S: S(architecture) = f(Op) * lt (Op) Op I_set where: f(op) is frequency of operation op lt(op) is latency of operation op Embedded Computer Architecture H. Corporaal and B. Mesman

Speed-upPowerful Instructions (1) MD-technique • Multiple data operands per operation • SIMD: Single Instruction Multiple Data Vector instruction: for (i=0, i++, i<64) c[i] = a[i] + 5*b[i]; or c = a + 5*b Assembly: set vl,64 ldv v1,0(r2) mulvi v2,v1,5 ldv v1,0(r1) addv v3,v1,v2 stv v3,0(r3) Embedded Computer Architecture H. Corporaal and B. Mesman

SIMD Execution Method time node1 node2 node-K Instruction 1 Instruction 2 Instruction 3 Instruction n Speed-upPowerful Instructions (1) SIMD computing • Nodes used for independent operations • Mesh or hypercube connectivity • Exploit data locality of e.g. image processing applications • Dense encoding (few instruction bits needed) Embedded Computer Architecture H. Corporaal and B. Mesman

* * * * Speed-upPowerful Instructions (1) • Sub-word parallelism • SIMD on restricted scale: • Used for Multi-media instructions • Examples • MMX, SSX, SUN-VIS, HP MAX-2, AMD 3Dnow, Trimedia II • Example: i=1..4|ai-bi| Embedded Computer Architecture H. Corporaal and B. Mesman

Speed-upPowerful Instructions (2) MO-technique: multiple operations per instruction Two options: • CISC (Complex Instruction Set Computer) • VLIW (Very Long Instruction Word) FU 1 FU 2 FU 3 FU 4 FU 5 field sub r8, r5, 3 and r1, r5, 12 mul r6, r5, r2 ld r3, 0(r5) bnez r5, 13 instruction VLIW instruction example Embedded Computer Architecture H. Corporaal and B. Mesman

VLIW architecture: central Register File Shared, Multi-ported Register file Exec unit 1 Exec unit 2 Exec unit 3 Exec unit 4 Exec unit 5 Exec unit 6 Exec unit 7 Exec unit 8 Exec unit 9 Issue slot 1 Issue slot 2 Issue slot 3 Q: How many ports does the registerfile need for n-issue? Embedded Computer Architecture H. Corporaal and B. Mesman

I/O INTERFACE D-CACHE I-CACHE I-Cache D-cache 32K 16K TAG TAG TAG SEQUENCER / DECODE TAG (FLOAT) DSPALU2 IFMUL2 FCOMP2 DSPMUL2 FALU3 ALU3 ALU4 ALU2 REGFILE 128 REGS X 32 BITS DSPALU0 SHIFTER0 FTOUGH1 SHIFTER1 (FLOAT) IFMUL1 FALU0 DSPMUL1 ALU1 ALU0 Philips oldie: TriMedia TM32A processor 0.18 micron area : 16.9mm2 200 MHz (typ) 1.4 W 7 mW/MHz (MIPS processor: 0.9 mW/MHz) Embedded Computer Architecture H. Corporaal and B. Mesman

Speedup: Powerful Instructions (2) VLIW Characteristics • Only RISC like operation support • Short cycle times • Flexible: Can implement any FU mixture • Extensible • Tight inter FU connectivity required • Large instructions (up to 1024 bits) • Not binary compatible !!! • But good compilers exist Embedded Computer Architecture H. Corporaal and B. Mesman

Speed-upMultiple instruction issue (per cycle) Who guarantees semantic correctness? • which can instructions be executed in parallel? • User: he specifies multiple instruction streams • Multi-processor: MIMD (Multiple Instruction Multiple Data) • HW: Run-time detection of ready instructions • Superscalar • Compiler: Compile into dataflow representation • Dataflow processors Embedded Computer Architecture H. Corporaal and B. Mesman

&d ld 3.14 &f &b ld ld * 15 &c + / st &a &e st st Multiple instruction issueThree Approaches Example code a := b + 15; c := 3.14 * d; e := c / f; Translation to DDG (Data Dependence Graph) Embedded Computer Architecture H. Corporaal and B. Mesman

Instr. Sequential Code I1 ld r1,M(&b) I2 addi r1,r1,15 I3 st r1,M(&a) I4 ld r1,M(&d) I5 muli r1,r1,3.14 I6 st r1,M(&c) I7 ld r2,M(&f) I8 div r1,r1,r2 I9 st r1,M(&e) Dataflow Code I1 ld(M(&b) -> I2 I2 addi 15 -> I3 I3 st M(&a) I4 ld M(&d) -> I5 I5 muli 3.14 -> I6, I8 I6 st M(&c) I7 ld M(&f) -> I8 I8 div -> I9 I9 st M(&e) Generated Code • 3 approaches: • An MIMD may execute two streams: (1) I1-I3 (2) I4-I9 • No dependencies between streams; in practice communication and synchronization required between streams • A superscalar issues multiple instructions from sequential stream • Obey dependencies (True and name dependencies) • Reverse engineering of DDG needed at run-time • Dataflow code is direct representation of DDG Embedded Computer Architecture H. Corporaal and B. Mesman

FU-1 FU-2 FU-K Multiple Instruction Issue:Data flow processor Token Matching Token Store Instruction Generate Instruction Store Result Tokens Reservation Stations Embedded Computer Architecture H. Corporaal and B. Mesman

IF DC RF EX WB IF DC/RF EX WB IF1 IF2 IFk IF3 DC2 DCk DC1 DC3 ISSUE ISSUE ISSUE ISSUE RFk RF2 RF1 RF3 EX2 EX3 EX1 EXk ROB ROB ROB ROB WBk WB1 WB3 WB2 IF1 IF2 --- IFs DC RF EX1 EX2 --- EX5 WB IF DC RF1 EX1 WB1 RF1 EX1 WB1 RF2 EX2 WB2 RF2 EX2 WB2 RFk EXk WBk RFk EXk WBk Instruction Pipeline Overview (no pipelining) CISC RISC Superscalar Superpipelined DATAFLOW VLIW Embedded Computer Architecture H. Corporaal and B. Mesman

SIMD 100 Data/operation ‘D’ 10 Vector CISC Superscalar MIMD Dataflow 0.1 10 100 RISC Instructions/cycle ‘I’ Superpipelined 10 VLIW 10 Operations/instruction ‘O’ Superpipelining Degree ‘S’ Four dimensional representation of the architecture design space <I, O, D, S> Note: MIMD should better be a separate, 5th dimension ! Embedded Computer Architecture H. Corporaal and B. Mesman

Architecture K I O D S Mpar CISC 1 0.2 1.2 1.1 1 0.26 RISC 1 1 1 1 1.2 1.2 VLIW 10 1 10 1 1.2 12 Superscalar 3 3 1 1 1.2 3.6 Superpipelined 1 1 1 1 3 3 Vector 7 0.1 1 64 5 32 SIMD 1024 1 1 1024 1.2 1229 MIMD 32 32 1 1 1.2 38 Dataflow 10 10 1 1 1.2 12 Architecture design space Typical values of K (# of functional units or processor nodes), and <I, O, D, S> for different architectures S(architecture) = f(Op) * lt (Op) Op I_set Mpar = I*O*D*S Embedded Computer Architecture H. Corporaal and B. Mesman

Overview • Enhance performance: architecture methods • Instruction Level Parallelism (ILP) • limits on ILP • VLIW • Examples • Clustering • Code generation • Hands-on Embedded Computer Architecture H. Corporaal and B. Mesman

FU-1 CPU FU-2 Instruction fetch unit Instruction decode unit Instruction memory FU-3 Bypassing network Data memory Register file FU-4 FU-5 General organization of an ILP architecture Embedded Computer Architecture H. Corporaal and B. Mesman

Motivation for ILP • Increasing VLSI densities; decreasing feature size • Increasing performance requirements • New application areas, like • multi-media (image, audio, video, 3-D, holographic) • intelligent search and filtering engines • neural, fuzzy, genetic computing • More functionality • Use of existing Code (Compatibility) • Low Power: P = fCVdd2 Embedded Computer Architecture H. Corporaal and B. Mesman

Low power through parallelism • Sequential Processor • Switching capacitance C • Frequency f • Voltage V • P = fCV2 • Parallel Processor (two times the number of units) • Switching capacitance 2C • Frequency f/2 • Voltage V’ < V • P = f/2 2C V’2 =fCV’2 Embedded Computer Architecture H. Corporaal and B. Mesman

Measuring and exploiting available ILP • How much ILP is there in applications? • How to measure parallelism within applications? • Using existing compiler • Using trace analysis • Track all the real data dependencies (RaWs) of instructions from issue window • register dependence • memory dependence • Check for correct branch prediction • if prediction correct continue • if wrong, flush schedule and start in next cycle Embedded Computer Architecture H. Corporaal and B. Mesman

Trace set r1,0 set r2,3 set r3,&A st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop add r1,r5,3 Trace analysis Compiled code set r1,0 set r2,3 set r3,&A Loop: st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop add r1,r5,3 Program For i := 0..2 A[i] := i; S := X+3; How parallel can you execute this code? Embedded Computer Architecture H. Corporaal and B. Mesman

Trace analysis Parallel Trace set r1,0 set r2,3 set r3,&A st r1,0(r3) add r1,r1,1 add r3,r3,4 st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop brne r1,r2,Loop add r1,r5,3 Max ILP = Speedup = Lserial / Lparallel = 16 / 6 = 2.7 Embedded Computer Architecture H. Corporaal and B. Mesman

Ideal Processor Assumptions for ideal/perfect processor: 1. Register renaming– infinite number of virtual registers => all register WAW & WAR hazards avoided 2. Branch and Jump prediction– Perfect => all program instructions available for execution 3. Memory-address alias analysis– addresses are known. A store can be moved before a load provided addresses not equal Also: • unlimited number of instructions issued/cycle (unlimited resources), and • unlimited instruction window • perfect caches • 1 cycle latency for all instructions (FP *,/) Programs were compiled using MIPS compiler with maximum optimization level Embedded Computer Architecture H. Corporaal and B. Mesman

Upper Limit to ILP: Ideal Processor Integer: 18 - 60 FP: 75 - 150 IPC Embedded Computer Architecture H. Corporaal and B. Mesman

Window Size and Branch Impact • Change from infinite window to examine 2000 and issue at most 64 instructions per cycle FP: 15 - 45 Integer: 6 – 12 IPC PerfectTournamentBHT(512)ProfileNo prediction Embedded Computer Architecture H. Corporaal and B. Mesman

Limiting nr. of Renaming Registers • Changes: 2000 instr. window, 64 instr. issue, 8K 2-level predictor (slightly better than tournament predictor) FP: 11 - 45 Integer: 5 - 15 IPC Infinite2561286432 Embedded Computer Architecture H. Corporaal and B. Mesman

Memory Address Alias Impact • Changes: 2000 instr. window, 64 instr. issue, 8K 2-level predictor, 256 renaming registers FP: 4 - 45 (Fortran, no heap) Integer: 4 - 9 IPC PerfectGlobal/stack perfectInspectionNone Embedded Computer Architecture H. Corporaal and B. Mesman

Reducing Window Size • Assumptions: Perfect disambiguation, 1K Selective predictor, 16 entry return stack, 64 renaming registers, issue as many as window FP: 8 - 45 IPC Integer: 6 - 12 Infinite2561286432 16 8 4 Embedded Computer Architecture H. Corporaal and B. Mesman

How to Exceed ILP Limits of This Study? • WAR and WAW hazards through memory: eliminated WAW and WAR hazards through register renaming, but not in memory • Unnecessary dependences • compiler did not unroll loops so iteration variable dependence • Overcoming the data flow limit: value prediction, predicting values and speculating on prediction • Address value prediction and speculation predicts addresses and speculates by reordering loads and stores. Could provide better aliasing analysis Embedded Computer Architecture H. Corporaal and B. Mesman

Conclusions • Amount of parallelism is limited • higher in Multi-Media and Signal Processing appl. • higher in kernels • Trace analysis detects all types of parallelism • task, data and operation types • Detected parallelism depends on • quality of compiler • hardware • source-code transformations Embedded Computer Architecture H. Corporaal and B. Mesman

Overview • Enhance performance: architecture methods • Instruction Level Parallelism • VLIW • Examples • C6 • TM • IA-64: Itanium, .... • TTA • Clustering • Code generation • Hands-on Embedded Computer Architecture H. Corporaal and B. Mesman

Instruction Memory Int FU Int FU Int FU LD/ST LD/ST FP FU FP FU Int Register File Floating Point Register File Data Memory VLIW: general concept A VLIW architecture with 7 FUs Instruction register Function units Embedded Computer Architecture H. Corporaal and B. Mesman

VLIW characteristics • Multiple operations per instruction • One instruction per cycle issued (at most) • Compiler is in control • Only RISC like operation support • Short cycle times • Easier to compile for • Flexible: Can implement any FU mixture • Extensible / Scalable However: • tight inter FU connectivity required • not binary compatible !! • (new long instruction format) • low code density Embedded Computer Architecture H. Corporaal and B. Mesman

VelociTIC6x datapath Embedded Computer Architecture H. Corporaal and B. Mesman

VLIW example: TMS320C62 TMS320C62 VelociTI Processor • 8 operations (of 32-bit) per instruction (256 bit) • Two clusters • 8 Fus: 4 Fus / cluster : (2 Multipliers, 6 ALUs) • 2 x 16 registers • One bus available to write in register file of other cluster • Flexible addressing modes (like circular addressing) • Flexible instruction packing • All instruction conditional • Originally: 5 ns, 200 MHz, 0.25 um, 5-layer CMOS • 128 KB on-chip RAM Embedded Computer Architecture H. Corporaal and B. Mesman

VLIW example: Philips TriMedia TM1000 Register file (128 regs, 32 bit, 15 ports) 5 constant 5 ALU 2 memory 2 shift 2 DSP-ALU 2 DSP-mul 3 branch 2 FP ALU 2 Int/FP ALU 1 FP compare 1 FP div/sqrt Exec unit Exec unit Exec unit Exec unit Exec unit Data cache (16 kB) Instruction register (5 issue slots) PC Instruction cache (32kB) Embedded Computer Architecture H. Corporaal and B. Mesman

Intel EPIC Architecture IA-64 Explicit Parallel Instruction Computer (EPIC) • IA-64 architecture -> Itanium, first realization 2001 Register model: • 128 64-bit int x bits, stack, rotating • 128 82-bit floating point, rotating • 64 1-bit boolean • 8 64-bit branch target address • system control registers See http://en.wikipedia.org/wiki/Itanium Embedded Computer Architecture H. Corporaal and B. Mesman

EPIC Architecture: IA-64 • Instructions grouped in 128-bit bundles • 3 * 41-bit instruction • 5 template bits, indicate type and stop location • Each 41-bit instruction • starts with 4-bit opcode, and • ends with 6-bit guard (boolean) register-id • Supports speculative loads Embedded Computer Architecture H. Corporaal and B. Mesman

Itanium organization Embedded Computer Architecture H. Corporaal and B. Mesman

Itanium 2: McKinley Embedded Computer Architecture H. Corporaal and B. Mesman

EPIC Architecture: IA-64 • EPIC allows for more binary compatibility then a plain VLIW: • Function unit assignment performed at run-time • Lock when FU results not available • See other website (course 5MD00) for more info on IA-64: • www.ics.ele.tue.nl/~heco/courses/ACA • (look at related material) Embedded Computer Architecture H. Corporaal and B. Mesman

VLIW = Very Long Instruction Word architecture Example Instruction format (5-issue): operation 1 operation 2 operation 3 operation 4 operation 5 What did we talk about? ILP = Instruction Level Parallelism = ability to perform multiple operations (or instructions), from a single instruction stream, in parallel Embedded Computer Architecture H. Corporaal and B. Mesman

FU-1 CPU FU-2 Instruction fetch unit Instruction decode unit Instruction memory FU-3 Bypassing network Data memory Register file FU-4 FU-5 Control problem O(N2) O(N)-O(N2) With N function units VLIW evaluation Embedded Computer Architecture H. Corporaal and B. Mesman

VLIW evaluation Strong points of VLIW: • Scalable (add more FUs) • Flexible (an FU can be almost anything; e.g. multimedia support) Weak points: • With N FUs: • Bypassing complexity: O(N2) • Register file complexity: O(N) • Register file size: O(N2) • Register file design restricts FU flexibility Solution: .................................................. ? Embedded Computer Architecture H. Corporaal and B. Mesman

Solution TTA: Transport Triggered Architecture Mirroring the Programming Paradigm + - + - > * > * st st Embedded Computer Architecture H. Corporaal and B. Mesman

Embedded Computer Architecture