1 / 19

Computer Architecture Principles Dr. Mike Frank

Computer Architecture Principles Dr. Mike Frank. CDA 5155 Summer 2003 Module #21 Multiple Issue Pipelining: Superscalar, VLIW, etc. Multiple Issue. Multiple Issue (3.6). Goal: Enable multiple instructions to be issued in a single clock cycle. (Can get CPI < 1!)

Download Presentation

Computer Architecture Principles Dr. Mike Frank

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. Computer Architecture PrinciplesDr. Mike Frank CDA 5155Summer 2003 Module #21Multiple Issue Pipelining:Superscalar, VLIW, etc.

  2. Multiple Issue

  3. Multiple Issue (3.6) • Goal: Enable multiple instructions to be issued in a single clock cycle. (Can get CPI < 1!) • Two basic “flavors” of multiple-issue: • Superscalar: • Operates on an ordinary serial instruction stream format. • Instructions per clock varies widely. • Scheduling can be either dynamic or static. • VLIW (Very Long Instruction Word) a.k.a. EPIC (Explicitly Parallel Instruction Computing). • New format: Parallel instructions grouped into blocks. • Instructions per clock fairly well fixed (by block size). • Mostly statically scheduled by compiler.

  4. Code Example to be Used C code fragment: double *p; do { *(p--) += c } while (p); DLX code fragment: Loop: LD F0,0(R1) ; F0 = *p ADDD F4,F0,F2 ; F4 = F0 + c SD 0(R1),F4 ; *p = F4 SUBI R1,R1,#8 ; p-- BNEZ R1,Loop ; until p=0

  5. Simple Superscalar RISC • Typical superscalar: 1-8 insts. issued per cycle • Actual number depends on dependences, hazards • Our simple example: At most 2 insts./cycle • Instructions statically pre-paired to ease decoding: • 1st: One load/store/branch/integer-ALU op. • 2nd: One floating-point op.

  6. Some Issues with this Approach • If FP ops are multiple-cycle, • then to issue 1 FP inst./cycle requires multiple or pipelined FP functional units (or both) • FP ops may finish out-of-order, usual issues arise • Why parallelize integer ops against FP ops? • Each uses different registers & functional units • Resource contention is minimized • Only contention is for FP registers on loads/stores • Must detect & deal with structural & data hazards • Note issue with load latency: • Result available 1 cycle after EX (in MEM) • Not available for next 3 instructions!

  7. Unrolled Loop, Superscalar vers. • 5 elements/iteration, 2.4 clock cycles/element:

  8. Pipeline details in this Example F = inst. Fetch D = inst. Decode E = integer execution 1234 = FP exec. stages M = mem. access W = writeback regs. Loop: LD F0,0(R1) FDEMW LD F6,-8(R1) FDEMW LD F10,-16(R1) FDEMW ADDD F4,F0,F2 FD1234MW LD F14,-24(R1) FDEMW ADDD F8,F6,F2 FD1234MW LD F18,-32(R1) FDEMW ADDD F12,F10,F2 FD1234MW SD 0(R1),F4 FDEMW ADDD F16,F14,F2 FD1234MW SD -8(R1),F8 FDEMW ADDD F20,F18,F2 FD1234MW SD -16(R1),F12 FDEMW SUBI R1,R1,#40 FDEMW SD 16(R1),F16 FDEMW BNEZ R1,Loop FDEMW SD 8(R1),F20 FDEMW

  9. Multiple Issue + Dynamic Sched. • Why? Usual advantages of dyn. scheduling… • Compiler independent, data-dependent scheduling • Multiple-issue Tomasulo: • Issue 1 integer + 1 FP instruction to RS each cycle • Problem (again) issuing FP loads + ops simult. • If instructions dependent, hazard detection is broken. • Two solutions to this problem: • 1. Enter inst. into tables in only 1/2 a clock • 2. Statically schedule loads/stores far enough away from instructions that they have dependences with, then queue them till ready, like in Tomasulo load-store buffers. • Another approach: only queue loads/stores • Use static scheduling for all other instructions

  10. Timing of Mult.-Issue, Dyn. 123456789012 LD F0,0(R1) IEM ADDD F4,F0,F2 I~~E>W SD 0(R1),F4 IE~~~M SUBI R1,R1,#8 IEW BNEZ R1,Loop IE LD F0,0(R1) IE~M ADDD F4,F0,F2 I~~~E>W SD 0(R1),F4 IE~~~~M SUBI R1,R1,#8 IEW BNEZ R1,Loop IE I = Issue E = Execute M = Memory~ = stall > = still in Execute W = Writeback

  11. VLIW (Very Long Instruction Word) • Also called EPIC (Explicitly Parallel Instruction Computing) by Intel in the IA-64. • Basic idea: • Have compiler determine multiple independent instructions that can execute simultaneously. • Package them into a wide, fixed-size bundle. • Slots in bundle correspond to functional units. • Advantages: • Permits very wide multiple-issue (high parallelism) • Avoids runtime complexity of dynamic scheduling

  12. VLIW Example • Each VLIW word contains: • 2 slots for memory-reference instructions • 2 slots for floating-point instructions • 1 slot for an integer operation or branch • See fig. 4.29 on p. 286 (no electronic version available yet) • Shows our old familiar loop • Unrolled & packed into VLIW words

  13. More on VLIW • A technique for multiple-issue • Statically scheduled by compiler • Difference vs. statically scheduled superscalar: • Compiler pre-collects instructions into issue packets • Avoids or marks dependences within issue packet • Avoids need for dynamic dependence detection • See example in 3rd ed., fig. 4.5, p. 318: • Unrolled loop on a 5-way VLIW • 2 memory references, 2 FP ops, and 1 int. op per clock • Achieves 9/7 = ~1.3 cycles per array element! • 60% efficiency vs. peak instruction issue rate

  14. Difficulties w. early VLIW • Increased code size: • Aggressively unrolling loops to expand basic blocks • Unfilled instruction slots are wasted bits in VLIW • Can be dealt w. by alternative encodings or in-memory compression • Lockstep operation: • All instructions in packet proceed in lockstep • Entire pipe must still if one functional unit does • Difficult to statically predict some stalls • e.g., cache misses • Binary code incompatibilities • Code layout depends on microarchitecture version!

  15. Limits to Multiple-Issue • Can we continue increasing the issue width (and decreasing CPI) indefinitely? • No, not for serially-written programs in general. • Some problems with increasing issue width: • Inherent limitations of ILP in programs. • Difficulty of scaling shared reg. file or memory. • Dynamic scheduling complexity in superscalar. • Code size, binary incompatibility in VLIW.

  16. An Important Lesson • Multiple-issue increases parallelism without requiring programmer effort, but it has limits! • Beyond a certain point, increasing parallelism requires programmer participation! • Multithreaded shared-memory programming. • Better: Distributed multiprocessor algorithms that take communications limits into account. • Maximal performance requires a programming model based on the structure of physics! • My proposal: a 3-D mesh of reversible (maybe quantum-coherent) processing elements.

  17. HW support for more ILP (4.6) • Static techniques described in 4.5 may miss a lot of ILP opportunities that occur dynamically. • In this section we introduce some HW techniques: • Conditional or predicated instructions • Compiler speculation w. HW support: • HW/SW cooperation for speculation • Speculation with Poison Bits • Speculative instructions with renaming • Hardware-based speculation

  18. Dyn. Mult.-Iss. Timing Example (Example from 3rd ed., figs. 3.25-3.26, pp. 221-223.) 1111111111 1234567890123456789 1 L.D F0,0(R1) IEMW 1 ADD.D F4,F0,F2 I EEEW 1 S.D F4,0(R1) IE M 1 DADDIU R1,R1,#-8 I EW 1 BNE R1,R2,Loop I E 2 L.D F0,0(R1) I EMW 2 ADD.D F4,F0,F2 I EEEW 2 S.D F4,0(R1) I E M 2 DADDIU R1,R1,#-8 I EW 2 BNE R1,R2,Loop I E 3 L.D F0,0(R1) I EMW 3 ADD.D F4,F0,F2 I EEEW 3 S.D F4,0(R1) I E M 3 DADDIU R1,R1,#-8 I EW 3 BNE R1,R2,Loop I E RAW hazards Structural hazards Control hazards With only 1 integeradder unit, shared byALU instructionsand EA calculationsfor load/stores

  19. Timing Example w. Extra Adder (Example from 3rd ed., figs. 3.27-3.28, pp. 223-225.) 1111111111 1234567890123456789 1 L.D F0,0(R1) IEMW 1 ADD.D F4,F0,F2 I EEEW 1 S.D F4,0(R1) IE M 1 DADDIU R1,R1,#-8 IEW 1 BNE R1,R2,Loop I E 2 L.D F0,0(R1) I EMW 2 ADD.D F4,F0,F2 I EEEW 2 S.D F4,0(R1) I E M 2 DADDIU R1,R1,#-8 IEW 2 BNE R1,R2,Loop I E 3 L.D F0,0(R1) I EMW 3 ADD.D F4,F0,F2 I EEEW 3 S.D F4,0(R1) I E M 3 DADDIU R1,R1,#-8 IEW 3 BNE R1,R2,Loop I E RAW hazards Structural hazards Control hazards With 2 integeradder units, one forALU instructions,another for and EAcalculations for load/stores

More Related