1 / 47

Pipelining Datapath

Pipelining Datapath. Adapted from the lecture notes of Dr. John Kubiatowicz (UC Berkeley) and Hank Walker (TAMU). A. B. C. D. Pipelining is Natural!. Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes

paul
Download Presentation

Pipelining Datapath

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pipelining Datapath Adapted from the lecture notes of Dr. John Kubiatowicz (UC Berkeley) and Hank Walker (TAMU)

  2. A B C D Pipelining is Natural! • Laundry Example • Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold • Washer takes 30 minutes • Dryer takes 40 minutes • “Folder” takes 20 minutes

  3. A B C D Sequential Laundry 6 PM Midnight 7 8 9 11 10 Time • Sequential laundry takes 6 hours for 4 loads 30 40 20 30 40 20 30 40 20 30 40 20 T a s k O r d e r

  4. 30 40 40 40 40 20 A B C D Pipelined Laundry: Start work ASAP 6 PM Midnight 7 8 9 11 10 Time • Pipelined laundry takes 3.5 hours for 4 loads T a s k O r d e r

  5. 30 40 40 40 40 20 A B C D Pipelining Lessons • Latency vs. Throughput • Question • What is the latency in both cases ? • What is the throughput in both cases ? Pipelining doesn’t help latency of single task, it helps throughput of entire workload

  6. 30 40 40 40 40 20 A B C D Pipelining Lessons [contd…] • Question • What is the fastest operation in the example ? • What is the slowest operation in the example Pipeline rate limited by slowest pipeline stage

  7. 30 40 40 40 40 20 A B C D Pipelining Lessons [contd…] Multiple tasks operating simultaneously using different resources

  8. 30 40 40 40 40 20 A B C D Pipelining Lessons [contd…] • Question • Would the speedup increase if we had more steps ? Potential Speedup = Number of pipe stages

  9. Pipelining Lessons [contd…] • Washer takes 30 minutes • Dryer takes 40 minutes • “Folder” takes 20 minutes • Question • Will it affect if “Folder” also took 40 minutes Unbalanced lengths of pipe stages reduces speedup

  10. 30 40 40 40 40 20 A B C D Pipelining Lessons [contd…] Time to “fill” pipeline and time to “drain” it reduces speedup

  11. Ifetch Reg/Dec Exec Mem Wr Five Stages of an Instruction Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 • Ifetch: Instruction Fetch • Fetch the instruction from the Instruction Memory • Reg/Dec: Registers Fetch and Instruction Decode • Exec: Calculate the memory address • Mem: Read the data from the Data Memory • Wr: Write the data back to the register file Load

  12. IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB Conventional Pipelined Execution Representation Time Program Flow

  13. Example

  14. Example [contd…] • Timepipeline = Timenon-pipeline / Pipe stages • Assumptions • Stages are perfectly balanced • Ideal conditions • Ideally, speedup = 8/5 = 1.6 • Most cases are not ideal !!!

  15. Example [contd…] • Speedup in this case = 24/14 = 1.7 • Lets add 1000 more instructions • Time (non-pipelined) = 1000 x 8 + 24 ns = 8000 ns • Time (pipelined) = 1000 x 2 + 14 ns = 2014 ns • Speedup = 8000 / 2014 = 3.98 = 4 (approx) = 8/2 Instruction throughput is important metric (as opposed to individual instruction) as real programs execute billions of instructions in practical case !!!

  16. IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB Pipeline Hazards • Structural Hazard Program Flow

  17. Pipeline Hazard [contd…] • Control Hazard • Example • add $4, $5, $6 • beq $1, $2, 40 • lw $3, 300($0)

  18. Pipleline Hazard [contd…] • Data Hazards • Example • add $s0, $t0, $t1 • sub $t2, $s0, $t3

  19. 30 40 40 40 40 20 A B C D Summary Pipelining Lessons 6 PM 7 8 9 • Pipelining doesn’t help latency of single task, it helps throughput of entire workload • Pipeline rate limited by slowest pipeline stage • Multiple tasks operating simultaneously using different resources • Potential speedup = Number pipe stages • Unbalanced lengths of pipe stages reduces speedup • Time to “fill” pipeline and time to “drain” it reduces speedup • Stall for Dependences Time T a s k O r d e r

  20. Summary of Pipeline Hazards • Structural Hazards • Hardware design • Control Hazard • Decision based on results • Data Hazard • Data Dependency

  21. Control Signals for existing Datapath The Right to Left Control can lead to hazards

  22. Place registers between each step

  23. Example 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15

  24. Start: Fetch 10 A M S B = IF D Next PC 10 PC n n n n Inst. Mem Decode WB Ctrl Mem Ctrl IR im rs rt Reg. File Reg File Exec Mem Access Data Mem 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15

  25. Fetch 14, Decode 10 A M S B = ID D IF Next PC 14 PC n n n lw r1, r2(35) Inst. Mem Decode WB Ctrl Mem Ctrl IR im 2 rt Reg. File Reg File Exec Mem Access Data Mem 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15

  26. Fetch 20, Decode 14, Exec 10 M S B = D Next PC 20 PC n n addI r2, r2, 3 Inst. Mem Decode WB Ctrl lw r1 Mem Ctrl IR 35 2 rt Reg. File Reg File r2 Exec Mem Access Data Mem EX 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15

  27. Fetch 24, Decode 20, Exec 14, Mem 10 M B = D EX ID Next PC 24 IF PC n sub r3, r4, r5 addI r2, r2, 3 Inst. Mem Decode WB Ctrl lw r1 Mem Ctrl IR 3 4 5 Reg. File Reg File r2 r2+35 Exec Mem Access Data Mem M 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15

  28. Fetch 30, Dcd 24, Ex 20, Mem 14, WB 10 r5 = WB D M EX Next PC ID 30 IF PC beq r6, r7 100 Inst. Mem Decode WB Ctrl addI r2 lw r1 sub r3 Mem Ctrl IR 6 7 Reg. File Reg File M[r2+35] r4 r2+3 Exec Mem Access Data Mem 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15

  29. Fetch 100, Dcd 30, Ex 24, Mem 20, WB 14 r7 = D Next PC EX 100 ID PC IF ori r8, r9 17 Inst. Mem Decode WB Ctrl addI r2 sub r3 Mem Ctrl beq IR 9 xx 100 r1=M[r2+35] Reg. File Reg File r6 r2+3 r4-r5 Exec Mem Access Data Mem 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 WB M

  30. 1st lw Ifetch Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Mem Wr Pipelining Load Instruction Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Clock 2nd lw 3rd lw • The five independent functional units in the pipeline datapath are: • Instruction Memory for the Ifetch stage • Register File’s Read ports (bus A and busB) for the Reg/Dec stage • ALU for the Exec stage • Data Memory for the Mem stage • Register File’s Write port (bus W) for the Wr stage

  31. Ifetch Reg/Dec Exec Wr Pipelining the R Instruction Cycle 1 Cycle 2 Cycle 3 Cycle 4 R-type • Ifetch: Instruction Fetch • Fetch the instruction from the Instruction Memory • Reg/Dec: Registers Fetch and Instruction Decode • Exec: • ALU operates on the two register operands • Update PC • Wr: Write the ALU output back to the register file

  32. Ifetch Reg/Dec Exec Wr Ifetch Reg/Dec Exec Wr Ifetch Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Wr Ifetch Reg/Dec Exec Wr Pipelingng Both L and R type Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Ops! We have a problem! R-type R-type Load R-type R-type • We have pipeline conflict or structural hazard: • Two instructions try to write to the register file at the same time! • Only one write port

  33. 1 2 3 4 5 Load Ifetch Reg/Dec Exec Mem Wr 1 2 3 4 R-type Ifetch Reg/Dec Exec Wr Important Observations • Each functional unit can only be used once per instruction • Each functional unit must be used at the same stage for all instructions: • Load uses Register File’s Write Port during its 5th stage • R-type uses Register File’s Write Port during its 4th stage

  34. Ifetch Reg/Dec Wr Ifetch Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Mem Wr Solution • Delay R-type’s register write by one cycle: • Now R-type instructions also use Reg File’s write port at Stage 5 • Mem stage is a NOOP stage: nothing is being done. 4 1 2 3 5 Exec Mem R-type Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 R-type Load R-type R-type

  35. A S B M D Datapath (Without Pipeline) IR <- Mem[PC]; PC <– PC+4; A <- R[rs]; B<– R[rt] S <– A + B; S <– A or ZX; S <– A + SX; S <– A + SX; If Cond PC < PC+SX; M <– Mem[S] Mem[S] <- B R[rd] <– S; R[rt] <– S; R[rd] <– M; Equal Reg. File Reg File Exec PC IR Next PC Inst. Mem Mem Access Data Mem

  36. Datapath (With Pipeline) A M B D IR <- Mem[PC]; PC <– PC+4; A <- R[rs]; B<– R[rt] S <– A + B; S <– A or ZX; S <– A + SX; S <– A + SX; if Cond PC < PC+SX; M <– S M <– Mem[S] Mem[S] <- B M <– S R[rd] <– M; R[rt] <– M; R[rd] <– M; Equal Reg. File Reg File S Exec PC IR Next PC Inst. Mem Mem Access Data Mem

  37. Mem ALU Mem Mem Reg Reg ALU Mem Mem Reg Reg ALU ALU Mem Mem Reg Reg ALU Structural Hazard and Solution Time (clock cycles) I n s t r. O r d e r Load Mem Reg Reg Instr 1 Instr 2 Mem Mem Reg Reg Instr 3 Instr 4

  38. I n s t r. O r d e r Time (clock cycles) Mem Reg Reg Add Mem ALU Mem Reg Reg Beq Mem ALU Load Lost potential Mem Reg Reg Mem ALU Control Hazard - #1 Stall • Stall: wait until decision is clear • Impact: 2 lost cycles (i.e. 3 clock cycles per branch instruction) => slow

  39. I n s t r. O r d e r Time (clock cycles) Mem Reg Reg Add Mem Reg Reg Beq Mem ALU Load Mem Mem Reg Reg Mem ALU ALU Control Hazard – #2 Predict • Predict: guess one direction then back up if wrong • Impact: 0 lost cycles per branch instruction if right, 1 if wrong (right ­ 50% of time) • More dynamic scheme: history of 1 branch

  40. I n s t r. O r d e r Time (clock cycles) Mem Reg Reg Add Mem ALU Mem Reg Reg Beq Mem ALU Misc Mem Mem Reg Reg ALU Load Mem Mem Reg Reg ALU Control Hazard - #3 Delayed Branch • Delayed Branch: Redefine branch behavior (takes place after next instruction) • Impact: 0 clock cycles per branch instruction if can find instruction to put in “slot” (­ 50% of time)

  41. Im ALU Im ALU Data Hazards (RAW) • Dependencies backwards in time are hazards Time (clock cycles) IF ID/RF EX MEM WB add r1,r2,r3 Reg Reg ALU Im Dm I n s t r. O r d e r sub r4,r1,r3 Dm Reg Reg Dm Reg Reg and r6,r1,r7 Im Dm Reg Reg or r8,r1,r9 ALU xor r10,r1,r11

  42. Im ALU Im ALU Im Dm Reg Reg ALU Data Hazards [contd…] • “Forward” result from one stage to another Time (clock cycles) IF ID/RF EX MEM WB add r1,r2,r3 Reg Reg ALU Im Dm I n s t r. O r d e r sub r4,r1,r3 Dm Reg Reg Dm Reg Reg and r6,r1,r7 Im Dm Reg Reg or r8,r1,r9 ALU xor r10,r1,r11

  43. Im Dm Reg Reg ALU Data Hazards [contd…] • Dependencies backwards in time are hazards • Can’t solve with forwarding: • Must delay/stall instruction dependent on loads Time (clock cycles) IF ID/RF EX MEM WB lw r1,0(r2) Reg Reg ALU Im Dm Stall sub r4,r1,r3

  44. Hazard Detection I-Fet ch DCD MemOpFetch OpFetch Exec Store IFetch DCD ° ° ° Structural Hazard I-Fet ch DCD OpFetch Jump Control Hazard IFetch DCD ° ° ° IF DCD EX Mem WB RAW (read after write) Data Hazard IF DCD EX Mem WB WAW Data Hazard (write after write) IF DCD EX Mem WB IF DCD OF Ex Mem IF DCD OF Ex RS WAR Data Hazard (write after read)

  45. New Inst Inst I Window on execution: Only pending instructions can cause hazards Instruction Movement: Inst J Hazard Detection • Suppose instruction i is about to be issued and a predecessor instruction j is in the instruction pipeline. • A RAW hazard exists on register if Rregs( i ) Wregs( j ) • A WAW hazard exists on register if Wregs( i ) Wregs( j ) • A WAR hazard exists on register if Wregs( i ) Rregs( j )

  46. Computing CPI • Start with Base CPI • Add stalls • Suppose: • CPIbase=1 • Freqbranch=20%, freqload=30% • Suppose branches always cause 1 cycle stall • Loads cause a 2 cycle stall • Then: CPI = 1 + (10.20)+(2  0.30)= 1.8

  47. Summary • Control Signals need to be propagated • Insert Registers between every stage to “remember” and “propagate” values • Solutions to Control Hazard are Stall, Predict and Delayed Branch • Solutions to Data Hazard is “Forwarding” • Effective CPI = CPIideal + CPIstall

More Related