Chapter Six Pipelining Harzard Mario Côrtes - MO401 - IC/Unicamp- 2004s2 1998 Morgan Kaufmann Publishers Ch6c-1 Data Dependencies • Problem with starting next instruction before first is finished – dependencies that “go backward in time” are data hazards – qual instrução receberá o valor errado de $2 (velho 10, novo -20) Time (in clock cycles) CC 1 Value of register $2: 10 Program execution order (in instructions) sub $2, $1, $3 and $12, $2, $5 or $13, $6, $2 IM CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 10 10 10 10/– 20 – 20 – 20 – 20 – 20 DM Reg Reg IM DM Reg IM add $14, $2, $2 sw $15, 100($2) Mario Côrtes - MO401 - IC/Unicamp- 2004s2 DM Reg IM Reg DM Reg IM Reg Reg 1998 Morgan Kaufmann Publishers Reg DM Reg Ch6c-2 Erro devido à dependência de dados sub and or add sw $2, $1, $3 $12, $2, $5 $13, $6, $2 $14, $2, $2 $15, 100($2) • and e or lêem resultado errado (velho 10) • store está claramente à direita (depois no tempo) e lê resultado certo (-20) • hazard no add pode ser evitado se a escrita no banco de registradores for feita (em CC5) na metade do ciclo (borda de descida) Reg WR CC4 Mario Côrtes - MO401 - IC/Unicamp- 2004s2 Reg RD CC5 1998 Morgan Kaufmann Publishers CC6 Ch6c-3 Software Solution • • • Have compiler guarantee no hazards Where do we insert the “nops” ? Quantos sub and or add sw • $2, $1, $3 $12, $2, $5 $13, $6, $2 $14, $2, $2 $15, 100($2) Problem: this really slows us down! Mario Côrtes - MO401 - IC/Unicamp- 2004s2 1998 Morgan Kaufmann Publishers Ch6c-4 Forwarding • Use temporary results, don’t wait for them to be written – register file forwarding to handle read/write to same register – ALU forwarding Time (in clock cycles) CC 1 Value of register $2 : 10 Value of EX/MEM : X Value of MEM/WB : X CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 10 X X 10 X X 10 – 20 X 10/– 20 X – 20 – 20 X X – 20 X X – 20 X X – 20 X X DM Reg Program execution order (in instructions) sub $2, $1, $3 and $12, $2, $5 IM Reg IM or $13, $6, $2 add $14, $2, $2 sw $15, 100($2) Reg IM DM Reg IM Reg DM Reg IM Reg DM Reg Reg DM Reg what if this $2 was $13? Mario Côrtes - MO401 - IC/Unicamp- 2004s2 1998 Morgan Kaufmann Publishers Ch6c-5 Forwarding ID/EX Control PC Instruction memory Instruction IF/ID WB EX/MEM M WB MEM/WB EX M WB M u x Registers ALU Data memory M u x IF/ID.RegisterRs IF/ID.RegisterRt IF/ID.RegisterRt Rs Rt Rt IF/ID.RegisterRd Rd M u x EX/MEM.RegisterRd Forwarding unit Mario Côrtes - MO401 - IC/Unicamp- 2004s2 M u x 1998 Morgan Kaufmann Publishers MEM/WB.RegisterRd Ch6c-6 Can't always forward • Load word can still cause a hazard: – an instruction tries to read a register following a load instruction that writes to the same register. Time (in clock cycles) Program CC1 CC2 execution order (in instructions) Reg lw$2, 20($1) IM and $4, $2, $5 IM CC3 CC4 CC5 DM Reg Reg DM CC6 CC7 CC 8 CC9 Reg – or $8, $2, $6 add $9, $4, $2 slt $1, $6, $7 • IM Reg IM DM Reg IM Reg DM Reg Reg DM Reg Thus, we need a hazard detection unit to “stall” the load instruction Mario Côrtes - MO401 - IC/Unicamp- 2004s2 1998 Morgan Kaufmann Publishers Ch6c-7 Stalling • We can stall the pipeline by keeping an instruction in the same stage Program Time (in clock cycles) execution CC 1 CC 2 order (in instructions) lw $2, 20($1) and $4, $2, $5 or $8, $2, $6 IM CC 3 Reg IM Reg IM CC 4 CC 5 DM Reg Reg IM CC 6 CC 7 DM Reg Reg DM CC 8 CC 9 CC 10 Reg bubble add $9, $4, $2 slt $1, $6, $7 Mario Côrtes - MO401 - IC/Unicamp- 2004s2 IM DM Reg IM Reg 1998 Morgan Kaufmann Publishers Reg DM Reg Ch6c-8 Hazard Detection Unit • Stall by letting an instruction that won’t write anything go forward ID/EX.MemRead IF/IDWrite Hazard detection unit ID/EX Control 0 M u x PC Instruction memory Instruction PCWrite IF/ID WB EX/MEM M WB MEM/WB EX M WB M u x Registers ALU Data memory M u x IF/ID.RegisterRs IF/ID.RegisterRt IF/ID.RegisterRt IF/ID.RegisterRd ID/EX.RegisterRt Mario Côrtes - MO401 - IC/Unicamp- 2004s2 Rt Rd Rs Rt M u x M u x EX/MEM.RegisterRd Forwarding unit 1998 Morgan Kaufmann Publishers MEM/WB.RegisterRd Ch6c-9 a n d $ 4, $2 , $ 5 lw $ 2 , 20 ($ 1 ) 1 X IF/IDWrite Exemplo pag 493 b efo re < 1 > Hazard detection unit ID/EX.MemRead ID/EX 11 WB Control 0 M u x PCWrite IF/ID PC Instruction 1 Instruction memory b efo re < 2> be fore <3 > EX/MEM M WB MEM/WB EX M WB $1 M u x X Registers Data memory ALU $X M u x M u x 1 X 2 M u x ID/EX.RegisterRt Forwarding unit C loc k 2 o r $ 4, $4 , $ 2 lw $ 2 , 2 0 ($ 1 ) a nd $4 , $ 2, $ 5 Hazard detection unit IF/IDWrite 2 5 ID/EX.MemRead ID/EX 11 00 WB Control 0 M u x PCWrite IF/ID PC $2 Instruction 2 Instruction memory b e fore < 1 > EX/MEM M WB MEM/WB EX M WB $1 M u x 5 Registers $5 ALU $X M u x 2 5 4 ID/EX.RegisterRt C loc k 3 be fore < 2> 1 X 2 M u x Forwarding unit Data memory M u x or $ 4 , $ 4, $2 a nd $ 4, $2 , $ 5 2 5 Control 0 M u x PCWrite IF/ID PC Instruction 2 Instruction memory lw $ 2 , . . . ID/EX.MemRead ID/EX 10 00 WB IF/IDWrite Exemplo pag 493 bu b b le Hazard detection unit EX/MEM M WB EX M $2 b e fo re < 1> 11 MEM/WB WB $2 M u x 5 Registers $5 Data memory ALU $5 M u x M u x 2 5 2 5 4 4 2 M u x ID/EX.RegisterRt Forwarding unit C lo ck 4 a d d $ 9, $ 4 , $ 2 o r $ 4, $4 , $ 2 an d $ 4, $2 , $ 5 Hazard detection unit ID/EX.MemRead ID/EX 10 10 WB IF/IDWrite 4 2 Control 0 M u x PCWrite IF/ID PC Instruction 4 Instruction memory b u b ble $4 lw $2 , . . . M EX/MEM 0 WB MEM/WB EX M WB 11 $2 M u x 2 2 Registers ALU $2 $5 Data memory M u x M u x ID/EX.RegisterRt C lo ck 5 4 2 2 5 4 4 M u x 2 Forwarding unit Branch Hazards • When we decide to branch, other instructions are in the pipeline! Time (in clock cycles) Program execution CC 1 CC 2 order (in instructions) 40 beq $1, $3, 7 44 and $12, $2, $5 IM Reg IM 48 or $13, $6, $2 52 add $14, $2, $2 72 lw $4, 50($7) • CC 3 CC 4 CC 5 DM Reg Reg IM DM Reg IM CC 6 CC 8 CC 9 Reg DM Reg IM CC 7 Reg DM Reg Reg DM Reg We are predicting “branch not taken” (stalling is too slow) – need to add hardware for flushing instructions if we are wrong Mario Côrtes - MO401 - IC/Unicamp- 2004s2 1998 Morgan Kaufmann Publishers Ch6c-12 Diminuindo a penalidade do “branch taken” • • • no esquema anterior – decisão só é tomada no estágio MEM – caso “branch taken” é necessário limpar (flush) os estágios IF, ID e EX – 3 clocks perdidos para diminuir a penalidade: – antecipar a decisão do estágio MEM para o estágio ID – economia de dois clocks – flush somente instrução sendo lida da memória (fetched) mudanças no hardware: – cálculo do endereço do desvio (PC + offset<<2) – comparação dos registradores • 32 XORs com or na saída • mais rápido do que a ALU Mario Côrtes - MO401 - IC/Unicamp- 2004s2 1998 Morgan Kaufmann Publishers Ch6c-13 Flushing Instructions IF.Flush Hazard detection unit ID/EX M u x Control 0 M u x IF/ID 4 EX/MEM M WB MEM/WB EX M WB Shift left 2 = PC WB M u x Registers Instruction memory ALU M u x Data memory M u x Sign extend M u x Forwarding unit Mario Côrtes - MO401 - IC/Unicamp- 2004s2 1998 Morgan Kaufmann Publishers Ch6c-14 Improving Performance • Try and avoid stalls! E.g., reorder these instructions: lw lw sw sw $t0, $t2, $t2, $t0, 0($t1) 4($t1) 0($t1) 4($t1) • Add a “branch delay slot” – the next instruction after a branch is always executed – rely on compiler to “fill” the slot with something useful • Branch prediction: – tentar acertar se o desvio será tomado ou não – em loops, desvio é tomado a maior parte das vezes – branch history table • Superscalar: start more than one instruction in the same cycle Mario Côrtes - MO401 - IC/Unicamp- 2004s2 1998 Morgan Kaufmann Publishers Ch6c-15 Dynamic Scheduling • The hardware performs the “scheduling” – hardware tries to find instructions to execute – out of order execution is possible – speculative execution and dynamic branch prediction • All modern processors are very complicated – DEC Alpha 21264: 9 stage pipeline, 6 instruction issue – PowerPC and Pentium: branch history table – Compiler technology important • This class has given you the background you need to learn more • Video: An Overview of Intel’s Pentium Processor (available from University Video Communications) Mario Côrtes - MO401 - IC/Unicamp- 2004s2 1998 Morgan Kaufmann Publishers Ch6c-16 Comparação de desempenho (p 504) • • • • para monociclo, multiciclo e pipeline gcc: lw (23%), sw (13%), beq (19%), j (2%), resto (43%) pipeline: – 2ns (memória e ALU) e 1ns para o registrador (RD ou WR) – 1/2 dos lw seguidas por instruções que usam o resultado – 1/4 dos beq são errados (1 clock perdido) – jumps perdem um ciclo pipeline: – lw: 1 clock sem hazard e 2 clock com hazard; média = 1.5 – beq: 1 clock se OK (3/4) e 2 clocks se não OK (1/4); média = 1.25 – jump: 2 clocks – demais instruções: 1 clock • • • • CPI = 1.5*0.23+1*0.13+1*0.43+1.25*0.19+2*0.02 = 1.18 clock = 2ns; tempo médio de execução = 2*1.18=2.36ns multiciclo; 4.02*2=8.08ns; monociclo = 8ns pipeline é 3.4 vezes mais rápido do que monociclo ou multiciclo Mario Côrtes - MO401 - IC/Unicamp- 2004s2 1998 Morgan Kaufmann Publishers Ch6c-17 Circuito completo Branch IF.Flush EX.Flush ID.Flush Hazard detection unit WB Control IF/ID 4 Instruction memory PC Address Read data Instruction Shift left 2 0 0 16 Sign extend Except PC = ALUSrc WB Data memory ALU M u x MEM/WB M M u x 32 Instruction [25– 21] Instruction [20– 16] Instruction [20– 16] Instruction [15– 11] WB Cause EX Read Read register 1data 1 Read register 2 Registers Write register Read data 2 Write data EX/MEM M u x M RegWrite 0 M u x M u x M u x ALU control ALUOp RegDst M u x Forwarding unit Address Read data Write data MemRead MemtoReg ID/EX M u x MemWrite 40000040 M u x