Chapter Six
Pipelining
Harzard
Mario Côrtes - MO401 - IC/Unicamp- 2004s2
1998 Morgan Kaufmann Publishers
Ch6c-1
Data Dependencies
•
Problem with starting next instruction before first is finished
– dependencies that “go backward in time” are data hazards
– qual instrução receberá o valor errado de $2 (velho 10, novo -20)
Time (in clock cycles)
CC 1
Value of
register $2: 10
Program
execution
order
(in instructions)
sub $2, $1, $3
and $12, $2, $5
or $13, $6, $2
IM
CC 2
CC 3
CC 4
CC 5
CC 6
CC 7
CC 8
CC 9
10
10
10
10/– 20
– 20
– 20
– 20
– 20
DM
Reg
Reg
IM
DM
Reg
IM
add $14, $2, $2
sw $15, 100($2)
Mario Côrtes - MO401 - IC/Unicamp- 2004s2
DM
Reg
IM
Reg
DM
Reg
IM
Reg
Reg
1998 Morgan Kaufmann Publishers
Reg
DM
Reg
Ch6c-2
Erro devido à dependência de dados
sub
and
or
add
sw
$2, $1, $3
$12, $2, $5
$13, $6, $2
$14, $2, $2
$15, 100($2)
• and e or lêem resultado errado (velho 10)
• store está claramente à direita (depois no tempo) e lê resultado
certo (-20)
• hazard no add pode ser evitado se a escrita no banco de
registradores for feita (em CC5) na metade do ciclo (borda de
descida)
Reg WR
CC4
Mario Côrtes - MO401 - IC/Unicamp- 2004s2
Reg RD
CC5
1998 Morgan Kaufmann Publishers
CC6
Ch6c-3
Software Solution
•
•
•
Have compiler guarantee no hazards
Where do we insert the “nops” ?
Quantos
sub
and
or
add
sw
•
$2, $1, $3
$12, $2, $5
$13, $6, $2
$14, $2, $2
$15, 100($2)
Problem: this really slows us down!
Mario Côrtes - MO401 - IC/Unicamp- 2004s2
1998 Morgan Kaufmann Publishers
Ch6c-4
Forwarding
•
Use temporary results, don’t wait for them to be written
– register file forwarding to handle read/write to same register
– ALU forwarding
Time (in clock cycles)
CC 1
Value of register $2 : 10
Value of EX/MEM : X
Value of MEM/WB : X
CC 2
CC 3
CC 4
CC 5
CC 6
CC 7
CC 8
CC 9
10
X
X
10
X
X
10
– 20
X
10/– 20
X
– 20
– 20
X
X
– 20
X
X
– 20
X
X
– 20
X
X
DM
Reg
Program
execution order
(in instructions)
sub $2, $1, $3
and $12, $2, $5
IM
Reg
IM
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
Reg
IM
DM
Reg
IM
Reg
DM
Reg
IM
Reg
DM
Reg
Reg
DM
Reg
what if this $2 was $13?
Mario Côrtes - MO401 - IC/Unicamp- 2004s2
1998 Morgan Kaufmann Publishers
Ch6c-5
Forwarding
ID/EX
Control
PC
Instruction
memory
Instruction
IF/ID
WB
EX/MEM
M
WB
MEM/WB
EX
M
WB
M
u
x
Registers
ALU
Data
memory
M
u
x
IF/ID.RegisterRs
IF/ID.RegisterRt
IF/ID.RegisterRt
Rs
Rt
Rt
IF/ID.RegisterRd
Rd
M
u
x
EX/MEM.RegisterRd
Forwarding
unit
Mario Côrtes - MO401 - IC/Unicamp- 2004s2
M
u
x
1998 Morgan Kaufmann Publishers
MEM/WB.RegisterRd
Ch6c-6
Can't always forward
•
Load word can still cause a hazard:
– an instruction tries to read a register following a load instruction
that writes to the same register.
Time (in clock cycles)
Program
CC1
CC2
execution
order
(in instructions)
Reg
lw$2, 20($1) IM
and $4, $2, $5
IM
CC3
CC4
CC5
DM
Reg
Reg
DM
CC6
CC7
CC 8
CC9
Reg
–
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
•
IM
Reg
IM
DM
Reg
IM
Reg
DM
Reg
Reg
DM
Reg
Thus, we need a hazard detection unit to “stall” the load instruction
Mario Côrtes - MO401 - IC/Unicamp- 2004s2
1998 Morgan Kaufmann Publishers
Ch6c-7
Stalling
•
We can stall the pipeline by keeping an instruction in the same stage
Program
Time (in clock cycles)
execution
CC 1
CC 2
order
(in instructions)
lw $2, 20($1)
and $4, $2, $5
or $8, $2, $6
IM
CC 3
Reg
IM
Reg
IM
CC 4
CC 5
DM
Reg
Reg
IM
CC 6
CC 7
DM
Reg
Reg
DM
CC 8
CC 9
CC 10
Reg
bubble
add $9, $4, $2
slt $1, $6, $7
Mario Côrtes - MO401 - IC/Unicamp- 2004s2
IM
DM
Reg
IM
Reg
1998 Morgan Kaufmann Publishers
Reg
DM
Reg
Ch6c-8
Hazard Detection Unit
•
Stall by letting an instruction that won’t write anything go forward
ID/EX.MemRead
IF/IDWrite
Hazard
detection
unit
ID/EX
Control
0
M
u
x
PC
Instruction
memory
Instruction
PCWrite
IF/ID
WB
EX/MEM
M
WB
MEM/WB
EX
M
WB
M
u
x
Registers
ALU
Data
memory
M
u
x
IF/ID.RegisterRs
IF/ID.RegisterRt
IF/ID.RegisterRt
IF/ID.RegisterRd
ID/EX.RegisterRt
Mario Côrtes - MO401 - IC/Unicamp- 2004s2
Rt
Rd
Rs
Rt
M
u
x
M
u
x
EX/MEM.RegisterRd
Forwarding
unit
1998 Morgan Kaufmann Publishers
MEM/WB.RegisterRd
Ch6c-9
a n d $ 4, $2 , $ 5
lw $ 2 , 20 ($ 1 )
1
X
IF/IDWrite
Exemplo
pag 493
b efo re < 1 >
Hazard
detection
unit
ID/EX.MemRead
ID/EX
11
WB
Control
0
M
u
x
PCWrite
IF/ID
PC
Instruction
1
Instruction
memory
b efo re < 2>
be fore <3 >
EX/MEM
M
WB
MEM/WB
EX
M
WB
$1
M
u
x
X
Registers
Data
memory
ALU
$X
M
u
x
M
u
x
1
X
2
M
u
x
ID/EX.RegisterRt
Forwarding
unit
C loc k 2
o r $ 4, $4 , $ 2
lw $ 2 , 2 0 ($ 1 )
a nd $4 , $ 2, $ 5
Hazard
detection
unit
IF/IDWrite
2
5
ID/EX.MemRead
ID/EX
11
00
WB
Control
0
M
u
x
PCWrite
IF/ID
PC
$2
Instruction
2
Instruction
memory
b e fore < 1 >
EX/MEM
M
WB
MEM/WB
EX
M
WB
$1
M
u
x
5
Registers
$5
ALU
$X
M
u
x
2
5
4
ID/EX.RegisterRt
C loc k 3
be fore < 2>
1
X
2
M
u
x
Forwarding
unit
Data
memory
M
u
x
or $ 4 , $ 4, $2
a nd $ 4, $2 , $ 5
2
5
Control
0
M
u
x
PCWrite
IF/ID
PC
Instruction
2
Instruction
memory
lw $ 2 , . . .
ID/EX.MemRead
ID/EX
10
00
WB
IF/IDWrite
Exemplo
pag 493
bu b b le
Hazard
detection
unit
EX/MEM
M
WB
EX
M
$2
b e fo re < 1>
11
MEM/WB
WB
$2
M
u
x
5
Registers
$5
Data
memory
ALU
$5
M
u
x
M
u
x
2
5
2
5
4
4
2
M
u
x
ID/EX.RegisterRt
Forwarding
unit
C lo ck 4
a d d $ 9, $ 4 , $ 2
o r $ 4, $4 , $ 2
an d $ 4, $2 , $ 5
Hazard
detection
unit
ID/EX.MemRead
ID/EX
10
10
WB
IF/IDWrite
4
2
Control
0
M
u
x
PCWrite
IF/ID
PC
Instruction
4
Instruction
memory
b u b ble
$4
lw $2 , . . .
M
EX/MEM
0
WB
MEM/WB
EX
M
WB
11
$2
M
u
x
2
2
Registers
ALU
$2
$5
Data
memory
M
u
x
M
u
x
ID/EX.RegisterRt
C lo ck 5
4
2
2
5
4
4
M
u
x
2
Forwarding
unit
Branch Hazards
•
When we decide to branch, other instructions are in the pipeline!
Time (in clock cycles)
Program
execution
CC 1
CC 2
order
(in instructions)
40 beq $1, $3, 7
44 and $12, $2, $5
IM
Reg
IM
48 or $13, $6, $2
52 add $14, $2, $2
72 lw $4, 50($7)
•
CC 3
CC 4
CC 5
DM
Reg
Reg
IM
DM
Reg
IM
CC 6
CC 8
CC 9
Reg
DM
Reg
IM
CC 7
Reg
DM
Reg
Reg
DM
Reg
We are predicting “branch not taken” (stalling is too slow)
– need to add hardware for flushing instructions if we are wrong
Mario Côrtes - MO401 - IC/Unicamp- 2004s2
1998 Morgan Kaufmann Publishers
Ch6c-12
Diminuindo a penalidade do “branch taken”
•
•
•
no esquema anterior
– decisão só é tomada no estágio MEM
– caso “branch taken” é necessário limpar (flush) os estágios IF, ID
e EX
– 3 clocks perdidos
para diminuir a penalidade:
– antecipar a decisão do estágio MEM para o estágio ID
– economia de dois clocks
– flush somente instrução sendo lida da memória (fetched)
mudanças no hardware:
– cálculo do endereço do desvio (PC + offset<<2)
– comparação dos registradores
• 32 XORs com or na saída
• mais rápido do que a ALU
Mario Côrtes - MO401 - IC/Unicamp- 2004s2
1998 Morgan Kaufmann Publishers
Ch6c-13
Flushing Instructions
IF.Flush
Hazard
detection
unit
ID/EX
M
u
x
Control
0
M
u
x
IF/ID
4
EX/MEM
M
WB
MEM/WB
EX
M
WB
Shift
left 2
=
PC
WB
M
u
x
Registers
Instruction
memory
ALU
M
u
x
Data
memory
M
u
x
Sign
extend
M
u
x
Forwarding
unit
Mario Côrtes - MO401 - IC/Unicamp- 2004s2
1998 Morgan Kaufmann Publishers
Ch6c-14
Improving Performance
•
Try and avoid stalls! E.g., reorder these instructions:
lw
lw
sw
sw
$t0,
$t2,
$t2,
$t0,
0($t1)
4($t1)
0($t1)
4($t1)
•
Add a “branch delay slot”
– the next instruction after a branch is always executed
– rely on compiler to “fill” the slot with something useful
•
Branch prediction:
– tentar acertar se o desvio será tomado ou não
– em loops, desvio é tomado a maior parte das vezes
– branch history table
•
Superscalar: start more than one instruction in the same cycle
Mario Côrtes - MO401 - IC/Unicamp- 2004s2
1998 Morgan Kaufmann Publishers
Ch6c-15
Dynamic Scheduling
•
The hardware performs the “scheduling”
– hardware tries to find instructions to execute
– out of order execution is possible
– speculative execution and dynamic branch prediction
•
All modern processors are very complicated
– DEC Alpha 21264: 9 stage pipeline, 6 instruction issue
– PowerPC and Pentium: branch history table
– Compiler technology important
•
This class has given you the background you need to learn more
•
Video: An Overview of Intel’s Pentium Processor
(available from University Video Communications)
Mario Côrtes - MO401 - IC/Unicamp- 2004s2
1998 Morgan Kaufmann Publishers
Ch6c-16
Comparação de desempenho (p 504)
•
•
•
•
para monociclo, multiciclo e pipeline
gcc: lw (23%), sw (13%), beq (19%), j (2%), resto (43%)
pipeline:
– 2ns (memória e ALU) e 1ns para o registrador (RD ou WR)
– 1/2 dos lw seguidas por instruções que usam o resultado
– 1/4 dos beq são errados (1 clock perdido)
– jumps perdem um ciclo
pipeline:
– lw: 1 clock sem hazard e 2 clock com hazard; média = 1.5
– beq: 1 clock se OK (3/4) e 2 clocks se não OK (1/4); média = 1.25
– jump: 2 clocks
– demais instruções: 1 clock
•
•
•
•
CPI = 1.5*0.23+1*0.13+1*0.43+1.25*0.19+2*0.02 = 1.18
clock = 2ns; tempo médio de execução = 2*1.18=2.36ns
multiciclo; 4.02*2=8.08ns; monociclo = 8ns
pipeline é 3.4 vezes mais rápido do que monociclo ou multiciclo
Mario Côrtes - MO401 - IC/Unicamp- 2004s2
1998 Morgan Kaufmann Publishers
Ch6c-17
Circuito completo
Branch
IF.Flush
EX.Flush
ID.Flush
Hazard
detection
unit
WB
Control
IF/ID
4
Instruction
memory
PC
Address
Read
data
Instruction
Shift
left 2
0
0
16 Sign
extend
Except
PC
=
ALUSrc
WB
Data
memory
ALU
M
u
x
MEM/WB
M
M
u
x
32
Instruction [25– 21]
Instruction [20– 16]
Instruction [20– 16]
Instruction [15– 11]
WB
Cause
EX
Read
Read
register 1data 1
Read
register 2
Registers
Write
register Read
data 2
Write
data
EX/MEM
M
u
x
M
RegWrite
0
M
u
x
M
u
x
M
u
x
ALU
control
ALUOp
RegDst
M
u
x
Forwarding
unit
Address
Read
data
Write
data
MemRead
MemtoReg
ID/EX
M
u
x
MemWrite
40000040
M
u
x
Download

ch6c_v1-cortes - Facom-UFMS