CSCE 5610 Computer System Architecture: Instruction Level Parallelism
CSCE 5610 Computer System Architecture: Instruction Level Parallelism
Clock cycle
1 2 3 4 5 6 7
sub $2, $1, $3 IF ID EX MEM WB
A more Detailed Look at the Pipeline A more Detailed Look at the Pipeline
• We have to eliminate the hazards, so the AND and OR instructions in our • The actual result $1 -$3 is computed in clock cycle 3, before it’s needed in cycles
example will use the correct value for register $2. 4 and 5.
• Let’s look at when the data is actually produced and consumed. • If we could somehow bypass the writeback and register read stages when
— The SUB instruction produces its result in its EX stage, during cycle 3 in needed, then we can eliminate these data hazards.
the diagram below. — Now, we’ll focus on hazards involving arithmetic instructions.
— The AND and OR need the new value of $2 in their EX stages, during — Next time, we’ll examine the lw instruction.
clock cycles 4-5 here. • Essentially, we need to pass the ALU output from SUB directly to the AND
and OR instructions, without going through the register file.
$3
IM Reg DM Reg
sub $2, $1,
$3
IM Reg DM Reg
IM Reg DM Reg sub $2, $1, $3
sub $2, $1,
$3
and $12, $2, $5 IM Reg DM Reg
and $12, $2, IM Reg DM Reg
$5
IM Reg DM Reg
sub $2, $1, $3
if (MEM/WB.RegWrite = 1
and MEM/WB.RegisterRd = ID/EX.RegisterRs
and (EX/MEM.RegisterRd ≠ ID/EX.RegisterRs or EX/MEM.RegWrite = 0)
then ForwardA = 1
if (MEM/WB.RegWrite = 1
and MEM/WB.RegisterRd = ID/EX.RegisterRt
and (EX/MEM.RegisterRd ≠ ID/EX.RegisterRt or EX/MEM.RegWrite = 0)
then ForwardB = 1
the pipeline.
• In general, you can always stall to avoid hazards—but dependencies are very
common in real code, and stalling often can reduce performance by a significant
amount.
• Notice that we’re still using forwarding in cycle 5, to get data from the
MEM/WB pipeline register to the ALU.
Stalling Delays the Entire Pipeline Stalling Delays the Entire Pipeline
• If we delay the second instruction, we’ll have to delay the third one too. • If we delay the second instruction, we’ll have to delay the third one too.
— Why? — This is necessary to make forwarding work between AND and OR.
— It also prevents problems such as two instructions trying to write to
the same register in the same cycle.
IM Reg DM Reg
lw $2, 20($3) lw $2, 20($3)
IM Reg DM Reg
and $12, $2, $5 IM Reg DM Reg and $12, $2, $5 IM Reg DM Reg
ex m
IM if/ Reg id/ /m DM e Reg
sub $2, $1, $3 id ex e m\
IM Reg DM Reg
or $13, $12, $2 m w
bex m
and $12, $2, $5 IM if/ Reg id/ /m DM e Reg
• The effect of a load stall is to insert an empty or nop instruction into the pipeline id ex e m\
m w
b
Detecting Stalls, cont. Hazard detection unit
• When should stalls be detected?
ex m
lw $2, 20($3) IM if/ Reg id/ /m DM e Reg
id ex e m\
m w
b m
ex
and $12, $2, $5 IM if/ Reg if/ Reg
id/
/m DM
e Reg
id id ex m\
e
w
m
b
if (ID/EX.MemRead = 1
and (ID/EX.RegisterRt = IF/ID.RegisterRs or ● To stall the pipeline, the instructions in ID stage should be stalled We can do it by
ID/EX.RegisterRt = IF/ID.RegisterRt) continuing to read the same PC
) ● To insert bubbles or nops, we set all the nine control signals to “0” no register or
memories are written, which will create a “do nothing” instruction.
then stall
● Assuming $s1 include the address of the element in the array with the highest address, $s2
is the address of the last element in the array, and $f2 contains the value s, the MIPS code
for the above loop would be: ● Without any reordering/scheduling the loop will execute as follows:
s.d $f4,-16($s1)
addi $s1,$s1,-24
● Branch Instruction: The pipeline cannot know what the next instruction should be!
1 2 3 4 5 6 7 8
IM Reg DM Reg
beq $2, $3, Label
• Here we just stall until cycle 4, after we do make the branch decision.
??? IM
Branch prediction Branch prediction
• Another approach is to guess whether or not the branch is taken.
● Branch Prediction: A method of resolving a branch hazard that assumes a given
— In terms of hardware, it’s easier to assume the branch is not taken.
outcome for the branch and proceeds according to that assumption rather then
waiting to ensure the actual outcome. — This way we just increment the PC and continue execution, as for
normal instructions.
● Possible Static Approaches: • If we’re correct, then there is no problem and the pipeline keeps going at full
1. always predict that branches will be untaken. speed.
2. always predict that branches at the bottom of the loops will be taken, and Clock cycle
branches at the top of the loop will be untaken. 1 2 3 4 5 6 7
I Reg DM Reg
beq $2, $3, Label M
● Dynamic Branch Prediction: Keep history for each branch as taken or untaken, and
then use the recent past behavior to predict the future.
next instruction 1 I Reg DM Reg
M
I Reg DM Reg
next instruction 2 M
next instruction 1 I
M
Reg
flush beq $2, $3, Label IM Reg DM Reg
next instruction 2 I
flush next instruction 1
M
IM flush
I ID EX ME W
F M B “00” “01”
IF IF ID EX MEM --- lost 1
WB cycle
“11” “10”
• If prediction:
— If Correct
IF ID MEM WB Given: Consider a loop that branches seven times in a row and then is not taken once. What is
EX the prediction accuracy using the above 2-bit prediction scheme with the initial state of “00”?
EX MEM WB -- no cycle lost
IF
— If Misprediction:
IF IDID EX MEM 0 0 0 0 0 0 0 01
WBIF0 True 0
True 0
True 0
True 0
True 0
True 0
True 0
False
IF1 ID MEM --- 1 cycle lost
EX WB 7/8 = 87.5%
“00” “01”
“11” “10”
Given: Consider a loop that branches seven times in a row and then is not taken once. What is
the prediction accuracy using the above 2-bit prediction scheme with the initial state of “10”?
0 0 0 0 0 0 0 01
False 1
True 0
True 0
True 0
True 0
True 0
True 0
False
6/8 = 75%