Instruction Level Parallelism: Omid Fatemi Advanced Computer Architecture
Instruction Level Parallelism: Omid Fatemi Advanced Computer Architecture
Omid Fatemi
Advanced Computer Architecture
University of Tehran 1
Outline
• MIPS R4000
• Dynamic Scheduling
• Scoreboard Implications
• Scoreboard Example
University of Tehran 2
Putting It All Together:
MIPS R4000
• 8 Stage Pipeline:
– IF–first half of fetching of instruction; PC selection happens here as well
as initiation of instruction cache access.
– IS–second half of access to instruction cache.
– RF–instruction decode and register fetch, hazard checking and also
instruction cache hit detection.
– EX–execution, which includes effective address calculation, ALU
operation, and branch target computation and condition evaluation.
– DF–data fetch, first half of access to data cache.
– DS–second half of access to data cache.
– TC–tag check, determine whether the data cache access hit.
– WB–write back for loads and register-register operations.
• 8 Stages: What is impact on Load delay? Branch delay? Why?
University of Tehran 5
Case Study: MIPS R4000
TWO Cycle IF IS RF EX DF DS TC WB
Load Latency IF IS RF EX DF DS TC
IF IS RF EX DF DS
IF IS RF EX DF
IF IS RF EX
IF IS RF
IF IS
IF
THREE Cycle IF IS RF EX DF DS TC WB
Branch Latency IF IS RF EX DF DS TC
(conditions evaluated IF IS RF EX DF DS
during EX phase) IF IS RF EX DF
IF IS RF EX
Delay slot plus two stalls IF IS RF
Branch likely cancels delay slot if not taken IF IS
IF
University of Tehran 6
Branch Delay
taken
Not taken
University of Tehran 7
MIPS R4000 Floating Point
• FP Adder, FP Multiplier, FP Divider
• Last step of FP Multiplier/Divider uses FP Adder HW
• 8 kinds of stages in FP units:
Stage Functional unit Description
A FP adder Mantissa ADD stage
D FP divider Divide pipeline stage
E FP multiplier Exception test stage
M FP multiplier First stage of multiplier
N FP multiplier Second stage of multiplier
R FP adder Rounding stage
S FP adder Operand shift stage
U Unpack FP numbers
University of Tehran 8
MIPS FP Pipe Stages
FP Instr 1 2 3 4 5 6 7 8 …
Add, Subtract U S+A A+R R+S
Multiply U E+M M M M N N+A R
Divide U A R D28 … D+A D+R, D+R, D+A, D+R, A, R
Square root U E (A+R)108 … A R
Negate U S
Absolute value U S
FP compare U A R
Stages:
M First stage of multiplier
N Second stage of multiplier A Mantissa ADD stage
R Rounding stage D Divide pipeline stage
S Operand shift stage E Exception test stage
U Unpack FP numbers
University of Tehran 9
Latencies
University of Tehran 10
Multiply followed by Add
University of Tehran 11
Add followed by a Multiply
No stall
University of Tehran 12
Add followed by a Devide
University of Tehran 13
Divide followed by Add
University of Tehran 14
R4000 Performance
• Not ideal CPI of 1:
– Load stalls (1 or 2 clock cycles)
– Branch stalls (2 cycles + unfilled slots)
– FP result stalls: RAW data hazard (latency)
– FP structural stalls: Not enough FP hardware (parallelism)
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
gcc
doduc
espress o
nasa7
ora
tomcatv
eqntott
li
spice2g6
su2cor
Base Load stalls Branch stalls FP result stalls FP structural
stalls
University of Tehran 15
University of Tehran 16
HW Schemes: Instruction Parallelism
• Why in HW at run time?
– Works when can’t know real dependence at compile time
– Compiler simpler
– Code for one machine runs well on another
University of Tehran 18
RISC or CISC
• For ILP?
– Example: A B + C
– CISC: 1 instruction
– RISC: 4 instruction
• RISC:
– More chance to schedule
University of Tehran 19
Dynamic Scheduling
DIV.D F0, F2, F4
ADD.D F10, F0, F8 •7-cycle divider
SUB.D F12, F8, F14 •4-cycle adder
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
In-order DIV.D F0, F2, F4 F I D E E E E E E E M W
ADD.D F10, F0, F8 F I D S S S S S S E E E E M W
SUB.D F12, F8, F14 F I D S S S S S S S S S E E E E M W
University of Tehran 20
Explanation of I
• To be able to execute the SUB.D instruction
– A function unit must be available
» Adder is free in example
– There should be no data hazards preventing early
execution
» None in this example
– We must be able to recognize the two previous
conditions
» Must examine several instructions before deciding
on what to execute
• I represents the instruction window (or issue
window) in which this examination happens
– If every instruction starts execution in order, then I is
superfluous
– Otherwise:
» Instruction enter the issue window in order
» Several instructions may be in issue window at any
instant
» Execution can begin out of order
University of Tehran 21
HW Schemes: Instruction Parallelism
• Out-of-order execution divides ID stage:
1. Issue—decode instructions, check for structural hazards
2. Read operands—wait until no data hazards, then read operands
• Scoreboards allow instruction to execute whenever
1 & 2 hold, not waiting for prior instructions
• CDC 6600: In order issue, out of order execution, out
of order commit ( also called completion)
University of Tehran 22
Scoreboard Implications
• Out-of-order completion => WAR, WAW hazards?
• WAR:
» DIVD F0,F2,F4
» ADDD F10,F0,F8
» SUBD F8,F10,F14
• WAW:
» DIVD F0,F2,F4
» ADDD F0,F5,F8
• Scoreboard keeps track of dependencies, state or operations
– for WAW: stall in Issue until previous write completes
– for WAR: stall in Write Result until previous read completes
• Need to have multiple instructions in execution phase =>
multiple execution units or pipelined execution units
• Scoreboard replaces ID, EX, WB with 4 stages
University of Tehran 23
Out-of-order Execution and
Renaming
DIV.D F0, F2, F4
ADD.D F10, F0, F8
SUB.D F10, F8, F14
University of Tehran 24
What is a Scoreboard?
University of Tehran 25
MIPS with a Scoreboard
University of Tehran 26
Dynamic Scheduling with a
Scoreboard
• Original development in CDC 6600
• Simplified example in the book for MIPS FP
operations (Read Section C.7)
– Using neither renaming nor forwarding
» Values always move from registers to function units,
and from function units back to registers
– Out-of-order completion can give rise to WAR and WAW
hazards
» Machine “knows” original program order (needed for
hazard detection)
– Machine model
» 2 FP multipliers (10 cycles), 1 FP adder (2 cycles), 1
FP divider (40 cycles), all non-pipelined
» 1 integer unit for everything else (incl. memory
references)
University of Tehran 27
Four Stages of Scoreboard
Control
1. Issue: decode instr. & check for structural hazards (ID1)
– If functional unit is free and no WAW hazard with other active instruction …
» … scoreboard issues the instruction to the functional unit and updates
its internal data structure.
– If a structural or WAW hazard exists …
» … instruction issue stalls
• no further instructions can issue until these hazards are cleared.
» CDC 6600 scoreboard would stall SUB.D until ADD.D reads ops
University of Tehran 29
Three Parts of the Scoreboard
1. Instruction status—which of 4 steps the instruction is in
University of Tehran 30
Detailed Scoreboard Pipeline
Control
Instruction
Wait until Bookkeeping
status
Busy(FU) yes; Op(FU) op;
Fi(FU) `D’; Fj(FU) `S1’;
Not busy (FU)
Issue Fk(FU) `S2’; Qj Result(‘S1’);
and not result(D)
Qk Result(`S2’); Rj not Qj;
Rk not Qk; Result(‘D’) FU;
Read
Rj and Rk Rj No; Rk No
operands
Execution Functional unit
complete done
f((Fj( f )≠Fi(FU)
f(if Qj(f)=FU then Rj(f) Yes);
or Rj( f )=No) &
Write result f(if Qk(f)=FU then Rj(f) Yes);
(Fk( f ) ≠Fi(FU) or
Result(Fi(FU)) 0; Busy(FU) No
Rk( f )=No))
University of Tehran 31
Scoreboard Example
Instruction status Read Execution Write
Instruction j k Issue operandscompleteResult
ADD: 2 clock cycles
LD F6 34+ R2 MUL: 10 clock cycles
LD F2 45+ R3 DIV: 40 clock cycles
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
FU
University of Tehran 32
Scoreboard Example Cycle 1
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult ADD: 2 clock cycles
LD F6 34+ R2 1 MUL: 10 clock cycles
LD F2 45+ R3 DIV: 40 clock cycles
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F6 R2 Yes
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
1 FU Integer
University of Tehran 33
Scoreboard Example Cycle 2
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult ADD: 2 clock cycles
LD F6 34+ R2 1 2 MUL: 10 clock cycles
LD F2 45+ R3 DIV: 40 clock cycles
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F6 R2 Yes
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
2 FU Integer
University of Tehran 36
Scoreboard Example Cycle 5
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F2 R3 Yes
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
5 FU Integer
University of Tehran 37
Scoreboard Example Cycle 6
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6
MULTDF0 F2 F4 6
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F2 R3 Yes
Mult1 Yes Mult F0 F2 F4 Integer No Yes
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
6 FU Mult1 Integer
University of Tehran 38
Scoreboard Example Cycle 7
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7
MULTDF0 F2 F4 6
SUBD F8 F6 F2 7
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F2 R3 Yes
Mult1 Yes Mult F0 F2 F4 Integer No Yes
Mult2 No
Add Yes Sub F8 F6 F2 Integer Yes No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
7 FU Mult1 Integer Add
University of Tehran 40
Scoreboard Example Cycle 8b
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6
SUBD F8 F6 F2 7
DIVD F10 F0 F6 8
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
Add Yes Sub F8 F6 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
8 FU Mult1 Add Divide
University of Tehran 41
Scoreboard Example Cycle 9
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9
DIVD F10 F0 F6 8
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
10 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
2 Add Yes Sub F8 F6 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
9 FU Mult1 Add Divide
University of Tehran 43
Scoreboard Example Cycle 12
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
7 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
Add No
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
12 FU Mult1 Divide
University of Tehran 45
Scoreboard Example Cycle 14
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
5 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
2 Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
14 FU Mult1 Add Divide
University of Tehran 46
Scoreboard Example Cycle 15
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
4 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
1 Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
15 FU Mult1 Add Divide
University of Tehran 47
Scoreboard Example Cycle 16
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14 16
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
3 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
0 Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
16 FU Mult1 Add Divide
University of Tehran 48
Scoreboard Example Cycle 17
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14 16
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
2 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
17 FU Mult1 Add Divide
University of Tehran 50
Scoreboard Example Cycle 19
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14 16
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
0 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
19 FU Mult1 Add Divide
University of Tehran 51
Scoreboard Example Cycle 20
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14 16
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Yes Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
20 FU Add Divide
University of Tehran 52
Scoreboard Example Cycle 21
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21
ADDD F6 F8 F2 13 14 16
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Yes Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
21 FU Add Divide
University of Tehran 53
Scoreboard Example Cycle 22
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21
ADDD F6 F8 F2 13 14 16 22
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add No
40 Divide Yes Div F10 F0 F6 Yes Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
22 FU Divide
University of Tehran 54
Scoreboard Example Cycle 61
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21 61
ADDD F6 F8 F2 13 14 16 22
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add No
0 Divide Yes Div F10 F0 F6 Yes Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
61 FU Divide
University of Tehran 55
Scoreboard Example Cycle 62
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21 61 62
ADDD F6 F8 F2 13 14 16 22
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add No
0 Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
62 FU
University of Tehran 56
CDC 6600 Scoreboard
• Speedup:
– 1.7 from compiler (Fortran program);
– 2.5 by hand coded assembly programs.
– BUT slow memory (no cache) limits benefit.
• Had as much logic as one of the functional U.
• Limitations of 6600 scoreboard:
– No forwarding hardware.
– Limited to instructions in basic block (small
window).
– Small number of functional units (structural
hazards), especially integer/load store units.
– Do not issue on structural hazards.
– Wait for WAR hazards.
– Prevent WAW hazards.
University of Tehran 57
Summary
• Instruction Level Parallelism (ILP) in HW
• HW exploiting ILP
– Works when can’t know dependence at run time
– Code for one machine runs well on another
• Key idea of Scoreboard: Allow instructions behind stall to proceed
(Decode => Issue instr & read operands)
– Enables out-of-order execution => out-of-order completion
– ID stage checked both for structural and WAW hazards;
University of Tehran 58