0% found this document useful (0 votes)
3 views70 pages

CMP3010L07 Tomasulo

The document discusses dynamic multiple-issue processors, particularly focusing on superscalar processors and their ability to execute multiple instructions per clock cycle. It explains dynamic pipeline scheduling, the components involved, and introduces Tomasulo's Algorithm for out-of-order execution, which helps avoid data hazards. The document also details the life cycle of an instruction within this architecture, emphasizing the importance of reservation stations and register renaming.

Uploaded by

Mostafa Mohamed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views70 pages

CMP3010L07 Tomasulo

The document discusses dynamic multiple-issue processors, particularly focusing on superscalar processors and their ability to execute multiple instructions per clock cycle. It explains dynamic pipeline scheduling, the components involved, and introduces Tomasulo's Algorithm for out-of-order execution, which helps avoid data hazards. The document also details the life cycle of an instruction within this architecture, emphasizing the importance of reservation stations and register renaming.

Uploaded by

Mostafa Mohamed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

CMP3010: Computer Architecture

L06: Multiple Issue Techniques

Dina Tantawy
Computer Engineering Department
Cairo University
Dynamic Multiple-Issue Processors
• Superscalar Processors
– An advanced pipelining technique that enables the processor to execute
more than one instruction per clock cycle by selecting them during
execution.

• in the simplest superscalar instructions issue in-order, and the


processor decides whether zero, one, or more instructions can
issue in a given clock cycle.
Difference between Simple Superscalar
and VLIW Processor
• The code is guaranteed by the hardware to execute
correctly.
• Compiled code will always run correctly independent of the
issue rate or pipeline structure of the processor.
• In some VLIW designs, recompilation was required when
moving across different processor models.
In VLIM, the compiler does the
scheduling,

how can we do this in superscalar ?

‹#›
Dynamic Pipeline Scheduling
• It is Hardware support for reordering the order of instruction
execution to avoid stalls.
• Chooses which instructions to execute in a given clock cycle
while trying to avoid hazards and stalls (out of order
execution)
Dynamic Pipeline Scheduling
• It is Hardware support for reordering the order of instruction
execution to avoid stalls.
• Chooses which instructions to execute in a given clock cycle
while trying to avoid hazards and stalls (out of order
execution)

Subtract
doesn’t need to
wait addu nor
load
Dynamic Pipeline Scheduling
• The pipeline is divided into 4 major units:
– Instruction fetch and issue unit
– Multiple functional units
– Write Result
– Commit unit
Dynamic Pipeline Scheduling
• The pipeline is divided into 4 major units:
– Instruction fetch and issue unit
– fetches instructions, decodes them, and sends each instruction to a corresponding
functional unit for execution.

– If a reservation station is free (no structural hazard), and reorder buffer slot free, issue instr &
send operands & reorder buffer no. for destination (this stage sometimes called “dispatch”)

– issue instruction & source operand values (if they are in the registers).

– If reservation stations or reorder buffer are busy, instruction stalls

– If source operands are not in the registers – rename registers (eliminate WAR, WAW hazards)
and keep track of functional units producing operands
Dynamic Pipeline Scheduling
• The pipeline is divided into 4 major units:
– Instruction fetch and issue unit
– Multiple functional units (operate on operands (EX))
• Each functional unit has buffers, called reservation stations, that hold the
operands and the operation.
• If both operands ready then execute;
• if not ready, watch Common Data Bus for result (Avoid RAW hazard)
Dynamic Pipeline Scheduling
• The pipeline is divided into 4 major units:
– Instruction fetch and issue unit
– Multiple functional units
– Write Results
• When the result is completed, it is sent to any reservation stations waiting for this particular
result as well as to the commit unit using common databus, which buffers the result (reorder
buffer)until it is safe to put the result into the register file or, for a store, into memory.

• Write on Common Data Bus to all units; mark reservation station available

• Common data bus: data + source + Broadcasts


Dynamic Pipeline Scheduling
• The pipeline is divided into 4 major units:
– Instruction fetch and issue unit
– Multiple functional units
– Write results
– Commit unit
• The unit that decides when it is safe to release the result of an operation to
programmer-visible registers and memory.
• The unit is used to make sure that the results of all instructions will be written
in the same order that instructions are fetched.
Tomasulo’s Algorithm
• It is a computer architecture hardware algorithm for dynamic
scheduling of instructions that allows out-of-order execution,
designed to efficiently utilize multiple execution units.
• Developed for IBM (1966)
• Goal: High Performance without special compilers
Tomasulo’s Algorithm
• Tracks when operands are available to satisfy data dependences.
• Removes name dependences through register renaming.
• Very similar to what is used today: Almost all modern high-
performance processors use a derivative of Tomasulo’s… much of
the terminology survives to today.
Tomasulo’s Algorithm

z Avoid RAW Hazards


y Execute an instruction only when its operands are available
y Has a scheme to track when operands are available
z Avoid WAR and WAW Hazards
y Support Register renaming.
y Renames all destination registers: Out-of-order write does not affect any instructions that
depend on an earlier value of an operand
x DIVD F0, F2, F4
x ADDD F6, F0, F8
x SD F6, 0(R1)
x SUBD F8, F10, F14
x MULD F6, F10, F8

15
Tomasulo’s Algorithm

z Avoid RAW Hazards


y Execute an instruction only when its operands are available
y Has a scheme to track when operands are available
z Avoid WAR and WAW Hazards
y Support Register renaming.
y Renames all destination registers: Out-of-order write does not affect any instructions that
depend on an earlier value of an operand
DIVD
x F0, F2, F4 DIVD F0, F2, F4
x ADDD F6, F0, F8 ADDD S, F0, F8 //S & T temp Reg
x SD F6, 0(R1) SD S, 0(R1)
WAR x SUBD F8, F10, F14 SUBD T, F10, F14
WAR & WAWx MULD F6, F10, F8 MULD F6, F10, T

16
Tomasulo’s Algorithm
z FU buffers are called reservation stations; have pending operands

z Registers in instructions replaced by values or pointers to reservation


stations(RS); called register renaming
y avoids WAR, WAW hazards

z A Common Data Bus broadcasts results to all FUs

z Load and Stores treated as FUs with reservation stations as well

17
Reservation Station Components
z Busy
y Indicates reservation station is busy

z Op
y Operation to perform in the unit (e.g., + or –)

z Vj, Vk
y Value of Source operands
y Store buffers have V field with result to be stored

z Qj, Qk
y Reservation stations producing source operand (Qj,Qk=0 => ready)

18
Load RS
Fetch & Decode busy Address
L1
Mem
L2
LD → 2
L3
Val Rs O Add RS
F1 1 0 b op Vi Vj Qi Qj
A1 Commit Unit
F2 2 0 Add/Sub (reorder
F3 3 0 A2 buffer)
A3 Add/Sub → 2
F4 4 0
F5 5 0 Mult RS
b op Vi Vj Qi Qj
F6 6 0
M1 Mult/div
F7 7 0
M2 Mult → 10
F8 8 0
Div → 40
F9 9 0
F10 10 0
20
Life Cycle of one Instruction

‹#›
Cycle#1
Fetch & Decode

‹#›
issue start write
Life Cycle of one Instruction: Clock Cycle 1 MUL

1. Is there a place in relevant reservation station & reorder buffer?


Load RS
Fetch & Decode busy Address
L1 No
MULT F0, F2, F4 Mem
L2 No
LD → 2
Val Rs O L3 No
F0 0 0 Add RS
F1 1 0
b op Vi Vj Qi Qj L1 : 23
A1 N Commit Unit
F2 2 0 Add/Sub (reorder
F3 3 0
A2 N buffer)
A3 N Add/Sub → 2
F4 4 L1 1
F5 5 0
Mult RS
b op Vi Vj Qi Qj
F6 6 0
M1 N Mult/div
F7 7 0
M2 N
F8 8 0 Mult → 10
Div → 40
F9 9 0
F10 10 0
23
issue start write
Life Cycle of one Instruction: Clock Cycle 1 MUL 1
1. Is there a place in relevant reservation station?
Load RS
Fetch & Decode busy Address
L1 No
MULT F0, F2, F4 Mem
L2 No
LD → 2
Val Rs O L3 No
F0 0 0 Add RS
F1 1 0
b op Vi Vj Qi Qj L1 : 23
A1 N Commit Unit
F2 2 0 Add/Sub (reorder
F3 3 0
A2 N buffer)
A3 N Add/Sub → 2
F4 4 L1 1
F5 5 0
Mult RS
b op Vi Vj Qi Qj
F6 6 0
M1 Yes M Mult/div
F7 7 0
M2 N
F8 8 0 Mult → 10
Div → 40
F9 9 0 2. Get Operands from Register File or Reorder Buffer
F10 10 0
24
issue start write
Life Cycle of one Instruction: Clock Cycle 1 MUL 1
1. Is there a place in relevant reservation station?
Load RS
Fetch & Decode busy Address
L1 No
MULT F0, F2, F4 Mem
L2 No
LD → 2
Val Rs O L3 No
F0 0 0 Add RS
F1 1 0
b op Vi Vj Qi Qj L1 : 23
A1 N Commit Unit
F2 2 0 Add/Sub (reorder
F3 3 0
A2 N buffer)
A3 N Add/Sub → 2
F4 4 L1 1
F5 5 0
Mult RS
b op Vi Vj Qi Qj
F6 6 0
M1 Yes M 2 23 Mult/div
F7 7 0
M2 N
F8 8 0 Mult → 10
Div → 40
F9 9 0 2. Get Operands from Register File or Reorder Buffer
F10 10 0
25
issue start write
Life Cycle of one Instruction: Clock Cycle 1 MUL 1
1. Is there a place in relevant reservation station?
Load RS
Fetch & Decode busy Address
L1 No
MULT F0, F2, F4 Mem
L2 No
LD → 2
Val Rs O L3 No
F0 0 M1 0 Add RS
3.
F1Rename
1 destination
0 register b op Vi Vj Qi Qj L1 : 23
A1 N Commit Unit
F2 2 0 Add/Sub (reorder
F3 3 0
A2 N buffer)
A3 N Add/Sub → 2
F4 4 L1 1
F5 5 0
Mult RS
b op Vi Vj Qi Qj
F6 6 0
M1 Yes M 2 23 Mult/div
F7 7 0
M2 N
F8 8 0 Mult → 10
Div → 40
F9 9 0 2. Get Operands from Register File or Reorder Buffer
F10 10 0
26
Cycle#2
Execute Phase

‹#›
issue start write
Life Cycle of one Instruction: Clock Cycle 2 MUL 1 2
Load RS
Fetch & Decode busy Address
L1 No
MULT F0, F2, F4 Mem
L2 No
LD → 2
Val Rs O L3 No
F0 0 M1 0 Add RS
F1 1 0
b op Vi Vj Qi Qj L1 : 23
A1 N Commit Unit
F2 2 0 Add/Sub (reorder
F3 3 0
A2 N buffer)
A3 N Add/Sub → 2
F4 4 L1 1
F5 5 0
Mult RS
b op Vi Vj Qi Qj
F6 6 0
M1 Yes M 2 23 Mult/div
F7 7 0
M2 N
F8 8 0 Mult → 10
Div → 40
F9 9 0 1. Are all operands ready ?
F10 10 0 2. Is Mult unit busy ?
28
issue start write
Life Cycle of one Instruction: Clock Cycle 2 MUL 1 2
Load RS
Fetch & Decode busy Address
L1 No
MULT F0, F2, F4 Mem
L2 No
LD → 2
Val Rs O L3 No
F0 0 M1 0 Add RS
F1 1 0
b op Vi Vj Qi Qj
A1 N Commit Unit
F2 2 0 Add/Sub (reorder
F3 3 0
A2 N buffer)
A3 N Add/Sub → 2
F4 4 0
F5 5 0
Mult RS
b op Vi Vj Qi Qj M1
F6 6 0
M1 Yes M 2 23 Mult/div
F7 7 0
M2 N
F8 8 0 Mult → 10
Div → 40
F9 9 0 1. Are all operands ready ?
F10 10 0 2. Is Mult unit busy ?
29
Fast forward to
Cycle#12
Write Result

‹#›
issue start write
Life Cycle of one Instruction: Clock Cycle 12 MUL 1 2 12
Load RS
Fetch & Decode busy Address
L1 No
MULT F0, F2, F4 Mem
L2 No
LD → 2
Val Rs O L3 No
F0 0 M1 0 Add RS
F1 1 0
b op Vi Vj Qi Qj
A1 N Commit Unit
F2 2 0 Add/Sub (reorder
F3 3 0
A2 N buffer)
A3 N Add/Sub → 2
F4 4 0
F5 5 0
Mult RS
b op Vi Vj Qi Qj M1: 46
F6 6 0
M1 Yes M 2 23 Mult/div
F7 7 0
M2 N
F8 8 0 Mult → 10
Div → 40
F9 9 0 1. Is common databus free?
F10 10 0
31
issue start write
Life Cycle of one Instruction: Clock Cycle 12 MUL 1 2 12
Load RS
Fetch & Decode busy Address
L1 No
MULT F0, F2, F4 Mem
L2 No
LD → 2
Val Rs O L3 No
F0 0 M1 0 Add RS
F1 1 0
b op Vi Vj Qi Qj M1: 46
A1 N Commit Unit
F2 2 0 Add/Sub (reorder
F3 3 0
A2 N buffer)
A3 N Add/Sub → 2
F4 4 0
F5 5 0
Mult RS
b op Vi Vj Qi Qj
F6 6 0
M1 Yes M 2 23 Mult/div
F7 7 0
M2 N
F8 8 0 Mult → 10
Div → 40
F9 9 0 1. Is common databus free?
F10 10 0
2. Go to every one listening to the bus 32
issue start write
Life Cycle of one Instruction: Clock Cycle 12 MUL 1 2 12
Load RS
Fetch & Decode busy Address
L1 No
MULT F0, F2, F4 Mem
L2 No
LD → 2
Val Rs O L3 No
F0 0 M1 0 Add RS
F1 1 0
b op Vi Vj Qi Qj M1: 46
A1 N Commit Unit
F2 2 0 Add/Sub (reorder
F3 3 0
A2 N buffer)
A3 N Add/Sub → 2
F4 4 0
F5 5 0
Mult RS
b op Vi Vj Qi Qj
F6 6 0
M1 Yes M 2 23 Mult/div
F7 7 0
M2 N
F8 8 0 Mult → 10
Div → 40
F9 9 0 1. Is common databus free?
F10 10 0
2. Go to every one listening to the bus 33
issue start write
Life Cycle of one Instruction: Clock Cycle 12 MUL 1 2 12
Load RS
Fetch & Decode busy Address
L1 No
MULT F0, F2, F4 Mem
L2 No
LD → 2
Val Rs O L3 No
F0 0 M1 0 Add RS
F1 1 0
b op Vi Vj Qi Qj M1: 46
A1 N Commit Unit
F2 2 0 Add/Sub (reorder
F3 3 0
A2 N buffer)
A3 N Add/Sub → 2
F4 4 0
F5 5 0
Mult RS
b op Vi Vj Qi Qj
F6 6 0
M1 No Mult/div
F7 7 0
M2 N
F8 8 0 Mult → 10
Div → 40
F9 9 0 1. Is common databus free?
F10 10 0
2. Go to every one listening to the bus 34
issue start write
Life Cycle of one Instruction: Clock Cycle 12 MUL 1 2 12
Load RS
Fetch & Decode busy Address
L1 No
MULT F0, F2, F4 Mem
L2 No
LD → 2
Val Rs O L3 No
F0 0 M1 1 Add RS
F1 1 0
b op Vi Vj Qi Qj M1: 46
A1 N Commit Unit
F2 2 0 Add/Sub (reorder
F3 3 0
A2 N buffer)
A3 N Add/Sub → 2
F4 4 0
F5 5 0
Mult RS
b op Vi Vj Qi Qj
F6 6 0
M1 No Mult/div
F7 7 0
M2 N
F8 8 0 Mult → 10
Div → 40
F9 9 0 1. Is common databus free?
F10 10 0
2. Go to every one listening to the bus 35
Fast forward to sometime …

‹#›
issue start write
Life Cycle of one Instruction: Clock Cycle X MUL 1 2 12
Load RS
Fetch & Decode busy Address
L1 No
MULT F0, F2, F4 Mem
L2 No
LD → 2
Val Rs O L3 No
F0 0 M1 1 Add RS
F1 1 0
b op Vi Vj Qi Qj M1 : 46
A1 N Commit Unit
F2 2 0 Add/Sub (reorder
F3 3 0
A2 N buffer)
A3 N Add/Sub → 2
F4 4 0
F5 5 0
Mult RS
1. Is the top of the
b op Vi Vj Qi Qj
F6 6 0 queue ready to
M1 N Mult/div commit?
F7 7 0
M2 N
F8 8 0 Mult → 10
Div → 40
F9 9 0
F10 10 0
37
issue start write
Life Cycle of one Instruction: Clock Cycle X MUL 1 2 12
Load RS
Fetch & Decode busy Address
L1 No
MULT F0, F2, F4 Mem
L2 No
LD → 2
Val Rs O L3 No
F0 0 0 Add RS
F1 1 0
b op Vi Vj Qi Qj
A1 N Commit Unit
F2 2 0 Add/Sub (reorder
F3 3 0
A2 N buffer)
A3 N Add/Sub → 2
F4 4 0
F5 5 0
Mult RS
1. Is the top of the
b op Vi Vj Qi Qj
F6 6 0 queue ready to
M1 N Mult/div commit?
F7 7 0
M2 N
F8 8 0 Mult → 10
Div → 40
F9 9 0
F10 10 0
38
Another Example
LD F6, 34(R2) Latencies (clock cycles)
LD F2, 45(R3) LD 2
MULT F0, F2, F4 MULT 10
SUBD F8, F6, F2 DIVD 40
DIVD F10, F0, F6 ADDD, SUBD 2
ADDD F6, F8, F2

‹#›
LD F6, 34(R2) issue start write
LD
MULT
F2, 45(R3)
F0, F2, F4
Clock Cycle 1 LD

SUBD F8, F6, F2 LD


DIVD F10, F0, F6
Load RS
ADDD F6, F8, F2 busy Address MUL
MULT F4,F1,F5
L1 No SUBD
Mem
L2 No DIV
LD → 2
Val Rs O L3 No ADD

F0 0 0 Add RS
b op Vi Vj Qi Qj
F1 1 0
A1 No Commit Unit
F2 2 0 Add/Sub (reorder
A2 No buffer)
F3 3 0
A3 No Add/Sub → 2
F4 4 0
Mult RS
F5 5 0
b op Vi Vj Qi Qj
F6 6 0
M1 No Mult/div
F7 7 0
M2 No Mult → 10
F8 8 0
Div → 40
F9 9 0
F10 10 0 40
LD F6, 34(R2) issue start write
LD
MULT
F2, 45(R3)
F0, F2, F4
Clock Cycle 1 LD 1
SUBD F8, F6, F2 LD
DIVD F10, F0, F6
Load RS
ADDD F6, F8, F2 busy Address MUL
MULT F4,F1,F5
L1 Yes 34+R2 SUBD
Mem
L2 No DIV
LD → 2
Val Rs O L3 No ADD

F0 0 0 Add RS
b op Vi Vj Qi Qj
F1 1 0
A1 No Commit Unit
F2 2 0 Add/Sub (reorder
A2 No buffer)
F3 3 0
A3 No Add/Sub → 2
F4 4 0
Mult RS
F5 5 0
b op Vi Vj Qi Qj
F6 6 L1 0
M1 No Mult/div
F7 7 0
M2 No Mult → 10
F8 8 0
Div → 40
F9 9 0
F10 10 0 41
LD F6, 34(R2) issue start write
LD
MULT
F2, 45(R3)
F0, F2, F4
Clock Cycle 2 LD 1 2
SUBD F8, F6, F2 LD 2
DIVD F10, F0, F6
Load RS
ADDD F6, F8, F2 busy Address L1 MUL
MULT F4,F1,F5
L1 Yes 34+R2 SUBD
Mem
L2 Yes 45+R2 DIV
LD → 2
Val Rs O L3 No ADD

F0 0 0 Add RS
b op Vi Vj Qi Qj
F1 1 0
A1 No Commit Unit
F2 2 L2 0 Add/Sub (reorder
A2 No buffer)
F3 3 0
A3 No Add/Sub → 2
F4 4 0
Mult RS
F5 5 0
b op Vi Vj Qi Qj
F6 6 L1 0
M1 No Mult/div
F7 7 0
M2 No Mult → 10
F8 8 0
Div → 40
F9 9 0
F10 10 0 42
LD F6, 34(R2) issue start write
LD
MULT
F2, 45(R3)
F0, F2, F4
Clock Cycle 3 LD 1 2
SUBD F8, F6, F2 LD 2
DIVD F10, F0, F6
Load RS
ADDD F6, F8, F2 busy Address L1 MUL 3
MULT F4,F1,F5
L1 Yes 34+R2 SUBD
Mem
L2 Yes 45+R2 DIV
LD → 2
Val Rs O L3 No ADD

F0 0 M1 0 Add RS
b op Vi Vj Qi Qj
F1 1 0
A1 No Commit Unit
F2 2 L2 0 Add/Sub (reorder
A2 No buffer)
F3 3 0
A3 No Add/Sub → 2
F4 4 0
Mult RS
F5 5 0
b op Vi Vj Qi Qj
F6 6 L1 0
M1 Yes M 4 L2 Mult/div
F7 7 0
M2 No Mult → 10
F8 8 0
Div → 40
F9 9 0 Note that we wrote L2
instead of register value
F10 10 0 43
LD F6, 34(R2) issue start write
LD
MULT
F2, 45(R3)
F0, F2, F4
Clock Cycle 4 L2 started exe
LD 1 2 4
SUBD F8, F6, F2 Rs got freed LD 2 4
DIVD F10, F0, F6
Load RS
ADDD F6, F8, F2 busy Address L2 MUL 3
MULT F4,F1,F5
L1 No SUBD 4
Mem
L2 Yes 45+R2 DIV
LD → 2
Val Rs O L3 No ADD

F0 0 M1 0 Add RS
b op Vi Vj Qi Qj L1:F6
F1 1 0
A1 Yes Sub V(L1)
L2 Commit Unit
F2 2 L2 0 Add/Sub (reorder
A2 No Sub read L1 buffer)
F3 3 0
A3 No value Add/Sub → 2
F4 4 0
Set it to Mult RS
F5 5 0 buffer
reorder b op Vi Vj Qi Qj
F6 6 L1 1
M1 Yes M 4 L2 Mult/div
F7 7 0
M2 No Mult → 10
F8 8 A1 0
Div → 40
F9 9 0
F10 10 0 44
LD F6, 34(R2) issue start write
LD
MULT
F2, 45(R3)
F0, F2, F4
Clock Cycle 5 LD 1 2 4
SUBD F8, F6, F2 LD 2 4
DIVD F10, F0, F6
Load RS
ADDD F6, F8, F2 busy Address L2 MUL 3
MULT F4,F1,F5
L1 No SUBD 4
Mem
L2 Yes 45+R2 DIV 5
LD → 2
Val Rs O L3 No ADD

F0 0 M1 0 Add RS
b op Vi Vj Qi Qj
F1 1 0
A1 Yes Sub V(L1)
L2 Commit Unit
F2 2 L2 0 Add/Sub (reorder
A2 No buffer)
F3 3 0
A3 No Add/Sub → 2
F4 4 0
reorder buffer Mult RS
F5 5 0
freed b op Vi Vj Qi Qj
V(L1)
F6 0
M1 Yes M 4 L2 Mult/div
F7 7 0
V(L1)
M2 Yes D M1 Mult → 10
F8 8 A1 0
Div → 40
F9 9 0
F10 10 M2 0 45
LD F6, 34(R2) issue start write
LD
MULT
F2, 45(R3)
F0, F2, F4
Clock Cycle 6 LD 1 2 4
SUBD F8, F6, F2 LD 2 4 6
DIVD F10, F0, F6
Load RS
ADDD F6, F8, F2 busy Address MUL 3
MULT F4,F1,F5
L1 No SUBD 4
Mem
L2 No DIV 5
LD → 2
Val Rs O L3 No ADD 6
F0 0 M1 0 Add RS Now Sub is ready, it will
b start
op exe
Vi next
Vj cycle
Qi Qj L2:F2
F1 1 0
A1 Yes Sub V(L1) V(L2) Commit Unit
F2 2 L2 1 Add/Sub (reorder
Add V(L2)
A2 Yes A1 buffer)
F3 3 0
A3 No Add/Sub → 2
F4 4 0
Mult RS
F5 5 0
b op Vi Vj Qi Qj
V(L1)
F6 A2 0
M1 Yes M V(L2)
4 Mult/div
F7 7 0
V(L1)
M2 Yes Mult
D is also ready,M1
what Mult → 10
F8 8 A1 0
will happen next cycle? Div → 40
F9 9 0
F10 10 M2 0 46
LD F6, 34(R2) issue start write
LD
MULT
F2, 45(R3)
F0, F2, F4
Clock Cycle7 LD 1 2 4
SUBD F8, F6, F2 LD 2 4 6
DIVD F10, F0, F6
Load RS
ADDD F6, F8, F2 busy Address MUL 3 7
MULT F4,F1,F5
L1 No SUBD 4 7
Mem
L2 No DIV 5
LD → 2
Val Rs O L3 No ADD 6 No place in
F0 0 M1 0 Add RS MUL RS, STALL
b op Vi Vj Qi Qj A1
F1 1 0
A1 Yes Sub V(L1) V(L2) Commit Unit
F2 2 0 Add/Sub (reorder
Add V(L2)
A2 Yes A1 buffer)
F3 3 0
A3 No Add/Sub → 2
F4 4 0
Mult RS
F5 5 0 M1
b op Vi Vj Qi Qj
V(L1)
F6 A2 0
M1 Yes M V(L2)
4 Mult/div
F7 7 0
V(L1)
M2 Yes D M1 Mult → 10
F8 8 A1 0
Div → 40
F9 9 0
F10 10 M2 0 47
LD F6, 34(R2) issue start write
LD
MULT
F2, 45(R3)
F0, F2, F4
Clock Cycle8 LD 1 2 4
SUBD F8, F6, F2 LD 2 4 6
DIVD F10, F0, F6
Load RS
ADDD F6, F8, F2 busy Address MUL 3 7
MULT F4,F1,F5
L1 No SUBD 4 7
Mem
L2 No DIV 5
LD → 2
Val Rs O L3 No ADD 6 No place in
F0 0 M1 0 Add RS MUL RS, STALL
b op Vi Vj Qi Qj A1
F1 1 0
A1 Yes Sub V(L1) V(L2) Commit Unit
F2 2 0 Add/Sub (reorder
Add V(L2)
A2 Yes A1 buffer)
F3 3 0
A3 No Add/Sub → 2
F4 4 0
Mult RS
F5 5 0 M1
b op Vi Vj Qi Qj
V(L1)
F6 A2 0
M1 Yes M V(L2)
4 Mult/div
F7 7 0
V(L1)
M2 Yes D M1 Mult → 10
F8 8 A1 0
Div → 40
F9 9 0
F10 10 M2 0 48
LD F6, 34(R2) issue start write
LD
MULT
F2, 45(R3)
F0, F2, F4
Clock Cycle9 LD 1 2 4
SUBD F8, F6, F2 LD 2 4 6
DIVD F10, F0, F6
Load RS
ADDD F6, F8, F2 busy Address MUL 3 7
MULT F4,F1,F5
L1 No SUBD 4 7 9
Mem
L2 No DIV 5
LD → 2
Val Rs O L3 No ADD 6 No place in
F0 0 M1 0 Add RS MUL RS, STALL
b op Vi Vj Qi Qj
F1 1 0
A1 No Commit Unit
F2 2 0 Add/Sub (reorder
A1:F8
Add v(A1) V(L2)
A2 Yes buffer)
F3 3 0
A3 No Add/Sub → 2
F4 4 0
Mult RS
F5 5 0 M1
b op Vi Vj Qi Qj
V(L1)
F6 A2 0
M1 Yes M V(L2)
4 Mult/div
F7 7 0
V(L1)
M2 Yes D M1 Mult → 10
F8 8 A1 1
Div → 40
F9 9 0
F10 10 M2 0 49
LD F6, 34(R2) issue start write
LD
MULT
F2, 45(R3)
F0, F2, F4
Clock Cycle10 LD 1 2 4
SUBD F8, F6, F2 LD 2 4 6
DIVD F10, F0, F6
Load RS
ADDD F6, F8, F2 busy Address MUL 3 7
MULT F4,F1,F5
L1 No SUBD 4 7 9
Mem
L2 No DIV 5
LD → 2
Val Rs O L3 No ADD 6 10
F0 0 M1 0 Add RS MUL No place in
b op Vi Vj Qi Qj A2 RS, STALL
F1 1 0
A1 No Commit Unit
F2 2 0 Add/Sub (reorder
A2 Yes Add v(A1) V(L2) A1:F8
F3 3 0 buffer)
A3 No Add/Sub → 2
F4 4 0
Mult RS
F5 5 0 M1
b op Vi Vj Qi Qj
V(L1)
F6 A2 0
M1 Yes M V(L2)
4 Mult/div
F7 7 0
V(L1)
M2 Yes D M1 Mult → 10
F8 8 A1 1
Div → 40
F9 9 0
F10 10 M2 0 50
LD F6, 34(R2) issue start write
LD
MULT
F2, 45(R3)
F0, F2, F4
Clock Cycle11 LD 1 2 4
SUBD F8, F6, F2 LD 2 4 6
DIVD F10, F0, F6
Load RS
ADDD F6, F8, F2 busy Address MUL 3 7
MULT F4,F1,F5
L1 No SUBD 4 7 9
Mem
L2 No DIV 5
LD → 2
Val Rs O L3 No ADD 6 10
F0 0 M1 0 Add RS MUL No place in
b op Vi Vj Qi Qj A2 RS, STALL
F1 1 0
A1 No Commit Unit
F2 2 0 Add/Sub (reorder
A2 Yes Add v(A1) V(L2) A1:F8
F3 3 0 buffer)
A3 No Add/Sub → 2
F4 4 0
Mult RS
F5 5 0 M1
b op Vi Vj Qi Qj
V(L1)
F6 A2 0
M1 Yes M V(L2)
4 Mult/div
F7 7 0
V(L1)
M2 Yes D M1 Mult → 10
F8 8 A1 1
Div → 40
F9 9 0
F10 10 M2 0 51
LD F6, 34(R2) issue start write
LD
MULT
F2, 45(R3)
F0, F2, F4
Clock Cycle12 LD 1 2 4
SUBD F8, F6, F2 LD 2 4 6
DIVD F10, F0, F6
Load RS
ADDD F6, F8, F2 busy Address MUL 3 7
MULT F4,F1,F5
L1 No SUBD 4 7 9
Mem
L2 No DIV 5
LD → 2
Val Rs O L3 No ADD 6 10 12
F0 0 M1 0 Add RS MUL No place in
b op Vi Vj Qi Qj RS, STALL
F1 1 0
A1 No Commit Unit
F2 2 0 Add/Sub (reorder
A2 No A1:F8
F3 3 0 buffer)
A3 No Add/Sub → 2 A2:F6
F4 4 0
Mult RS
F5 5 0 M1
b op Vi Vj Qi Qj
V(L1)
F6 A2 1
M1 Yes M V(L2)
4 Mult/div
F7 7 0
V(L1)
M2 Yes D M1 Mult → 10
F8 8 A1 1
Div → 40
F9 9 0
F10 10 M2 0 52
LD F6, 34(R2) issue start write
LD
MULT
F2, 45(R3)
F0, F2, F4
Clock Cycle13 LD 1 2 4
SUBD F8, F6, F2 LD 2 4 6
DIVD F10, F0, F6
Load RS
ADDD F6, F8, F2 busy Address MUL 3 7
MULT F4,F1,F5
L1 No SUBD 4 7 9
Mem
L2 No DIV 5
LD → 2
Val Rs O L3 No ADD 6 10 12
F0 0 M1 0 Add RS MUL No place in
b op Vi Vj Qi Qj RS, STALL
F1 1 0
A1 No Commit Unit
F2 2 0 Add/Sub (reorder
A2 No A1:F8
F3 3 0 buffer)
A3 No Add/Sub → 2 A2:F6
F4 4 0
Mult RS
F5 5 0 M1
b op Vi Vj Qi Qj
V(L1)
F6 A2 1
M1 Yes M V(L2)
4 Mult/div
F7 7 0
V(L1)
M2 Yes D M1 Mult → 10
F8 8 A1 1
Div → 40
F9 9 0
F10 10 M2 0 53
Fast forward to
clock#17
Write Result

‹#›
LD F6, 34(R2) issue start write
LD
MULT
F2, 45(R3)
F0, F2, F4
Clock Cycle17 LD 1 2 4
SUBD F8, F6, F2 LD 2 4 6
DIVD F10, F0, F6
Load RS
ADDD F6, F8, F2 busy Address MUL 3 7 17
MULT F4,F1,F5
L1 No SUBD 4 7 9
Mem
L2 No DIV 5
LD → 2
Val Rs O L3 No ADD 6 10 12
F0 0 M1 1 Add RS MUL No place in
b op Vi Vj Qi Qj RS, STALL
F1 1 0 M1:F0
A1 No Commit Unit
F2 2 0 Add/Sub (reorder
A2 No A1:F8
F3 3 0 buffer)
A3 No Add/Sub → 2 A2:F6
F4 4 0
Mult RS
F5 5 0
b op Vi Vj Qi Qj
V(L1)
F6 A2 1
M1 No Mult/div
F7 7 0
V(M1 V(L1)
M2 Yes D Mult → 10
F8 8 A1 1
M1 is now free, Mul can be Div → 40
F9 9 0
loaded next cycle
F10 10 M2 0 55
LD F6, 34(R2) issue start write
LD
MULT
F2, 45(R3)
F0, F2, F4
Clock Cycle18 LD 1 2 4
SUBD F8, F6, F2 LD 2 4 6
DIVD F10, F0, F6
Load RS
ADDD F6, F8, F2 busy Address MUL 3 7 17
MULT F4,F1,F5
L1 No SUBD 4 7 9
Mem
L2 No DIV 5 18
LD → 2
Val Rs O L3 No ADD 6 10 12
F0 V(M1)
0 Add RS MUL 18
b op Vi Vj Qi Qj
F1 1 0 M1:F0
A1 No Commit Unit
F2 2 0 Add/Sub (reorder
A2 No A1:F8
F3 3 0 buffer)
A3 No Add/Sub → 2 A2:F6
F4 4 M1 0
Mult RS
F5 5 0 M2
b op Vi Vj Qi Qj
V(L1)
F6 A2 1
M1 Yes M 1 5 Mult/div
F7 7 0
V(M1 V(L1)
M2 Yes D Mult → 10
F8 8 A1 1
Div → 40
F9 9 0 M1 is ready can it execute?
F10 10 M2 0 56
LD F6, 34(R2) issue start write
LD
MULT
F2, 45(R3)
F0, F2, F4
Clock Cycle19 LD 1 2 4
SUBD F8, F6, F2 LD 2 4 6
DIVD F10, F0, F6
Load RS
ADDD F6, F8, F2 busy Address MUL 3 7 17
MULT F4,F1,F5
L1 No SUBD 4 7 9
Mem
L2 No DIV 5 18
LD → 2
Val Rs O L3 No ADD 6 10 12
F0 V(M1)
0 Add RS MUL 18
b op Vi Vj Qi Qj
F1 1 0
A1 No Commit Unit
F2 2 0 Add/Sub (reorder
A2 No A1:F8
F3 3 0 buffer)
A3 No Add/Sub → 2 A2:F6
F4 4 M1 0
Mult RS
F5 5 0 M2 A1 can now be written to
b op Vi Vj Qi Qj
F6 V(L1)
A2 1 register file
M1 Yes M 1 5 Mult/div
F7 7 0
V(M1 V(L1)
V(A1)
M2 Yes D Mult → 10
F8 0
Div → 40
F9 9 0
F10 10 M2 0 57
LD F6, 34(R2) issue start write
LD
MULT
F2, 45(R3)
F0, F2, F4
Clock Cycle20 LD 1 2 4
SUBD F8, F6, F2 LD 2 4 6
DIVD F10, F0, F6
Load RS
ADDD F6, F8, F2 busy Address MUL 3 7 17
MULT F4,F1,F5
L1 No SUBD 4 7 9
Mem
L2 No DIV 5 18
LD → 2
Val Rs O L3 No ADD 6 10 12
F0 V(M1)
0 Add RS MUL 18
b op Vi Vj Qi Qj
F1 1 0
A1 No Commit Unit
F2 2 0 Add/Sub (reorder
A2 No buffer)
F3 3 0
A3 No Add/Sub → 2 A2:F6
F4 4 M1 0
Mult RS
F5 5 0 M2
b op Vi Vj Qi Qj Why A2 is still in reorder buffer?
V(L1)
F6 A2 1
M1 Yes M 1 5 Mult/div
F7 7 0
V(M1 V(L1)
V(A1)
M2 Yes D Mult → 10
F8 0
Div → 40
F9 9 0
F10 10 M2 0 58
Fast forward to
clock#58
Write Result

‹#›
LD F6, 34(R2) issue start write
LD
MULT
F2, 45(R3)
F0, F2, F4
Clock Cycle58 LD 1 2 4
SUBD F8, F6, F2 LD 2 4 6
DIVD F10, F0, F6
Load RS
ADDD F6, F8, F2 busy Address MUL 3 7 17
MULT F4,F1,F5
L1 No SUBD 4 7 9
Mem
L2 No DIV 5 18 58
LD → 2
Val Rs O L3 No ADD 6 10 12
F0 V(M1)
0 Add RS MUL 18 58
b op Vi Vj Qi Qj
F1 1 0
A1 No Commit Unit
F2 2 0 Add/Sub (reorder
A2 No M2:F10
F3 3 0 buffer)
A3 No Add/Sub → 2 A2:F6
F4 4 M1 0
Mult RS
F5 5 0 M1
b op Vi Vj Qi Qj
V(L1)
F6 A2 1
M1 Yes M 1 5 Mult/div
F7 7 0
V(A1)
M2 No Mult → 10
F8 0
Div → 40
F9 9 0 M1 is ready can it execute?
F10 10 M2 1 60
Fast forward to
clock#68
Write Result

‹#›
LD F6, 34(R2) issue start write
LD
MULT
F2, 45(R3)
F0, F2, F4
Clock Cycle68 LD 1 2 4
SUBD F8, F6, F2 LD 2 4 6
DIVD F10, F0, F6
Load RS
ADDD F6, F8, F2 busy Address MUL 3 7 17
MULT F4,F1,F5
L1 No SUBD 4 7 9
Mem
L2 No DIV 5 18 58
LD → 2
Val Rs O L3 No ADD 6 10 12
F0 V(M1)
0 Add RS MUL 18 58 68
b op Vi Vj Qi Qj
F1 1 0
A1 No Commit Unit
F2 2 0 Add/Sub (reorder
A2 No buffer)
F3 3 0
A3 No Add/Sub → 2 M1:F4
F4 4 M1 1
Mult RS
F5 5 0
V(A2
b op Vi Vj Qi Qj
F6 0
M1 No Mult/div
F7 7 0
V(A1)
M2 No Mult → 10
F8 0
Div → 40
F9 9 0
F10 V(m 0
2) 62
Important Notes
• If an instruction fetches at cycle#1 and takes 4 cycles to execute, Then it
will start execution at cycle#2 and finishes at the end of cycle#5 And
writebacks at cycle#6
• The next instruction requires the FU will start at #6
• The next instruction requires the RS will use it at #7
• The next instruction requires the data in Exe will use it at #7
(no forwarding)
• Execution units are non-pipelined unless stated the opposite.
• Only one execution unit available per function unless stated otherwise.
Tomasulo’s Summary
z Prevents Register as bottleneck

z Avoids different data hazards

z Lasting Contributions
y Dynamic scheduling
y Register renaming
y Load/store buffers

Performance is limited by Common Data bus, WHY?!!!!

64
Tomasulo’s Summary issue start write

• Without re-order buffer LD 1 2 4


LD 2 4 6
• In-order issue, out-of-order execution, and out-
MUL 3 7 17
of-order completion
SUBD 4 7 9
• What will happen in case of control Hazard ? DIV 5 18 58
• Tomasulo with re-order buffer is called ADD 6 10 12
speculative Tomasulo MUL 18 58 68
• What is the speedup of this processor
compared to similar architecture without
dynamic scheduling?
Four Stages of Tomasulo’s Algorithm
1. Issue—get instruction from FP Op Queue
If reservation station and reorder buffer slot free, issue instr & send operands & reorder buffer no.
for destination (this stage sometimes called “dispatch”)
2. Execution—operate on operands (EX)
When both operands ready then execute; if not ready, watch CDB for result; when both in
reservation station, execute; checks RAW (sometimes called “issue”)
3. Write result—finish execution (WB)
Write on Common Data Bus to all awaiting FUs
& reorder buffer; mark reservation station available.
4. Commit—update register with reorder result
When instr. at head of reorder buffer & result present, update register with result (or store to
memory) and remove instr from reorder buffer. Mispredicted branch flushes reorder buffer
(sometimes called “graduation”)
Speculative Tomasulo’s Algorithm
Important Terms
Exercise
• LD F1, 0(R2) //LD 3 cycles
• ADD F2,F3,F1 //ALU → 1cyclle
• SUB F2,F4,F5
• XOR F4,F2,F1
• SW F1, 4(R1) // SW → 3cycles

‹#›
Thank you

‹#›

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy