Lecture10 - chapter4-p2
Lecture10 - chapter4-p2
Add
RegWrite ALUSrc ALU control MemWrite MemtoReg
4
ovf
zero
Read Addr 1
Instruction Read Address
Memory Register
Read Addr 2 Data 1
Data
Read File
PC Instruction ALU Memory Read Data
Address Write Addr Read
Data 2 Write Data
Write Data
MemRead
Sign
16 Extend 32
Adding the Control
• Selecting the operations to perform (ALU,
Register File and Memory read/write)
• Controlling the flow of data (multiplexor inputs)
31 25 20 15 10 5 0
Observations
R-type: op rs rt rd shamt funct
op field always in bits
31-26 31 25 20 15 0
addr of registers toI-Type: op rs rt address offset
be read are always
31 25 0
specified by the rs
J-type: op target address
field (bits 25-21) and
rt field (bits 20-16);
for lw and sw rs is the
base register
RegWrite
RegDst
ovf
Instr[25-21] Read Addr 1
Instruction
Register Read Address
Memory Instr[20-16] Read Addr 2 zero
Data 1
Data
Read
PC Instr[31-0] 0 File
ALU Memory Read Data 1
Address Write Addr
1 Read 0
Instr[15 Data 2 Write Data 0
Write Data
-11] 1
RegWrite
RegDst
ovf
Instr[25-21] Read Addr 1
Instruction
Register Read Address
Memory Instr[20-16] Read Addr 2 zero
Data 1
Data
Read
PC Instr[31-0] 0 File
ALU Memory Read Data 1
Address Write Addr
1 Read 0
Instr[15 Data 2 Write Data 0
Write Data
-11] 1
RegWrite
RegDst
ovf
Instr[25-21] Read Addr 1
Instruction
Register Read Address
Memory Instr[20-16] Read Addr 2 zero
Data 1
Data
Read
PC Instr[31-0] 0 File
ALU Memory Read Data 1
Address Write Addr
1 Read 0
Instr[15 Data 2 Write Data 0
Write Data
-11] 1
RegWrite
RegDst
ovf
Instr[25-21] Read Addr 1
Instruction
Register Read Address
Memory Instr[20-16] Read Addr 2 zero
Data 1
Data
Read
PC Instr[31-0] 0 File
ALU Memory Read Data 1
Address Write Addr
1 Read 0
Instr[15 Data 2 Write Data 0
Write Data
-11] 1
RegWrite
RegDst
ovf
Instr[25-21] Read Addr 1
Instruction
Register Read Address
Memory Instr[20-16] Read Addr 2 zero
Data 1
Data
Read
PC Instr[31-0] 0 File
ALU Memory Read Data 1
Address Write Addr
1 Read 0
Instr[15 Data 2 Write Data 0
Write Data
-11] 1
RegWrite
RegDst
ovf
Instr[25-21] Read Addr 1
Instruction
Register Read Address
Memory Instr[20-16] Read Addr 2 zero
Data 1
Data
Read
PC Instr[31-0] 0 File
ALU Memory Read Data 1
Address Write Addr
1 Read 0
Instr[15 Data 2 Write Data 0
Write Data
-11] 1
ALUOp
ALU control block
• The circuit for ALU control unit ALUOp0
• Obtained through combinational ALUOp1
digital logic design method
Operation2
F3
Operation
F2 Operation1
F (5– 0)
F1
Operation0
F0
Implementing Main Control Unit (I)
• Inputs are 6-bit op-code from all instructions
– op-code field (6 bits [31:26])
• Outputs (10-bit) are the control lines to control
– Memory (2 bits)
• MemRead, MemWrite
– Multiplexers (5 bits)
• RegDst, (Jump), Branch, MemtoReg, ALUSrc,
– Registers (1 bit)
• RegWrite
– and 2-bit ALUOp (2 bits)
• ALUOp0, ALUOp1
Implementing Main Control Unit (II)
The Outputs of Main Control Unit
Implementing Main Control Unit (III)
Truth table for main control unit:
Implementing Main Control Unit (IV)
The circuit for main control unit:
Inputs
Op5
Op4
Op3
Op2
Op1
Op0
Outputs
R-format Iw sw beq
RegDst
ALUSrc
MemtoReg
RegWrite
MemRead
MemWrite
Branch
ALUOp1
ALUOpO
Handle the Jump Instruction
• For jump instruction, the target address can be formed with the
concatenation of
• For main control unit, add an output control signal Jump, which is “1” when
the 6-bit op-code matches that of instruction j.
Implementation of a MIPS Processor
Instruction Critical Paths
What is the clock cycle time assuming negligible delays for
muxes, control unit, sign extend, PC access, shift left 2, wires,
setup and hold times except:
Instruction and Data Memory (200 ps)
ALU and adders (200 ps)
Register File access (reads or writes) (100 ps)
Cycle 1 Cycle 2
Clk
lw sw Waste
• Uses the clock cycle inefficiently – the clock cycle must
be timed to accommodate the slowest instruction
– especially problematic for more complex instructions like
floating point multiple cycle
but
• Is simple and easy to understand
How Can We Make It Faster?
• Start fetching and executing the next instruction
before the current one has completed
– Pipelining – (all?) modern processors are pipelined for
performance
– Remember the performance equation:
CPU time = CPI * CC * IC
• Under ideal conditions and with a large number of
instructions, the speedup from pipelining is
approximately equal to the number of pipe stages
– A five stage pipeline is nearly five times faster because the
CC is nearly five times faster
lw sw Waste
Add
Shift Add MEM/WB
4
left 2
Read Addr 1
Instruction Read Data
Register
Memory Memory
Read Addr 2 Data 1
Read File
PC
Read
Address ALU Address
Write Addr Data
Read
Data 2 Write Data
Write Data
Sign
16 Extend 32
System Clock
MIPS Pipeline Control Path Modifications
• All control signals can be determined during Decode
– and held in the state registers between pipeline stages
PCSrc
ID/EX
EX/MEM
Control
IF/ID
Add
Branch MEM/WB
RegWrite Shift Add
4
left 2
Read Addr 1
Instruction Read Data
Register
Memory Memory
Read Addr 2 Data 1 MemtoReg
Read ALUSrc
File
PC
Read
Address ALU Address
Write Addr Data
Read
Data 2 Write Data
Write Data
ALU
cntrl
MemRead
Sign
16 Extend 32 ALUOp
RegDst
Pipeline Control
• IF Stage: read Instr Memory (always asserted)
and write PC (on System Clock)
• ID Stage: no optional control signals to set
ALU
IM Reg DM Reg
ALU
I Inst 0 IM Reg DM Reg is full, one
n instruction is
s completed every
ALU
t Inst 1 IM Reg DM Reg
cycle, so CPI = 1
r.
ALU
O Inst 2 IM Reg DM Reg
r
d
ALU
e Inst 3 IM Reg DM Reg
r
ALU
Inst 4 IM Reg DM Reg
ALU
I Mem Reg Mem Reg
memory
n
s
ALU
t Inst 1 Mem Reg Mem Reg
r.
ALU
O Inst 2 Mem Reg Mem Reg
r
d
ALU
e Inst 3 Mem Reg Mem Reg
r
ALU
Inst 4 Mem Reg Mem Reg
Reading instruction
from memory
ALU
I Reg DM Reg hazard by doing
n reads in the second
s half of the cycle and
ALU
t Inst 1 IM Reg DM Reg
writes in the first half
r.
ALU
O Inst 2 IM Reg DM Reg
r
d
ALU
e add $t2,$t1, IM Reg DM Reg
r
ALU
add $t1,$s0,$s1 IM Reg DM Reg
ALU
sub $t4,$t1,$t5 IM Reg DM Reg
ALU
and $t6,$t1,$t7 IM Reg DM Reg
ALU
or $t8,$t1,$t9 IM Reg DM Reg
ALU
IM DM Reg
xor $t4,$t1,$t5 Reg
ALU
Ilw $t1,4($t2) IM Reg DM Reg
n
s
ALU
t sub $t4,$t1,$t5 IM Reg DM Reg
r.
ALU
O and $t6,$t1,$t7 IM Reg DM Reg
r
d
ALU
e or $t8,$t1,$t9 IM Reg DM Reg
r
ALU
IM DM Reg
xor $t4,$t1,$t5 Reg
beq
ALU
I IM Reg DM Reg
n
s
ALU
t lw IM Reg DM Reg
r.
ALU
O Inst 3 IM Reg DM Reg
r
d
ALU
e Inst 4 IM Reg DM Reg
r
Summary
• All modern day processors use pipelining
• Pipelining doesn’t help latency of single task, it helps
throughput of entire workload
• Potential speedup: CPI = 1
• Pipeline rate limited by slowest pipeline stage
– Unbalanced pipe stages makes for inefficiencies
– The time to “fill” pipeline and time to “drain” it can
impact speedup for deep pipelines and short code runs
• Must detect and resolve hazards
– Stalling negatively affects CPI (makes CPI larger than the
ideal of 1)
Problem
In a pipeline system, there are five-stages: Line
- IF: Fetch an instruction from instruction 1 Loop: addi $t2, $zero, 10
memory. 2 Loop2 addi $s2, $s2, 2
- ID: Decode the instruction and read register 3 : subi $t2, $t2, 1
values 4 bne $t2, $zero, Loop2
- EX: Perform an ALU operation as specified 5 subi $t1, $t1, 1
by the instruction 6 bne $t1, $zero, Loop
- MEM: Access the memory to read or write
data.
- WB: Write the result to one of the registers in
the register file.
Problem
• Each stage takes one clock cycle. The bne instruction will
finish in the 3rd stage, while addi and subi will finish in the
5th stage. Writing to a register is done in the first half of a
clock cycle while reading is done in the second half cycle.
Assume all branches are perfectly predicted (no control
hazards), and that NOP instructions takes only 2 stages (IF
and ID). The individual stages of the datapath have the
following latencies:
IF ID EX MEM WB
400 ps 250 ps 200 ps 400 ps 140 ps
1 2 3 4 5 6 7 8 9 10 11 12 13
1 L lw $t1, IF ID EX MEM WB
1: 40($t6) $t6 $t1
2 beq $t2, $t3, IF ID EX MEM WB
L2 $t2, X X
$t3
NO IF ID EX MEM WB
P X X X
3 add $t1, $t1, IF ID EX ME WB
$t4 $t4, M $t1
$t1 X
NO IF ID EX MEM WB
P X X X
NO IF ID EX MEM WB
P X X X
4 L beq $t1, $t2, IF ID EX ME WB
2: L1 $t1, M X
$t2 X
5 sw $t2, IF ID EX ME WB
20($t4) $t2,$t M X
4
6 and $t1, $t1, IF ID EX ME WB
$t4 $t1, M $t1
$t4 X
Problem
• Assume that branches are always not taken (predicted to
be FALSE). Calculate number of clock cycles needed to
finish the instructions, after solving data hazards, if:
• Initially: $t1=0, $t2=$t3=$t4=50, $t6=10, and M[50]=75
1 2 3 4 5 6 7 8 9 10
1 L1: lw $t1, 40($t6) IF ID EX MEM WB
$t6 $t1
2 beq $t2, $t3, L2 IF ID EX MEM WB
$t2, X X
$t3
3 add $t1, $t1, IF ID EX MEM WB
$t4 $t4, X $t1
$t1
4 L2: beq $t1, $t2, L1 IF ID EX ME WB
$t1, M X
$t2 X
5 sw $t2, 20($t4) IF ID EX ME WB
$t2,$t4 M X