0% found this document useful (0 votes)
6 views46 pages

Lecture10 - chapter4-p2

The document outlines the design and implementation of a single-cycle datapath for a MIPS processor, detailing the assembly of datapath segments, control lines, and multiplexors. It discusses the control unit's role in managing instruction execution, including ALU operations and memory access, while highlighting the advantages and disadvantages of single-cycle versus pipelined architectures. Pipelining is presented as a method to enhance performance by overlapping instruction execution stages, ultimately improving throughput despite not reducing individual instruction latency.

Uploaded by

Rosette Atteya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views46 pages

Lecture10 - chapter4-p2

The document outlines the design and implementation of a single-cycle datapath for a MIPS processor, detailing the assembly of datapath segments, control lines, and multiplexors. It discusses the control unit's role in managing instruction execution, including ALU operations and memory access, while highlighting the advantages and disadvantages of single-cycle versus pipelined architectures. Pipelining is presented as a method to enhance performance by overlapping instruction execution stages, ultimately improving throughput despite not reducing individual instruction latency.

Uploaded by

Rosette Atteya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Creating a Single Datapath from the Parts

• Assemble the datapath segments and add control lines


and multiplexors as needed
• Single cycle design – fetch, decode and execute each
instruction in one clock cycle
– no datapath resource can be used more than once per
instruction, so some must be duplicated (e.g., separate
Instruction Memory and Data Memory, several adders)
– multiplexors needed at the input of shared elements with
control lines to do the selection
– write signals to control writing to the Register File and Data
Memory

• Cycle time is determined by length of the longest path


Fetch, R, and Memory Access Portions

Add
RegWrite ALUSrc ALU control MemWrite MemtoReg
4
ovf
zero
Read Addr 1
Instruction Read Address
Memory Register
Read Addr 2 Data 1
Data
Read File
PC Instruction ALU Memory Read Data
Address Write Addr Read
Data 2 Write Data
Write Data

MemRead
Sign
16 Extend 32
Adding the Control
• Selecting the operations to perform (ALU,
Register File and Memory read/write)
• Controlling the flow of data (multiplexor inputs)
31 25 20 15 10 5 0
 Observations
R-type: op rs rt rd shamt funct
 op field always in bits
31-26 31 25 20 15 0
 addr of registers toI-Type: op rs rt address offset
be read are always
31 25 0
specified by the rs
J-type: op target address
field (bits 25-21) and
rt field (bits 20-16);
for lw and sw rs is the
base register

 addr. of register to be written is in one of two places – in rt (bits 20-16) for


lw; in rd (bits 15-11) for R-type instructions
 offset for beq, lw, and sw always in bits 15-0
Single Cycle Datapath with Control Unit
0
Add
Add 1
4 Shift
left 2 PCSrc
ALUOp Branch
MemRead
Instr[31-26] Control MemtoReg
Unit MemWrite
ALUSrc

RegWrite
RegDst
ovf
Instr[25-21] Read Addr 1
Instruction
Register Read Address
Memory Instr[20-16] Read Addr 2 zero
Data 1
Data
Read
PC Instr[31-0] 0 File
ALU Memory Read Data 1
Address Write Addr
1 Read 0
Instr[15 Data 2 Write Data 0
Write Data
-11] 1

Instr[15-0] Sign ALU


16 Extend 32 control
Instr[5-0]
R-type Instruction Data/Control Flow
0
Add
Add 1
4 Shift
left 2 PCSrc
ALUOp Branch
MemRead
Instr[31-26] Control MemtoReg
Unit MemWrite
ALUSrc

RegWrite
RegDst
ovf
Instr[25-21] Read Addr 1
Instruction
Register Read Address
Memory Instr[20-16] Read Addr 2 zero
Data 1
Data
Read
PC Instr[31-0] 0 File
ALU Memory Read Data 1
Address Write Addr
1 Read 0
Instr[15 Data 2 Write Data 0
Write Data
-11] 1

Instr[15-0] Sign ALU


16 Extend 32 control
Instr[5-0]
Load Word Instruction Data/Control Flow
0
Add
Add 1
4 Shift
left 2 PCSrc
ALUOp Branch
MemRead
Instr[31-26] Control MemtoReg
Unit MemWrite
ALUSrc

RegWrite
RegDst
ovf
Instr[25-21] Read Addr 1
Instruction
Register Read Address
Memory Instr[20-16] Read Addr 2 zero
Data 1
Data
Read
PC Instr[31-0] 0 File
ALU Memory Read Data 1
Address Write Addr
1 Read 0
Instr[15 Data 2 Write Data 0
Write Data
-11] 1

Instr[15-0] Sign ALU


16 Extend 32 control
Instr[5-0]
Branch Instruction Data/Control Flow
0
Add
Add 1
4 Shift
left 2 PCSrc
ALUOp Branch
MemRead
Instr[31-26] Control MemtoReg
Unit MemWrite
ALUSrc

RegWrite
RegDst
ovf
Instr[25-21] Read Addr 1
Instruction
Register Read Address
Memory Instr[20-16] Read Addr 2 zero
Data 1
Data
Read
PC Instr[31-0] 0 File
ALU Memory Read Data 1
Address Write Addr
1 Read 0
Instr[15 Data 2 Write Data 0
Write Data
-11] 1

Instr[15-0] Sign ALU


16 Extend 32 control
Instr[5-0]
Adding the Jump Operation
Instr[25-0] 1
Shift
28 32
26 left 2
PC+4[31-28]
0
Add 0
Add 1
4 Shift
left 2 PCSrc
Jump
ALUOp Branch
MemRead
Instr[31-26] Control MemtoReg
Unit MemWrite
ALUSrc

RegWrite
RegDst
ovf
Instr[25-21] Read Addr 1
Instruction
Register Read Address
Memory Instr[20-16] Read Addr 2 zero
Data 1
Data
Read
PC Instr[31-0] 0 File
ALU Memory Read Data 1
Address Write Addr
1 Read 0
Instr[15 Data 2 Write Data 0
Write Data
-11] 1

Instr[15-0] Sign ALU


16 Extend 32 control
Instr[5-0]
Fig.4.24 (p.329)
Instr[25-0] 1
Shift
28 32
26 left 2
PC+4[31-28]
0
Add 0
Add 1
4 Shift
left 2 PCSrc
Jump
ALUOp Branch
MemRead
Instr[31-26] Control MemtoReg
Unit MemWrite
ALUSrc

RegWrite
RegDst
ovf
Instr[25-21] Read Addr 1
Instruction
Register Read Address
Memory Instr[20-16] Read Addr 2 zero
Data 1
Data
Read
PC Instr[31-0] 0 File
ALU Memory Read Data 1
Address Write Addr
1 Read 0
Instr[15 Data 2 Write Data 0
Write Data
-11] 1

Instr[15-0] Sign ALU


16 Extend 32 control
Instr[5-0]
Implementing the Control Units
• The control unit includes mainly
1. ALU control unit
2. Main control unit
• Using only combinational circuits (simple!)
• Inputs are from each instruction’s
– op-code field (6 bits [31:26]) for all instructions, and
– funct field (6 bits [5:0]) for R-type instructions.
• Outputs are the control lines to control
– ALU,
– Multiplexers,
– registers, and
– memory
Implementing ALU Control Unit (I)
• Inputs: 6-bit function code, plus 2-bit ALUOp from
main control unit
• Outputs: 4-bit ALU control lines used to decide
which operation ALU performs
Implementing ALU Control Unit (II)
• Unit Inputs: ALUOp code (2 bits) and Funct field (6 bits)
– ALUOp code is generated at the main control unit
• Unit Outputs: ALU control lines (4 bits)

ALUOp
ALU control block
• The circuit for ALU control unit ALUOp0
• Obtained through combinational ALUOp1
digital logic design method
Operation2
F3
Operation
F2 Operation1
F (5– 0)
F1
Operation0
F0
Implementing Main Control Unit (I)
• Inputs are 6-bit op-code from all instructions
– op-code field (6 bits [31:26])
• Outputs (10-bit) are the control lines to control
– Memory (2 bits)
• MemRead, MemWrite
– Multiplexers (5 bits)
• RegDst, (Jump), Branch, MemtoReg, ALUSrc,
– Registers (1 bit)
• RegWrite
– and 2-bit ALUOp (2 bits)
• ALUOp0, ALUOp1
Implementing Main Control Unit (II)
The Outputs of Main Control Unit
Implementing Main Control Unit (III)
Truth table for main control unit:
Implementing Main Control Unit (IV)
The circuit for main control unit:
Inputs
Op5
Op4
Op3
Op2
Op1
Op0

Outputs
R-format Iw sw beq
RegDst

ALUSrc
MemtoReg
RegWrite
MemRead
MemWrite

Branch
ALUOp1

ALUOpO
Handle the Jump Instruction

• For jump instruction, the target address can be formed with the
concatenation of

– The upper 4 bits of [PC]+4


– The 26-bit immediate field of the jump instruction
– The bits 00

• For main control unit, add an output control signal Jump, which is “1” when
the 6-bit op-code matches that of instruction j.
Implementation of a MIPS Processor
Instruction Critical Paths
 What is the clock cycle time assuming negligible delays for
muxes, control unit, sign extend, PC access, shift left 2, wires,
setup and hold times except:
 Instruction and Data Memory (200 ps)
 ALU and adders (200 ps)
 Register File access (reads or writes) (100 ps)

Instr. I Mem Reg Rd ALU Op D Mem Reg Wr Total


R- 200 100 200 100 600
type
load 200 100 200 200 100 800
store 200 100 200 200 700
beq 200 100 200 500
jump 200 200
Single Cycle Disadvantages & Advantages
• One instruction is completed in one single cycle
• Cycle time has to be chosen as the max time delay
– i.e., 800 ns

Cycle 1 Cycle 2
Clk

lw sw Waste
• Uses the clock cycle inefficiently – the clock cycle must
be timed to accommodate the slowest instruction
– especially problematic for more complex instructions like
floating point multiple cycle
but
• Is simple and easy to understand
How Can We Make It Faster?
• Start fetching and executing the next instruction
before the current one has completed
– Pipelining – (all?) modern processors are pipelined for
performance
– Remember the performance equation:
CPU time = CPI * CC * IC
• Under ideal conditions and with a large number of
instructions, the speedup from pipelining is
approximately equal to the number of pipe stages
– A five stage pipeline is nearly five times faster because the
CC is nearly five times faster

 Fetch (and execute) more than one instruction at a time


 Superscalar processing – stay tuned
The Five Stages of Load Instruction
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

lw IFetch Dec Exec Mem WB

• IFetch: Instruction Fetch and Update PC


• Dec: Registers Fetch and Instruction Decode
• Exec: Execute R-type; calculate memory address
• Mem: Read/write the data from/to the Data
Memory
• WB: Write the result data into the register file
A Pipelined MIPS Processor
• Start the next instruction before the current one has completed
– improves throughput - total amount of work done in a given time
– instruction latency (execution time, delay time, response time - time
from the start of an instruction to its completion) is not reduced

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8

lw IFetch Dec Exec Mem WB

sw IFetch Dec Exec Mem WB

R-type IFetch Dec Exec Mem WB

- clock cycle (pipeline stage time) is limited by the slowest stage


- for some stages don’t need the whole clock cycle (e.g., WB)
- for some instructions, some stages are wasted cycles (i.e., nothing is
done during that cycle for that instruction)
Single Cycle versus Pipeline
Single Cycle Implementation (CC = 800 ps):
Cycle 1 Cycle 2
Clk

lw sw Waste

Pipeline Implementation (CC = 200 ps): 400 ps


lw IFetch Dec Exec Mem WB

sw IFetch Dec Exec Mem WB

R-type IFetch Dec Exec Mem WB

 To complete an entire instruction in the pipelined case takes


1000 ps (as compared to 800 ps for the single cycle case).
Why ?
 How long does each take to complete 1,000,000 adds ?
Pipelining the MIPS ISA
• What makes it easy
– all instructions are the same length (32 bits)
• can fetch in the 1st stage and decode in the 2nd stage
– few instruction formats (three) with symmetry across
formats
• can begin reading register file in 2nd stage
– memory operations occur only in loads and stores
• can use the execute stage to calculate memory addresses
– each instruction writes at most one result (i.e., changes
the machine state) and does it in the last few pipeline
stages (MEM or WB)
– operands must be aligned in memory so a single data
transfer takes only one data memory access
MIPS Pipeline Datapath Additions/Mods
• State registers between each pipeline stage to isolate them
IF:IFetch ID:Dec EX:Execute MEM: WB:
MemAccess WriteBack

IF/ID ID/EX EX/MEM

Add
Shift Add MEM/WB
4
left 2
Read Addr 1
Instruction Read Data
Register
Memory Memory
Read Addr 2 Data 1
Read File
PC

Read
Address ALU Address
Write Addr Data
Read
Data 2 Write Data
Write Data

Sign
16 Extend 32

System Clock
MIPS Pipeline Control Path Modifications
• All control signals can be determined during Decode
– and held in the state registers between pipeline stages
PCSrc
ID/EX
EX/MEM
Control
IF/ID

Add
Branch MEM/WB
RegWrite Shift Add
4
left 2
Read Addr 1
Instruction Read Data
Register
Memory Memory
Read Addr 2 Data 1 MemtoReg
Read ALUSrc
File
PC

Read
Address ALU Address
Write Addr Data
Read
Data 2 Write Data
Write Data
ALU
cntrl
MemRead
Sign
16 Extend 32 ALUOp

RegDst
Pipeline Control
• IF Stage: read Instr Memory (always asserted)
and write PC (on System Clock)
• ID Stage: no optional control signals to set

EX Stage MEM Stage WB Stage


Reg ALU ALU ALU Brch Mem Mem Reg Mem
Dst Op1 Op0 Src Read Write Write toReg
R 1 1 0 0 0 0 0 1 0
lw 0 0 0 1 0 1 0 1 1
sw X 0 0 1 0 0 1 0 X
beq X 0 1 0 1 0 0 0 X
Graphically Representing MIPS Pipeline

ALU
IM Reg DM Reg

• Can help with answering questions like:


– How many cycles does it take to execute this code?
– What is the ALU doing during cycle 4?
– Is there a hazard, why does it occur, and how can it be fixed?
Why Pipeline? For Performance!
Time (clock cycles)

Once the pipeline

ALU
I Inst 0 IM Reg DM Reg is full, one
n instruction is
s completed every

ALU
t Inst 1 IM Reg DM Reg
cycle, so CPI = 1
r.

ALU
O Inst 2 IM Reg DM Reg
r
d

ALU
e Inst 3 IM Reg DM Reg
r

ALU
Inst 4 IM Reg DM Reg

Time to fill the pipeline


Can Pipelining Get Us Into Trouble?
• Yes: Pipeline Hazards
– structural hazards: attempt to use the same resource by
two different instructions at the same time
– data hazards: attempt to use data before it is ready
• An instruction’s source operand(s) are produced by a prior
instruction still in the pipeline
– control hazards: attempt to make a decision about
program control flow before the condition has been
evaluated and the new PC target address calculated
• branch and jump instructions, exceptions

 Can usually resolve hazards by waiting


 pipeline control must detect the hazard
 and take action to resolve hazards
A Single Memory Would Be a Structural Hazard
Time (clock cycles)

Reading data from


lw

ALU
I Mem Reg Mem Reg
memory
n
s

ALU
t Inst 1 Mem Reg Mem Reg
r.

ALU
O Inst 2 Mem Reg Mem Reg
r
d

ALU
e Inst 3 Mem Reg Mem Reg
r

ALU
Inst 4 Mem Reg Mem Reg
Reading instruction
from memory

 Fix with separate instr and data memories


How About Register File Access?
Time (clock cycles)

Fix register file access


add $t1,IM

ALU
I Reg DM Reg hazard by doing
n reads in the second
s half of the cycle and

ALU
t Inst 1 IM Reg DM Reg
writes in the first half
r.

ALU
O Inst 2 IM Reg DM Reg
r
d

ALU
e add $t2,$t1, IM Reg DM Reg
r

clock edge that controls clock edge that controls


register writing loading of pipeline state
registers
Register Usage Can Cause Data Hazards
• Dependencies backward in time cause hazards

ALU
add $t1,$s0,$s1 IM Reg DM Reg

ALU
sub $t4,$t1,$t5 IM Reg DM Reg

ALU
and $t6,$t1,$t7 IM Reg DM Reg

ALU
or $t8,$t1,$t9 IM Reg DM Reg

ALU
IM DM Reg
xor $t4,$t1,$t5 Reg

 Read before write data hazard


Loads Can Cause Data Hazards
• Dependencies backward in time cause hazards

ALU
Ilw $t1,4($t2) IM Reg DM Reg
n
s

ALU
t sub $t4,$t1,$t5 IM Reg DM Reg
r.

ALU
O and $t6,$t1,$t7 IM Reg DM Reg
r
d

ALU
e or $t8,$t1,$t9 IM Reg DM Reg
r

ALU
IM DM Reg
xor $t4,$t1,$t5 Reg

 Load-use data hazard


Branch Instructions Cause Control Hazards
• Dependencies backward in time cause hazards

beq

ALU
I IM Reg DM Reg
n
s

ALU
t lw IM Reg DM Reg
r.

ALU
O Inst 3 IM Reg DM Reg
r
d

ALU
e Inst 4 IM Reg DM Reg
r
Summary
• All modern day processors use pipelining
• Pipelining doesn’t help latency of single task, it helps
throughput of entire workload
• Potential speedup: CPI = 1
• Pipeline rate limited by slowest pipeline stage
– Unbalanced pipe stages makes for inefficiencies
– The time to “fill” pipeline and time to “drain” it can
impact speedup for deep pipelines and short code runs
• Must detect and resolve hazards
– Stalling negatively affects CPI (makes CPI larger than the
ideal of 1)
Problem
In a pipeline system, there are five-stages: Line
- IF: Fetch an instruction from instruction 1 Loop: addi $t2, $zero, 10
memory. 2 Loop2 addi $s2, $s2, 2
- ID: Decode the instruction and read register 3 : subi $t2, $t2, 1
values 4 bne $t2, $zero, Loop2
- EX: Perform an ALU operation as specified 5 subi $t1, $t1, 1
by the instruction 6 bne $t1, $zero, Loop
- MEM: Access the memory to read or write
data.
- WB: Write the result to one of the registers in
the register file.
Problem
• Each stage takes one clock cycle. The bne instruction will
finish in the 3rd stage, while addi and subi will finish in the
5th stage. Writing to a register is done in the first half of a
clock cycle while reading is done in the second half cycle.
Assume all branches are perfectly predicted (no control
hazards), and that NOP instructions takes only 2 stages (IF
and ID). The individual stages of the datapath have the
following latencies:
IF ID EX MEM WB
400 ps 250 ps 200 ps 400 ps 140 ps

Draw the pipeline system in your draft sheet and use it to


choose the best answer in the following questions.
Problem
1 2 3 4 5 6 7 8 9 10
1 L1: lw $t1, IF ID EX MEM WB
40($t6) $t6 $t1
2 beq $t2, $t3, IF ID EX MEM WB
L2 $t2, X X
$t3
3 add $t1, $t1, IF ID EX MEM WB
$t4 $t4, X $t1
$t1
4 L2: beq $t1, $t2, IF ID EX ME WB
L1 $t1, M X
$t2 X
5 sw $t2, IF ID EX ME WB
20($t4) $t2,$t M X
4
6 and $t1, $t1, IF ID EX WB
$t4 $t1, $t1
$t4
Problem
• Assume all branches are perfectly predicted (no control
hazards). Find all data hazards, given initially: $t1=0,
$t2=$t3=$t4=50, $t6=10, and M[50]=75
1 2 3 4 5 6 7 8 9 10
1 L1 lw $t1, IF ID EX MEM WB
: 40($t6) $t6 $t1
2 beq $t2, $t3, IF ID EX MEM WB
L2 $t2, X X
$t3
3 L2 beq $t1, $t2, IF ID EX MEM WB
4 : L1 $t1, X X
$t2
sw $t2, IF ID EX ME WB
5 20($t4) $t2,$t M X
4
and $t1, $t1, IF ID EX WB
$t4 $t1, $t1
$t4
Problem
• Solve the hazards in (a) by inserting the minimum number of NOP
instructions. (Draw a figure showing the active pipeline stage at each
clock cycle).
• How many clock cycles does it take to finish execution of these
instructions?
1 2 3 4 5 6 7 8 9 10
1 L1 lw $t1, IF ID EX MEM WB
: 40($t6) $t6 $t1
2 beq $t2, $t3, IF ID EX MEM WB
L2 $t2, X X
$t3
3 NOP IF ID EX MEM WB
X X X
4 L2 beq $t1, $t2, IF ID EX ME WB
: L1 $t1, M X
$t2 X
5 sw $t2, IF ID EX ME WB
20($t4) $t2,$t M X
4
6 and $t1, $t1, IF ID EX WB
$t4 $t1, $t1
$t4
Problem
• Repeat (a) and (b), given initially: $t1=0,
$t2=60, $t3=$t4=50, $t6=10, and M[50]=75
1 2 3 4 5 6 7 8 9 10
1 L1: lw $t1, IF ID EX MEM WB
40($t6) $t6 $t1
2 beq $t2, $t3, IF ID EX MEM WB
L2 $t2, X X
$t3
3 add $t1, $t1, IF ID EX MEM WB
$t4 $t4, X $t1
$t1
4 L2: beq $t1, $t2, IF ID EX ME WB
L1 $t1, M X
$t2 X
5 sw $t2, IF ID EX ME WB
20($t4) $t2,$t M X
4
6 and $t1, $t1, IF ID EX WB
$t4 $t1, $t1
$t4
Problem

1 2 3 4 5 6 7 8 9 10 11 12 13
1 L lw $t1, IF ID EX MEM WB
1: 40($t6) $t6 $t1
2 beq $t2, $t3, IF ID EX MEM WB
L2 $t2, X X
$t3
NO IF ID EX MEM WB
P X X X
3 add $t1, $t1, IF ID EX ME WB
$t4 $t4, M $t1
$t1 X
NO IF ID EX MEM WB
P X X X
NO IF ID EX MEM WB
P X X X
4 L beq $t1, $t2, IF ID EX ME WB
2: L1 $t1, M X
$t2 X
5 sw $t2, IF ID EX ME WB
20($t4) $t2,$t M X
4
6 and $t1, $t1, IF ID EX ME WB
$t4 $t1, M $t1
$t4 X
Problem
• Assume that branches are always not taken (predicted to
be FALSE). Calculate number of clock cycles needed to
finish the instructions, after solving data hazards, if:
• Initially: $t1=0, $t2=$t3=$t4=50, $t6=10, and M[50]=75
1 2 3 4 5 6 7 8 9 10
1 L1: lw $t1, 40($t6) IF ID EX MEM WB
$t6 $t1
2 beq $t2, $t3, L2 IF ID EX MEM WB
$t2, X X
$t3
3 add $t1, $t1, IF ID EX MEM WB
$t4 $t4, X $t1
$t1
4 L2: beq $t1, $t2, L1 IF ID EX ME WB
$t1, M X
$t2 X
5 sw $t2, 20($t4) IF ID EX ME WB
$t2,$t4 M X

6 and $t1, $t1, IF ID EX WB


$t4 $t1, $t1
Problem
• Assume the individual stages of the datapath have the
following latencies:
IF ID EX MEM WB
240 ps 150 ps 120 ps 200 ps 140 ps
– What is the clock cycle time for the single cycle processor (no
pipeline)?
– What is the clock cycle time for the multi cycle processor?
• Using the clock cycle time from (e), how much time does it
take to process the instructions with pipelining in case (b)
and (c), and without pipelining (single cycle)?
• If we can split one stage of the pipelined datapath into two
new stages, each with half the latency of the original stage,
which stage would you split? What is the clock cycle time
for the pipelined processor?

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy