0% found this document useful (0 votes)
6 views40 pages

EC483_Fall2024_W7

Chapter 3 of 'Computer Architecture: A Quantitative Approach' discusses instruction-level parallelism (ILP) and its exploitation through pipelined architecture. It covers the concepts of data, name, and control dependencies, as well as hazards that can occur in pipelined systems, and techniques for optimizing ILP such as loop unrolling and branch prediction. The chapter emphasizes the importance of minimizing stalls and maximizing instruction throughput in modern processors.

Uploaded by

tmbuot
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views40 pages

EC483_Fall2024_W7

Chapter 3 of 'Computer Architecture: A Quantitative Approach' discusses instruction-level parallelism (ILP) and its exploitation through pipelined architecture. It covers the concepts of data, name, and control dependencies, as well as hazards that can occur in pipelined systems, and techniques for optimizing ILP such as loop unrolling and branch prediction. The chapter emphasizes the importance of minimizing stalls and maximizing instruction throughput in modern processors.

Uploaded by

tmbuot
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Computer Architecture

A Quantitative Approach, Sixth Edition

Chapter 3
Instruction-Level
Parallelism and Its
Exploitation

Copyright © 2019, Elsevier Inc. All


rights Reserved 1
Un-pipelined Architecture
Unpipelined Start and finish a job before moving to the next

Fetch

Decode

Jobs
Execute

Time
2
Pipelined Architecture
Pipelined Break the job into smaller stages

F D X

F D X

F D X
Jobs

F D X

Time
3
5-Stage Pipeline

In order to enable pipelining we need to


hold or keep the input stable to each
stage and this require latching data and
control signals to each stage in the
pipeline → Pipeline Registers 4
Clocks and Latches

Stage 1 L Stage 2 L

Clk

• Unpipelined: time to execute one instruction = T + Tovh


• For an N-stage pipeline, time per stage = T/N + Tovh
• Total time per instruction = N (T/N + Tovh) = T + N Tovh
• Clock cycle time = T/N + Tovh
• Clock speed = 1 / (T/N + Tovh)
• Ideal speedup = (T + Tovh) / (T/N + Tovh)
• Cycles to complete one instruction = N
• Average CPI (cycles per instr) = 1 5
A 5-Stage Pipeline

6
A 5-Stage Pipeline

Use the PC to access the I-cache and increment PC by 4

P
C

PC+4
P
C

PC+4
P
C

PC+4
P
C

PC+4

7
A 5-Stage Pipeline
Read registers, compare registers, compute branch target; for now, assume
branches take 2 cyc (there is enough work that branches can easily take more)

8
A 5-Stage Pipeline

ALU computation, effective address computation for load/store

9
A 5-Stage Pipeline

Memory access to/from data cache, stores finish in 4 cycles

10
A 5-Stage Pipeline

Write result of ALU computation or load into register file

11
Introduction

• Pipelining become universal technique in 1985


• Overlaps execution of instructions
• Exploits “Instruction Level Parallelism”

Two main approaches:


• Hardware-based dynamic approaches
• Used in server and desktop processors
• Not used as extensively in PMP processors
• Compiler-based static approaches
• Not as successful outside of scientific applications

12
Instruction-Level Parallelism

• When exploiting instruction-level parallelism, goal is to


minimize pipeline CPI
• Pipeline CPI = Ideal pipeline CPI +Structural stalls +Data hazard
stalls + Control stalls

• **Parallelism with basic block is limited


• Typical size of basic block = 3-6 instructions
• Must optimize across branches

13
Instruction Dependences

• There are three different types of dependences


• Data dependence (True data dependence)
• Name dependence (Instructions using same register
names)
• Control dependence ( Branches )

•An instruction j is data-dependent on instruction


i if either of the following holds
– Instruction i produces a result that may be used by
instruction j
– Instruction j is data dependent on instruction k and
instruction k is data dependent on instruction i 14
Data Dependences

• Example of data dependence

Lp: fld f0,0(x1) //f0=array element

fadd.d f4,f0,f2 //add scalar in f2

fsd f4,0(x1) //store result

addi x1,x1,-8 //decrement pointer 8 bytes

bne x1,x2,Lp //branch if x1 ≠ x2

15
Instruction Dependences

• Dependencies are a property of programs


• Pipeline organization determines if dependence is
detected and if it causes a stall

• Data dependence conveys:


– Possibility of a hazard
– Order in which results must be calculated
– Upper bound on exploitable instruction level parallelism

• Dependencies that flow through memory locations


are difficult to detect

16
Name Dependences

• A name dependence occurs when two instructions


use the same register or memory location, called a
name, but there is no flow of data between the
instructions associated with that name

• Two types of name dependence


– An antidependence Write After Read or (WAR)
– Output dependence Write After Write or (WAW

17
Register Renaming

• Instructions with name dependence can execute


simultaneously or out of order if the registers are
renamed (register renaming)

• Renaming can be done statically at compile time or


dynamically by hardware at run time.

18
Control Dependence

• A control dependence determines the ordering of an


instruction i with respect to a branch instruction
if p1 {
S1;
};
if p2 {
S2;
}
• Instruction S1 is control dependent on p1 and S2 is
control dependent on p2
• Control dependence is preserved by implementing
control hazard detection that causes control stalls.
19
Control Dependence

• Can we move S1 after (if p2 ) or S2 before (if p1) ?


• Yes! but without affecting the correctness if p1 {
S1;
of the program };
if p2 {
S2;
• The two properties critical to program }
correctness are exception behavior and the data flow
add x2,x3,x4
beq x2,x0,L1
ld x1,0(x2)
L1:
• The load instruction may cause a memory
protection exception if moved before the branch
20
Control Dependence

• It is insufficient to just maintain data dependences


because an instruction may be data-dependent on
more than one predecessor add x1,x2,x3
beq x4,x0,L
sub x1,x5,x6
L: ...
or x7,x1,x8
• The or instruction is data-dependent on both the add
and sub instructions
• The data flow must be preserved.
• Speculation helps to lessen the impact of the control
dependence while still maintaining the data flow
21
Value liveness

• The property of whether a value will be used by an


upcoming instruction is called liveness
• What if we knew that the register destination of the sub
instruction (x4) was unused after the instruction labeled
skip? add x1,x2,x3
beq x12,x0,skip
sub x4,x5,x6
add x5,x4,x9
skip: or x7,x8,x9

• Then we can move the sub before the beq


• This type of code scheduling is also a form of
22
speculation, often called software speculation
Hazards

• Structural hazards: different instructions in different stages


(or the same stage) conflicting for the same resource

• Data hazards: an instruction cannot continue because it


needs a value that has not yet been generated by an
earlier instruction

• Control hazard: fetch cannot continue because it does


not know the outcome of an earlier branch – special case
of a data hazard – separate category because they are
treated in different ways

23
Structural Hazards

• Example: a unified instruction and data cache →


stage 4 (MEM) and stage 1 (IF) can never coincide

• The later instruction and all its successors are delayed


until a cycle is found when the resource is free → these
are pipeline bubbles

• Structural hazards are easy to eliminate – increase the


number of resources (for example, implement a separate
instruction and data cache)

24
Enabling and optimizing ILP

• To enable ILP we need to


– Detect data dependences either in software or hardware
– Insert stalls when ever needed for correct program result
– Flush pipeline when ever a branch is taken

• To optimize ILP we need to


– Minimize number of stalls needed for correct program result
• Know when and how the ordering among instructions may be changed
– Minimize flushing the pipeline
• Predicting branch outcomes.

25
Compiler Techniques for Exposing ILP

• Pipeline Scheduling
– Separate dependent instruction from the source instruction by
the pipeline latency of the source instruction
• Example
➢ C code:
for (i=999; i>=0; i=i-1)
x[i] = x[i] + s;

➢ Un-Pipelined RISC-V Code


Loop: fld f0,0(x1) //f0=array element x[i]
fadd.d f4,f0,f2 //add scalar in f2=s
fsd f4,0(x1) //store result
addi x1,x1,-8 //decrement pointer 8 bytes (per DW)
bne x1,x2,Loop //branch if x1≠x2
Where are the data dependencies in the above code? And of which type?
26
Compiler Techniques for Exposing ILP
➢ Pipelined RISC-V Code
Loop: fld f0,0(x1) Loop: fld f0,0(x1)
stall addi x1,x1,-8
fadd.d f4,f0,f2 fadd.d f4,f0,f2
stall stall
stall stall
fsd f4,0(x1) fsd f4,8(x1)
addi x1,x1,-8 bne x1,x2,Loop
bne x1,x2,Loop
Constrains

27
Compiler Techniques for Exposing ILP

• Loop unrolling
– Replicates the loop body multiple times, and adjusting the loop
termination code
– Unroll by a factor of 4 (assume # elements is divisible by 4)
– Eliminate unnecessary instructions
Loop fld f0,0(x1)
fadd.d f4,f0,f2
fsd f4,0(x1) //drop addi & bne
fld f6,-8(x1)
fadd.d f8,f6,f2
fsd f8,-8(x1) //drop addi & bne
fld f0,-16(x1)
fadd.d f12,f0,f2
fsd f12,-16(x1) //drop addi & bne
fld f14,-24(x1)
fadd.d f16,f14,f2
fsd f16,-24(x1)
addi x1,x1,-32
bne x1,x2,Loop
• Eliminating three branches and three decrements of x1 28
Compiler Techniques for Exposing ILP

• Pipeline schedule the unrolled loop

Loop: fld f0,0(x1)


fld f6,-8(x1)
fld f8,-16(x1)
fld f14,-24(x1)
fadd.d f4,f0,f2
fadd.d f10,f6,f2
fadd.d f12,f8,f2
fadd.d f16,f14,f2
fsd f4,0(x1)
fsd f10,-8(x1)
fsd
fsd
f12,-16(x1)
f16,-24(x1)
◼ 14 cycles
addi x1,x1,-32 ◼ 3.5 cycles per element
bne x1,x2,Loop

29
Compiler Techniques for Exposing ILP
❖ Determine that unrolling the loop would be useful by finding that the
loop iterations were independent, except for the loop maintenance
code
❖ Use different registers for different computations to avoid name
dependence.
❖ Eliminate the extra test and branch instructions and adjust the loop
termination and iteration code.
❖ Determine that the loads and stores in the unrolled loop can be
interchanged if they are independent, they do not refer to the same
address.
❖ Schedule the code, preserving any dependences needed to yield
the same result as the original code.

30
Compiler Techniques Limitations

❖ Loop overhead
❖ Amount of overhead that can be reduced decrease by each
additional unroll
❖ Code size limitations
❖ Increase in code size → possible increase in cache miss rate
❖ Compiler limitations
❖ Potential shortfall in registers --> register pressure.

31
Branch Prediction

❖ Basic 1-bit predictor:


❖ Predict not taken, just increment pc+4 (do nothing special)

T
T
N 0 1

N
32
Basic 1-bit predictor

▪ How basic 1bit predictor branch predictor behaves on


the following branch patterns?
▪ TTTTTTTTTTTNTTTTTTTTTTTTTTT…..
▪ NNNNNNNNNNNNTNNNNNNNNNNNNN….
▪ TNTNTNTNTNTNTNTNTNTNTNTNTNTNTN…..

33
Basic 1-bit predictor

▪ Assume 30% of instructions are branches, and 60%of


branches are mispredicted, calculate pipeline CPI if the
branch misprediction penalty is 2 cycles.
Pipeline CPI =
= 1 + %Branch Instructions  %Branch Miss Prediction Rate  Branch Miss Prediction Penalty
= 1 + 0.3  0.6  2 = 1.36

34
Resources

▪ Memory Timing
▪ https://www.hardwaresecrets.com/understanding-ram-timings/
▪ Memory Architecture
▪ https://en.wikipedia.org/wiki/Multi-channel_memory_architecture
▪ CS6810 Computer Architecture 87 Lectures by Rajeev
Balasubramonian
▪ https://www.youtube.com/playlist?list=PL8EC1756A7B1764F6

35
Resources

▪ HPCA short Lecture series on High Performance Computer Architecture


▪ Part 1 (161 Lectures)
▪ https://www.youtube.com/playlist?list=PLAwxTw4SYaPmqpjgrmf4-
DGlaeV0om4iP
▪ Part 2 (62 Lectures)
▪ https://www.youtube.com/playlist?list=PLAwxTw4SYaPkNw98-
MFodLzKgi6bYGjZs
▪ Part 3 (169 Lectures)
▪ https://www.youtube.com/playlist?list=PLAwxTw4SYaPnhRXZ6wuHnnclMLfg_y
jHs
▪ Part 4 (120 Lectures)
▪ https://www.youtube.com/playlist?list=PLAwxTw4SYaPn79fsplIuZG34KwbkYSe
dj
▪ Part 5 (149 Lectures)
▪ https://www.youtube.com/playlist?list=PLAwxTw4SYaPkr-
vo9gKBTid_BWpWEfuXe
36
How we implement a basic 1-bit predictor?

Branch PC
10 bits
Table of
1K entries

Each
entry is
a bit
The table keeps track of what the branch did last time

37
Basic 2-bit Branch Prediction

❖ Basic 2-bit predictor:


❖ For each branch:
❖ Predict taken or not taken
❖ Change prediction only if the prediction is wrong two consecutive times.

T T T
T

00 01 10 11
N
N N N
▪Check the following case assuming we start from 11 state:
TNTNTNTNTNTNTNTNTNTNTNTNTNTNTN…..
▪ We get 50% Correct prediction!
38
Basic 2-bit Branch Prediction
• For each branch, maintain a 2-bit saturating counter:
if the branch is taken: counter = min(3,counter+1)
if the branch is not taken: counter = max(0,counter-1)
• If (counter >= 2), predict taken, else predict not taken
• Advantage: few typical branches will not influence the
prediction (a better measure of “the common case”)
• Especially useful when multiple branches share the same
counter (some bits of the branch PC are used to index
into the branch predictor)
• Can be easily extended to N-bits (in most processors,
N=2)
• Prediction performance depends on both the prediction
accuracy and the branch frequency
39
Basic 2-bit Branch Prediction

Branch PC
10 bits Table of
1K entries

Each
entry is
a 2-bit
sat.
The table keeps track of the common-case counter
outcome for the branch

40

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy