EC483_Fall2024_W7
EC483_Fall2024_W7
Chapter 3
Instruction-Level
Parallelism and Its
Exploitation
Fetch
Decode
Jobs
Execute
Time
2
Pipelined Architecture
Pipelined Break the job into smaller stages
F D X
F D X
F D X
Jobs
F D X
Time
3
5-Stage Pipeline
Stage 1 L Stage 2 L
Clk
6
A 5-Stage Pipeline
P
C
PC+4
P
C
PC+4
P
C
PC+4
P
C
PC+4
7
A 5-Stage Pipeline
Read registers, compare registers, compute branch target; for now, assume
branches take 2 cyc (there is enough work that branches can easily take more)
8
A 5-Stage Pipeline
9
A 5-Stage Pipeline
10
A 5-Stage Pipeline
11
Introduction
12
Instruction-Level Parallelism
13
Instruction Dependences
15
Instruction Dependences
16
Name Dependences
17
Register Renaming
18
Control Dependence
23
Structural Hazards
24
Enabling and optimizing ILP
25
Compiler Techniques for Exposing ILP
• Pipeline Scheduling
– Separate dependent instruction from the source instruction by
the pipeline latency of the source instruction
• Example
➢ C code:
for (i=999; i>=0; i=i-1)
x[i] = x[i] + s;
27
Compiler Techniques for Exposing ILP
• Loop unrolling
– Replicates the loop body multiple times, and adjusting the loop
termination code
– Unroll by a factor of 4 (assume # elements is divisible by 4)
– Eliminate unnecessary instructions
Loop fld f0,0(x1)
fadd.d f4,f0,f2
fsd f4,0(x1) //drop addi & bne
fld f6,-8(x1)
fadd.d f8,f6,f2
fsd f8,-8(x1) //drop addi & bne
fld f0,-16(x1)
fadd.d f12,f0,f2
fsd f12,-16(x1) //drop addi & bne
fld f14,-24(x1)
fadd.d f16,f14,f2
fsd f16,-24(x1)
addi x1,x1,-32
bne x1,x2,Loop
• Eliminating three branches and three decrements of x1 28
Compiler Techniques for Exposing ILP
29
Compiler Techniques for Exposing ILP
❖ Determine that unrolling the loop would be useful by finding that the
loop iterations were independent, except for the loop maintenance
code
❖ Use different registers for different computations to avoid name
dependence.
❖ Eliminate the extra test and branch instructions and adjust the loop
termination and iteration code.
❖ Determine that the loads and stores in the unrolled loop can be
interchanged if they are independent, they do not refer to the same
address.
❖ Schedule the code, preserving any dependences needed to yield
the same result as the original code.
30
Compiler Techniques Limitations
❖ Loop overhead
❖ Amount of overhead that can be reduced decrease by each
additional unroll
❖ Code size limitations
❖ Increase in code size → possible increase in cache miss rate
❖ Compiler limitations
❖ Potential shortfall in registers --> register pressure.
31
Branch Prediction
T
T
N 0 1
N
32
Basic 1-bit predictor
33
Basic 1-bit predictor
34
Resources
▪ Memory Timing
▪ https://www.hardwaresecrets.com/understanding-ram-timings/
▪ Memory Architecture
▪ https://en.wikipedia.org/wiki/Multi-channel_memory_architecture
▪ CS6810 Computer Architecture 87 Lectures by Rajeev
Balasubramonian
▪ https://www.youtube.com/playlist?list=PL8EC1756A7B1764F6
35
Resources
Branch PC
10 bits
Table of
1K entries
Each
entry is
a bit
The table keeps track of what the branch did last time
37
Basic 2-bit Branch Prediction
T T T
T
00 01 10 11
N
N N N
▪Check the following case assuming we start from 11 state:
TNTNTNTNTNTNTNTNTNTNTNTNTNTNTN…..
▪ We get 50% Correct prediction!
38
Basic 2-bit Branch Prediction
• For each branch, maintain a 2-bit saturating counter:
if the branch is taken: counter = min(3,counter+1)
if the branch is not taken: counter = max(0,counter-1)
• If (counter >= 2), predict taken, else predict not taken
• Advantage: few typical branches will not influence the
prediction (a better measure of “the common case”)
• Especially useful when multiple branches share the same
counter (some bits of the branch PC are used to index
into the branch predictor)
• Can be easily extended to N-bits (in most processors,
N=2)
• Prediction performance depends on both the prediction
accuracy and the branch frequency
39
Basic 2-bit Branch Prediction
Branch PC
10 bits Table of
1K entries
Each
entry is
a 2-bit
sat.
The table keeps track of the common-case counter
outcome for the branch
40