0% found this document useful (0 votes)
385 views4 pages

Compiler Techniques For Exposing ILP

This document discusses techniques for improving instruction level parallelism (ILP) through compiler optimizations like loop unrolling. It provides an example of a floating point array addition loop and analyzes the performance of the original code, a rescheduled version, an unrolled version, and an unrolled and rescheduled version on a simple pipeline. Unrolling the loop and rescheduling the instructions to hide latencies achieves the best performance of 3.5 cycles per iteration compared to 9 cycles for the original code. However, it notes that the performance gains from unrolling diminish with more iterations and it increases code size and pressure on registers.

Uploaded by

Gan Esh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
385 views4 pages

Compiler Techniques For Exposing ILP

This document discusses techniques for improving instruction level parallelism (ILP) through compiler optimizations like loop unrolling. It provides an example of a floating point array addition loop and analyzes the performance of the original code, a rescheduled version, an unrolled version, and an unrolled and rescheduled version on a simple pipeline. Unrolling the loop and rescheduling the instructions to hide latencies achieves the best performance of 3.5 cycles per iteration compared to 9 cycles for the original code. However, it notes that the performance gains from unrolling diminish with more iterations and it increases code size and pressure on registers.

Uploaded by

Gan Esh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Compiler Techniques for Exposing ILP

Keep pipeline full: need sequences of unrelated instructions.

Related instructions must be separated by an amount dependent on the pipeline


depth.

The section concentrates on using loop unrolling.


Consider the following code, where x and s are floating point and i is an int:
for (i=999; i>0; i--)
x[i] = x[i] + s;

The following is a MIPS implementation assuming that s is in F2, R1 has the address
of the last element of the array, and 8(R2) is the address of the first element of the
array.
loop:

L.D
ADD.D
S.D
DADDUI
BNE

F0,0(R1)
F4, F0, F2
F4, 0(R1)
R1, R1, #-8
R1, R2, loop

What happens when you execute this on a simple pipeline.


We have not discussed how floating point operations work, but we will assume the
following latencies:
Instruction
Producing Result
FP ALU Op
FP ALU Op
Load double
Load double

Instruction
Using Result
FP ALU Op
Store double
FP ALU Op
Store double

Latency
in cycles
3
2
1
0

Here is the timing of these instructions.


loop:

L.D

F0, 0(R1)

stall
ADD.D

F4, F0, F2

stall
stall
S.D
DADDUI

F4, F0, 0(R1)


R1, R1, #-8

stall
BNE

R1, R2, loop

clock cycle
issued
1
2
3
4
5
6
7
8
9

We assume the latencies of from the table above.


We assume a latency of 1 cycle from integer ALU to branch since the branch address

is calculated in ID which occurs in the same cycle is the EX of the previous


instruction.
We ignore other delays due to branches.
We can remove half of the stalls by moving the DADDUI up after the L.D.
loop:

L.D
DADDUI
ADD.D

F0, 0(R1)
R1, R1, #-8
F4, F0, F2

stall
stall
S.D
BNE

F4, 8(R1)
R1, R2, loop

clock cycle
issued
1
2
3
4
5
6
7

The body of the loop takes 7 cycles.


Now we unroll 4 cycles of the loop, assuming the number of iterations is divisible by
4:
loop:

L.D
ADD.D
S.D
L.D
ADD.D
S.D
L.D
ADD.D
S.D
L.D
ADD.D
S.D
DADDUI
BNE

F0, 0(R1)
F4, F0, F2
F4, 0(R1)
F6, -8(R1)
F8, F6, F2
F8, -8(R1)
F10, -16(R1)
F12, F10, F2
F12, -16(R1)
F14, -24(R1)
F16, F14, F2
F16, -24(R1)
R1, R1, #-32
R1, R2, loop

clock cycle
issued
1
3
6
7
9
12
13
15
18
19
21
24
25
27

Each LD has 1 stall, each ADDD has 2, and the DADDUI has 1 for a total of 13 stall
cycles and a total of 27 clock cycles for the loop.
Without unrolling, the original would take 36 cycles for 4 iterations and the
rescheduled code would take 28 cycles.
We can do better by changing the order of the instructions:
loop:

L.D

F0, 0(R1)

clock cycle
issued
1

L.D
L.D
L.D
ADD.D
ADD.D
ADD.D
ADD.D
S.D
S.D
DADDUI
S.D
S.D
BNE

2
3
4
5
6
7
8
9
10
11
12
13
14

F6, -8(R1)
F10, -16(R1)
F14, -24(R1)
F4, F0, F2
F8, F6, F2
F12, F10, F2
F16, F14, F2
F4, 0(R1)
F8, -8(R1)
R1, R1, #-32
F12, -16(R1)
F16, -24(R1)
R1, R2, loop

There are now no stalls at all.


Summary of the 4 examples:
Description
ideal
original
scheduled
unrolled
unrolled and scheduled

Cycles per iteration


5
9
7
6.75
3.5

Limitations of loop unrolling:

decrease in saving as we unroll more

When we unroll 4, 2 cycles out of 14 or 14.3% are loop overhead

If we unroll 8, 2 cycles out of 26 or 7.7% are loop overhead

If we unroll 16, 2 cycles out of 34 or 5.9% are loop overhead

increase in code size

limited number of registers

http://vip.cs.utsa.edu/classes/cs3853f2012/notes/ch3-5.html

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy