Compiler Techniques For Exposing ILP
Compiler Techniques For Exposing ILP
The following is a MIPS implementation assuming that s is in F2, R1 has the address
of the last element of the array, and 8(R2) is the address of the first element of the
array.
loop:
L.D
ADD.D
S.D
DADDUI
BNE
F0,0(R1)
F4, F0, F2
F4, 0(R1)
R1, R1, #-8
R1, R2, loop
Instruction
Using Result
FP ALU Op
Store double
FP ALU Op
Store double
Latency
in cycles
3
2
1
0
L.D
F0, 0(R1)
stall
ADD.D
F4, F0, F2
stall
stall
S.D
DADDUI
stall
BNE
clock cycle
issued
1
2
3
4
5
6
7
8
9
L.D
DADDUI
ADD.D
F0, 0(R1)
R1, R1, #-8
F4, F0, F2
stall
stall
S.D
BNE
F4, 8(R1)
R1, R2, loop
clock cycle
issued
1
2
3
4
5
6
7
L.D
ADD.D
S.D
L.D
ADD.D
S.D
L.D
ADD.D
S.D
L.D
ADD.D
S.D
DADDUI
BNE
F0, 0(R1)
F4, F0, F2
F4, 0(R1)
F6, -8(R1)
F8, F6, F2
F8, -8(R1)
F10, -16(R1)
F12, F10, F2
F12, -16(R1)
F14, -24(R1)
F16, F14, F2
F16, -24(R1)
R1, R1, #-32
R1, R2, loop
clock cycle
issued
1
3
6
7
9
12
13
15
18
19
21
24
25
27
Each LD has 1 stall, each ADDD has 2, and the DADDUI has 1 for a total of 13 stall
cycles and a total of 27 clock cycles for the loop.
Without unrolling, the original would take 36 cycles for 4 iterations and the
rescheduled code would take 28 cycles.
We can do better by changing the order of the instructions:
loop:
L.D
F0, 0(R1)
clock cycle
issued
1
L.D
L.D
L.D
ADD.D
ADD.D
ADD.D
ADD.D
S.D
S.D
DADDUI
S.D
S.D
BNE
2
3
4
5
6
7
8
9
10
11
12
13
14
F6, -8(R1)
F10, -16(R1)
F14, -24(R1)
F4, F0, F2
F8, F6, F2
F12, F10, F2
F16, F14, F2
F4, 0(R1)
F8, -8(R1)
R1, R1, #-32
F12, -16(R1)
F16, -24(R1)
R1, R2, loop
http://vip.cs.utsa.edu/classes/cs3853f2012/notes/ch3-5.html