Instruction Scheduling
Instruction Scheduling
15-745
• Most modern processors have the ability to
execute several adjacent instructions
simultaneously.
Instruction Scheduling – Pipelined machines.
– Very-long-instruction-word machines (VLIW).
– Superscalar machines.
– Dynamic scheduling/out-of-order machines.
• ILP is limited by several kinds of execution
constraints:
– Data dependence constraints.
Copyright © Seth Copen Goldstein 2000-5 – Resource constraints (“hazards”)
– Control hazards
(some slides borrowed from M. Voss)
15-745 © Seth Copen Goldstein 2000-5 1 15-745 © Seth Copen Goldstein 2000-5 2
15-745 © Seth Copen Goldstein 2000-5 5 15-745 © Seth Copen Goldstein 2000-5 6
Instruction Scheduling
What we will cover
• Scheduling basic blocks • In the von Neumann model of execution an instruction starts
– List scheduling only after its predecessor completes.
– Long-latency operations
– Delay slots instr 1 instr 2
• Scheduling for clusters architectures time
• Software Pipelining (next week)
• This is not a very efficient model of execution.
• What we need to know – von Neumann bottleneck or the memory wall.
– pipeline structure
– data dependencies
– register renaming
15-745 © Seth Copen Goldstein 2000-5 7 15-745 © Seth Copen Goldstein 2000-5 8
Instruction Pipelines
Instruction Pipelines
• Almost all processors today use instructions pipelines to allow • However, we overlap these stages in time to complete an
overlap of instructions (Pentium 4 has a 20 stage pipeline!!!).
instruction every cycle.
• The execution of an instruction is divided into stages; each stage is
performed by a separate part of the processor. Steady state
instr 1 F D E M W
instr 2 F D E M W
instr F D E M W Draining the
time instr 3 F D E M W pipeline
F: Fetch instruction from cache or memory.
D: Decode instruction. instr 4 F D E M W
E: Execute. ALU operation or address calculation.
M: Memory access. instr 5 Filling the F D E M W
W: Write back result into register. pipeline
• Each of these stages completes its operation in one cycle (shorter instr 6 F D E M W
the the cycle in the von Neumann model).
• An instruction still takes the same time to execute. instr 7 F D E M W
time
15-745 © Seth Copen Goldstein 2000-5 9 4
15-745 © Seth Copen Goldstein 2000-5 10
r1 = [r2] instr 2 F D E M W
r4 = r1 + r1
– solved by forwarding and/or stalling instr 3 F D E M W
– cache miss?
• Control Hazards instr 4 F D E M W
– jump & branch address not known until later in pipeline
– solved by delay slot and/or prediction
15-745 © Seth Copen Goldstein 2000-5 11 15-745 © Seth Copen Goldstein 2000-5 12
Jump/Branch Delay Slot(s) Jump/Branch Delay Slot(s)
• One option is to stall the pipeline (hardware solution). • another option is for the branch take effect after
the delay slots.
jump F D E M W • I.e., some instructions always get executed after the
branch but before the branching takes effect.
instr 2 F D E M W
• Another option is to insert a no-op instructions (software).
bra F D E M W
instr x F D E M W
jump F D E M W
instr y F D E M W
nop F D E M W
instr 2 F D E M W
instr 2 F D E M W
instr 3 F D E M W
• Both degrade performance!
15-745 © Seth Copen Goldstein 2000-5 13 15-745 © Seth Copen Goldstein 2000-5 14
Branch Prediction
Jump/Branch Delay Slots
• In other words, the instruction(s) in the delay slots of the • Current processors will speculatively execute at
jump/branch instruction always get(s) executed when the conditional branches
branch is executed (regardless of the branch result). – if a branch direction is correctly guessed, great!
• Fetching from the branch target begins only after these
– if not, the pipeline is flushed before instructions
instructions complete.
commit (WB).
bgt r3, L1 • Why not just let compiler schedule?
– The average number of instructions per basic block
in typical C code is about 5 instructions.
– branches are not statically predictable
: – What happens if you have a 20 stage pipeline?
:
L1:
r1 = r2 + r3 • Anti-Dependence R ÎW δa
false
r4 = r1 + r1 • Output Dependence W ÎW δo
• Input Dependence R ÎR δi
F D E M W
r2 + r3 available here
S1) a=0;
r1 = [r2] Not generally
S2) b=a;
r4 = r1 + r1 defined
S3) c=a+d+e;
F D E M W
S4) d=b;
[r2] available here
S5) b=5+e;
15-745 © Seth Copen Goldstein 2000-5 17 15-745 © Seth Copen Goldstein 2000-5 18
15-745 © Seth Copen Goldstein 2000-5 19 15-745 © Seth Copen Goldstein 2000-5 20
Register Renaming Example Scheduling a BB
r1 ← r2 + 1 r7 ← r2 + 1 r7 ← r2 + 1 •x ← w * 2 * x * y * z • What do we need to know?
[fp+8] ← r1 [fp+8] ← r7 r1 ← r3 + 2 r1 ← [fp+w] • Latency of operations
r2 ←2 • # of registers
r1 ← r3 + 2 r1 ← r3 + 2 [fp+8] ← r7 r1 ← r1 * r2 • Assume:
[fp+12] ← r1 [fp+12] ← r1 [fp+12] ← r1 r2 ← [fp+x] • load 5
r1 ← r1 * r2 • store 5
Phase ordering problem r2 ← [fp+y] • mult 2
• Can perform register renaming after register r1 ← r1 * r2 • others 1
allocation r2 ← [fp+z] • Also assume,
• Constrained by available registers r1 ← r1 * r2 • operations are non-blocking
[fp+w] ← r1
• Constrained by live on entry/exit
• Instead, do scheduling before register allocation
15-745 © Seth Copen Goldstein 2000-5 21 15-745 © Seth Copen Goldstein 2000-5 22
15-745 © Seth Copen Goldstein 2000-5 25 15-745 © Seth Copen Goldstein 2000-5 26
15-745 © Seth Copen Goldstein 2000-5 27 15-745 © Seth Copen Goldstein 2000-5 28
Lots of Heuristics DLS (1995)
• forward or backward • Aim: avoid pipeline hazards in load/store unit
• choose instructions on critical path – load followed by use of target reg
• ASAP or ALAP – store followed by load
• Balanced paths • Simplifies in two ways
• depth in schedule graph – 1 cycle latency for load/store
– includes all dependencies (WaW included)
15-745 © Seth Copen Goldstein 2000-5 29 15-745 © Seth Copen Goldstein 2000-5 30
The algorithm
• Construct Scheduling dag
• Make srcs of dag candidates 1) ld r1 ← [a]
2) ld r2 ← [b]
• Pick a candidate
3) add r1 ← r1 + r2
– Choose an instruction with an interlock
4) ld r2 ← [c]
– Choose an instruction with a large number of 5) ld r3 ← [d]
successors 6) mul r4 ← r2 * r3
– Choose with longest path to root 7) add r1 ← r1 + r4
• Add newly available instruction to candidate list 8) add r2 ← r2 + r3
9) mul r2 ← r2 * r3
10) add r1 ← r1 + r2
11) st [a] ← r1
15-745 © Seth Copen Goldstein 2000-5 31 15-745 © Seth Copen Goldstein 2000-5 32
Trace Scheduling Trace Scheduling
• Basic blocks typically contain a small number of instrs.
• With many FUs, we may not be able to keep all the units
busy with just the instructions of a BB.
A A
• Trace scheduling allows block scheduling across BBs. S
called a trace. D D
A
E E
S
B C
F
F G G
The trace is then scheduled as a single BB. J
– TI C6x
…
a=b+c
a=b+c
Memory ALU FPU Branch ALU
x=x+1 a=e*f
x=x+1 a=e*f
d=a-3 d=a-3
d=a-3
…… • Scalability Issues?
15-745 © Seth Copen Goldstein 2000-5 35 15-745 © Seth Copen Goldstein 2000-5 36
Why Clusters? Some more details ÁÁÁÁ
ÁÁ
ÁÁÁÁ
ÁÁ
ÁÁÁÁ
src1 ÁÁÁÁÁÁ
ÁÁÁÁÁÁ
ÁÁÁÁÁÁ
ÁÁÁÁ ÁÁ
ÁÁÁÁÁÁ
.L1 src2
Á ÁÁÁÁ ÁÁÁÁÁÁ
long src 8
ÁÁÁÁÁÁ ÁÁÁÁÁÁ
ST1
– some overlap
8
long src
ÁÁÁÁ ÁÁ
ÁÁÁÁÁÁ
long dst
Register
– Add on L,S,D
src1 (A0–A15)
ÁÁÁÁ
ÁÁÁÁ ÁÁÁÁ
src2
ÁÁ
ÁÁÁÁÁÁ
ÁÁ
ÁÁÁÁ
dst
ÁÁÁÁÁÁ
• Example: C6x – delay 1 mostly ÁÁÁÁÁÁ
.M1 src1
ÁÁÁÁ Á ÁÁ
ÁÁÁÁÁÁ
src2
ÁÁÁÁ Á ÁÁ
ÁÁÁÁÁÁ
LD1
ÁÁÁÁ ÁÁÁÁÁÁ
DA1
src2
Á ÁÁÁÁ Á ÁÁÁÁÁÁ
1X
src2
ÁÁÁÁ ÁÁ
ÁÁÁÁÁÁ
DA2
Data path A
– both srcs from same RF
.D2 src1
Data path B
ÁÁÁÁ ÁÁÁ ÁÁ
ÁÁÁÁÁÁ
dst
LD2
ÁÁÁÁ ÁÁÁÁ
src1
ÁÁÁÁÁÁ
ÁÁÁÁÁÁ
dst
ÁÁÁÁ ÁÁ
ÁÁÁÁÁÁ
Data Path B src1 File B
.S2
dst (B0–B15)
ÁÁÁÁ
8
ÁÁÁÁÁÁ
ÁÁÁÁÁÁ
long src
32
ÁÁÁÁ ÁÁ ÁÁ
ÁÁÁÁÁÁ
ST2
8
L1 S1 M1 D1 D2 M2 S2 L2 long src 8
ÁÁÁÁ ÁÁÁÁ
dst
Address bus
.L2
ÁÁÁÁ
src2
ÁÁÁÁÁÁ
– if 2 D ops, srcs and dests ÁÁÁÁ ÁÁ ÁÁÁÁÁÁ
src1
ÁÁÁÁÁÁ
must be to different RFs ÁÁ
Control
Data bus Register
File
I MEM I MEM
unnecc spillling/ unnecc comm/ + Intercluster move
15-745 © Seth Copen Goldstein 2000-5 39 15-745 © Seth Copen Goldstein 2000-5 40
Bottom-Up Greedy (BUG) 1985 Integrated Approaches
• Assigns operations to cluster, then schedules • Leupers, 2000
• recurses down DFG – combine partitioning & scheduling
– assigns ops to cluster based on estimates of – iterative approach
resource usage • B-init/B-iter, 2002
– assigns ops on critical path first – Initial binding/scheduling
– tries to schedule as early as possible – Iterative improvement
• RHOP, 2003
1 1 1 1
10 11 10 11 10 11 10 11
12 12 12 12
15-745 © Seth Copen Goldstein 2000-5 41 15-745 © Seth Copen Goldstein 2000-5 42
15-745 © Seth Copen Goldstein 2000-5 43 15-745 © Seth Copen Goldstein 2000-5 44
Approach: SA Basic Algorthm
Generate random
partitioning & schedule algorithm Partition WHILE_LOOP:
input DFG G with nodes; while T>0.01 do
output: DP: array [1..N] of 0,1 ; for i=1 to 50 do
var int i, r, cost, mincost; r:= RANDOM(1,n);
Pick one node to swap Reduce T float T; P[r] := 1-P[r];
begin cost:=LISTSCHEDULING(G,P);
T=10; delta:=cost-mincost;
P:=Randompartitioning; if delta <0 or
schedule mincost := LISTSCHEDULING(G,P); RANDOM(0,1)<exp(-delta/T)
WHILE_LOOP; then mincost:=cost
return DP; else P[r]:=1-P[r]
end. end if;
yes end for;
Newcost < oldcost*eT T:= 0.9 * T;
no Undo swap
end while;
15-745 © Seth Copen Goldstein 2000-5 45 15-745 © Seth Copen Goldstein 2000-5 46
Scheduling ScheduleNode
• Use a List Scheduler
• Goal: insert node m as early as possible
• Tie breaker for next ready node is min ALAP
– don’t violate resource constraints
• Heart of routine is ScheduleNode
– don’t violate dependence constraints
algorithm ListScheduling(G,P)
input DFG G; parition P; • First try based on ASAP
output: length of schedule
var m: DFG node; S: schedule
• Until it is scheduled
begin – See if there is an FU that can execute m
mark all nodes unscheduled
S=∅ – check source registers
while (while not all scheduled) do
• if both from same RF as FU, done
m = NextReadyNode(G);
S = ScheduleNode(S,m,P); • if not: must decide what to do
mark m as scheduled
end
return Length(S)
end 15-745 © Seth Copen Goldstein 2000-5 47 15-745 © Seth Copen Goldstein 2000-5 48
Dealing with x-RF transfers Basic Scheduling of a Node
TryScheduleTransfers:
• Two ways to XFER: CheckArgTransfer:
• reuse first
• if CSE,•reuse
algorithm ScheduleNode(S,m,P)
move if possible
– Source can be Xfered this cycle • if room
input Schedule S, node m, parition P; for move, do so
• use X-path if possible
– Source can be copied in previous cycle • if X-path avail and valid,args
use it
output: new schedule with m
var cs: control step • try commuting
• If neither is true begin
cs = EarliestControlStep(m)-1; • Can fail
– maybe commutative? repeat
cs++;
– try to schedule next cycle f = GetNodeUnit(m, cs, P);
if (f == ∅) continue;
if (m needs arg from other RF) then
CheckArgTransfer();
if (no transfer possible) then continue;
else TryScheduleTransfers();
until (m is scheduled);
…
15-745 © Seth Copen Goldstein 2000-5 49 15-745 © Seth Copen Goldstein 2000-5 50
15-745 © Seth Copen Goldstein 2000-5 51 15-745 © Seth Copen Goldstein 2000-5 52
RHOP partitioning/scheduling RHOP Approach
• 2003, Chu,Fan, Mahlke • Opposite approach to conventional clustering
• Global scheduling and partitioning • Global view
• Based on graph-partitioning – Graph partitioning strategy [Aletà ‘01, ‘02]
– Identify tightly coupled operations - treat uniformly
15-745 © Seth Copen Goldstein 2000-5 53 15-745 © Seth Copen Goldstein 2000-5 54
Region-based Hierarchical
Edge Weights
Operation Partitioning (RHOP)
• Slack distribution allocates slack to certain edges
1 – slack
– Edge slack = lstartdest - latencyedge - estartsrc
Program Region 1
8 – no slack after dist
– First come, first serve method used
10
10 - critical
1 1 1 1
10 8 8 8
1 1 1 1
10 8 8 8
1 1
10 1 1
int main {
int x;
Weight Graph (0,0) 1 (0,0) 2
printf(…);
.
.
.
Calculation Partitioning 10
0 10
0
}
• Weight calculation creates guides for good partitions (2,2) 8 (1,2) 9 (0,2)10 (1,2) 11
• Partitioning clusters based on given weights 0
10 1 12 1
0
8
(3,3) 12 (2,3) 13
(estart, lstart) 0
10 1
(4,4) 14
15-745 © Seth Copen Goldstein 2000-5 55 15-745 © Seth Copen Goldstein 2000-5 56
RHOP - Partitioning Phase Cluster Refinement
15-745 © Seth Copen Goldstein 2000-5 57 15-745 © Seth Copen Goldstein 2000-5 58
1 2
1 2 Register File
3 4 5 6 7
Register File
8 9 10 11
3 I F M B 12 13 I F M B
14
15-745 © Seth Copen Goldstein 2000-5 59 15-745 © Seth Copen Goldstein 2000-5 60
Where Should Operations Move
How Good is this Partition?
From?
τ # ops in c at
Twgt c, t = * shared _ wgt c
slack ave + 1
max estart
op _ wgt c
c,t
SL = ∑ max (Cwgt −1)
∑ τ op
t=0 i ,t
Iwgt c , t = max Cluster 1 0 1 2 i∈cluster
slack + 1
o∈opgroups
t=0
op∈o at 0 1 2 2.5
4 5 6
1 3 2.0
cycle
9
2 8 0.5 Cluster 2 Max
Cluster 1
3 12 0.0 0 1 2 0 1 2
4 14 0.0 2.5
2.5 0.0
Cluster_wgt1= 5.0 2.0 0.33 2.0
0.5 0.33 0.5
Cluster 2 0 1 2
0.0 0.0 0.0
0 0.0
7 0.0 0.0 0.0
1 10 0.33
11 Cluster_wgt1= 5.0 Cluster_wgt1= 0.67 SL= 5.0
cycle
2 0.33
13
3
0.0
4 0.0
Cluster_wgt2= 0.67
15-745 © Seth Copen Goldstein 2000-5 61 15-745 © Seth Copen Goldstein 2000-5 62
Mgain = Egain + (Lgain * CRITICAL EDGE COST) • Evaluated DSP kernels and SPECint2000
Cluster 1
0 1 2 1.0 SL(before)= 5.0
1 3 0.0 Name Configuration
• 64 registers per cluster
cycle
4
• Latencies similar to Itanium
14 0.0 2-2111 2 Homogenous clusters
2 I, 1 F, 1 M, 1 B per cluster
1 10
7 4 5 6
2.33
4-2111 4 Homogenous clusters • For more detailed results,
11 9 Egain= -1.0 2 I, 1 F, 1 M, 1 B per cluster
cycle
15-745 © Seth Copen Goldstein 2000-5 63 15-745 © Seth Copen Goldstein 2000-5 64
2 Cluster Results vs 1 Cluster 4 Cluster Results vs 1 Cluster
1 1
0.9 0.9
Normalized Performance
Normalized Performance
0.8 0.8
BUG BUG
0.7 0.7
RHOP RHOP
0.6 0.6
0.5 0.5
0.4 0.4
175.vpr
Average
253.perlbmk
255.vortex
huffman
197.parser
175.vpr
181.mcf
lyapunov
Average
253.perlbmk
255.vortex
fir
huffman
254.gap
197.parser
181.mcf
channel
LU
dct
heat
rls
lyapunov
fsed
fir
254.gap
atmcell
sobel
channel
LU
dct
heat
rls
164.gzip
256.bzip
300.twolf
fsed
atmcell
halftone
sobel
164.gzip
256.bzip
300.twolf
halftone
Benchmarks Benchmarks
15-745 © Seth Copen Goldstein 2000-5 65 15-745 © Seth Copen Goldstein 2000-5 66
operations
Iterativ
Algorithm During Before Local Region Sched Pseudo Est Count Hier. Flat
e
Machine RHOP vs BUG
X X X X
– Prescheduling technique
UAS
X X X X
– Combines slack distribution 4-1111 14.3% Leupers
X X X X
with multilevel-KL partitioning 4-2111 15.3%
Capitanio
X X X X
• Performs better as number
GP(B)
4-H 8.0%
B-ITER X X X X
of resources increases BUG X X X X
RHOP X X X X
15-745 © Seth Copen Goldstein 2000-5 67 15-745 © Seth Copen Goldstein 2000-5 68