0% found this document useful (0 votes)
76 views17 pages

Instruction Scheduling

This document discusses instruction-level parallelism and instruction scheduling. It introduces several techniques that modern processors use to execute multiple instructions simultaneously, such as pipelining, VLIW, superscalar, and out-of-order execution. Instruction scheduling aims to order instructions to maximize parallelism while avoiding dependencies and resource constraints. While compilers perform basic scheduling, fully optimizing scheduling is NP-complete, so heuristic methods and dynamic scheduling in hardware are also used.

Uploaded by

934916760
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views17 pages

Instruction Scheduling

This document discusses instruction-level parallelism and instruction scheduling. It introduces several techniques that modern processors use to execute multiple instructions simultaneously, such as pipelining, VLIW, superscalar, and out-of-order execution. Instruction scheduling aims to order instructions to maximize parallelism while avoiding dependencies and resource constraints. While compilers perform basic scheduling, fully optimizing scheduling is NP-complete, so heuristic methods and dynamic scheduling in hardware are also used.

Uploaded by

934916760
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Instruction-level Parallelism

15-745
• Most modern processors have the ability to
execute several adjacent instructions
simultaneously.
Instruction Scheduling – Pipelined machines.
– Very-long-instruction-word machines (VLIW).
– Superscalar machines.
– Dynamic scheduling/out-of-order machines.
• ILP is limited by several kinds of execution
constraints:
– Data dependence constraints.
Copyright © Seth Copen Goldstein 2000-5 – Resource constraints (“hazards”)
– Control hazards
(some slides borrowed from M. Voss)
15-745 © Seth Copen Goldstein 2000-5 1 15-745 © Seth Copen Goldstein 2000-5 2

Execution Constraints Instruction Scheduling


• Data-dependence constraints:
– If instruction A computes a value that is • The purpose of instruction scheduling (IS) is to
read by instruction B, then B cannot execute order the instructions for maximum ILP.
before A is completed. – Keep all resources busy every cycle.
• Resource hazards: For example: – If necessary, eliminate data dependences and
ld [%fp-28], %o1
– Limited # of functional units. resource hazards to accomplish this.
• If there are n functional units of a particular kind (e.g., n
add
multipliers), then only n instructions %o1, %l2,
that require that %l3
kind • The IS problem is NP-complete (and bad in
of unit can execute at once. practice).
– Limited instruction issue. – So heuristic methods are necessary.
• If the instruction-issue unit can issue only n instructions at
a time, then this limits ILP.
– Limited register set. How can you tell this is an old slide?
• Any schedule of instructions must have a valid register
allocation.
15-745 © Seth Copen Goldstein 2000-5 3 15-745 © Seth Copen Goldstein 2000-5 4
Instruction Scheduling Should the Compiler Do IS?
• There are many different techniques for IS. • Many modern machines perform dynamic reordering of
instructions.
– Still an open area of research.
– Also called “out-of-order execution” (OOOE).
• Most optimizing compilers perform good local – Not yet clear whether this is a good idea.
IS, and only simple global IS.
– Pro:
• The biggest opportunities are in scheduling the • OOOE can use additional registers and register renaming to
code for loops. eliminate data dependences that no amount of static IS can
accomplish.
• No need to recompile programs when hardware changes.
– Con:
• OOOE means more complex hardware (and thus longer cycle times
and more wattage).
• And can’t be optimal since IS is NP-complete.

15-745 © Seth Copen Goldstein 2000-5 5 15-745 © Seth Copen Goldstein 2000-5 6

Instruction Scheduling
What we will cover
• Scheduling basic blocks • In the von Neumann model of execution an instruction starts
– List scheduling only after its predecessor completes.

– Long-latency operations
– Delay slots instr 1 instr 2
• Scheduling for clusters architectures time
• Software Pipelining (next week)
• This is not a very efficient model of execution.
• What we need to know – von Neumann bottleneck or the memory wall.
– pipeline structure
– data dependencies
– register renaming
15-745 © Seth Copen Goldstein 2000-5 7 15-745 © Seth Copen Goldstein 2000-5 8
Instruction Pipelines
Instruction Pipelines
• Almost all processors today use instructions pipelines to allow • However, we overlap these stages in time to complete an
overlap of instructions (Pentium 4 has a 20 stage pipeline!!!).
instruction every cycle.
• The execution of an instruction is divided into stages; each stage is
performed by a separate part of the processor. Steady state
instr 1 F D E M W

instr 2 F D E M W
instr F D E M W Draining the
time instr 3 F D E M W pipeline
F: Fetch instruction from cache or memory.
D: Decode instruction. instr 4 F D E M W
E: Execute. ALU operation or address calculation.
M: Memory access. instr 5 Filling the F D E M W
W: Write back result into register. pipeline
• Each of these stages completes its operation in one cycle (shorter instr 6 F D E M W
the the cycle in the von Neumann model).
• An instruction still takes the same time to execute. instr 7 F D E M W
time
15-745 © Seth Copen Goldstein 2000-5 9 4
15-745 © Seth Copen Goldstein 2000-5 10

Jump/Branch Delay Slot(s)


Pipeline Hazards
• Structural Hazards
– two instructions need the same resource at the same time • Control hazards, i.e. jump/branch instructions.
– memory or functional units in a superscalar.
• Data Hazards unconditional jump address available only after Decode.
– an instructions needs the results of a previous instruction conditional branch address available only after Execute.
r1 = r2 + r3
r4 = r1 + r1 jump/branch F D E M W

r1 = [r2] instr 2 F D E M W
r4 = r1 + r1
– solved by forwarding and/or stalling instr 3 F D E M W
– cache miss?
• Control Hazards instr 4 F D E M W
– jump & branch address not known until later in pipeline
– solved by delay slot and/or prediction

15-745 © Seth Copen Goldstein 2000-5 11 15-745 © Seth Copen Goldstein 2000-5 12
Jump/Branch Delay Slot(s) Jump/Branch Delay Slot(s)
• One option is to stall the pipeline (hardware solution). • another option is for the branch take effect after
the delay slots.
jump F D E M W • I.e., some instructions always get executed after the
branch but before the branching takes effect.
instr 2 F D E M W
• Another option is to insert a no-op instructions (software).
bra F D E M W

instr x F D E M W
jump F D E M W
instr y F D E M W
nop F D E M W
instr 2 F D E M W
instr 2 F D E M W
instr 3 F D E M W
• Both degrade performance!
15-745 © Seth Copen Goldstein 2000-5 13 15-745 © Seth Copen Goldstein 2000-5 14

Branch Prediction
Jump/Branch Delay Slots
• In other words, the instruction(s) in the delay slots of the • Current processors will speculatively execute at
jump/branch instruction always get(s) executed when the conditional branches
branch is executed (regardless of the branch result). – if a branch direction is correctly guessed, great!
• Fetching from the branch target begins only after these
– if not, the pipeline is flushed before instructions
instructions complete.
commit (WB).
bgt r3, L1 • Why not just let compiler schedule?
– The average number of instructions per basic block
in typical C code is about 5 instructions.
– branches are not statically predictable
: – What happens if you have a 20 stage pipeline?
:
L1:

• What instruction(s) to use?


15-745 © Seth Copen Goldstein 2000-5 15 15-745 © Seth Copen Goldstein 2000-5 16
Data Hazards
Defining Dependencies
• Flow Dependence W ÎR δf true

r1 = r2 + r3 • Anti-Dependence R ÎW δa
false
r4 = r1 + r1 • Output Dependence W ÎW δo
• Input Dependence R ÎR δi
F D E M W
r2 + r3 available here

S1) a=0;
r1 = [r2] Not generally
S2) b=a;
r4 = r1 + r1 defined
S3) c=a+d+e;
F D E M W
S4) d=b;
[r2] available here
S5) b=5+e;

15-745 © Seth Copen Goldstein 2000-5 17 15-745 © Seth Copen Goldstein 2000-5 18

Example Dependencies Renaming of Variables


• Sometimes constraints are not “real,” in the
S1) a=0; 1
S2) b=a; sense that a simple renaming of
S3) c=a+d+e; variables/registers can eliminate them.
2
S4) d=b; – Output dependence (WW):
S5) b=5+e; S1 δf S2 due to a A and B write to the same variable.
S1 δf S3 due to a 3 – Anti dependence (RW):
S2 δf S4 due to b A reads from a variable to which B writes.
S3 δa S4 due to d • In such cases, the order of A and B cannot be
4
S4 δa S5 due to b changed unless variables are renamed.
S2 δo S5 due to b – Can sometimes be done by the hardware, to a
S3 δi S5 due to a 5 limited extent.

15-745 © Seth Copen Goldstein 2000-5 19 15-745 © Seth Copen Goldstein 2000-5 20
Register Renaming Example Scheduling a BB
r1 ← r2 + 1 r7 ← r2 + 1 r7 ← r2 + 1 •x ← w * 2 * x * y * z • What do we need to know?
[fp+8] ← r1 [fp+8] ← r7 r1 ← r3 + 2 r1 ← [fp+w] • Latency of operations
r2 ←2 • # of registers
r1 ← r3 + 2 r1 ← r3 + 2 [fp+8] ← r7 r1 ← r1 * r2 • Assume:
[fp+12] ← r1 [fp+12] ← r1 [fp+12] ← r1 r2 ← [fp+x] • load 5
r1 ← r1 * r2 • store 5
Phase ordering problem r2 ← [fp+y] • mult 2
• Can perform register renaming after register r1 ← r1 * r2 • others 1
allocation r2 ← [fp+z] • Also assume,
• Constrained by available registers r1 ← r1 * r2 • operations are non-blocking
[fp+w] ← r1
• Constrained by live on entry/exit
• Instead, do scheduling before register allocation
15-745 © Seth Copen Goldstein 2000-5 21 15-745 © Seth Copen Goldstein 2000-5 22

Scheduling a BB We can do better


• Assume: • Assume:
• x ←w*2*x*y*z 1 r1 ← [fp+w]
• load 5 • load 5
1 r1 ← [fp+w] 2 r2 ← [fp+x]
• store 5 • store 5
• mult 2 2 r2 ←2 • mult 2 3 r3 ← [fp+y]
• others 1 6 r1 ← r1 * r2 • others 1 4 r4 ← [fp+z]
• operations 7 r2 ← [fp+x] • operations 5 r5 ←2
are non- 12 r1 ← r1 * r2 are non- 6 r1 ← r1 * r5
blocking 13 r2 ← [fp+y] blocking 8 r1 ← r1 * r2
18 r1 ← r1 * r2 10 r1 ← r1 * r3
19 r2 ← [fp+z] 12 r1 ← r1 * r4
We can do even
24 r1 ← r1 * r2 better if we 14 [fp+w] ← r1
26 [fp+x] ← r1 assume what? 19 r1 can be used again
33 r1 can be used again
15-745 © Seth Copen Goldstein 2000-5 23 15-745 © Seth Copen Goldstein 2000-5 24
Defining Better The Scheduler
1 r1 ← [fp+w] 1 r1 ← [fp+w]
2 r2 ←2 2 r2 ← [fp+x] • Given:
6 r1 ← r1 * r2 3 r3 ← [fp+y] – Code to schedule
7 r2 ← [fp+x] 4 r4 ← [fp+z] – Resources available (FU and # of Reg)
12 r1 ← r1 * r2 5 r5 ←2 – Latencies of instructions
13 r2 ← [fp+y] 6 r1 ← r1 * r5
• Goal:
18 r1 ← r1 * r2 8 r1 ← r1 * r2
19 r2 ← [fp+z] 10 r1 ← r1 * r3
– Correct code
24 r1 ← r1 * r2 12 r1 ← r1 * r4 – Better code [fewer cycles, less power,
26 [fp+w] ← r1 14 [fp+w] ← r1 fewer registers, …]
33 r1 can be used again 19 r1 can be used again – Do it quickly

15-745 © Seth Copen Goldstein 2000-5 25 15-745 © Seth Copen Goldstein 2000-5 26

More Abstractly List Scheduling


• Given a graph G = (V,E) where • Keep a list of available instructions, I.e.,
– nodes are operations – If we are at cycle k, then all predecessors, p,
• Each operation has an associated delay and type
in graph have all been scheduled so that
– edges between nodes represent dependencies S(p)+delay(p) ≤ k
– The number of resources of type t, R(t) • Pick some instruction, n, from queue such that
• A schedule assigns to each node a cycle number: there are resources for type(n)
– S(n) ≥ 0 • Update available instructions and continue
– If (n,m) ∈ G, S(m) ≥ S(n) + delay(n)
– |{ n | S(n) = x and type(n) = t}| <= R(t)
• It is all in how we pick instructions
• Goal is shortest length schedule, where length
– L(S) = max over n, S(n)+delay(n)

15-745 © Seth Copen Goldstein 2000-5 27 15-745 © Seth Copen Goldstein 2000-5 28
Lots of Heuristics DLS (1995)
• forward or backward • Aim: avoid pipeline hazards in load/store unit
• choose instructions on critical path – load followed by use of target reg
• ASAP or ALAP – store followed by load
• Balanced paths • Simplifies in two ways
• depth in schedule graph – 1 cycle latency for load/store
– includes all dependencies (WaW included)

15-745 © Seth Copen Goldstein 2000-5 29 15-745 © Seth Copen Goldstein 2000-5 30

The algorithm
• Construct Scheduling dag
• Make srcs of dag candidates 1) ld r1 ← [a]
2) ld r2 ← [b]
• Pick a candidate
3) add r1 ← r1 + r2
– Choose an instruction with an interlock
4) ld r2 ← [c]
– Choose an instruction with a large number of 5) ld r3 ← [d]
successors 6) mul r4 ← r2 * r3
– Choose with longest path to root 7) add r1 ← r1 + r4
• Add newly available instruction to candidate list 8) add r2 ← r2 + r3
9) mul r2 ← r2 * r3
10) add r1 ← r1 + r2
11) st [a] ← r1
15-745 © Seth Copen Goldstein 2000-5 31 15-745 © Seth Copen Goldstein 2000-5 32
Trace Scheduling Trace Scheduling
• Basic blocks typically contain a small number of instrs.
• With many FUs, we may not be able to keep all the units
busy with just the instructions of a BB.
A A
• Trace scheduling allows block scheduling across BBs. S

• The basic idea is to dynamically determine which blocks B C B C


are executed more frequently. The set of such BBs is J

called a trace. D D
A
E E
S
B C
F
F G G
The trace is then scheduled as a single BB. J

• Blocks that are not part of the trace must be modified H H

to restore program semantics if/when execution goes


off-trace.
15-745 © Seth Copen Goldstein 2000-5 33 15-745 © Seth Copen Goldstein 2000-5 34

Trace Scheduling VLIW


a=b+c x>10? • Very Long Instruction Word
• Multiple Function Units
x>10? a=b+c
• Statically scheduled
d=a-3 f=a+3 d=a-3 a=b+c • Examples
– Itanium
f=a+3

– TI C6x


a=b+c
a=b+c
Memory ALU FPU Branch ALU
x=x+1 a=e*f
x=x+1 a=e*f
d=a-3 d=a-3

d=a-3
…… • Scalability Issues?
15-745 © Seth Copen Goldstein 2000-5 35 15-745 © Seth Copen Goldstein 2000-5 36
Why Clusters? Some more details ÁÁÁÁ
ÁÁ
ÁÁÁÁ
ÁÁ
ÁÁÁÁ
src1 ÁÁÁÁÁÁ
ÁÁÁÁÁÁ
ÁÁÁÁÁÁ
ÁÁÁÁ ÁÁ
ÁÁÁÁÁÁ
.L1 src2

• Not all FUs the same ÁÁ


ÁÁÁÁ ÁÁ
ÁÁÁÁÁÁ
dst
8
long dst

Á ÁÁÁÁ ÁÁÁÁÁÁ
long src 8

• Reduce number of register ports


32

ÁÁÁÁÁÁ ÁÁÁÁÁÁ
ST1

– some overlap
8
long src

ÁÁÁÁ ÁÁ
ÁÁÁÁÁÁ
long dst
Register

ÁÁÁÁ ÁÁÁÁ ÁÁÁÁÁÁ


Data Path A dst

• Reduce length of buses


.S1 File A

– Add on L,S,D
src1 (A0–A15)

ÁÁÁÁ
ÁÁÁÁ ÁÁÁÁ
src2
ÁÁ
ÁÁÁÁÁÁ
ÁÁ
ÁÁÁÁ
dst
ÁÁÁÁÁÁ
• Example: C6x – delay 1 mostly ÁÁÁÁÁÁ
.M1 src1

ÁÁÁÁ Á ÁÁ
ÁÁÁÁÁÁ
src2

ÁÁÁÁ Á ÁÁ
ÁÁÁÁÁÁ
LD1

– load, 4. mult, 2 Á ÁÁÁÁ Á ÁÁÁÁÁÁ


dst
.D1 src1

ÁÁÁÁ ÁÁÁÁÁÁ
DA1
src2

• Not all srcs the same ÁÁÁÁ Á ÁÁÁÁÁÁ


2X

Á ÁÁÁÁ Á ÁÁÁÁÁÁ
1X
src2

ÁÁÁÁ ÁÁ
ÁÁÁÁÁÁ
DA2
Data path A
– both srcs from same RF
.D2 src1
Data path B
ÁÁÁÁ ÁÁÁ ÁÁ
ÁÁÁÁÁÁ
dst
LD2

– L,S,M: 1 from other RF ÁÁÁÁ ÁÁ src2


ÁÁÁÁÁÁ
ÁÁ
register file A register file B ÁÁÁÁ
.M2

ÁÁÁÁ ÁÁÁÁ
src1
ÁÁÁÁÁÁ
ÁÁÁÁÁÁ
dst

– Only L&S used for copy ÁÁÁÁ ÁÁÁÁÁÁ


src2
Register

ÁÁÁÁ ÁÁ
ÁÁÁÁÁÁ
Data Path B src1 File B
.S2
dst (B0–B15)

– S &M: only right from other RF Á ÁÁÁÁ ÁÁ


long dst

ÁÁÁÁ
8
ÁÁÁÁÁÁ
ÁÁÁÁÁÁ
long src
32

ÁÁÁÁ ÁÁ ÁÁ
ÁÁÁÁÁÁ
ST2
8
L1 S1 M1 D1 D2 M2 S2 L2 long src 8

• Not all dests the same ÁÁÁÁ ÁÁ


ÁÁÁÁÁÁ
long dst

ÁÁÁÁ ÁÁÁÁ
dst

Address bus
.L2

ÁÁÁÁ
src2
ÁÁÁÁÁÁ
– if 2 D ops, srcs and dests ÁÁÁÁ ÁÁ ÁÁÁÁÁÁ
src1
ÁÁÁÁÁÁ
must be to different RFs ÁÁ
Control
Data bus Register
File

Figure 1. TMS320C62x CPU Data Paths


15-745 © Seth Copen Goldstein 2000-5 37 15-745 © Seth Copen Goldstein 2000-5 38

Phase-Ordering Partitioning/Scheduling Basics


• Valid assembly must
– be properly partioned
– be properly scheduled • Objectives:
– be properly register allocated – Balance workload per cluster
• What order to perform? – Minimize critical intercluster communication
false
dependencies Interconnection Network
+

>> & Register Register


register partitioning File File
allocation
* LW

I MEM I MEM
unnecc spillling/ unnecc comm/ + Intercluster move

reschedule? Scheduling reduce ILP Cluster 1 Cluster 2

15-745 © Seth Copen Goldstein 2000-5 39 15-745 © Seth Copen Goldstein 2000-5 40
Bottom-Up Greedy (BUG) 1985 Integrated Approaches
• Assigns operations to cluster, then schedules • Leupers, 2000
• recurses down DFG – combine partitioning & scheduling
– assigns ops to cluster based on estimates of – iterative approach
resource usage • B-init/B-iter, 2002
– assigns ops on critical path first – Initial binding/scheduling
– tries to schedule as early as possible – Iterative improvement
• RHOP, 2003
1 1 1 1

2 3 4 5 2 3 4 5 2 3 4 5 2 3 4 5 – region-based graph partitioning


6 7 8 9 6 7 8 9 6 7 8 9 6 7 8 9

10 11 10 11 10 11 10 11
12 12 12 12
15-745 © Seth Copen Goldstein 2000-5 41 15-745 © Seth Copen Goldstein 2000-5 42

Leupers Approach Example Result


• Integrate partitioning and scheduling

• Use Simulated Annealing to determine partition


• The eval step in the SA loop is the scheduler!

• Deals with details of architecture

15-745 © Seth Copen Goldstein 2000-5 43 15-745 © Seth Copen Goldstein 2000-5 44
Approach: SA Basic Algorthm
Generate random
partitioning & schedule algorithm Partition WHILE_LOOP:
input DFG G with nodes; while T>0.01 do
output: DP: array [1..N] of 0,1 ; for i=1 to 50 do
var int i, r, cost, mincost; r:= RANDOM(1,n);
Pick one node to swap Reduce T float T; P[r] := 1-P[r];
begin cost:=LISTSCHEDULING(G,P);
T=10; delta:=cost-mincost;
P:=Randompartitioning; if delta <0 or
schedule mincost := LISTSCHEDULING(G,P); RANDOM(0,1)<exp(-delta/T)
WHILE_LOOP; then mincost:=cost
return DP; else P[r]:=1-P[r]
end. end if;
yes end for;
Newcost < oldcost*eT T:= 0.9 * T;
no Undo swap
end while;

15-745 © Seth Copen Goldstein 2000-5 45 15-745 © Seth Copen Goldstein 2000-5 46

Scheduling ScheduleNode
• Use a List Scheduler
• Goal: insert node m as early as possible
• Tie breaker for next ready node is min ALAP
– don’t violate resource constraints
• Heart of routine is ScheduleNode
– don’t violate dependence constraints
algorithm ListScheduling(G,P)
input DFG G; parition P; • First try based on ASAP
output: length of schedule
var m: DFG node; S: schedule
• Until it is scheduled
begin – See if there is an FU that can execute m
mark all nodes unscheduled
S=∅ – check source registers
while (while not all scheduled) do
• if both from same RF as FU, done
m = NextReadyNode(G);
S = ScheduleNode(S,m,P); • if not: must decide what to do
mark m as scheduled
end
return Length(S)
end 15-745 © Seth Copen Goldstein 2000-5 47 15-745 © Seth Copen Goldstein 2000-5 48
Dealing with x-RF transfers Basic Scheduling of a Node
TryScheduleTransfers:
• Two ways to XFER: CheckArgTransfer:
• reuse first
• if CSE,•reuse
algorithm ScheduleNode(S,m,P)
move if possible
– Source can be Xfered this cycle • if room
input Schedule S, node m, parition P; for move, do so
• use X-path if possible
– Source can be copied in previous cycle • if X-path avail and valid,args
use it
output: new schedule with m
var cs: control step • try commuting
• If neither is true begin
cs = EarliestControlStep(m)-1; • Can fail
– maybe commutative? repeat
cs++;
– try to schedule next cycle f = GetNodeUnit(m, cs, P);
if (f == ∅) continue;
if (m needs arg from other RF) then
CheckArgTransfer();
if (no transfer possible) then continue;
else TryScheduleTransfers();
until (m is scheduled);

15-745 © Seth Copen Goldstein 2000-5 49 15-745 © Seth Copen Goldstein 2000-5 50

Handling loads Benefits/Drawbacks


• After scheduling an op to a cluster, see if it is a • Does not predetermine the partitioning
load. Determine the partition of the result • handles many real world details
• Scheduling of loads uses the RF of the address
• Scheduling the result • local decisions only
– check to see if both units are free • Time consuming
– if so, check to see where it is used most and • Very specific to C6x
schedule result in that RF • may not scale to multiple clusters?

15-745 © Seth Copen Goldstein 2000-5 51 15-745 © Seth Copen Goldstein 2000-5 52
RHOP partitioning/scheduling RHOP Approach
• 2003, Chu,Fan, Mahlke • Opposite approach to conventional clustering
• Global scheduling and partitioning • Global view
• Based on graph-partitioning – Graph partitioning strategy [Aletà ‘01, ‘02]
– Identify tightly coupled operations - treat uniformly

• Avoid local scheduling pitfalls • Non scheduler-centric mindset


– Prescheduling technique
• Avoid “scheduling”
– Doesn’t complicate scheduler
– Enable global view of code
– Estimate-based approach [Lapinskii ‘01]

15-745 © Seth Copen Goldstein 2000-5 53 15-745 © Seth Copen Goldstein 2000-5 54

Region-based Hierarchical
Edge Weights
Operation Partitioning (RHOP)
• Slack distribution allocates slack to certain edges
1 – slack
– Edge slack = lstartdest - latencyedge - estartsrc
Program Region 1
8 – no slack after dist
– First come, first serve method used
10

10 - critical
1 1 1 1
10 8 8 8
1 1 1 1
10 8 8 8
1 1
10 1 1

int main {
int x;
Weight Graph (0,0) 1 (0,0) 2
printf(…);
.
.
.
Calculation Partitioning 10
0 10
0
}

(1,1) 3 (0,1) 4 (0,1) 5 (0,1) 6 (0,1) 7


• Code is considered region at a time 10
0 1
8
0 1
0
8 1
0
8 1
0
8

• Weight calculation creates guides for good partitions (2,2) 8 (1,2) 9 (0,2)10 (1,2) 11
• Partitioning clusters based on given weights 0
10 1 12 1
0
8
(3,3) 12 (2,3) 13
(estart, lstart) 0
10 1
(4,4) 14

15-745 © Seth Copen Goldstein 2000-5 55 15-745 © Seth Copen Goldstein 2000-5 56
RHOP - Partitioning Phase Cluster Refinement

• Modified Multilevel-KL algorithm [Kernighan ‘69]


• Multilevel graph partitioning consists of two stages • 3 questions to answer:
1. Coarsening stage 1. Which cluster should operations move from?
Coasening: meld partitions to 2. How good is the current partition?
2. Refinement stage
reduce weight, thus, will keep
3. How profitable is it to move X from cluster A to
edges on CP together.
B?
?

15-745 © Seth Copen Goldstein 2000-5 57 15-745 © Seth Copen Goldstein 2000-5 58

Node Weights Where Should Operations Move


From?
• Create a metric to determine resource usage
op _ wgt c
Iwgt c, t = max ∑ τ op
slack + 1
Dedicated Resources Shared Resources o∈opgroups
op∈o at
1 resource limited sched length on c
op wgtc = shared wgtc =
# ops that can execute on c in 1 cycle # ops

1 2

1 2 Register File
3 4 5 6 7
Register File
8 9 10 11

3 I F M B 12 13 I F M B
14

Accounts for FU’s Accounts for buses, ports

15-745 © Seth Copen Goldstein 2000-5 59 15-745 © Seth Copen Goldstein 2000-5 60
Where Should Operations Move
How Good is this Partition?
From?
τ # ops in c at
Twgt c, t = * shared _ wgt c
slack ave + 1
max estart

cluster _ wgtc = ( ∑ max(Iwgt ,Twgtc,t ) −1) max estart

op _ wgt c
c,t
SL = ∑ max (Cwgt −1)
∑ τ op
t=0 i ,t
Iwgt c , t = max Cluster 1 0 1 2 i∈cluster
slack + 1
o∈opgroups
t=0
op∈o at 0 1 2 2.5
4 5 6
1 3 2.0
cycle
9
2 8 0.5 Cluster 2 Max
Cluster 1
3 12 0.0 0 1 2 0 1 2

4 14 0.0 2.5
2.5 0.0
Cluster_wgt1= 5.0 2.0 0.33 2.0
0.5 0.33 0.5
Cluster 2 0 1 2
0.0 0.0 0.0
0 0.0
7 0.0 0.0 0.0
1 10 0.33
11 Cluster_wgt1= 5.0 Cluster_wgt1= 0.67 SL= 5.0
cycle

2 0.33
13
3
0.0
4 0.0

Cluster_wgt2= 0.67

15-745 © Seth Copen Goldstein 2000-5 61 15-745 © Seth Copen Goldstein 2000-5 62

How Good is This Proposed Move? Experimental Evaluation


• Trimaran toolset: a retargetable VLIW
Egain = ∑ edge_wgt − ∑ edge_wgt Lgain = SL(before) − SL(after)
compiler
i j
i ∈merged edges j ∈cut edges

Mgain = Egain + (Lgain * CRITICAL EDGE COST) • Evaluated DSP kernels and SPECint2000
Cluster 1
0 1 2 1.0 SL(before)= 5.0
1 3 0.0 Name Configuration
• 64 registers per cluster
cycle

2 8 0.0 SL(after)= 4.5 2-1111 2 Homogenous clusters


3 12 0.0 1 I, 1 F, 1 M, 1 B per cluster

4
• Latencies similar to Itanium
14 0.0 2-2111 2 Homogenous clusters
2 I, 1 F, 1 M, 1 B per cluster

Cluster 2 4-1111 4 Homogenous clusters


• Perfect caches
0 1.33
Lgain= 0.5 1 I, 1 F, 1 M, 1 B per cluster

1 10
7 4 5 6
2.33
4-2111 4 Homogenous clusters • For more detailed results,
11 9 Egain= -1.0 2 I, 1 F, 1 M, 1 B per cluster
cycle

2 0.83 see paper


13 4-H 4 Heterogeneous clusters
3
0.0
4
Mgain= 4.0 IM, IF, IB and IMF clusters
0.0

15-745 © Seth Copen Goldstein 2000-5 63 15-745 © Seth Copen Goldstein 2000-5 64
2 Cluster Results vs 1 Cluster 4 Cluster Results vs 1 Cluster
1 1

0.9 0.9

Normalized Performance
Normalized Performance

0.8 0.8

BUG BUG
0.7 0.7
RHOP RHOP

0.6 0.6

0.5 0.5

0.4 0.4

175.vpr

Average
253.perlbmk

255.vortex
huffman

197.parser
175.vpr

181.mcf
lyapunov
Average
253.perlbmk

255.vortex

fir
huffman

254.gap
197.parser
181.mcf

channel

LU
dct

heat

rls
lyapunov

fsed
fir

254.gap

atmcell

sobel
channel

LU
dct

heat

rls

164.gzip

256.bzip

300.twolf
fsed
atmcell

halftone
sobel

164.gzip

256.bzip

300.twolf
halftone

Benchmarks Benchmarks

15-745 © Seth Copen Goldstein 2000-5 65 15-745 © Seth Copen Goldstein 2000-5 66

Conclusions Previous Work


• A new, region-scoped
method for clustering Average Improvement
When (rel. to sched) Scope Desirability Metric Grouping

operations
Iterativ
Algorithm During Before Local Region Sched Pseudo Est Count Hier. Flat
e
Machine RHOP vs BUG
X X X X
– Prescheduling technique
UAS

2-1111 -1.8% CARS X X X X


– Estimates on schedule length
X X X X
used instead of scheduler 2-2111 3.7% Convergent

X X X X
– Combines slack distribution 4-1111 14.3% Leupers

X X X X
with multilevel-KL partitioning 4-2111 15.3%
Capitanio

X X X X
• Performs better as number
GP(B)
4-H 8.0%
B-ITER X X X X
of resources increases BUG X X X X
RHOP X X X X

15-745 © Seth Copen Goldstein 2000-5 67 15-745 © Seth Copen Goldstein 2000-5 68

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy