0% found this document useful (0 votes)

76 views17 pages

Instruction Scheduling

This document discusses instruction-level parallelism and instruction scheduling. It introduces several techniques that modern processors use to execute multiple instructions simultaneously, such as pipelining, VLIW, superscalar, and out-of-order execution. Instruction scheduling aims to order instructions to maximize parallelism while avoiding dependencies and resource constraints. While compilers perform basic scheduling, fully optimizing scheduling is NP-complete, so heuristic methods and dynamic scheduling in hardware are also used.

Uploaded by

934916760

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

76 views17 pages

Instruction Scheduling

Uploaded by

934916760

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Instruction-level Parallelism

15-745
• Most modern processors have the ability to
execute several adjacent instructions
simultaneously.
Instruction Scheduling – Pipelined machines.
– Very-long-instruction-word machines (VLIW).
– Superscalar machines.
– Dynamic scheduling/out-of-order machines.
• ILP is limited by several kinds of execution
constraints:
– Data dependence constraints.
Copyright © Seth Copen Goldstein 2000-5 – Resource constraints (“hazards”)
– Control hazards
(some slides borrowed from M. Voss)
15-745 © Seth Copen Goldstein 2000-5 1 15-745 © Seth Copen Goldstein 2000-5 2

Execution Constraints Instruction Scheduling

• Data-dependence constraints:
– If instruction A computes a value that is • The purpose of instruction scheduling (IS) is to
read by instruction B, then B cannot execute order the instructions for maximum ILP.
before A is completed. – Keep all resources busy every cycle.
• Resource hazards: For example: – If necessary, eliminate data dependences and
ld [%fp-28], %o1
– Limited # of functional units. resource hazards to accomplish this.
• If there are n functional units of a particular kind (e.g., n
add
multipliers), then only n instructions %o1, %l2,
that require that %l3
kind • The IS problem is NP-complete (and bad in
of unit can execute at once. practice).
– Limited instruction issue. – So heuristic methods are necessary.
• If the instruction-issue unit can issue only n instructions at
a time, then this limits ILP.
– Limited register set. How can you tell this is an old slide?
• Any schedule of instructions must have a valid register
allocation.
15-745 © Seth Copen Goldstein 2000-5 3 15-745 © Seth Copen Goldstein 2000-5 4
Instruction Scheduling Should the Compiler Do IS?
• There are many different techniques for IS. • Many modern machines perform dynamic reordering of
instructions.
– Still an open area of research.
– Also called “out-of-order execution” (OOOE).
• Most optimizing compilers perform good local – Not yet clear whether this is a good idea.
IS, and only simple global IS.
– Pro:
• The biggest opportunities are in scheduling the • OOOE can use additional registers and register renaming to
code for loops. eliminate data dependences that no amount of static IS can
accomplish.
• No need to recompile programs when hardware changes.
– Con:
• OOOE means more complex hardware (and thus longer cycle times
and more wattage).
• And can’t be optimal since IS is NP-complete.

15-745 © Seth Copen Goldstein 2000-5 5 15-745 © Seth Copen Goldstein 2000-5 6

Instruction Scheduling
What we will cover
• Scheduling basic blocks • In the von Neumann model of execution an instruction starts
– List scheduling only after its predecessor completes.

– Long-latency operations
– Delay slots instr 1 instr 2
• Scheduling for clusters architectures time
• Software Pipelining (next week)
• This is not a very efficient model of execution.
• What we need to know – von Neumann bottleneck or the memory wall.
– pipeline structure
– data dependencies
– register renaming
15-745 © Seth Copen Goldstein 2000-5 7 15-745 © Seth Copen Goldstein 2000-5 8
Instruction Pipelines
Instruction Pipelines
• Almost all processors today use instructions pipelines to allow • However, we overlap these stages in time to complete an
overlap of instructions (Pentium 4 has a 20 stage pipeline!!!).
instruction every cycle.
• The execution of an instruction is divided into stages; each stage is
performed by a separate part of the processor. Steady state
instr 1 F D E M W

instr 2 F D E M W
instr F D E M W Draining the
time instr 3 F D E M W pipeline
F: Fetch instruction from cache or memory.
D: Decode instruction. instr 4 F D E M W
E: Execute. ALU operation or address calculation.
M: Memory access. instr 5 Filling the F D E M W
W: Write back result into register. pipeline
• Each of these stages completes its operation in one cycle (shorter instr 6 F D E M W
the the cycle in the von Neumann model).
• An instruction still takes the same time to execute. instr 7 F D E M W
time
15-745 © Seth Copen Goldstein 2000-5 9 4
15-745 © Seth Copen Goldstein 2000-5 10

Jump/Branch Delay Slot(s)

Pipeline Hazards
• Structural Hazards
– two instructions need the same resource at the same time • Control hazards, i.e. jump/branch instructions.
– memory or functional units in a superscalar.
• Data Hazards unconditional jump address available only after Decode.
– an instructions needs the results of a previous instruction conditional branch address available only after Execute.
r1 = r2 + r3
r4 = r1 + r1 jump/branch F D E M W

r1 = [r2] instr 2 F D E M W
r4 = r1 + r1
– solved by forwarding and/or stalling instr 3 F D E M W
– cache miss?
• Control Hazards instr 4 F D E M W
– jump & branch address not known until later in pipeline
– solved by delay slot and/or prediction

15-745 © Seth Copen Goldstein 2000-5 11 15-745 © Seth Copen Goldstein 2000-5 12
Jump/Branch Delay Slot(s) Jump/Branch Delay Slot(s)
• One option is to stall the pipeline (hardware solution). • another option is for the branch take effect after
the delay slots.
jump F D E M W • I.e., some instructions always get executed after the
branch but before the branching takes effect.
instr 2 F D E M W
• Another option is to insert a no-op instructions (software).
bra F D E M W

instr x F D E M W
jump F D E M W
instr y F D E M W
nop F D E M W
instr 2 F D E M W
instr 2 F D E M W
instr 3 F D E M W
• Both degrade performance!
15-745 © Seth Copen Goldstein 2000-5 13 15-745 © Seth Copen Goldstein 2000-5 14

Branch Prediction
Jump/Branch Delay Slots
• In other words, the instruction(s) in the delay slots of the • Current processors will speculatively execute at
jump/branch instruction always get(s) executed when the conditional branches
branch is executed (regardless of the branch result). – if a branch direction is correctly guessed, great!
• Fetching from the branch target begins only after these
– if not, the pipeline is flushed before instructions
instructions complete.
commit (WB).
bgt r3, L1 • Why not just let compiler schedule?
– The average number of instructions per basic block
in typical C code is about 5 instructions.
– branches are not statically predictable
: – What happens if you have a 20 stage pipeline?
:
L1:

• What instruction(s) to use?

15-745 © Seth Copen Goldstein 2000-5 15 15-745 © Seth Copen Goldstein 2000-5 16
Data Hazards
Defining Dependencies
• Flow Dependence W ÎR δf true

r1 = r2 + r3 • Anti-Dependence R ÎW δa
false
r4 = r1 + r1 • Output Dependence W ÎW δo
• Input Dependence R ÎR δi
F D E M W
r2 + r3 available here

S1) a=0;
r1 = [r2] Not generally
S2) b=a;
r4 = r1 + r1 defined
S3) c=a+d+e;
F D E M W
S4) d=b;
[r2] available here
S5) b=5+e;

15-745 © Seth Copen Goldstein 2000-5 17 15-745 © Seth Copen Goldstein 2000-5 18

Example Dependencies Renaming of Variables

• Sometimes constraints are not “real,” in the
S1) a=0; 1
S2) b=a; sense that a simple renaming of
S3) c=a+d+e; variables/registers can eliminate them.
2
S4) d=b; – Output dependence (WW):
S5) b=5+e; S1 δf S2 due to a A and B write to the same variable.
S1 δf S3 due to a 3 – Anti dependence (RW):
S2 δf S4 due to b A reads from a variable to which B writes.
S3 δa S4 due to d • In such cases, the order of A and B cannot be
4
S4 δa S5 due to b changed unless variables are renamed.
S2 δo S5 due to b – Can sometimes be done by the hardware, to a
S3 δi S5 due to a 5 limited extent.

15-745 © Seth Copen Goldstein 2000-5 19 15-745 © Seth Copen Goldstein 2000-5 20
Register Renaming Example Scheduling a BB
r1 ← r2 + 1 r7 ← r2 + 1 r7 ← r2 + 1 •x ← w * 2 * x * y * z • What do we need to know?
[fp+8] ← r1 [fp+8] ← r7 r1 ← r3 + 2 r1 ← [fp+w] • Latency of operations
r2 ←2 • # of registers
r1 ← r3 + 2 r1 ← r3 + 2 [fp+8] ← r7 r1 ← r1 * r2 • Assume:
[fp+12] ← r1 [fp+12] ← r1 [fp+12] ← r1 r2 ← [fp+x] • load 5
r1 ← r1 * r2 • store 5
Phase ordering problem r2 ← [fp+y] • mult 2
• Can perform register renaming after register r1 ← r1 * r2 • others 1
allocation r2 ← [fp+z] • Also assume,
• Constrained by available registers r1 ← r1 * r2 • operations are non-blocking
[fp+w] ← r1
• Constrained by live on entry/exit
• Instead, do scheduling before register allocation
15-745 © Seth Copen Goldstein 2000-5 21 15-745 © Seth Copen Goldstein 2000-5 22

Scheduling a BB We can do better

• Assume: • Assume:
• x ←w*2*x*y*z 1 r1 ← [fp+w]
• load 5 • load 5
1 r1 ← [fp+w] 2 r2 ← [fp+x]
• store 5 • store 5
• mult 2 2 r2 ←2 • mult 2 3 r3 ← [fp+y]
• others 1 6 r1 ← r1 * r2 • others 1 4 r4 ← [fp+z]
• operations 7 r2 ← [fp+x] • operations 5 r5 ←2
are non- 12 r1 ← r1 * r2 are non- 6 r1 ← r1 * r5
blocking 13 r2 ← [fp+y] blocking 8 r1 ← r1 * r2
18 r1 ← r1 * r2 10 r1 ← r1 * r3
19 r2 ← [fp+z] 12 r1 ← r1 * r4
We can do even
24 r1 ← r1 * r2 better if we 14 [fp+w] ← r1
26 [fp+x] ← r1 assume what? 19 r1 can be used again
33 r1 can be used again
15-745 © Seth Copen Goldstein 2000-5 23 15-745 © Seth Copen Goldstein 2000-5 24
Defining Better The Scheduler
1 r1 ← [fp+w] 1 r1 ← [fp+w]
2 r2 ←2 2 r2 ← [fp+x] • Given:
6 r1 ← r1 * r2 3 r3 ← [fp+y] – Code to schedule
7 r2 ← [fp+x] 4 r4 ← [fp+z] – Resources available (FU and # of Reg)
12 r1 ← r1 * r2 5 r5 ←2 – Latencies of instructions
13 r2 ← [fp+y] 6 r1 ← r1 * r5
• Goal:
18 r1 ← r1 * r2 8 r1 ← r1 * r2
19 r2 ← [fp+z] 10 r1 ← r1 * r3
– Correct code
24 r1 ← r1 * r2 12 r1 ← r1 * r4 – Better code [fewer cycles, less power,
26 [fp+w] ← r1 14 [fp+w] ← r1 fewer registers, …]
33 r1 can be used again 19 r1 can be used again – Do it quickly

15-745 © Seth Copen Goldstein 2000-5 25 15-745 © Seth Copen Goldstein 2000-5 26

More Abstractly List Scheduling

• Given a graph G = (V,E) where • Keep a list of available instructions, I.e.,
– nodes are operations – If we are at cycle k, then all predecessors, p,
• Each operation has an associated delay and type
in graph have all been scheduled so that
– edges between nodes represent dependencies S(p)+delay(p) ≤ k
– The number of resources of type t, R(t) • Pick some instruction, n, from queue such that
• A schedule assigns to each node a cycle number: there are resources for type(n)
– S(n) ≥ 0 • Update available instructions and continue
– If (n,m) ∈ G, S(m) ≥ S(n) + delay(n)
– |{ n | S(n) = x and type(n) = t}| <= R(t)
• It is all in how we pick instructions
• Goal is shortest length schedule, where length
– L(S) = max over n, S(n)+delay(n)

15-745 © Seth Copen Goldstein 2000-5 27 15-745 © Seth Copen Goldstein 2000-5 28
Lots of Heuristics DLS (1995)
• forward or backward • Aim: avoid pipeline hazards in load/store unit
• choose instructions on critical path – load followed by use of target reg
• ASAP or ALAP – store followed by load
• Balanced paths • Simplifies in two ways
• depth in schedule graph – 1 cycle latency for load/store
– includes all dependencies (WaW included)

15-745 © Seth Copen Goldstein 2000-5 29 15-745 © Seth Copen Goldstein 2000-5 30

The algorithm
• Construct Scheduling dag
• Make srcs of dag candidates 1) ld r1 ← [a]
2) ld r2 ← [b]
• Pick a candidate
3) add r1 ← r1 + r2
– Choose an instruction with an interlock
4) ld r2 ← [c]
– Choose an instruction with a large number of 5) ld r3 ← [d]
successors 6) mul r4 ← r2 * r3
– Choose with longest path to root 7) add r1 ← r1 + r4
• Add newly available instruction to candidate list 8) add r2 ← r2 + r3
9) mul r2 ← r2 * r3
10) add r1 ← r1 + r2
11) st [a] ← r1
15-745 © Seth Copen Goldstein 2000-5 31 15-745 © Seth Copen Goldstein 2000-5 32
Trace Scheduling Trace Scheduling
• Basic blocks typically contain a small number of instrs.
• With many FUs, we may not be able to keep all the units
busy with just the instructions of a BB.
A A
• Trace scheduling allows block scheduling across BBs. S

• The basic idea is to dynamically determine which blocks B C B C

are executed more frequently. The set of such BBs is J

called a trace. D D
A
E E
S
B C
F
F G G
The trace is then scheduled as a single BB. J

• Blocks that are not part of the trace must be modified H H

to restore program semantics if/when execution goes

off-trace.
15-745 © Seth Copen Goldstein 2000-5 33 15-745 © Seth Copen Goldstein 2000-5 34

Trace Scheduling VLIW

a=b+c x>10? • Very Long Instruction Word
• Multiple Function Units
x>10? a=b+c
• Statically scheduled
d=a-3 f=a+3 d=a-3 a=b+c • Examples
– Itanium
f=a+3

– TI C6x

…
a=b+c
a=b+c
Memory ALU FPU Branch ALU
x=x+1 a=e*f
x=x+1 a=e*f
d=a-3 d=a-3

d=a-3
…… • Scalability Issues?
15-745 © Seth Copen Goldstein 2000-5 35 15-745 © Seth Copen Goldstein 2000-5 36
Why Clusters? Some more details ÁÁÁÁ
ÁÁ
ÁÁÁÁ
ÁÁ
ÁÁÁÁ
src1 ÁÁÁÁÁÁ
ÁÁÁÁÁÁ
ÁÁÁÁÁÁ
ÁÁÁÁ ÁÁ
ÁÁÁÁÁÁ
.L1 src2

• Not all FUs the same ÁÁ

ÁÁÁÁ ÁÁ
ÁÁÁÁÁÁ
dst
8
long dst

Á ÁÁÁÁ ÁÁÁÁÁÁ
long src 8

• Reduce number of register ports

ÁÁÁÁÁÁ ÁÁÁÁÁÁ
ST1

– some overlap
8
long src

ÁÁÁÁ ÁÁ
ÁÁÁÁÁÁ
long dst
Register

ÁÁÁÁ ÁÁÁÁ ÁÁÁÁÁÁ

Data Path A dst

• Reduce length of buses

.S1 File A

– Add on L,S,D
src1 (A0–A15)

ÁÁÁÁ
ÁÁÁÁ ÁÁÁÁ
src2
ÁÁ
ÁÁÁÁÁÁ
ÁÁ
ÁÁÁÁ
dst
ÁÁÁÁÁÁ
• Example: C6x – delay 1 mostly ÁÁÁÁÁÁ
.M1 src1

ÁÁÁÁ Á ÁÁ
ÁÁÁÁÁÁ
src2

ÁÁÁÁ Á ÁÁ
ÁÁÁÁÁÁ
LD1

– load, 4. mult, 2 Á ÁÁÁÁ Á ÁÁÁÁÁÁ

dst
.D1 src1

ÁÁÁÁ ÁÁÁÁÁÁ
DA1
src2

• Not all srcs the same ÁÁÁÁ Á ÁÁÁÁÁÁ

Á ÁÁÁÁ Á ÁÁÁÁÁÁ
1X
src2

ÁÁÁÁ ÁÁ
ÁÁÁÁÁÁ
DA2
Data path A
– both srcs from same RF
.D2 src1
Data path B
ÁÁÁÁ ÁÁÁ ÁÁ
ÁÁÁÁÁÁ
dst
LD2

– L,S,M: 1 from other RF ÁÁÁÁ ÁÁ src2

ÁÁÁÁÁÁ
ÁÁ
register file A register file B ÁÁÁÁ
.M2

ÁÁÁÁ ÁÁÁÁ
src1
ÁÁÁÁÁÁ
ÁÁÁÁÁÁ
dst

– Only L&S used for copy ÁÁÁÁ ÁÁÁÁÁÁ

src2
Register

ÁÁÁÁ ÁÁ
ÁÁÁÁÁÁ
Data Path B src1 File B
.S2
dst (B0–B15)

– S &M: only right from other RF Á ÁÁÁÁ ÁÁ

long dst

ÁÁÁÁ
8
ÁÁÁÁÁÁ
ÁÁÁÁÁÁ
long src
32

ÁÁÁÁ ÁÁ ÁÁ
ÁÁÁÁÁÁ
ST2
8
L1 S1 M1 D1 D2 M2 S2 L2 long src 8

• Not all dests the same ÁÁÁÁ ÁÁ

ÁÁÁÁÁÁ
long dst

ÁÁÁÁ ÁÁÁÁ
dst

Address bus
.L2

ÁÁÁÁ
src2
ÁÁÁÁÁÁ
– if 2 D ops, srcs and dests ÁÁÁÁ ÁÁ ÁÁÁÁÁÁ
src1
ÁÁÁÁÁÁ
must be to different RFs ÁÁ
Control
Data bus Register
File

Figure 1. TMS320C62x CPU Data Paths

15-745 © Seth Copen Goldstein 2000-5 37 15-745 © Seth Copen Goldstein 2000-5 38

Phase-Ordering Partitioning/Scheduling Basics

• Valid assembly must
– be properly partioned
– be properly scheduled • Objectives:
– be properly register allocated – Balance workload per cluster
• What order to perform? – Minimize critical intercluster communication
false
dependencies Interconnection Network
+

>> & Register Register

I MEM I MEM
unnecc spillling/ unnecc comm/ + Intercluster move

reschedule? Scheduling reduce ILP Cluster 1 Cluster 2

15-745 © Seth Copen Goldstein 2000-5 39 15-745 © Seth Copen Goldstein 2000-5 40
Bottom-Up Greedy (BUG) 1985 Integrated Approaches
• Assigns operations to cluster, then schedules • Leupers, 2000
• recurses down DFG – combine partitioning & scheduling
– assigns ops to cluster based on estimates of – iterative approach
resource usage • B-init/B-iter, 2002
– assigns ops on critical path first – Initial binding/scheduling
– tries to schedule as early as possible – Iterative improvement
• RHOP, 2003
1 1 1 1

2 3 4 5 2 3 4 5 2 3 4 5 2 3 4 5 – region-based graph partitioning

6 7 8 9 6 7 8 9 6 7 8 9 6 7 8 9

10 11 10 11 10 11 10 11
12 12 12 12
15-745 © Seth Copen Goldstein 2000-5 41 15-745 © Seth Copen Goldstein 2000-5 42

Leupers Approach Example Result

• Integrate partitioning and scheduling

• Use Simulated Annealing to determine partition

• The eval step in the SA loop is the scheduler!

• Deals with details of architecture

15-745 © Seth Copen Goldstein 2000-5 43 15-745 © Seth Copen Goldstein 2000-5 44
Approach: SA Basic Algorthm
Generate random
partitioning & schedule algorithm Partition WHILE_LOOP:
input DFG G with nodes; while T>0.01 do
output: DP: array [1..N] of 0,1 ; for i=1 to 50 do
var int i, r, cost, mincost; r:= RANDOM(1,n);
Pick one node to swap Reduce T float T; P[r] := 1-P[r];
begin cost:=LISTSCHEDULING(G,P);
T=10; delta:=cost-mincost;
P:=Randompartitioning; if delta <0 or
schedule mincost := LISTSCHEDULING(G,P); RANDOM(0,1)<exp(-delta/T)
WHILE_LOOP; then mincost:=cost
return DP; else P[r]:=1-P[r]
end. end if;
yes end for;
Newcost < oldcost*eT T:= 0.9 * T;
no Undo swap
end while;

15-745 © Seth Copen Goldstein 2000-5 45 15-745 © Seth Copen Goldstein 2000-5 46

Scheduling ScheduleNode
• Use a List Scheduler
• Goal: insert node m as early as possible
• Tie breaker for next ready node is min ALAP
– don’t violate resource constraints
• Heart of routine is ScheduleNode
– don’t violate dependence constraints
algorithm ListScheduling(G,P)
input DFG G; parition P; • First try based on ASAP
output: length of schedule
var m: DFG node; S: schedule
• Until it is scheduled
begin – See if there is an FU that can execute m
mark all nodes unscheduled
S=∅ – check source registers
while (while not all scheduled) do
• if both from same RF as FU, done
m = NextReadyNode(G);
S = ScheduleNode(S,m,P); • if not: must decide what to do
mark m as scheduled
end
return Length(S)
end 15-745 © Seth Copen Goldstein 2000-5 47 15-745 © Seth Copen Goldstein 2000-5 48
Dealing with x-RF transfers Basic Scheduling of a Node
TryScheduleTransfers:
• Two ways to XFER: CheckArgTransfer:
• reuse first
• if CSE,•reuse
algorithm ScheduleNode(S,m,P)
move if possible
– Source can be Xfered this cycle • if room
input Schedule S, node m, parition P; for move, do so
• use X-path if possible
– Source can be copied in previous cycle • if X-path avail and valid,args
use it
output: new schedule with m
var cs: control step • try commuting
• If neither is true begin
cs = EarliestControlStep(m)-1; • Can fail
– maybe commutative? repeat
cs++;
– try to schedule next cycle f = GetNodeUnit(m, cs, P);
if (f == ∅) continue;
if (m needs arg from other RF) then
CheckArgTransfer();
if (no transfer possible) then continue;
else TryScheduleTransfers();
until (m is scheduled);
…
15-745 © Seth Copen Goldstein 2000-5 49 15-745 © Seth Copen Goldstein 2000-5 50

Handling loads Benefits/Drawbacks

• After scheduling an op to a cluster, see if it is a • Does not predetermine the partitioning
load. Determine the partition of the result • handles many real world details
• Scheduling of loads uses the RF of the address
• Scheduling the result • local decisions only
– check to see if both units are free • Time consuming
– if so, check to see where it is used most and • Very specific to C6x
schedule result in that RF • may not scale to multiple clusters?

15-745 © Seth Copen Goldstein 2000-5 51 15-745 © Seth Copen Goldstein 2000-5 52
RHOP partitioning/scheduling RHOP Approach
• 2003, Chu,Fan, Mahlke • Opposite approach to conventional clustering
• Global scheduling and partitioning • Global view
• Based on graph-partitioning – Graph partitioning strategy [Aletà ‘01, ‘02]
– Identify tightly coupled operations - treat uniformly

• Avoid local scheduling pitfalls • Non scheduler-centric mindset

– Prescheduling technique
• Avoid “scheduling”
– Doesn’t complicate scheduler
– Enable global view of code
– Estimate-based approach [Lapinskii ‘01]

15-745 © Seth Copen Goldstein 2000-5 53 15-745 © Seth Copen Goldstein 2000-5 54

Region-based Hierarchical
Edge Weights
Operation Partitioning (RHOP)
• Slack distribution allocates slack to certain edges
1 – slack
– Edge slack = lstartdest - latencyedge - estartsrc
Program Region 1
8 – no slack after dist
– First come, first serve method used
10

10 - critical
1 1 1 1
10 8 8 8
1 1 1 1
10 8 8 8
1 1
10 1 1

int main {
int x;
Weight Graph (0,0) 1 (0,0) 2
printf(…);
.
.
.
Calculation Partitioning 10
0 10
0
}

(1,1) 3 (0,1) 4 (0,1) 5 (0,1) 6 (0,1) 7

• Code is considered region at a time 10
0 1
8
0 1
0
8 1
0
8 1
0
8

• Weight calculation creates guides for good partitions (2,2) 8 (1,2) 9 (0,2)10 (1,2) 11
• Partitioning clusters based on given weights 0
10 1 12 1
0
8
(3,3) 12 (2,3) 13
(estart, lstart) 0
10 1
(4,4) 14

15-745 © Seth Copen Goldstein 2000-5 55 15-745 © Seth Copen Goldstein 2000-5 56
RHOP - Partitioning Phase Cluster Refinement

• Modified Multilevel-KL algorithm [Kernighan ‘69]

• Multilevel graph partitioning consists of two stages • 3 questions to answer:
1. Coarsening stage 1. Which cluster should operations move from?
Coasening: meld partitions to 2. How good is the current partition?
2. Refinement stage
reduce weight, thus, will keep
3. How profitable is it to move X from cluster A to
edges on CP together.
B?
?

15-745 © Seth Copen Goldstein 2000-5 57 15-745 © Seth Copen Goldstein 2000-5 58

Node Weights Where Should Operations Move

From?
• Create a metric to determine resource usage
op _ wgt c
Iwgt c, t = max ∑ τ op
slack + 1
Dedicated Resources Shared Resources o∈opgroups
op∈o at
1 resource limited sched length on c
op wgtc = shared wgtc =
# ops that can execute on c in 1 cycle # ops

1 2

1 2 Register File
3 4 5 6 7
Register File
8 9 10 11

3 I F M B 12 13 I F M B
14

Accounts for FU’s Accounts for buses, ports

15-745 © Seth Copen Goldstein 2000-5 59 15-745 © Seth Copen Goldstein 2000-5 60
Where Should Operations Move
How Good is this Partition?
From?
τ # ops in c at
Twgt c, t = * shared _ wgt c
slack ave + 1
max estart

cluster _ wgtc = ( ∑ max(Iwgt ,Twgtc,t ) −1) max estart

op _ wgt c
c,t
SL = ∑ max (Cwgt −1)
∑ τ op
t=0 i ,t
Iwgt c , t = max Cluster 1 0 1 2 i∈cluster
slack + 1
o∈opgroups
t=0
op∈o at 0 1 2 2.5
4 5 6
1 3 2.0
cycle
9
2 8 0.5 Cluster 2 Max
Cluster 1
3 12 0.0 0 1 2 0 1 2

4 14 0.0 2.5
2.5 0.0
Cluster_wgt1= 5.0 2.0 0.33 2.0
0.5 0.33 0.5
Cluster 2 0 1 2
0.0 0.0 0.0
0 0.0
7 0.0 0.0 0.0
1 10 0.33
11 Cluster_wgt1= 5.0 Cluster_wgt1= 0.67 SL= 5.0
cycle

2 0.33
13
3
0.0
4 0.0

Cluster_wgt2= 0.67

How Good is This Proposed Move? Experimental Evaluation

• Trimaran toolset: a retargetable VLIW
Egain = ∑ edge_wgt − ∑ edge_wgt Lgain = SL(before) − SL(after)
compiler
i j
i ∈merged edges j ∈cut edges

Mgain = Egain + (Lgain * CRITICAL EDGE COST) • Evaluated DSP kernels and SPECint2000
Cluster 1
0 1 2 1.0 SL(before)= 5.0
1 3 0.0 Name Configuration
• 64 registers per cluster
cycle

2 8 0.0 SL(after)= 4.5 2-1111 2 Homogenous clusters

3 12 0.0 1 I, 1 F, 1 M, 1 B per cluster

4
• Latencies similar to Itanium
14 0.0 2-2111 2 Homogenous clusters
2 I, 1 F, 1 M, 1 B per cluster

Cluster 2 4-1111 4 Homogenous clusters

• Perfect caches
0 1.33
Lgain= 0.5 1 I, 1 F, 1 M, 1 B per cluster

1 10
7 4 5 6
2.33
4-2111 4 Homogenous clusters • For more detailed results,
11 9 Egain= -1.0 2 I, 1 F, 1 M, 1 B per cluster
cycle

2 0.83 see paper

13 4-H 4 Heterogeneous clusters
3
0.0
4
Mgain= 4.0 IM, IF, IB and IMF clusters
0.0

15-745 © Seth Copen Goldstein 2000-5 63 15-745 © Seth Copen Goldstein 2000-5 64
2 Cluster Results vs 1 Cluster 4 Cluster Results vs 1 Cluster
1 1

0.9 0.9

Normalized Performance
Normalized Performance

0.8 0.8

BUG BUG
0.7 0.7
RHOP RHOP

0.6 0.6

0.5 0.5

0.4 0.4

175.vpr

Average
253.perlbmk

255.vortex
huffman

197.parser
175.vpr

181.mcf
lyapunov
Average
253.perlbmk

255.vortex

fir
huffman

254.gap
197.parser
181.mcf

channel

LU
dct

heat

rls
lyapunov

fsed
fir

254.gap

atmcell

sobel
channel

LU
dct

heat

rls

164.gzip

256.bzip

300.twolf
fsed
atmcell

halftone
sobel

164.gzip

256.bzip

300.twolf
halftone

Benchmarks Benchmarks

Conclusions Previous Work

• A new, region-scoped
method for clustering Average Improvement
When (rel. to sched) Scope Desirability Metric Grouping

operations
Iterativ
Algorithm During Before Local Region Sched Pseudo Est Count Hier. Flat
e
Machine RHOP vs BUG
X X X X
– Prescheduling technique
UAS

2-1111 -1.8% CARS X X X X

– Estimates on schedule length
X X X X
used instead of scheduler 2-2111 3.7% Convergent

X X X X
– Combines slack distribution 4-1111 14.3% Leupers

X X X X
with multilevel-KL partitioning 4-2111 15.3%
Capitanio

X X X X
• Performs better as number
GP(B)
4-H 8.0%
B-ITER X X X X
of resources increases BUG X X X X
RHOP X X X X

Pan Os Admin
No ratings yet
Pan Os Admin
1,538 pages
Final Exam Topics: CSE 564 Computer Architecture Summer 2017
No ratings yet
Final Exam Topics: CSE 564 Computer Architecture Summer 2017
78 pages
Vanquish Pumps C, Pumps F (VC-PXX, VF-PXX) - Operating Manual
No ratings yet
Vanquish Pumps C, Pumps F (VC-PXX, VF-PXX) - Operating Manual
292 pages
Onur Ddca 2025 Lecture14 Out of Order Execution Afterlecture
No ratings yet
Onur Ddca 2025 Lecture14 Out of Order Execution Afterlecture
114 pages
Onur Digitaldesign - Comparch 2021 Lecture13 Pipelining Afterlecture
No ratings yet
Onur Digitaldesign - Comparch 2021 Lecture13 Pipelining Afterlecture
138 pages
3020RS Manual E Version Final
No ratings yet
3020RS Manual E Version Final
16 pages
Instruction Level Pipelining
100% (1)
Instruction Level Pipelining
113 pages
Module 5 - Processor Structure and Function
No ratings yet
Module 5 - Processor Structure and Function
74 pages
CMP3010L05-Hazard Continue ILP
No ratings yet
CMP3010L05-Hazard Continue ILP
54 pages
EE457Unit9a OoO
No ratings yet
EE457Unit9a OoO
77 pages
SRM Pipelining 05
No ratings yet
SRM Pipelining 05
42 pages
CA Lecture 12
No ratings yet
CA Lecture 12
48 pages
4-Advanced Pipelining - 241114 - 060906
No ratings yet
4-Advanced Pipelining - 241114 - 060906
80 pages
L1.3b OOOpipelines
No ratings yet
L1.3b OOOpipelines
72 pages
L03 Pipelining
No ratings yet
L03 Pipelining
45 pages
Test Set Ads-B/Tis/Tis-B Xpdr/Dme/Tcas: Operation Manual
No ratings yet
Test Set Ads-B/Tis/Tis-B Xpdr/Dme/Tcas: Operation Manual
238 pages
EC483 Fall2024 W7
No ratings yet
EC483 Fall2024 W7
40 pages
Chapter 5
No ratings yet
Chapter 5
48 pages
Slot24 25 CH14 ProcessorStructureAndFunction 42 Slots
No ratings yet
Slot24 25 CH14 ProcessorStructureAndFunction 42 Slots
42 pages
L04 Pipelining
No ratings yet
L04 Pipelining
38 pages
Zidio Web Development
No ratings yet
Zidio Web Development
12 pages
Pipelining and Parallel Processing
No ratings yet
Pipelining and Parallel Processing
26 pages
Onur 447 Spring15 Lecture12 Ooo Execution Afterlecture
No ratings yet
Onur 447 Spring15 Lecture12 Ooo Execution Afterlecture
67 pages
Instruction-Level Parallelism and Its Exploitation: Prof. Dr. Nizamettin AYDIN
No ratings yet
Instruction-Level Parallelism and Its Exploitation: Prof. Dr. Nizamettin AYDIN
170 pages
Water Level Indicator For Water Tanks - Eoi - No - Carr-Ss-12-2024
No ratings yet
Water Level Indicator For Water Tanks - Eoi - No - Carr-Ss-12-2024
9 pages
Lec 7 CSE-509 Pipelining
No ratings yet
Lec 7 CSE-509 Pipelining
27 pages
Unit1 1.7 Instr Cycle
No ratings yet
Unit1 1.7 Instr Cycle
35 pages
Computer Architecture Revision For Final Exam
No ratings yet
Computer Architecture Revision For Final Exam
60 pages
Chapter 2 ILP
No ratings yet
Chapter 2 ILP
89 pages
Ground Observer 20 MM: Win Decisive Seconds in C-UAV - Regain Battlefield Supremacy
No ratings yet
Ground Observer 20 MM: Win Decisive Seconds in C-UAV - Regain Battlefield Supremacy
2 pages
XCOM User Guide PDF
100% (1)
XCOM User Guide PDF
39 pages
Instruction Level Parallelism
No ratings yet
Instruction Level Parallelism
22 pages
Slides Chapter 6 Pipelining
No ratings yet
Slides Chapter 6 Pipelining
60 pages
Lecture 3: CPU Structure and Function
No ratings yet
Lecture 3: CPU Structure and Function
47 pages
Week 11
No ratings yet
Week 11
33 pages
3 Pipeline
No ratings yet
3 Pipeline
38 pages
CompArch 17e ILP-1
No ratings yet
CompArch 17e ILP-1
15 pages
BSC (Computer Science) Sem-3
No ratings yet
BSC (Computer Science) Sem-3
11 pages
Instruction-Level Parallelism (ILP), Since The
100% (1)
Instruction-Level Parallelism (ILP), Since The
57 pages
Chapter 2 Lecture 4 and 5
No ratings yet
Chapter 2 Lecture 4 and 5
56 pages
CH10-Processor Structure and Function
No ratings yet
CH10-Processor Structure and Function
14 pages
Data Hazards
No ratings yet
Data Hazards
29 pages
Radio Receivers
No ratings yet
Radio Receivers
72 pages
Dumpstate Board
No ratings yet
Dumpstate Board
33 pages
Control Lab 2
No ratings yet
Control Lab 2
7 pages
Database Presentation Slides
No ratings yet
Database Presentation Slides
52 pages
Victoria Drury Resume
No ratings yet
Victoria Drury Resume
1 page
Online Food Ordering System Thesis
100% (3)
Online Food Ordering System Thesis
7 pages
Ca06 2014 PDF
No ratings yet
Ca06 2014 PDF
53 pages
Application of Wireless Sensor Network in Smart Buildings
No ratings yet
Application of Wireless Sensor Network in Smart Buildings
11 pages
Rahmi Koç Müzesi Strateji Planı
No ratings yet
Rahmi Koç Müzesi Strateji Planı
9 pages
Cosc530 Ch3all6up
No ratings yet
Cosc530 Ch3all6up
8 pages
Pipelining
No ratings yet
Pipelining
44 pages
DDCO Notes-162-171
No ratings yet
DDCO Notes-162-171
10 pages
Config
No ratings yet
Config
8 pages
Pipe Lining
No ratings yet
Pipe Lining
66 pages
CAQA5e ch3
No ratings yet
CAQA5e ch3
45 pages
What Is The Most Boring Household Activity?
No ratings yet
What Is The Most Boring Household Activity?
27 pages
Computer System Organization
No ratings yet
Computer System Organization
26 pages
CSE 820 Graduate Computer Architecture Week 5 - Instruction Level Parallelism
No ratings yet
CSE 820 Graduate Computer Architecture Week 5 - Instruction Level Parallelism
38 pages
Pipelining Become Universal Technique in 1985
No ratings yet
Pipelining Become Universal Technique in 1985
16 pages
iDS-7200HQHI-M1-FA SERIES TURBO ACUSENSE DVR
No ratings yet
iDS-7200HQHI-M1-FA SERIES TURBO ACUSENSE DVR
4 pages
CS 6461: Computer Architecture Instruction Level Parallelism
No ratings yet
CS 6461: Computer Architecture Instruction Level Parallelism
41 pages
JD - Machine Learning Engineer
No ratings yet
JD - Machine Learning Engineer
1 page
Pipeline History
No ratings yet
Pipeline History
30 pages
Outline: Code Scheduling
No ratings yet
Outline: Code Scheduling
22 pages
Topic2c Ss Dynamicscheduling
No ratings yet
Topic2c Ss Dynamicscheduling
94 pages
PowerVault ME4 Series and VMware VSphere
No ratings yet
PowerVault ME4 Series and VMware VSphere
24 pages
Instruction Level Parallelism: Soner Onder
No ratings yet
Instruction Level Parallelism: Soner Onder
25 pages
Processor Organization & Instruction Cycle
No ratings yet
Processor Organization & Instruction Cycle
31 pages
Lec3 PDF
No ratings yet
Lec3 PDF
15 pages
8x8 LED Matrix Using Arduino
No ratings yet
8x8 LED Matrix Using Arduino
4 pages
CS 6290 Instruction Level Parallelism
No ratings yet
CS 6290 Instruction Level Parallelism
45 pages
Parallelism Via Instructions: Instruction-Level Parallelism (ILP)
No ratings yet
Parallelism Via Instructions: Instruction-Level Parallelism (ILP)
21 pages
Computer Architecture and Organization
No ratings yet
Computer Architecture and Organization
49 pages
Web Caching and Data Compression
No ratings yet
Web Caching and Data Compression
10 pages
CH12 CPU Structure and Function
No ratings yet
CH12 CPU Structure and Function
44 pages
EE (CE) 6304 Computer Architecture Lecture #2 (8/28/13)
No ratings yet
EE (CE) 6304 Computer Architecture Lecture #2 (8/28/13)
35 pages
Charging, Billing & Accounting in A Multi-Operator and Multi-Service Provider Environment
No ratings yet
Charging, Billing & Accounting in A Multi-Operator and Multi-Service Provider Environment
6 pages
Chapter 8 - Pipelining
No ratings yet
Chapter 8 - Pipelining
38 pages
Failure Data Analysis PDF
No ratings yet
Failure Data Analysis PDF
2 pages
Title of The Project: Department of Computer Science COMSATS Institute of Information Technology, Lahore
No ratings yet
Title of The Project: Department of Computer Science COMSATS Institute of Information Technology, Lahore
15 pages
Midterm Recap: Performance Evaluation
No ratings yet
Midterm Recap: Performance Evaluation
5 pages
Installation and Operation Guide FP 125A System (B - 1503287 - 1 - 1)
No ratings yet
Installation and Operation Guide FP 125A System (B - 1503287 - 1 - 1)
18 pages
Cse410 10 Pipelining A
No ratings yet
Cse410 10 Pipelining A
7 pages
Tuv0t en
No ratings yet
Tuv0t en
2 pages
Tariq Salim Ambusaidi: Personal Profile
No ratings yet
Tariq Salim Ambusaidi: Personal Profile
3 pages
Pretest
No ratings yet
Pretest
3 pages
Node.js, JavaScript, API: Interview Questions and Answers
From Everand
Node.js, JavaScript, API: Interview Questions and Answers
John Edward Cooper Berg
5/5 (1)
Node.js 63 Interview Questions and Answers
From Everand
Node.js 63 Interview Questions and Answers
John Edward Cooper Berg
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Instruction Scheduling

Uploaded by

Instruction Scheduling

Uploaded by

Instruction-level Parallelism

Execution Constraints Instruction Scheduling

Jump/Branch Delay Slot(s)

• What instruction(s) to use?

Example Dependencies Renaming of Variables

Scheduling a BB We can do better

More Abstractly List Scheduling

• The basic idea is to dynamically determine which blocks B C B C

• Blocks that are not part of the trace must be modified H H

to restore program semantics if/when execution goes

Trace Scheduling VLIW

• Not all FUs the same ÁÁ

• Reduce number of register ports

ÁÁÁÁ ÁÁÁÁ ÁÁÁÁÁÁ

• Reduce length of buses

– load, 4. mult, 2 Á ÁÁÁÁ Á ÁÁÁÁÁÁ

• Not all srcs the same ÁÁÁÁ Á ÁÁÁÁÁÁ

– L,S,M: 1 from other RF ÁÁÁÁ ÁÁ src2

– Only L&S used for copy ÁÁÁÁ ÁÁÁÁÁÁ

– S &M: only right from other RF Á ÁÁÁÁ ÁÁ

• Not all dests the same ÁÁÁÁ ÁÁ

Figure 1. TMS320C62x CPU Data Paths

Phase-Ordering Partitioning/Scheduling Basics

>> & Register Register

reschedule? Scheduling reduce ILP Cluster 1 Cluster 2

2 3 4 5 2 3 4 5 2 3 4 5 2 3 4 5 – region-based graph partitioning

Leupers Approach Example Result

• Use Simulated Annealing to determine partition

• Deals with details of architecture

Handling loads Benefits/Drawbacks

• Avoid local scheduling pitfalls • Non scheduler-centric mindset

(1,1) 3 (0,1) 4 (0,1) 5 (0,1) 6 (0,1) 7

• Modified Multilevel-KL algorithm [Kernighan ‘69]

Node Weights Where Should Operations Move

Accounts for FU’s Accounts for buses, ports

cluster _ wgtc = ( ∑ max(Iwgt ,Twgtc,t ) −1) max estart

How Good is This Proposed Move? Experimental Evaluation

2 8 0.0 SL(after)= 4.5 2-1111 2 Homogenous clusters

Cluster 2 4-1111 4 Homogenous clusters

2 0.83 see paper

Conclusions Previous Work

2-1111 -1.8% CARS X X X X

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.