Superscalar and VLIW Architectures
Superscalar and VLIW Architectures
Architectures
Parallel processing [2]
Processing instructions in parallel requires
three major tasks:
1. checking dependencies between
instructions to determine which
instructions can be grouped together for
parallel execution;
2. assigning instructions to the functional
units on the hardware;
3. determining when instructions are initiated
placed together into a single word.
Major categories [2]
VLIW Very Long Instruction Word
EPIC Explicitly Parallel Instruction Computing
Major categories [2]
Superscalar Processors [1]
Superscalar processors are designed to exploit
more instruction-level parallelism in user
programs.
Only independent instructions can be executed
in parallel without causing a wait state.
The amount of instruction-level parallelism
varies widely depending on the type of code
being executed.
Pipelining in Superscalar
Processors [1]
In order to fully utilise a superscalar processor
of degree m, m instructions must be executable
in parallel. This situation may not be true in all
clock cycles. In that case, some of the pipelines
may be stalling in a wait state.
In a superscalar processor, the simple
operation latency should require only one cycle,
as in the base scalar processor.
Superscalar Execution
Superscalar
Implementation
Simultaneously fetch multiple instructions
Logic to determine true dependencies
involving register values
Mechanisms to communicate these values
Mechanisms to initiate multiple instructions in
parallel
Resources for parallel execution of multiple
instructions
Mechanisms for committing process state in
correct order
Some Architectures
PowerPC 604
six independent execution units:
Branch execution unit
Load/Store unit
3 Integer units
Floating-point unit
in-order issue
register renaming
Power PC 620
provides in addition to the 604 out-of-order issue
Pentium
three independent execution units:
2 Integer units
Floating point unit
in-order issue
VLIW
Very Long Instruction Word (VLIW) architectures are used for executing more
than one basic instruction at a time.
These processors contain multiple functional units, which fetch from the
instruction cache a Very-Long Instruction Word containing several basic
instructions, and dispatch the entire VLIW for parallel execution. These
capabilities are exploited by compilers which generate code that has grouped
together independent primitive instructions executable in parallel.
VLIW has been described as a natural successor to RISC (Reduced Instruction
Set Computing), because it moves complexity from the hardware to the compiler,
allowing simpler, faster processors.
VLIW eliminates the complicated instruction scheduling and parallel dispatch
that occurs in most modern microprocessors.
WHY VLIW ?
The key to higher performance in microprocessors for a broad range of
applications is the ability to exploit fine-grain, instruction-level
parallelism.
Some methods for exploiting fine-grain parallelism include:
Pipelining
Multiple processors
Superscalar implementation
Specifying multiple independent operations per instruction
Architecture Comparison:
CISC, RISC & VLIW
ARCHITECTURE
CHARACTERISTIC
CISC RISC VLIW
INSTRUCTION SIZE
Varies One size, usually 32 bits One size
INSTRUCTION
FORMAT
Field placement varies Regular, consistent
placement of fields
Regular, consistent
placement of
Fields
INSTRUCTION
SEMANTICS
Varies from simple to
complex ; possibly many
dependent operations per
instruction
Almost always one
simple operation
Many simple,
independent
operations
REGISTERS Few, sometimes special Many, general-purpose
Many, general-purpose
Architecture Comparison:
CISC, RISC & VLIW
ARCHITECTURE
CHARACTERISTIC
CISC RISC VLIW
MEMORY REFERENCES Bundled with operations
in many different types
of instructions
Not bundled with
operations, i.e.,load/store
architecture
Not bundled with
operations,i.e., load/store
architecture
HARDWARE DESIGN
FOCUS
Exploit micro coded
implementations
Exploit
implementations
with one pipeline and &
no microcode
Exploit
Implementations
With multiple pipelines,
no microcode & no
complex dispatch logic
PICTURES OF FIVE
TYPICAL INSTRUCTIONS
Advantages of VLIW
VLIW processors rely on the compiler that generates the VLIW code to
explicitly specify parallelism. Relying on the compiler has advantages.
VLIW architecture reduces hardware complexity. VLIW simply moves
complexity from hardware into software.
What is ILP ?
Instruction-level parallelism (ILP) is a measure of how many of the
operations in a computer program can be performed simultaneously.
A system is said to embody ILP (instruction-level parallelism) is
multiple instructions runs on them at the same time.
ILP can have a significant effect on performance which is critical to
embedded systems.
ILP provides an form of power saving by slowing the clock.
What we intend to do with ILP ?
We use Micro-architectural techniques to exploit the ILP. The various techniques
include :
Instruction pipelining which depend on CPU caches.
Register renaming which refers to a technique used to avoid unnecessary.
serialization of program operations imposed by the reuse of registers by those
operations.
Speculative execution which reduce pipeline stalls due to control dependencies.
Branch prediction which is used to keep the pipeline full.
Superscalar execution in which multiple execution units are used to execute
multiple instructions in parallel.
Out of Order execution which reduces pipeline stall due to operand dependencies.
Algorithms for scheduling
Few of the Instruction scheduling algorithms used are :
List scheduling
Trace scheduling
Software pipelining (modulo scheduling)
List Scheduling
List scheduling by steps :
1. Construct a dependence graph of the basic block. (The edges are
weighted with the latency of the instruction).
2. Use the dependence graph to determine instructions that can execute;
insert on a list, called the Readylist.
3. Use the dependence graph and the Ready list to schedule an instruction
that causes the smallest possible stall; update the Ready list. Repeat
Code Representation for
List Scheduling
a = b + c
d = e - f
1. load R1, b
2. load R2, c
3. add R2,R1
4. store a, R2
5. load R3, e
6. load R4,f
7. sub R3,R4
8. store d,R3
4
3
8
7
1 2 5 6
Code Representation for
List Scheduling
1. load R1, b
5.load R3, e
2. load R2, c
6.load R4, f
3.add R2,R1
7.sub R3,R4
4.store a, R2
8. store d, R3
4
3
8
7
1 2 5 6 1. load R1, b
2. load R2, c
3. add R2,R1
4. store a, R2
5. load R3, e
6. load R4,f
7. sub R3,R4
8. store d,R3
a = b + c
d = e - f
Now we have a schedule that requires no stalls and no NOPs.
Problem and Solution
Register allocation conflict : use of same register creates
anti-Dependencies that restrict scheduling
Register allocation before scheduling
prevents good scheduling
Scheduling before register allocation
spills destroy scheduling
Solution : Schedule abstract assembly, Allocate registers, Schedule again.
Trace scheduling
Steps involved in Trace Scheduling :
Trace Selection
Find the most common trace of basic blocks.
Trace Compaction
Combine the basic blocks in the trace and schedule them as one block
Create clean-up code if the execution goes off-trace
Parallelism across IF branches vs. LOOP branches
Can provide a speedup if static prediction is accurate
How Trace Scheduling works
Look for higher priority and trace the blocks as shown below.
How Trace Scheduling works
After tracing the priority blocks you schedule it first and rest
parallel to that .
How Trace Scheduling works
We can see the blocks been traced
depending on the priority.
How Trace Scheduling works
Creating large extended basic blocks by duplication
Schedule the larger blocks
Figure above shows how the extended basic blocks can be
created.
How Trace Scheduling works
This block diagram in its final stage shows you the parallelism across the
branches.
Limitations of Trace Scheduling
Optimizations depends on the traces being the dominant paths
in the programs control-flow.
Therefore, the following two things should be true:
Programs should demonstrate the behavior of being skewed in
the branches taken at run-time, for typical mixes of input data.
We should have access to this information at compile time.
Not so easy.
Software Pipelining
In software pipelining, iterations of a loop in the source program are
continuously initiated at constant intervals, before the preceding
iterations complete thus taking advantage of the parallelism in data path.
Its also explained as scheduling the operations within an iteration,
such that the iterations can be pipelined to yield optimal throughput.
The sequence of instructions before the steady state are called
PROLOG and the ones that are in the sequence after the steady state is
called EPI LOG.
Software Pipelining Example
Source code:
for(i=0;i<n;i++) sum += a[i]
Loop body in assembly:
r1 = L r0
---;stall
r2 = Addr2,r1
r0 = addr0,4
Unroll loop & allocate registers
r1 = L r0
---;stall
r2 = Add r2,r1
r0 = Add r0,12
r4 = L r3
---;stall
r2 = Add r2,r4
r3 = add r3,12
r7 = L r6
---;stall
r2 = Add r2,r7
r6 = add r6,12
r10 = L r9
---;stall
r2 = Add r2,r10
r9 = add r9,12
Software Pipelining Example
Software Pipelining Example
Schedule Unrolled Instructions, exploiting VLIW (or not)
Identify
Repeating
Pattern
(Kernel)
EPILOG
PROLOG
Constraints in Software pipelining
Recurrence Constraints: which is determined
by loop carried data dependencies.
Resource Constraints: which is determined by
total resource requirements.
Remarks on Software Pipelining
Innermost loop, loops with larger trip count, loops without conditionals
can be software pipelined.
Code size increase due to prolog and epilog.
Code size increase due to unrolling for MVE (Modulo Variable
Expansion).
Register allocation strategies for software pipelined loops .
Loops with conditional can be software pipelined if predicated execution
is supported.
Higher resource requirement, but efficient schedule