upd2
upd2
PARALLELISM
Introduction
• An instruction that does arithmetic on one or two
numbers at a time is called a scalar instruction.
• An instruction that operates on a larger number of
values at once (e.g. 32 or 64) is called a vector
instruction.
• A processor that contains no vector instructions is
called a scalar processor
• And one that contains vector instructions is called
a vector processor.
• If the machine has more than one processor of
either type, it is called a multiprocessor or a
parallel computer.
Vectorizing
and Parallelizing compilers
• Programs written for execution by a single processor are
referred to as serial, or sequential programs.
• When the quest for increased speed produced computers with
vector instructions and multi-processors, compilers were
created to convert serial programs for use with these
machines. Such compilers, called vectorizing and parallelizing
compilers,
• Vector processors provide instructions that load a series of
numbers for each operand of a given operation, then perform
the operation on the whole series.
• This can be done in pipelined fashion,
• Parallel processors offer the opportunity to do multiple
operations at the same time on the different processors.
Classification of Parallel Machines - based on
the instruction and data streams
B2
Global Code Scheduling
• All load LD is done in Basic block
B1 (Load operation from B2 &
B3 is moved to B1)
• Add operation of B3 computed
speculatively in B1 (Moved from
B3 to B1)
• Based on the conditional stmt.
in B1 either B2 or B3 will be
executed
B2
Dynamic Scheduling
Designing the hardware so that it can dynamically
rearrange instruction execution to reduce stalls while
maintaining data flow.
Challenges in Dynamic Scheduling
• Instructions are issued to the pipeline in-order but
executed and completed out-of-order.
• out-of-order execution leading to the possibility of out-
of-order completion.
• out-of-order execution introduces the possibility of WAR
and WAW hazards which do not occur in statically
scheduled pipelines.
• out-of-order completion creates major complications in
exception handling
Dynamic scheduling using Tomasulo’s Algorithm
Key Features
• Executing instructions only when operands are
available,
• Waiting instruction is stored in a Reservation
Station(RS)
• Reservation stations keep track of pending instructions
(RAW).
• WAW and WAR hazards can be avoided using Register
renaming.
• Common data bus carries results past the reservation
stations (where they are captured) and back to the
register file.
Dynamic scheduling using Tomasulo’s Algorithm
Key Features contd.
• In case of speculation, instructions that are predicted to
occur after a branch, is executed without knowing the
branch outcome. In such cases, the speculated
instruction is committed, only if the outcome of Branch
stmt. is favorable, otherwise it will not be committed
• Key to speculation is to allow out-of-order instruction
execution, but force them to commit in order. Generally
achieved by a reorder buffer (ROB) which holds
completed instructions and retires them in order.
• Instruction-level Parallelism using Dynamic scheduling is
done in 3 phases
Three Phases:-
1. Issue
• Get next instruction from FIFO queue
• Issue the instruction to the RS with operand values if it is available
• If operand values not available, stall the instruction.
2. Execute
• When operand becomes available, store it in any reservation
stations waiting for it
• When all operands are ready, Execute the instruction
• Loads and store maintained in program order through effective
addressing
• No instruction allowed to initiate execution until all branches that
proceed it in program order have completed
3. Write result
• Write result on Common Data Bus to all awaiting units (reservation
stations , store buffers )
• Stores must wait until address and value are received
WAW
RAW
Register Renaming
Execution Order Commit Order
Software Pipelining
• Do-all loops, are particularly attractive from a
parallelization perspective because their iterations can be
executed in parallel to achieve a speed-up linear in the
number of iterations in the loop.
• Software pipelining that schedules an entire loop at a time,
taking full advantage of the parallelism across iterations.
• The following do-all loop is used to explain software
pipelining.
x=y–z x=y-z
s=x+s s1 = x + s
x=s+p x1 = s1 + p
s=z*q s2 = z * q
s=x*s
s3 = x1 * s2
Variable S is redefined in Stmt 2, 4 & 5, In Normal IR, even
though the values of S are different, it is represented as S
only, This may lead to conflicts. This is addressed by
Static single assignment, by giving variable name as S1
S2 and S3. Similarly for Variable X
Program Analysis for Parallelism
Dependence Analysis
• For parallelization and vectorization, the compiler
typically takes as input the serial form of a program,
then determines which parts of the program can be
transformed into parallel or vector form.
• The key constraint is that the “results" of each
section of code must be the same as those of the
serial program.
• A data dependence between two sections of a
program indicates that during execution of the
optimized program, those two sections of code must
be run in the order indicated by the dependence.
Classification of dependence Analysis
• Data dependences between two sections of code that
access the same memory location are classified based on
the type of the access (read or write) and the order, so
there are four classifications: