0% found this document useful (0 votes)
7 views87 pages

upd2

The document discusses parallelism in computing, explaining the differences between scalar and vector instructions, as well as the classifications of parallel machines based on instruction/data streams and memory systems. It covers programming techniques for parallel machines, including vectorizing and parallelizing compilers, and highlights various parallel programming models and architectures such as SIMD, shared memory, and distributed memory. Additionally, it addresses challenges in parallel programming, such as loop parallelism, conditionals, and instruction scheduling.

Uploaded by

gazebo118
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views87 pages

upd2

The document discusses parallelism in computing, explaining the differences between scalar and vector instructions, as well as the classifications of parallel machines based on instruction/data streams and memory systems. It covers programming techniques for parallel machines, including vectorizing and parallelizing compilers, and highlights various parallel programming models and architectures such as SIMD, shared memory, and distributed memory. Additionally, it addresses challenges in parallel programming, such as loop parallelism, conditionals, and instruction scheduling.

Uploaded by

gazebo118
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 87

Module 7

PARALLELISM
Introduction
• An instruction that does arithmetic on one or two
numbers at a time is called a scalar instruction.
• An instruction that operates on a larger number of
values at once (e.g. 32 or 64) is called a vector
instruction.
• A processor that contains no vector instructions is
called a scalar processor
• And one that contains vector instructions is called
a vector processor.
• If the machine has more than one processor of
either type, it is called a multiprocessor or a
parallel computer.
Vectorizing
and Parallelizing compilers
• Programs written for execution by a single processor are
referred to as serial, or sequential programs.
• When the quest for increased speed produced computers with
vector instructions and multi-processors, compilers were
created to convert serial programs for use with these
machines. Such compilers, called vectorizing and parallelizing
compilers,
• Vector processors provide instructions that load a series of
numbers for each operand of a given operation, then perform
the operation on the whole series.
• This can be done in pipelined fashion,
• Parallel processors offer the opportunity to do multiple
operations at the same time on the different processors.
Classification of Parallel Machines - based on
the instruction and data streams

• Single instruction stream - single data stream (SISD) - these


are single processor machines.
• Single instruction stream - multiple data streams (SIMD)
• These machines have two or more processors that all execute
the same instruction at the same time, but on separate data.
• Multiple instruction streams - single data stream (MISD) – it
have multiple processors, all operating on the same data. This
is similar to the idea of pipelining, where different pipeline
stages operate in sequence on a single data stream.
• Multiple instruction streams - multiple data streams (MIMD)
These machines have two or more processors that can all
execute different programs and operate on their own data.
Classification of Parallel Machines – Based on
memory system

• Shared-memory multiprocessors (SMPs)

• Distributed-memory multiprocessors (DMPs)


Classification of Parallel Machines – Based on
memory system
• Shared-memory multiprocessors (SMPs) are machines
in which any processor can access the contents of any
memory location by simply issuing its memory address
Classification of Parallel Machines – Based
on memory system
• Distributed shared
memory (DSM)
• Machines use a
combined model, in
which each processor
has a separate memory,
but special hardware
and/or software is used
to retrieve data from
the memory of another
processor.
Distributed shared memory (DSM)
Classification of Parallel Machines – Based on
memory system
• Distributed-memory
multiprocessors (DMPs) use
processors that each have
their own local memory,
inaccessible to other
processors.
• To move data from one
processor to another, a
message containing the data
must be sent between the
processors.
• Distributed memory
machines have frequently
been called multi computers
Parallel Computer Architectures - SIMD
• The earliest parallel machines were SIMD machines.
• SIMD machines have a large number of very simple slave
processors controlled by a sequential host or master processor.
• The slave processors each contain a portion of the data for the
program.
• The master processor executes a user’s program until it
encounters a parallel instruction.
• At that time, the master processor broadcasts the instruction to
all the slave processors, which then execute the instruction on
their data.
• The master processor typically applies a bitmask to the slave
processors.
• If the bit-mask entry for a particular slave processor is 0, then
that processor does not execute the instruction on its data.
SIMD Machines
Parallel Computer Architectures-
Vector Machines
• A vector machine has a specialized instruction set with
vector operations and usually a set of vector registers, each
of which can contain a large number of floating point values
(up to 128).
• With a single instruction, it applies an operation to all the
floating point numbers in a vector register.
• The processor of a vector machine is typically pipelined, so
that the different stages of applying the operation to the
vector of values overlap.
• This also avoids the overheads associated with loop
constructs.
• A scalar processor would have to apply the operation to
each data value in a loop.
Parallel Computer Architectures--
Shared Memory Machines
• In a shared memory multi-processor, each processor
can access the value of any shared address by simply
issuing the address
• Centralized shared memory, the processors are
connected to the shared memory via either a system
bus or an interconnection network.
• Distributed shared-memory, each processor has a
local memory, and whenever a processor issues the
address of a memory location not in its local
memory, special hardware is activated to fetch the
value from the remote memory that contains it.
Parallel Computer Architectures-
Distributed Memory Multiprocessors
• In a distributed memory multiprocessor, each
processor has access to its local memory only.
• It can only access values from a different
memory by receiving them in a message from
the processor
• The other processor must be programmed to
send the value at the right time.
Parallel Computer Architectures-
Cache-only Memory Architecture (COMA).
• A machine that uses all of its memory as a cache is called
a cache-only memory architecture (COMA).
• Typically in these machines, each processor has a local
memory, and data is allowed to move from one
processor's memory to another during the run of a
program.
• The term attraction memory has been used to describe
the tendency of data to migrate toward the processor that
uses it the most.
• Theoretically, this can minimize the latency to access
data, since latency increases as the data get further from
the processor.
Parallel Computer Architectures-
Multi-threaded Machines
• A multi-threaded machine attempts to hide latency
to memory by overlapping it with computation.
• As soon as the processor is forced to wait for a data
access, it switches to another thread to do more
computation.
• If there are enough threads to keep the processor
busy until each piece of data arrives, then the
processor is never idle.
Parallel Computer Architectures-
Clusters of SMPs
• Another approach to building a multiprocessor is to
use a small number of commodity microprocessors
to make centralized shared-memory clusters, then
connect large numbers of these together.
• The number of microprocessors to use to make a
single cluster would be determined by the number of
processors that would saturate the bus (keep the bus
constantly busy).
• Such machines are called clusters of SMPs.
Programming Parallel Machines
Three different ways of creating a parallel program:
1.writing a serial program and compiling it with a
parallelizing compiler
– write the program in one of the languages for which a
parallelizing compiler is available (Fortran, C, and C+
+), then employ the compiler.
2. composing a program from modules that have
already been implemented as parallel programs
3. writing a program that expresses parallel
activities explicitly
Program implemented as parallel
programs
• For many problems and computer systems there exist
libraries that perform common operations in parallel.
• Among them, mathematical libraries for manipulating
matrices are best known.
• One difficulty for users is that one must make sure that
a large fraction of the program execution is spent inside
such libraries.
• Otherwise, the serial part of the program may dominate
the execution time when running the application on
many processors.
Program that expresses parallel activities
• The most difficult for programmers, but gives
them direct control over the performance of
the parallel execution.
• Many parallelizers act as source-to-source
restructurers, translating the original, serial
program into parallel form.
• The actual generation of parallel code is then
performed by a backend compiler from this
parallel language form.
Expressing Parallel Programs
1. A large number of languages offer parallel
programming constructs.
• Examples are Prolog, Haskell, Sisal, Multilisp,
Concurrent Pascal, Occam and many others.
• Compared to standard, sequential languages
they tend to be more complex, available on
less machines, and lack good debugging tools
2. Parallelism can also be expressed in the form
of directives, which are pseudo comments
with semantics understood by the compiler.
Parallelism using Directives - OpenMP
• OpenMP describes a common set of directives for
implementing various types of parallel execution and
synchronization.
• One advantage of the OpenMP directives is that they
are designed to be added to a working serial code.
• If the compiler is told to ignore the directives, the
serial program will still execute correctly.
• Since the serial program is unchanged, such a parallel
program may be easier to debug.
OpenMP Directives
Directive Description
parallel Defines a parallel region, which part of
the code will be executed by multiple
threads in parallel.
for Causes the work done in a for loop
inside a parallel region to be divided
among threads.
sections Identifies code sections to be divided
among all threads.
single Lets you specify that a section of code
should be executed on a single thread,
not necessarily the main thread.
Open MP - Example
#include <stdio.h>
#include <omp.h>
int main() {
#pragma omp parallel num_threads(4)
{ int i = omp_get_thread_num();
printf_s("Hello from thread %d\n", i); }
}
Output
Hello from thread 0
Hello from thread 1
Hello from thread 2
Hello from thread 3
Expressing Parallel Programs – contd..
3. Use library calls within sequential program.
• The libraries perform the task of creating and
terminating parallel activities, scheduling them, and
supporting communication and synchronization.
• Examples of libraries that support this method are
the POSIX threads package, which is supported by
many operating systems,
• And the MPI libraries, which have become a standard
for expressing message passing parallel applications.
Loop Parallelism
• Parallelism is exploited by identifying loops that have
independent iterations.
• That is, all iterations access separate data.
• Loop parallelism is often expressed through directives,
which are placed before the first statement of the loop.
• OpenMP is an important example of a loop-oriented
directive language.
• Typically, a single processor executes code between loops,
but activates (forks) a set of processors to cooperate in
executing the parallel loop.
• Every processor will execute a share of the loop iterations.
• A synchronization point (or barrier) is typically placed after
the loop.
Loop Parallelism – contd..
• When all processors arrive at the barrier, only the master
processor continues. This is called a join point for the
loop.
• Thus the term fork/join parallelism is used for loop
parallelism.
• Determining which processor executes which iteration of
the loop is called scheduling.
• Loops may be scheduled statically, which means that the
assignment of processors to loop iterations is fully
determined prior to the execution of the loop.
• Loops may also be self-scheduled, which means that
whenever a given processor is ready to execute a loop
iteration, it takes the next available iteration.
Parallel Threads Model
• If the parallel activities in a program can be packaged
well in the form of subroutines that can execute
independently of each other, then the threads model
is adequate.
• Threads are parallel activities that are created and
terminated explicitly by the program.
• The code executed by a thread is a specified
subroutine, and the data accessed can either be
private to a thread or shared with other threads.
• Various synchronization constructs are usually
supported for coordinating parallel threads
• The POSIX threads package is one example of a well
known library that supports this model.
Parallel Programming Models –
Programming Vector Machines
• Vector parallelism typically exploits operations that are
performed on array data structures.
• This can be expressed using vector constructs that have
been added to standard languages. For instance,
Fortran90 uses constructs such as
• A(1:n) = B(1:n) + C(1:n)
• For a vector machine, this could cause a vector loop to be
produced, which performs a vector add between chunks
of arrays B and C, then a vector copy of the result into a
chunk of array A.
• The size of a chunk would be determined by the number
of elements that fit into a vector register in the machine.
Vectorization: Exploiting Vector Architectures
• Vectorizing compilers exploit vector architectures by generating
code that performs operations on a number of data elements in
a row.
• Vector architectures can accommodate several short data items
in one word. For example, a 64 bit word can accommodate a
“vector" of 16 4-bit words.
• Instructions that operate on vectors of this kind are sometimes
referred to as multi-media extensions (MMX).
• The objective of a vectorizing compiler is to identify and express
such vector operations in a form that can then be easily mapped
onto the vector instructions available in these architectures.
Scalar Expansion

• Private variables need to be expanded in order to allow


vectorization. The above shows the privatization
transformed into vector form.
Loop Distribution
• A loop containing several statements must first be
distributed into several loops before each one can be
turned into a vector operation.
• Loop distribution (also called loop splitting or loop
fission) is only possible if there is no dependence in a
lexically backward direction.
Loop Distribution
• Figure shows a loop that is distributed and
vectorized. The original loop contains a dependence
in a lexically forward direction. Such a dependence
does not prevent loop distribution.
• That is, the execution order of the two dependent
statements is maintained in the vectorized code.
Handling Conditionals in a Loop
• Conditional execution is an issue for vectorization because
all elements in a vector are processed in the same way.
• Figure shows how a conditional execution can be
vectorized.
• The condition is first evaluated for all vector elements and a
vector of true/false values is formed, called the mask.
• The actual operation is then executed conditionally, based
on the value of the mask at each vector position.
Stripmining Vector Lengths
• Vector instructions usually take operands of length 2n - the
size of the vector registers.
• The original loop must be divided into strips of this length.
This is called stripmining.
Vector Code Generation
• Finding vectorizable statements in a multiply-nested
loop that contains data dependences can be quite
difficult.
• Algorithms check dependency in a recursive manner
• They move from the outermost to the innermost loop
level and test at each level for code sections that can be
distributed (i.e., they do not contain dependence cycles)
and then vectorize the code
Instruction Scheduling and Software Pipelining
Instruction-level Parallelism
• High-performance processor can execute several
operations in a single clock cycle.
• Instruction-level parallelism depends on
1. The potential parallelism in the program.
2. The available parallelism on the processor.
3. Ability to extract parallelism from the original
sequential program.
4. Ability to find the best parallel schedule given
scheduling constraints
Instruction Pipelines
• A new instruction can be fetched every clock while
preceding instructions are still going through the
pipeline.
A simple 5-stage instruction pipeline:
• Fetches the instruction (IF),
• Decodes it (ID),
• Executes the operation (EX),
• Accesses the memory (MEM), and
• Writes back the result (WB)
The figure shows how instructions i, i + 1, i + 2, i + 3, and
i + 4 can execute at the same time. Each row
corresponds to a clock tick, and each column in the
figure specifies the stage in each instruction
Very-Long-Instruction-Word
• In pipelining, Machines that rely on software to
manage their parallelism are known as VLIW (Very-
Long-Instruction-Word) machines
• In VLIW machines, as their name implies, have wider
than normal instruction words that encode the
operations to be issued in a single clock.
Data Dependency/ Data Hazards
• True dependence: Read After Write (RAW Hazards)
If a write is followed by a read of the same location, the
read depends on the value written; such a dependence
is known as a true dependence.

• Output dependence: Write After Write. (WAW Hazards)


Two writes to the same location share an output
dependence. If the dependence is violated, the value of
the memory location written will have the wrong value
after both operations are performed
Data Dependence - contd
• Antidependence: Write After Read (WAR Hazards)
If a read is followed by a write to the same location,
The write does not depend on the read, but if the
write happens before the read, then the read
operation will pick up the wrong value. It is not a
"true" dependence and potentially can be eliminated
by storing the values in different locations.
Remedial action
Antidependence and Output dependence can be
eliminated by using different locations to store their
values or Register Renaming
Control Dependence
• Scheduling operations within a basic block is relatively easy because
Instructions in a basic block can be reordered arbitrarily, as long as all the
data dependences are satisfied and there is no branching within the basic
block
• Exploiting parallelism across basic blocks is a challenge since it involves
branching across the basic block
Example if a> t
b=a*a
d=a+c
• The statement b = a*a depends on the comparison a > t. The stmt. if a> t
may be in basic block 1 and the stmt. b=a*a may be in another basic block
2. The execution of the stmt. b=a*a depends on the value of a > t. This is
control dependence. The stmt. d = a+c, however, does not depend on the
comparison and can be executed any time.
• Remedial Action: Assuming that the multiplication a * a does not cause
any side effects, it can be performed speculatively, as long as b is written
only after a is found to be greater than t.
Control Dependence
• In general, two constraints are imposed by control
dependencies:
– An instruction that is control dependent on a branch
cannot be moved before the branch so that its execution is
no longer controlled by the branch.
– An instruction that is not control dependent on a branch
cannot be moved after the branch so that the execution is
controlled by the branch.
Register Allocation and Code Scheduling
Challenges
• If registers are allocated before scheduling, the
resulting code tends to have many storage
dependences that limit code scheduling.
• if code is scheduled before register allocation, the
schedule created may require so many registers that
results in register spilling (that is storing the contents
of a register in a memory location, so the register can
be used for some other purpose) which may negate
the advantages of instruction-level parallelism.
Register Allocation and Code Scheduling - contd
Remedial Action
• A hierarchical approach where code is optimized
inside out, starting with the innermost loops.
• Instructions are first scheduled assuming that every
pseudo-register(Extra register) will be allocated its
own physical register.
• Register allocation is applied after scheduling and
spill code is added where necessary, and the code is
then rescheduled.
• This process is repeated for the codes in each loop till
the outer loops.
Instruction scheduling at Basic Block Level
• Construct a Data dependency graph on the
machine code of the basic block
• Apply following Prioritized Topological Order steps
on the Data dependency graph to generate List
scheduling
• If no resource constraints exist for scheduling, then
shortest schedule is given by the critical path which
is a longest path through the data-dependence
graph.
• If resource constraints exist, Operations using more
critical resources may be given higher priority (The
critical resource is the one with the largest ratio of
uses to the number of units of that resource
available)
• Finally, the operation that shows up earlier in the
source program can be used to break ties between
Global Code Scheduling
• Instruction-level parallelism at basic blocks tend to leave
many resources idle.
• In order to make better use of machine resources, it is
necessary to consider code-generation strategies that
move instructions from one basic block to another.
• Strategies that consider more than one basic block at a
time are referred to as Global Code scheduling
algorithms.
• To do Global Code scheduling correctly, the data
dependences and also control dependences should be
considered
Global Code Scheduling – An Example

B2
Global Code Scheduling
• All load LD is done in Basic block
B1 (Load operation from B2 &
B3 is moved to B1)
• Add operation of B3 computed
speculatively in B1 (Moved from
B3 to B1)
• Based on the conditional stmt.
in B1 either B2 or B3 will be
executed

B2
Dynamic Scheduling
Designing the hardware so that it can dynamically
rearrange instruction execution to reduce stalls while
maintaining data flow.
Challenges in Dynamic Scheduling
• Instructions are issued to the pipeline in-order but
executed and completed out-of-order.
• out-of-order execution leading to the possibility of out-
of-order completion.
• out-of-order execution introduces the possibility of WAR
and WAW hazards which do not occur in statically
scheduled pipelines.
• out-of-order completion creates major complications in
exception handling
Dynamic scheduling using Tomasulo’s Algorithm
Key Features
• Executing instructions only when operands are
available,
• Waiting instruction is stored in a Reservation
Station(RS)
• Reservation stations keep track of pending instructions
(RAW).
• WAW and WAR hazards can be avoided using Register
renaming.
• Common data bus carries results past the reservation
stations (where they are captured) and back to the
register file.
Dynamic scheduling using Tomasulo’s Algorithm
Key Features contd.
• In case of speculation, instructions that are predicted to
occur after a branch, is executed without knowing the
branch outcome. In such cases, the speculated
instruction is committed, only if the outcome of Branch
stmt. is favorable, otherwise it will not be committed
• Key to speculation is to allow out-of-order instruction
execution, but force them to commit in order. Generally
achieved by a reorder buffer (ROB) which holds
completed instructions and retires them in order.
• Instruction-level Parallelism using Dynamic scheduling is
done in 3 phases
Three Phases:-
1. Issue
• Get next instruction from FIFO queue
• Issue the instruction to the RS with operand values if it is available
• If operand values not available, stall the instruction.
2. Execute
• When operand becomes available, store it in any reservation
stations waiting for it
• When all operands are ready, Execute the instruction
• Loads and store maintained in program order through effective
addressing
• No instruction allowed to initiate execution until all branches that
proceed it in program order have completed
3. Write result
• Write result on Common Data Bus to all awaiting units (reservation
stations , store buffers )
• Stores must wait until address and value are received
WAW

RAW

Register Renaming
Execution Order Commit Order
Software Pipelining
• Do-all loops, are particularly attractive from a
parallelization perspective because their iterations can be
executed in parallel to achieve a speed-up linear in the
number of iterations in the loop.
• Software pipelining that schedules an entire loop at a time,
taking full advantage of the parallelism across iterations.
• The following do-all loop is used to explain software
pipelining.

Iterations in the above loop write to different memory locations,


Therefore, there are no memory dependences between the
iterations, and all iterations can proceed in parallel.
Software Pipelining
Assumptions for the machine
code
• one load, one store, and one
branch operation can take
place in one clock pulse
• Multiplication takes 3 clock
pulses and Addition takes 2
clock pulses
• In the following code
BL R, L
• The register R is decremented
unless the result is 0, then it
branches to location L.
Loop Unrolling
• Unrolling several iterations of a loop will have better
hardware utilization
• But it increases the code size, which in turn can have a
negative impact on overall performance.
• Thus, there is a need for compromise in picking a
number of times to unroll a loop that gets most of the
performance improvement, yet doesn't expand the
code size too much.
• Loop unrolling places several iterations of the loop in
one large basic block, and a simple list-scheduling
algorithm can be used to schedule the operations to
execute in parallel.
Software Pipelining of Loops
• Register usage is not considered in this example, loop is unrolled 5 times
• Shown in row i are all the operations issued at clock i; shown in column j
are all the operations from iteration j.
• Each row in the figure corresponds to one VLIW machine instruction
Software Pipelining of Loops
• Here loop is unrolled 4 times.
• The schedule executed by each
iteration in this example as an 8-
stage pipeline.
• A new iteration can be started
on the pipeline every 2 clocks.
• At the beginning, there is only
one iteration in the pipeline.
• As the first iteration proceeds
to stage three, the second
iteration starts to execute in the
first pipeline stage.
• By clock 7, the pipeline is fully
filled with the first four
iterations
Software Pipelining of Loops
• In the steady state, four
consecutive iterations are
executing at the same time.
• A new iteration is started as
the oldest iteration in the
pipeline retires.
• When we run out of
iterations, the pipeline
drains,
• And all the iterations in the
pipeline run to completion.
• The sequence of instructions
used to fill the pipeline
Software Pipelining of Loops
• The lines 1 through 6 is called
the Prolog;
• The lines 7 and 8 are the steady
state;
• And the lines 9 through 14 used
to drain the pipeline, is called
the Epilog.
• The software-pipelined loop
executes in 2n + 6 clocks, where
n is the number of iterations in
the original loop.
• The delay is placed strategically
after MUL and ADD operation,
so that the schedule can be
initiated every two clocks
without resource conflicts
Software Pipelining of Loops
• If the initiation of each schedule is
extended to 4 clocks to avoid
resource conflicts, then throughput
rate would be halved
• Hence, the schedule must be
chosen carefully in order to
optimize the throughput, otherwise
results in suboptimal throughput
when pipelined.
• When considering the register
usage , there might be a conflict
aroused in using the registers
between the overlapping loop
iterations.
• Register renaming techniques can
be used to resolve the conflict
Issues/challenges in code generation of a
Vectorizing/Parallelizing compilers
• Compiler developers have to resolve a number of issues
other than designing analysis and transformation
techniques. These issues become important when
creating a complete compiler implementation.
• An adequate compiler-internal representation of the
program must be chosen
• The large number of transformation passes need to be put
in the proper order
• Where to apply which transformation, so as to maximize
the benefits but keep the compile time within bounds.
• The user interface of the compiler is important

Internal Representation
• A large variety of compiler-internal program
representations(IRs) are in use.
• Several IRs may be used for several phases of the
compilation.
• The syntax tree IR represents the program at a level
that is close to the original program.
• Another representations close to the generated
machine code.
• An example of an IR in between these extremes is
the register transfer language, which is used by the
widely-available GNU C compiler.
Internal Representation - Contd
• Source-level transformations, such as loop analysis
and transformations, are usually applied on an IR at
the level of the syntax-tree
• Whereas instruction-level transformations are
applied on an IR that is closer to the Generated
machine code.
• Examples of representations that include analysis
information is program dependence graph (PDG)
• The PDG includes information about both data
dependences and control dependences
• It facilitates transformations that need to deal with
both types of dependences at the same time.
Phase Ordering
• Many compiler techniques are applied in an obvious order.
• For Example, Data dependence analysis needs to come
before parallel loop recognition, and dependence removing
transformations.
• There are many situations where the order of
transformations is not easy to determine.
• One possible solution is for the compiler to generate
internally a large number of program variants and then
estimate their performance.
• Generating a large number of program variants may get
prohibitively expensive in terms of both compiler execution
time and space need.
Phase Ordering-contd

• Practical solutions to the phase ordering problem are


based on heuristics and ad-hoc strategies.
• In unimodular transformations, best combination of
iteration-reordering is determined based on data
dependence constraints.
Applying Transformations at the Right Place
• One of the most difficult problems for compilers is to
decide when and where to apply a specific technique.
• In addition to the phase ordering problem, there is the
issue that most transformations can have a negative
performance impact if applied to the wrong program
section.
• For example, a very small loop may run slower in parallel
than serially.
• Interchanging two loops may increase the parallel
granularity but reduce data locality.
• Stripmining for multi-level parallelism may introduce
more overhead than benefit if the loop has a small
number of iterations.
Applying Transformations at the Right
Place - contd
• This difficulty is increased by the fact that machine
architectures are getting more complex, requiring
specialized compiler transformations in many situations.
• Furthermore, an increasing number of compiler
techniques are being developed that apply to a specific
program pattern, but not in general.
• The compiler does not always have sufficient information
about the program input data and machine parameters
to make optimal decisions.
Speed versus Degree of Optimization
• Ordinary compilers transform medium-size programs in a
few seconds. This is not so for parallelizing compilers.
• Advanced program analysis methods, such as data
dependence analysis and symbolic range analysis, may take
significantly longer.
• Compilers may need to create several optimization variants
of a program and then pick the one with the best estimated
performance.
• This can further multiply the compilation time.
• It raises a new issue in that the compiler now needs to make
decisions about which program sections to optimize to the
fullest of its capabilities and where to save compilation time.
• One way of resolving this issue is to pass the decision on to
the user, in the form of command line flags.
Compiler Command Line Flags
• A goal of a Compiler is, A compiler would not need any
command line flags. It would make all decisions about
where to apply which optimization technique fully
automatically.
• But , Today's compilers are far from this goal.
• Compiler flags can be seen as one way for the compiler to
gather additional knowledge that is unavailable in the
program source code.
• They may supply information that otherwise would come
from program input data
Compiler Command Line Flags
• They provide information like
• Most frequently executed program sections,
• Machine environment (e.g., the cache size), or
• Application needs (e.g., degree of permitted round off
error).
• User preferences (e.g., compilation speed versus degree of
optimization).
• A parallelizing compiler can include several tens of
command line options.
• Reducing this number can be seen as an important goal for
the future generation of vectorizing and parallelizing
compilers.
Static Single Assignment
• Many of the annoying problems in implementing analyses
& optimizations stem from variable name conflicts.
• It would be nice if every assignment in a program used a
unique variable name
• It would be nice every variable has exactly one static
assignment location.
• Static single assignment form is a property of an
intermediate representation (IR) that requires each
variable to be assigned exactly once and defined before it
is used.
Example
Normal Intermediate Static single assignment
Representation

x=y–z x=y-z
s=x+s s1 = x + s
x=s+p x1 = s1 + p
s=z*q s2 = z * q
s=x*s
s3 = x1 * s2
Variable S is redefined in Stmt 2, 4 & 5, In Normal IR, even
though the values of S are different, it is represented as S
only, This may lead to conflicts. This is addressed by
Static single assignment, by giving variable name as S1
S2 and S3. Similarly for Variable X
Program Analysis for Parallelism
Dependence Analysis
• For parallelization and vectorization, the compiler
typically takes as input the serial form of a program,
then determines which parts of the program can be
transformed into parallel or vector form.
• The key constraint is that the “results" of each
section of code must be the same as those of the
serial program.
• A data dependence between two sections of a
program indicates that during execution of the
optimized program, those two sections of code must
be run in the order indicated by the dependence.
Classification of dependence Analysis
• Data dependences between two sections of code that
access the same memory location are classified based on
the type of the access (read or write) and the order, so
there are four classifications:

• input dependence: READ before READ (parallelism possible)


• anti dependence: READ before WRITE (parallelism not
possible)
• flow dependence: WRITE before READ (parallelism not
possible)
• output dependence: WRITE before WRITE (parallelism
not possible)
Classification of dependence Analysis
• Flow dependences are also referred to as true dependences.
• If an input dependence occurs between two sections of a
program, it allows the sections to run at the same time
• The existence of any of the other types of dependences
would prevent the sections from running in parallel,
• Techniques have been developed for changing the original
program in many situations where dependences exist, so that
the sections can run in parallel.
Interprocedural Analysis
• A loop may contain one or more procedure calls.
• It presents a special challenge for parallelizing compilers
to compare the memory activity in different execution
contexts (subroutines), for the purpose of discovering
data dependences.
• A Method called subroutine inlining, is used to remove
all subroutine calls by directly replacing all subroutine
calls with the code from the called subroutine, then
parallelizing the whole program as one large routine
Challenge:
• Often causes an explosion in the amount of source code
that the compiler must compile.
Abstract Interpretation
• When compilers need to know the result of executing a
section of code, they often traverse the program in
“execution order", keeping track of the effect of each
statement. This process is called Abstract interpretation.
• Since the compiler generally will not have access to the
runtime values of all the variables in the program, the
effect of each statement will have to be computed
symbolically.
• The effect of a loop is easily determined when there is a
fixed number of iterations
Abstract Interpretation – Contd &
Range Analysis
• The effect of the iteration whose limit is not known may be
determined by widening, in which the values changing due
to the loop are made to change as though the loop had an
infinite number of iterations, and then narrowing, in which
an attempt is made to factor in the loop exit conditions, to
limit the changes due to widening.
• Range Analysis Range analysis is an application of abstract
interpretation. It gathers the range of values that each
variable can assume at each point in the program.
Data Flow Analysis
• A general framework for gathering the information is
called data flow analysis.
• To use data flow analysis, the compiler writer must
set up and solve systems of data flow equations that
relate information at various points in a program.
• The whole program is traversed and information is
gathered from each program node, then used in the
data flow equations.
• The traversal of the program can be either forward
or backward .
• At join points in the program's control flow graph,
the information coming from the paths that join
must be combined,
Data Flow Analysis - contd

• This equation can be read as “ the information at the end of


a statement is either generated within the statement , or
enters at the beginning and is not killed as control flows
through the statement.”
• It can be used for determining which variables are aliased
to which other variables,
• For determining which variables are potentially modified by
a given section of code,
• For determining which variables may be pointed to by a
given pointer, and many other purposes.
• Its use generally increases the precision of other compiler
analyses.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy