Coa-Unit 4 Handout
Coa-Unit 4 Handout
Loop-Level Parallelism
– Parallelism among iterations of a loop.
• Example: for(I=1; I<=100; I++)
X[I]=X[I]+Y[I];
– Each iteration of the loop can overlap with any other iteration in
this example.
– Techniques converting the loop-level parallelism into ILP
• Loop unrolling
• Use of vector instructions (Appendix G)
– LOAD X; LOAD Y; ADD X, Y; STORE X
– Originally used in mainframe and supercomputers.
– Die away due to the effective use of pipelining in desktop and
server processors
– See a renaissance for use in graphics, DSP, and multimedia
applications
Chapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation Rung-Bin Lin 3-6
Name Dependence
– Name dependences
• Occurs when two instructions use the same register or memory
location, called a name, but no data flow between the instructions
with that name.
– Two types of name dependences:
• Antidependence: Occur when instruction j writes a register or
memory location that instruction i reads and instruction i is
executed first.
• Output dependence: Occur when instruction i and instruction j
write the same register or memory location.
– Register renaming can be employed to eliminate name
dependences
Chapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation Rung-Bin Lin 3-8
Control Dependence
• A control dependence determines the ordering of an
instruction with respect to a branch instruction.
– Example: S1 is control dependent on p1, but not on p2.
if p1 {
S1;
};
if p2 {
S1;
};
Chapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation Rung-Bin Lin 3-9
Speculation
• Check whether an instruction can be executed with
violation of control dependence yet preserve the
exception behavior and the data flow.
• Example
DADDU R1, R2, R3
BEQZ R12, skipnext
DSUBU R4, R5, R6
DADDU R5, R4, R9
Skipnext: OR R7, R8, R9
Basic Ideas
– A reservation station (RS) fetches and buffers an operand
as soon as it is available.
– Pending instructions designate the RS that will provide
their inputs.
– When successive writes to a register appear, only the last
one is actually used to update the register.
– As instructions are issued, the register specifiers for
pending operands are renamed to the names of the RS, i.e.,
register renaming
• The functionality of register renaming is provided by
– The reservation stations (RS), which buffer the operands of
instructions waiting to issue.
– The issue logic
• Since three can be more RSs than real registers, the technique can
eliminate hazards that could not be eliminated by a compiler
Chapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation Rung-Bin Lin 3-23
UNIT-IV
Part- A
UNIT-IV
Part-B
Static multiple-issue processors all use compiler to assist with packaging instructions and
handling hazards. In a static issue processor, you can think of set of instructions issued in a given
clock cycle, which is called an issue packet, as one large instruction with multiple operations.
Very Long Instruction Word (VLIW):
A style of instruction set architecture that launches many operations that are defined to be
independent in a single wide instruction, typically with many separate op code fields.
Commit unit is the unit in a dynamic or out-of-order execution pipeline that decides when
it is safe to release result of an operation to programmer visible registers and memory.
Reservation station is a buffer within a functional unit that holds operands and operation.
Reorder buffer is the buffer that holds results in a dynamically scheduled processor until it is safe
to store results to memory or a register.
• Single instruction: All processing units execute same instruction issued by control unit at
any given clock cycle as shown in figure 13.5 where re are multiple processor executing
instruction given by one control unit.
Multiple data: Each processing unit can operate on a different data element as shown if
figure below processor are connected to shared memory or interconnection network
providing multiple data to processing unit.
This type of machine typically has an instruction dispatcher, a very high-bandwidth internal
network, and a very large array of very small-capacity instruction units.
• Thus single instruction is executed by different processing unit on different set of data as
shown in fig.
• Best suited for specialized problems characterized by a high degree of regularity, such as
image processing and vector computation.
• Synchronous (lockstep) and deterministic execution
• Two varieties: Processor Arrays e.g., Connection Machine CM-2, Maspar MP-1, MP-2 and
Vector Pipelines processor e.g., IBM 9000, Cray C90, Hitachi S820
Thus in these computers same data flow through a linear array of processors executing
different instruction streams as shown in fig.
• This architecture is also known as systolic arrays for pipelined execution of specific
instructions.
• Few actual examples of class of parallel computer have ever existed. One is experimental
Carnegie-Mellon C.mmp computer (1971).
• Some conceivable uses might be:
1. multiple frequency filters operating on a single signal stream
2. multiple cryptography algorithms attempting to crack a single coded message.
For both analogy and parallel programming, challenges include scheduling, partitioning
work into parallel pieces, balancing load evenly between workers, time to synchronize, and
overhead for communication between parties. The challenge is stiffer with more reporters for a
newspaper story and with more processors for parallel programming.
12
Another obstacle, namely Amdahl’s Law. It reminds us that even small parts of a program
must be parallelized if program is to make good use of many cores.
Speed-up Challenge: Suppose you want to achieve a speed-up of 90 times faster with
100 processors. What percentage of original computation can be sequential?
Amdahl’s Law in terms of speed-up versus original execution time:
Thus, to achieve a speed-up of 90 from 100 processors, sequential percentage can only be
0.1%.
Examples show that getting good speed-up on a multiprocessor while keeping problem
size fixed is harder than getting good speed-up by increasing size of problem. This insight allows
us to introduce two terms that describe ways to scale up.
Strong scaling means measuring speed-up while keeping problem size fixed. Weak
scaling means that problem size grows proportionally to increase in number of processors.
Speed-up Challenge: Balancing Load
Example demonstrates importance of balancing load, for just a single processor with
twice load of the others cuts speed-up by a third, and five times load on just one processor
reduces speed-up by almost a factor of three.
4. Explain in detail, the shared memory multiprocessor, with a neat diagram. (16 marks)
Shared memory multiprocessor (SMP) is one that offers programmer a single physical
address space across all processors-which is nearly always case for multicore chips- although a
more accurate term would have been shared-address multiprocessor. Processors communicate
through shared variables in memory, with all processors capable of accessing any memory
location via loads and stores. Note that such systems can still run independent jobs in their own
virtual address spaces, even if y all share a physical address space. Single address space
multiprocessors come in two styles. In first style, latency to a word in memory does not depend
on which processor asks for it.
Such machines are called uniform memory access (UMA) multiprocessors. In second
style, some memory accesses are much faster than others, depending on which processor asks for
which word, typically because main memory is divided and attached to different microprocessors
or to different memory controllers on same chip. Such machines are called non uniform memory
access (NUMA) multiprocessors. As you might expect, programming challenges are harder for
a NUMA multiprocessor than for a UMA multiprocessor, but NUMA machines can scale to
larger sizes and NUMAs can have lower latency to nearby memory.
13
As processors operating in parallel will normally share data, you also need to coordinate when
operating on shared data; otherwise, one processor could start working on data before another is
finished with it. This coordination is called synchronization, When sharing is supported with a
single address space, there must be a separate mechanism for synchronization. One approach
uses a lock for a shared variable. Only one processor at a time can acquire lock, and or
processors interested in shared data must wait until original processor unlocks variable.
The next step is to add se 64 partial sums. This step is called a reduction, where we divide
to conquer. Half of processors add pairs of partial sums, and n a quarter add pairs of new partial
sums, and so on until we have single, final sum.
Each processor to have its own version of loop counter variable i, so we must indicate
that it is a private variable. Here is the code,
14
Some writers repurposed acronym SMP to mean symmetric multiprocessor, to indicate that
latency from processor to memory was about same for all processors.
to complete. Due to start-up overhead, coarse-grained multithreading is much more useful for
reducing penalty of high-cost stalls, where pipeline refill is negligible compared to stall time.
How four threads use issue slots of a superscalar processor in different approaches.
Simultaneous multithreading (SMT) is a variation on hardware multithreading that uses
resources of a multiple-issue, dynamically scheduled pipelined processor to exploit thread-level
parallelism at same time it exploits instruction level parallelism. The key insight that motivates
SMT is that multiple-issue processors often have more functional unit parallelism available than
most single threads can effectively use. Furthermore, with register renaming and dynamic
scheduling, multiple instructions from independent threads can be issued without regard to
dependences among m; resolution of dependences can be handled by dynamic scheduling
capability. Since SMT relies on existing dynamic mechanisms, it does not switch resources every
cycle. Instead, SMT is always executing instructions from multiple threads, leaving it up to
hardware to associate instruction slots and renamed registers with their proper threads.
The four threads at top show how each would execute running alone on a standard
superscalar processor without multithreading support. The three examples at bottom show how
you would execute running together in three multithreading options. The horizontal dimension
represents instruction issue capability in each clock cycle. The vertical dimension represents a
sequence of clock cycles. An empty (white) box indicates that corresponding issue slot is unused
in that clock cycle. The shades of gray and color correspond to four different threads in
multithreading processors.
processors can share, so that programs need not concern themselves with where their data is,
merely that programs may be executed in parallel. In approach, all variables of a program can be
made available at any time to any processor. The alternative is to have a separate address space
per processor that requires that sharing must be explicit;
Introduction to Graphics Processing Units (GPU):
The original justification for adding SIMD instructions to existing architectures was that
many microprocessors were connected to graphics displays in PCs and workstations, so an
increasing fraction of processing time was used for graphics. As Moore’s Law increased number
of transistors available to microprocessors, it therefore made sense to improve graphics
processing.
A major driving force for improving graphics processing was computer game industry,
both on PCs and in dedicated game consoles such as Sony PlayStation. The rapidly growing
game market encouraged many companies to make increasing investments in developing faster
graphics hardware, and positive feedback loop led graphics processing to improve at a faster rate
than general-purpose processing in mainstream microprocessors. Given that graphics and game
community had different goals than microprocessor development community, it evolved its own
style of processing and terminology. As graphics processors increased in power, they earned
name Graphics Processing Units or GPUs to distinguish themselves from CPUs. For a few
hundred dollars, anyone can buy a GPU today with hundreds of parallel floating-point units,
which makes high-performance computing more accessible. The interest in GPU computing
blossomed when potential was combined with a programming language that made GPUs easier
to program. Hence, many programmers of scientific and multimedia applications today are
pondering whether to use GPUs or CPUs.
Here are some of key characteristics as to how GPUs vary from CPUs:
■ GPUs are accelerators that supplement a CPU, so y do not need be able to perform all tasks of
a CPU. This role allows m to dedicate all their resources to graphics. It’s fine for GPUs to
perform some tasks poorly or not at all, given that in a system with both a CPU and a GPU, CPU
can do m if needed.
■ The GPU problems sizes are typically hundreds of megabytes to gigabytes, but not hundreds of
gigabytes to terabytes. These differences led to different styles of architecture:
■ Perhaps biggest difference is that GPUs do not rely on multilevel caches to overcome long
latency to memory, as do CPUs. Instead, GPUs rely on hardware multithreading (Section 6.4) to
hide latency to memory. That is, between time of a memory request and time that data arrives,
GPU executes hundreds or thousands of threads that are independent of that request.
The GPU memory is thus oriented toward bandwidth rather than latency. There are even
special graphics DRAM chips for GPUs that are wider and have higher bandwidth than DRAM
chips for CPUs. In addition, GPU memories have traditionally had smaller main memories than
conventional microprocessors. In 2013, GPUs typically have 4 to 6 GiB or less, while CPUs
have 32 to 256 GiB. Finally, keep in mind that for general-purpose computation, you must
include time to transfer data between CPU memory and GPU memory, since GPU is a
coprocessor.
■ Given reliance on many threads to deliver good memory bandwidth, GPUs can accommodate
many parallel processors (MIMD) as well as many threads. Hence, each GPU processor is more
highly multithreaded than a typical CPU, plus y have more processors.
17
Similarities and differences between multicore with Multimedia SIMD extensions and
recent GPUs.
At a high level, multicore computers with SIMD instruction extensions do share
similarities with GPUs. Both are MIMDs whose processors use multiple SIMD lanes, although
GPUs have more processors and many more lanes. Both use hardware multithreading to improve
processor utilization, although GPUs have hardware support for many more threads. Both use
caches, although GPUs use smaller streaming caches and multicore computers use large
multilevel caches that try to contain whole working sets completely. Both use a 64-bit address
space, although physical main memory is much smaller in GPUs. While GPUs support memory
protection at page level, y do not yet support demand paging.
SIMD processors are also similar to vector processors. The multiple SIMD processors in
GPUs act as independent MIMD cores, just as many vector computers have multiple vector
processors.