0% found this document useful (0 votes)
31 views39 pages

Coa-Unit 4 Handout

This document discusses techniques for exploiting instruction-level parallelism (ILP) in processors. It describes how ILP can reduce pipeline stalls and mentions techniques like dynamic scheduling that allow out-of-order execution to find parallelism. It also discusses preserving data flow and exception behavior when violating control dependence through speculation. Loop-level parallelism and eliminating dependencies through renaming registers are also covered.

Uploaded by

Agnika Shiva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views39 pages

Coa-Unit 4 Handout

This document discusses techniques for exploiting instruction-level parallelism (ILP) in processors. It describes how ILP can reduce pipeline stalls and mentions techniques like dynamic scheduling that allow out-of-order execution to find parallelism. It also discusses preserving data flow and exception behavior when violating control dependence through speculation. Loop-level parallelism and eliminating dependencies through renaming registers are also covered.

Uploaded by

Agnika Shiva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Chapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation Rung-Bin Lin 3-2

Instruction Level Parallelism: Concepts and


Challenges
• Instruction-level parallelism(ILP)
– The potential of overlapping the execution of multiple
instructions is called instruction-level parallelism.
Chapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation Rung-Bin Lin 3-3

Techniques to Reduce Pipeline CPI


• Recall,
– Pipeline CPI = Ideal pipeline CPI + Structural stalls + RAW
stalls + WAR stalls + WAW stalls +Control stalls.
– Instruction-level parallelism is to reduce the number of
stalls.
– How to find out ILP
• Dynamically locate ILP by hardware
• Statically locate ILP by software
– Techniques that affect CPI (fig. 3.1 on page 173).
Chapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation Rung-Bin Lin 3-4

ILP Within and Across a Basic Block


• ILP within a basic block
– If the branch frequency is 15%~25%, there are only 4 ~ 7
instructions within a basic block. This implies that we
must exploit ILP across a basic block.
• Loop-Level Parallelism(ILP across a basic block)
– Exploit parallelism among iteration of a loop.
Chapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation Rung-Bin Lin 3-5

Loop-Level Parallelism
– Parallelism among iterations of a loop.
• Example: for(I=1; I<=100; I++)
X[I]=X[I]+Y[I];
– Each iteration of the loop can overlap with any other iteration in
this example.
– Techniques converting the loop-level parallelism into ILP
• Loop unrolling
• Use of vector instructions (Appendix G)
– LOAD X; LOAD Y; ADD X, Y; STORE X
– Originally used in mainframe and supercomputers.
– Die away due to the effective use of pipelining in desktop and
server processors
– See a renaissance for use in graphics, DSP, and multimedia
applications
Chapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation Rung-Bin Lin 3-6

Data Dependence and Hazards


• To have ILP, instructions should have no dependence
• A dependence indicates the possibility of a hazard,
– Determines the order in which results must be calculated, and
– Sets an upper bound on how much parallelism can possibly be
exploited.
• Overcome the limitation of dependence on ILP by
– Maintaining the dependence but avoiding a hazard,
– Eliminating a dependence by transforming the code.
• Dependence types
– Data dependence
• Creating RAW, WAR, and WAW hazards
– Name dependences
– Control dependences
Chapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation Rung-Bin Lin 3-7

Name Dependence
– Name dependences
• Occurs when two instructions use the same register or memory
location, called a name, but no data flow between the instructions
with that name.
– Two types of name dependences:
• Antidependence: Occur when instruction j writes a register or
memory location that instruction i reads and instruction i is
executed first.
• Output dependence: Occur when instruction i and instruction j
write the same register or memory location.
– Register renaming can be employed to eliminate name
dependences
Chapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation Rung-Bin Lin 3-8

Control Dependence
• A control dependence determines the ordering of an
instruction with respect to a branch instruction.
– Example: S1 is control dependent on p1, but not on p2.
if p1 {
S1;
};
if p2 {
S1;
};
Chapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation Rung-Bin Lin 3-9

Two Constraints Imposed by Control


Dependences
– An instruction that is control dependent on a branch
cannot be moved before the branch so that its execution is
no longer controlled by the branch.
– An instruction that is not control dependent on a branch
cannot be moved after the branch so that its execution is
controlled by the branch.
Chapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation Rung-Bin Lin 3-10

How the Simple Pipeline in Appendix A


Preserves Control Dependence

– Instructions execute in order.


– Detection of control or branch hazards ensures that an
instruction that is control dependent on a branch is not
executed until the branch direction is known.
Chapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation Rung-Bin Lin 3-11

Can We Violate Control Dependence?


• Yes, we can
– If we can ensure that violating the control dependence will not result
in incorrectness of the programs, control dependence is not critical
property that must be preserved.
– Instead, the exception behavior and data flow critical to correctness of
the program are normally preserved by data and control dependence
Chapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation Rung-Bin Lin 3-12

Preserving Exception Behavior


– Preserving the exception behavior means that any changes
in the ordering of instruction execution must not change
how exceptions are raised in the program.
• Often this is relaxed to mean that the reordering of instruction
execution must not cause any new exceptions in the program.
• Example
DADDU R2, R3, R4
BEQZ R2, L1
LW R1, 0(R2)
L1: …
How about LW is moved before BEQZ and there is a memory
exception while the branch is taken?
Chapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation Rung-Bin Lin 3-13

Preserving Data Flow


– The actual flows of data among instructions that produce
results and those that consume them must be preserved.
– Branch makes data flow dynamic (i.e., coming from
multiple points).
– Example
DADDU R1, R2, R3
BEQZ R4, L
DSUBU R1, R5, R6
L: …
OR R7, R1, R8
– “Preserving data flow” means that if branch is not taken, the value
of R1 computed by DSUBU is used by OR, otherwise, the value
of R1 computed by DADDU is used.
Chapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation Rung-Bin Lin 3-14

Speculation
• Check whether an instruction can be executed with
violation of control dependence yet preserve the
exception behavior and the data flow.
• Example
DADDU R1, R2, R3
BEQZ R12, skipnext
DSUBU R4, R5, R6
DADDU R5, R4, R9
Skipnext: OR R7, R8, R9

– How about moving DSUBD before BEQZ if R4 were not


used in taken path?
Chapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation Rung-Bin Lin 3-15

Overcoming Data Hazards with Dynamic


Scheduling
– Basic idea:
DIV.D F0, F2, F4
ADD.D F10, F0, F8
SUB.D F12, F8, F14
– SUB.D is stalled, but not data dependent on anything.
– The major limitation of so-far introduced pipeline is in-
order issuing of instructions.
– To allow SUB.D to execute by dynamically scheduling the
instructions, it create out-of-order execution, and thus
out-of-order completion.
Chapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation Rung-Bin Lin 3-16

Advantages and Problems of Dynamic


Scheduling
– Advantages
• Enable handling some cases when dependences are unknown at
compile time (e.g. When involving memory reference).
• Simplify the compiler.
• Allow code that was compiled with one pipeline in mind to run
efficiently on a different pipeline.
– Problems
• It creates WAR and WAW hazards.
• It complicates exception handling due to out-of-order completion.
It creates imprecise exception.
– The processor state when an exception is raised does not look
exactly as if the instructions were executed sequentially in strict
program order.
Chapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation Rung-Bin Lin 3-17

Support Dynamic Scheduling for the Simple


Five-Stage Pipeline
• Divide the ID stage into the following two stages:
– Issue: Decode instructions and check for structural
hazards.
– Read operands: Wait until no data hazards, then read
operands.
Chapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation Rung-Bin Lin 3-18

Dynamic Scheduling Algorithms


• Algorithms
– Scoreboarding, originated from CDC 6600 (Appendix A).
• Effective when there are sufficient resources and no data
dependence.
– Tomasulo algorithm, originated from IBM 360.
• Both algorithms can be applied to pipelining or
multi-functional units implementations.
Chapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation Rung-Bin Lin 3-19

Dynamic Scheduling Using Tomasulo’s


Approach
• Combine key elements of the scoreboarding scheme
with register renaming.
– Track availability of operands to minimize RAW.
– Use register renaming to minimize RAW and WAW.
Chapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation Rung-Bin Lin 3-20

Concept of Register Renaming


• Code before renaming
DIV.D F0, F2, F4
ADD.D F6, F0, F8
S.D F6, 0(R1)
SUB.D F8, F10, F14
MUL.D F6, F10, F8
• Code after renaming
DIV.D F0, F2, F4
ADD.D S, F0, F8
S.D S, 0(R1)
SUB.D T, F10, F14
MUL.D F6, F10, T
Chapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation Rung-Bin Lin 3-21

Basic Architecture for Tomasulo’s Approach


Chapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation Rung-Bin Lin 3-22

Basic Ideas
– A reservation station (RS) fetches and buffers an operand
as soon as it is available.
– Pending instructions designate the RS that will provide
their inputs.
– When successive writes to a register appear, only the last
one is actually used to update the register.
– As instructions are issued, the register specifiers for
pending operands are renamed to the names of the RS, i.e.,
register renaming
• The functionality of register renaming is provided by
– The reservation stations (RS), which buffer the operands of
instructions waiting to issue.
– The issue logic
• Since three can be more RSs than real registers, the technique can
eliminate hazards that could not be eliminated by a compiler
Chapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation Rung-Bin Lin 3-23

What Is a Reservation Station Actually Held?


– Instructions that have been issued and are awaiting
execution at a functional unit.
– The operands if available, otherwise, the source of the
operands.
– The information needed to control the execution of the
instruction at the unit.
– The load buffers and store buffers hold data or addresses
coming from and going to memory.
Chapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation Rung-Bin Lin 3-24

Steps in Tomasulo’s Approach


– Issue
• Get an instruction from the floating point queue. If it is a floating point
operation, issue it if there is an empty RS, and send the operands to the
RS if they are in the registers. If it is a load or store , it can be issued if
there is an available buffer. If the hardware resource is not available, the
instruction stalls.
– Execute
• If one or more operands is not yet available, monitor the CDB to obtain
the required operands. When two operands are available, the instruction
is executed. This step checks for RAW hazards.
– Write result
• When the result is available, write it on the CDB and from there into the
registers and any RS waiting for this result.
• The above steps differ from Scoreboarding in the following
three aspects:
• No checking for WAW and WAR hazards.
• The CDB is used to broadcast results.
• Loads and stores are treated as basic functional units.
2

UNIT-IV

Part- A

1. What is meant by ILP?


Pipelining exploits the potential parallelism among instructions. This parallelism
is called instruction-level parallelism (ILP). There are two primary methods for
increasing the potential amount of instruction-level parallelism.
1. Increasing the depth of the pipeline to overlap more instructions.
2. Multiple issue.
2. What is meant by multiple issue and its 2 approaches?
Multiple issue is a scheme whereby multiple instructions are launched in one
clock cycle. It is a method for increasing the potential amount of instruction-level
parallelism. It is done by replicating the internal components of the computer so
that it can launch multiple instructions in every pipeline stage. The two approaches are:
1. Static multiple issue (at compile time)
2. Dynamic multiple issue (at run time)
3. What is meant by speculation?
One of the most important methods for finding and exploiting more ILP is
speculation. It is an approach whereby the compiler or processor guesses the outcome of
an instruction to remove it as a dependence in executing other instructions.
For example, we might speculate on the outcome of a branch, so that instructions
after the branch could be executed earlier.

4. Define – Static Multiple Issue


Static multiple issue is an approach to implementing a multiple-issue processor
where many decisions are made by the compiler before execution.

5. Define – Issue Slots and Issue Packet


Issue slots are the positions from which instructions could issue in a given clock
cycle; by analogy, these correspond to positions at the starting blocks for a sprint.
Issue packet is the set of instructions that issues together in one clock cycle; the packet
may be determined statically by the compiler or dynamically by the processor.
6. Define – VLIW
Very Long Instruction Word (VLIW) A style of instruction set architecture that
launches many operations that are defined to be independent in a single wide instruction,
typically with many separate opcode fields.

7. Define – Superscalar Processor


Superscalar is an advanced pipelining technique that enables the processor to execute
more than one instruction per clock cycle by selecting them during execution. Dynamic
3

multiple-issue processors are also known as superscalar processors, or simply


superscalars.

8. What is meant by loop unrolling?


An important compiler technique to get more performance from loops is loop
unrolling, where multiple copies of the loop body are made. After unrolling, there is more
ILP available by overlapping instructions from different iterations.

9. What is meant by anti-dependence and how does it removed?


Antidependence also called name dependence is an ordering forced by the reuse
of a name, typically a register, rather than by a true dependence that carries a value
between two instructions.
Register renaming is the renaming of registers by the compiler or hardware to
remove antidependences.

10. What is the use of reservation station and reorder buffer?


Reservation station is a buffer within a functional unit that holds the operands and
the operation.
Reorder buffer is the buffer that holds results in a dynamically scheduled
processor until it is safe to store the results to memory or a register.

11. Differentiate in-order execution from out-of-order execution.


Out-of-order execution is a situation in pipelined execution when an instruction
blocked from executing does not cause the following instructions to wait. It preserves the
data flow order of the program.
In-order execution requires the instruction fetch and decode unit to issue
instructions in order, which allows dependences to be tracked, and the commit unit is
required to write results to registers and memory in program fetch order. This
conservative mode is called in-order commit.

12. What is meant by hardware multithreading?


Hardware multithreading allows multiple threads to share the functional units of a
single processor in an overlapping fashion to try to utilize the hardware resources
efficiently. To permit this sharing, the processor must duplicate the independent state of
each thread. It Increases the utilization of a processor.
13. What are the two main approaches to hardware multithreading?
There are two main approaches to hardware multithreading. Fine-grained
multithreading switches between threads on each instruction, resulting in interleaved
execution of multiple threads. This interleaving is oft en done in a round-robin fashion,
skipping any threads that are stalled at that clock cycle.
Coarse-grained multithreading was invented as an alternative to fine-grained
multithreading. Coarse-grained multithreading switches threads only on costly stalls,
such as last-level cache misses.
4

14. What is meant by SMT?


Simultaneous multithreading (SMT) is a variation on hardware multithreading
that uses the resources of a multiple-issue, dynamically scheduled pipelined processor to
exploit thread-level parallelism at the same time it exploits instruction level parallelism.

15. Differentiate SMT from hardware multithreading.


Since SMT relies on the existing dynamic mechanisms, it does not switch
resources every cycle. Instead, SMT is always executing instructions from multiple
threads, leaving it up to the hardware to associate instruction slots and renamed registers
with their proper threads.

16. What are the 3 multithreading options?


The three multithreading options are:
1. A superscalar with coarse-grained multithreading
2. A superscalar with fi ne-grained multithreading
3. A superscalar with simultaneous multithreading
17. Define – SMP
Shared memory multiprocessor (SMP) is one that offers the programmer a single
physical address space across all processors - which is nearly always the case for
multicore chips. Processors communicate through shared variables in memory, with all
processors capable of accessing any memory location via loads and stores.

18. Differentiate UMA from NUMA.


Uniform memory access (UMA) is a multiprocessor in which latency to any word
in main memory is about the same no matter which processor requests the access.
Non uniform memory access (NUMA) is a type of single address space multiprocessor in
which some memory accesses are much faster than others depending on which processor asks for
which word.
5

UNIT-IV

Part-B

1. Explain in detail, the instruction level parallelism. (16 marks)


Pipelining exploits potential parallelism among instructions. This parallelism is called
instruction-level parallelism (ILP). There are two primary methods for increasing potential
amount of instruction-level parallelism. The first is increasing depth of pipeline to overlap more
instructions. Another approach is to replicate internal components of computer so that it can
launch multiple instructions in every pipeline stage. The general name for technique is multiple
issue. Launching multiple instructions per stage allows instruction execution rate to exceed clock
rate or, stated alternatively, CPI to be less than 1. Assuming a five-stage pipeline, such a
processor would have 20 instructions in execution at any given time. Today’s high-end
microprocessors attempt to issue from three to six instructions in every clock cycle. Even
moderate designs will aim at a peak IPC of 2. There are typically, however, many constraints on
what types of instructions may be executed simultaneously, and what happens when
dependences arise.
There are two major ways to implement a multiple-issue processor: Statically (that is, at
compile time) or dynamically (that is, during execution), approaches are sometimes called static
multiple issue and dynamic multiple issue.
There are two primary and distinct responsibilities that must be dealt with in a multiple-
issue pipeline:
 Packaging instructions into issue slots: how does processor determine how many
instructions and which instructions can be issued in a given clock cycle? In most static
issue processors, process is at least partially handled by compiler; in dynamic issue
designs, it is normally dealt with at runtime by processor, although compiler will often
have already tried to help improve issue rate by placing instructions in a beneficial order.
 Dealing with data and control hazards: in static issue processors, compiler handles some
or all of consequences of data and control hazards statically. In contrast, most dynamic
issue processors attempt to alleviate at least some classes of hazards using hardware
techniques operating at execution time.
Concept of Speculation:
One of most important methods for finding and exploiting more ILP is speculation. Based
on great idea of prediction, speculation is an approach that allows compiler or processor to
“guess” about properties of an instruction, so as to enable execution to begin for or instructions
that may depend on speculated instruction. For example, we might speculate on outcome of a
branch, so that instructions after branch could be executed earlier Speculation may be done in
compiler or by hardware. For example, compiler can use speculation to reorder instructions,
moving an instruction across a branch or a load across a store. The processor hardware can
perform same transformation at runtime using techniques we discuss later in section. The
recovery mechanisms used for incorrect speculation are different. Speculation introduces one or
possible problem: speculating on certain instructions may introduce exceptions that were
formerly not present.
Static Multiple Issue:
6

Static multiple-issue processors all use compiler to assist with packaging instructions and
handling hazards. In a static issue processor, you can think of set of instructions issued in a given
clock cycle, which is called an issue packet, as one large instruction with multiple operations.
Very Long Instruction Word (VLIW):
A style of instruction set architecture that launches many operations that are defined to be
independent in a single wide instruction, typically with many separate op code fields.

Static two-issue pipeline in operation.


To issue an ALU and a data transfer operation in parallel, first need for additional
hardware—beyond usual hazard detection and stall logic—is extra ports in register file.
An important compiler technique to get more performance from loops is loop unrolling,
where multiple copies of loop body are made. After unrolling, there is more ILP available by
overlapping instructions from different iterations.
Register renaming is the renaming of registers by compiler or hardware to remove anti
dependences. Anti-dependence also called name dependence. An ordering forced by reuse of a
name, typically a register, rather than by a true dependence that carries a value between two
instructions.

A static two-issue data path.


Dynamic Multiple-Issue Processors:
7

Dynamic multiple-issue processors are also known as superscalar processors, or simply


superscalar is an advanced pipelining technique that enables processor to execute more than one
instruction per clock cycle by selecting m during execution.
Dynamic pipeline scheduling chooses which instructions to execute in a given clock
cycle while trying to avoid hazards and stalls. Let’s start with a simple example of avoiding a
data hazard. Consider following code sequence:
lw $t0, 20($s2)
addu $t1, $t0, $t2
sub $s4, $s4, $t3
slti $t5, $s4, 20
Even though sub instruction is ready to execute, it must wait for lw and addu to complete first,
which might take many clock cycles if memory is slow.

The three primary units of a dynamically scheduled pipeline.

Commit unit is the unit in a dynamic or out-of-order execution pipeline that decides when
it is safe to release result of an operation to programmer visible registers and memory.
Reservation station is a buffer within a functional unit that holds operands and operation.
Reorder buffer is the buffer that holds results in a dynamically scheduled processor until it is safe
to store results to memory or a register.

2. Explain in detail, the Flynn’s classification. (8 marks)


Flynn's Classification distinguishes multi-processor computer architectures according
two independent dimensions of Instruction stream and Data stream. An instruction stream is
sequence of instructions executed by machine. And a data stream is a sequence of data
including input, partial or temporary results used by instruction stream. Each of se dimensions
can have only one of two possible states: Single or Multiple. Flynn’s classification depends on
distinction between performance of control unit and data processing unit rather than its
operational and structural interconnections.
Following are four category of Flynn classification:
8

Single instruction stream, single data stream (SISD):

SISD processor organization

Execution of instruction in SISD processors


 They are also called scalar processor i.e., one instruction at a time and each instruction
have only one set of operands.
 Single instruction: only one instruction stream is being acted on by CPU during any one
clock cycle
 Examples: most PCs, single CPU workstations and mainframes
 Single data: only one data stream is being used as input during any one clock cycle
 Deterministic execution
 Instructions are executed sequentially.
 This is the oldest and until recently, most prevalent form of computer
 Examples: most PCs, single CPU workstations and mainframes
Single instruction stream, multiple data stream (SIMD) processors :
• A type of parallel computer

SIMD processor organization


9

• Single instruction: All processing units execute same instruction issued by control unit at
any given clock cycle as shown in figure 13.5 where re are multiple processor executing
instruction given by one control unit.
 Multiple data: Each processing unit can operate on a different data element as shown if
figure below processor are connected to shared memory or interconnection network
providing multiple data to processing unit.
 This type of machine typically has an instruction dispatcher, a very high-bandwidth internal
network, and a very large array of very small-capacity instruction units.
• Thus single instruction is executed by different processing unit on different set of data as
shown in fig.
• Best suited for specialized problems characterized by a high degree of regularity, such as
image processing and vector computation.
• Synchronous (lockstep) and deterministic execution
• Two varieties: Processor Arrays e.g., Connection Machine CM-2, Maspar MP-1, MP-2 and
Vector Pipelines processor e.g., IBM 9000, Cray C90, Hitachi S820

Single instruction stream, multiple data stream (SIMD):


• A single data stream is fed into multiple processing units.
• Each processing unit operates on data independently via independent instruction streams as
shown in figure 1.5 a single data stream is forwarded to different processing unit which
are connected to different control unit and execute instruction given to it by control unit
to which it is attached.
10

 Thus in these computers same data flow through a linear array of processors executing
different instruction streams as shown in fig.
• This architecture is also known as systolic arrays for pipelined execution of specific
instructions.
• Few actual examples of class of parallel computer have ever existed. One is experimental
Carnegie-Mellon C.mmp computer (1971).
• Some conceivable uses might be:
1. multiple frequency filters operating on a single signal stream
2. multiple cryptography algorithms attempting to crack a single coded message.

Execution of instructions in MISD processors


Multiple instruction stream, multiple data stream (MIMD):
• Multiple Instruction: every processor may be executing a different instruction stream
• Multiple Data: every processor may be working with a different data stream as shown in fig
multiple data stream is provided by shared memory.
• Can be categorized as loosely coupled or tightly coupled depending on sharing of data and
control
• Execution can be synchronous or asynchronous, deterministic or non-deterministic

MIMD processor organizations


• There are different processor each processing different task.
• Examples: most current supercomputers, networked parallel computer "grids" and multi-
processor SMP computers - including some types of PCs.
11

Execution of instructions MIMD processors

3. Explain the challenges in parallel processing. (8 marks)


The tall challenge facing industry is to create hardware and software that will make it
easy to write correct parallel processing programs that will execute efficiently in performance
and energy as number of cores per chip scales.
Only challenge of parallel revolution is figuring out how to make naturally sequential
software have high performance on parallel hardware, but it is also to make concurrent programs
have high performance on multiprocessors as number of processors increases.
The difficulty with parallelism is not hardware; it is that too few important application
programs have been rewritten to complete tasks sooner on multiprocessors. It is difficult to write
software that uses multiple processors to complete one task faster, and problem gets worse as
number of processors increases.
The first reason is that you must get better performance or better energy efficiency from a
parallel processing program on a multiprocessor; why is it difficult to write parallel processing
programs that are fast, especially as number of processors increases.

For both analogy and parallel programming, challenges include scheduling, partitioning
work into parallel pieces, balancing load evenly between workers, time to synchronize, and
overhead for communication between parties. The challenge is stiffer with more reporters for a
newspaper story and with more processors for parallel programming.
12

Another obstacle, namely Amdahl’s Law. It reminds us that even small parts of a program
must be parallelized if program is to make good use of many cores.
Speed-up Challenge: Suppose you want to achieve a speed-up of 90 times faster with
100 processors. What percentage of original computation can be sequential?
Amdahl’s Law in terms of speed-up versus original execution time:

Thus, to achieve a speed-up of 90 from 100 processors, sequential percentage can only be
0.1%.
Examples show that getting good speed-up on a multiprocessor while keeping problem
size fixed is harder than getting good speed-up by increasing size of problem. This insight allows
us to introduce two terms that describe ways to scale up.
Strong scaling means measuring speed-up while keeping problem size fixed. Weak
scaling means that problem size grows proportionally to increase in number of processors.
Speed-up Challenge: Balancing Load
Example demonstrates importance of balancing load, for just a single processor with
twice load of the others cuts speed-up by a third, and five times load on just one processor
reduces speed-up by almost a factor of three.

4. Explain in detail, the shared memory multiprocessor, with a neat diagram. (16 marks)
Shared memory multiprocessor (SMP) is one that offers programmer a single physical
address space across all processors-which is nearly always case for multicore chips- although a
more accurate term would have been shared-address multiprocessor. Processors communicate
through shared variables in memory, with all processors capable of accessing any memory
location via loads and stores. Note that such systems can still run independent jobs in their own
virtual address spaces, even if y all share a physical address space. Single address space
multiprocessors come in two styles. In first style, latency to a word in memory does not depend
on which processor asks for it.
Such machines are called uniform memory access (UMA) multiprocessors. In second
style, some memory accesses are much faster than others, depending on which processor asks for
which word, typically because main memory is divided and attached to different microprocessors
or to different memory controllers on same chip. Such machines are called non uniform memory
access (NUMA) multiprocessors. As you might expect, programming challenges are harder for
a NUMA multiprocessor than for a UMA multiprocessor, but NUMA machines can scale to
larger sizes and NUMAs can have lower latency to nearby memory.
13

As processors operating in parallel will normally share data, you also need to coordinate when
operating on shared data; otherwise, one processor could start working on data before another is
finished with it. This coordination is called synchronization, When sharing is supported with a
single address space, there must be a separate mechanism for synchronization. One approach
uses a lock for a shared variable. Only one processor at a time can acquire lock, and or
processors interested in shared data must wait until original processor unlocks variable.

Classic organization of a shared memory multiprocessor


OpenMP An API for shared memory multiprocessing in C, C++, or Fortran that runs on
UNIX and Microsoft platforms. It includes compiler directives, a library, and runtime directives.
A Simple Parallel Processing Program for a Shared Address Space Suppose we want to
sum 64,000 numbers on a shared memory multiprocessor computer with uniform memory access
time. Let’s assume we have 64 processors.
The first step is to ensure a balanced load per processor, so we split set of numbers into
subsets of same size. We do not allocate subsets to a different memory space, since re is a single
memory space for machine; we just give different starting addresses to each processor. Pn is
number that identifies processor, between 0 and 63. All processors start program by running a
loop that sums their subset of numbers:

The next step is to add se 64 partial sums. This step is called a reduction, where we divide
to conquer. Half of processors add pairs of partial sums, and n a quarter add pairs of new partial
sums, and so on until we have single, final sum.

Each processor to have its own version of loop counter variable i, so we must indicate
that it is a private variable. Here is the code,
14

Some writers repurposed acronym SMP to mean symmetric multiprocessor, to indicate that
latency from processor to memory was about same for all processors.

5. Explain in detail, hardware multithreading. (16 marks)


Hardware Multithreading A related concept to MIMD, especially from programmer’s
perspective, is hardware multithreading. While MIMD relies on multiple processes or threads to
try to keep multiple processors busy, hardware multithreading allows multiple threads to share
functional units of a single processor in an overlapping fashion to try to utilize hardware
resources efficiently. To permit sharing, processor must duplicate independent state of each
thread. For example, each thread would have a separate copy of register file and program
counter.
The memory itself can be shared through virtual memory mechanisms, which already
support multi-programming. In addition, hardware must support ability to change to a different
thread relatively quickly. In particular, a thread switch should be much more efficient than a
process switch, which typically requires hundreds to thousands of processor cycles while a
thread switch can be instantaneous.
There are two main approaches to hardware multithreading:
Fine-grained multithreading switches between threads on each instruction, resulting in
interleaved execution of multiple threads. This interleaving is often done in a round-robin
fashion, skipping any threads that are stalled at that clock cycle. To make fine-grained
multithreading practical, processor must be able to switch threads on every clock cycle. One
advantage of fine-grained multithreading is that it can hide throughput losses that arise from both
short and long stalls, since instructions from or threads can be executed when one thread stalls.
The primary disadvantage of fine-grained multithreading is that it slows down execution of
individual threads, since a thread that is ready to execute without stalls will be delayed by
instructions from or threads.
Coarse-grained multithreading was invented as an alternative to fine-grained
multithreading. Coarse-grained multithreading switches threads only on costly stalls, such as
last-level cache misses. This change relieves need to have thread switching be extremely fast and
is much less likely to slow down execution of an individual thread, since instructions from or
threads will only be issued when a thread encounters a costly stall. Coarse-grained
multithreading suffers, however, from a major drawback: it is limited in its ability to overcome
throughput losses, especially from shorter stalls. This limitation arises from pipeline start-up
costs of coarse-grained multithreading. Because a processor with coarse-grained multithreading
issues instructions from a single thread, when a stall occurs, pipeline must be emptied or frozen.
The new thread that begins executing after stall must fill pipeline before instructions will be able
15

to complete. Due to start-up overhead, coarse-grained multithreading is much more useful for
reducing penalty of high-cost stalls, where pipeline refill is negligible compared to stall time.

How four threads use issue slots of a superscalar processor in different approaches.
Simultaneous multithreading (SMT) is a variation on hardware multithreading that uses
resources of a multiple-issue, dynamically scheduled pipelined processor to exploit thread-level
parallelism at same time it exploits instruction level parallelism. The key insight that motivates
SMT is that multiple-issue processors often have more functional unit parallelism available than
most single threads can effectively use. Furthermore, with register renaming and dynamic
scheduling, multiple instructions from independent threads can be issued without regard to
dependences among m; resolution of dependences can be handled by dynamic scheduling
capability. Since SMT relies on existing dynamic mechanisms, it does not switch resources every
cycle. Instead, SMT is always executing instructions from multiple threads, leaving it up to
hardware to associate instruction slots and renamed registers with their proper threads.
The four threads at top show how each would execute running alone on a standard
superscalar processor without multithreading support. The three examples at bottom show how
you would execute running together in three multithreading options. The horizontal dimension
represents instruction issue capability in each clock cycle. The vertical dimension represents a
sequence of clock cycles. An empty (white) box indicates that corresponding issue slot is unused
in that clock cycle. The shades of gray and color correspond to four different threads in
multithreading processors.

6. Explain in detail, the multi core processors. (16marks)


While hardware multithreading improved efficiency of processors at modest cost, big
challenge of last decade has been to deliver on performance potential of Moore’s Law by
efficiently programming increasing number of processors per chip. Given difficulty of rewriting
old programs to run well on parallel hardware, a natural question is: what can computer
designers do to simplify task? One answer was to provide a single physical address space that all
16

processors can share, so that programs need not concern themselves with where their data is,
merely that programs may be executed in parallel. In approach, all variables of a program can be
made available at any time to any processor. The alternative is to have a separate address space
per processor that requires that sharing must be explicit;
Introduction to Graphics Processing Units (GPU):
The original justification for adding SIMD instructions to existing architectures was that
many microprocessors were connected to graphics displays in PCs and workstations, so an
increasing fraction of processing time was used for graphics. As Moore’s Law increased number
of transistors available to microprocessors, it therefore made sense to improve graphics
processing.
A major driving force for improving graphics processing was computer game industry,
both on PCs and in dedicated game consoles such as Sony PlayStation. The rapidly growing
game market encouraged many companies to make increasing investments in developing faster
graphics hardware, and positive feedback loop led graphics processing to improve at a faster rate
than general-purpose processing in mainstream microprocessors. Given that graphics and game
community had different goals than microprocessor development community, it evolved its own
style of processing and terminology. As graphics processors increased in power, they earned
name Graphics Processing Units or GPUs to distinguish themselves from CPUs. For a few
hundred dollars, anyone can buy a GPU today with hundreds of parallel floating-point units,
which makes high-performance computing more accessible. The interest in GPU computing
blossomed when potential was combined with a programming language that made GPUs easier
to program. Hence, many programmers of scientific and multimedia applications today are
pondering whether to use GPUs or CPUs.
Here are some of key characteristics as to how GPUs vary from CPUs:
■ GPUs are accelerators that supplement a CPU, so y do not need be able to perform all tasks of
a CPU. This role allows m to dedicate all their resources to graphics. It’s fine for GPUs to
perform some tasks poorly or not at all, given that in a system with both a CPU and a GPU, CPU
can do m if needed.
■ The GPU problems sizes are typically hundreds of megabytes to gigabytes, but not hundreds of
gigabytes to terabytes. These differences led to different styles of architecture:
■ Perhaps biggest difference is that GPUs do not rely on multilevel caches to overcome long
latency to memory, as do CPUs. Instead, GPUs rely on hardware multithreading (Section 6.4) to
hide latency to memory. That is, between time of a memory request and time that data arrives,
GPU executes hundreds or thousands of threads that are independent of that request.
The GPU memory is thus oriented toward bandwidth rather than latency. There are even
special graphics DRAM chips for GPUs that are wider and have higher bandwidth than DRAM
chips for CPUs. In addition, GPU memories have traditionally had smaller main memories than
conventional microprocessors. In 2013, GPUs typically have 4 to 6 GiB or less, while CPUs
have 32 to 256 GiB. Finally, keep in mind that for general-purpose computation, you must
include time to transfer data between CPU memory and GPU memory, since GPU is a
coprocessor.
■ Given reliance on many threads to deliver good memory bandwidth, GPUs can accommodate
many parallel processors (MIMD) as well as many threads. Hence, each GPU processor is more
highly multithreaded than a typical CPU, plus y have more processors.
17

Similarities and differences between multicore with Multimedia SIMD extensions and
recent GPUs.
At a high level, multicore computers with SIMD instruction extensions do share
similarities with GPUs. Both are MIMDs whose processors use multiple SIMD lanes, although
GPUs have more processors and many more lanes. Both use hardware multithreading to improve
processor utilization, although GPUs have hardware support for many more threads. Both use
caches, although GPUs use smaller streaming caches and multicore computers use large
multilevel caches that try to contain whole working sets completely. Both use a 64-bit address
space, although physical main memory is much smaller in GPUs. While GPUs support memory
protection at page level, y do not yet support demand paging.
SIMD processors are also similar to vector processors. The multiple SIMD processors in
GPUs act as independent MIMD cores, just as many vector computers have multiple vector
processors.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy