0% found this document useful (0 votes)
6 views

COA Module5 Notes

The document discusses the challenges of parallel processing, emphasizing the difficulty of writing efficient software for multiple processors. It outlines Flynn's classification of processor organization, including SISD, SIMD, MISD, and MIMD, and explains the advantages and disadvantages of SIMD architecture. Additionally, it covers pipelining techniques in both arithmetic and instruction processing, detailing the benefits and potential performance issues associated with pipeline execution.

Uploaded by

Rene Dev
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

COA Module5 Notes

The document discusses the challenges of parallel processing, emphasizing the difficulty of writing efficient software for multiple processors. It outlines Flynn's classification of processor organization, including SISD, SIMD, MISD, and MIMD, and explains the advantages and disadvantages of SIMD architecture. Additionally, it covers pipelining techniques in both arithmetic and instruction processing, detailing the benefits and potential performance issues associated with pipeline execution.

Uploaded by

Rene Dev
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

PARALLEL PROCESSING CHALLENGES:

• It is difficult to write software that uses multiple processors to complete one task is faster.
• Parallel processing will increase the performance of processor and it will reduce the utilization time to
execute a task.
• By obtaining the parallel processing is not an easy task.
• The difficulty is not in hardware side it is in software side.
We can understand that it is difficult to write parallel processing programs that are fast, especially as the
number of processor increases.
PROCESSOR ORGANAIZATION [FLYNN’S CLASSIFICATION]
SISD
• Single Instruction stream, Single Data stream.
• Example of SISD is uniprocessor.
• It has a single control unit and producing a single stream of instruction.
• It has one processing unit and the processing has more than one functional unit these are under the
supervision of one control unit.
• It has one memory unit

SIMD
• It has one instruction and multiple data stream.
• It has a single control unit and producing a single stream of instruction and multi stream of data.
• It has more than one processing unit and each processing unit has its own associative data memory unit.
• In this organization, multiple processing elements work under the control of a single control unit.
• A single machine instruction controls the simultaneous execution of a number of processing element.
• Each instruction to be executed on different sets of data by different processor.
• The same instruction is applied to many data streams, as in a vector processor.
• All the processing elements of this organization receive the same instruction broadcast from the CU.
• Main memory can also be divided into modules for generating multiple data streams acting as a distributed
memory as shown in figure.
• Therefore, all the processing elements simultaneously execute the same instruction and are said to be
'lock-stepped' together.
• Each processor takes the data from its own memory and hence it has on distinct data streams.
• Every processor must be allowed to complete its instruction before the next instruction is taken for
execution. Thus, the execution of instructions is synchronous.
• Example of SIMD is Vector Processor and Array Processor.
Advantage of SIMD:
• The original motivation behind SIMD was to amortize the cost of the control unit over dozens of execution
units.
• Another advantage is the reduced instruction bandwidth and space.
• SIMD needs only one copy of the code that is being simultaneously executed while message-passing
MIMDs may need a copy in every processor, and shared memory MIMD will need multiple instruction
caches.
• SIMD works best when dealing with arrays in for loops because parallelism achieved by performing the
same operation on independent data.
• SIMD is at its weakest in case or switch statements, where each execution unit must perform a different
operation on its data, depending on what data it has. Execution units with the wrong data must be disabled
so that units with proper data may continue.

MISD
• Multiple Instruction and Single Data stream (MISD)
• In this organization, multiple processing elements are organized under the control of multiple control
units.
• Each control unit is handling one instruction stream and processed through its corresponding processing
element.
• But each processing element is processing only a single data stream at a time.
• Therefore, for handling multiple instruction streams and single data stream, multiple control units and
multiple processing elements are organized in this classification.
• All processing elements are interacting with the common shared memory for the organization of single
data stream as shown in figure. • The only known example of a computer capable of MISD operation is the
C.mmp built by Carnegie-Mellon University.

MIMD
• Multiple Instruction streams and Multiple Data streams (MIMD). In this organization, multiple processing
elements and multiple control units are organized.
• Compared to MISD the difference is that now in this organization multiple instruction streams operate on
multiple data streams.
• Therefore, for handling multiple instruction streams, multiple control units and multiple processing
elements are organized such that multiple processing elements are handling multiple data streams from the
main memory as shown in figure.
• The processors work on their own data with their own instructions. Tasks executed by different processors
can start or finish at different times.
• They are not lock-stepped, as in SIMD computers, but run asynchronously.
• This classification actually recognizes the parallel computer. That means in the real sense MIMD
organization is said to be a Parallel computer.

SIMD-VECTOR ARCHITECTURE [SPMD] (Single Program Multiple data)

• SIMD is called vector architecture.


• It is also a great match to problems with lots of data-level parallelism.i.e. Parallelism achieved by
performing the same operation on independent data.
• Rather than having 64 ALUs perform 64 additions simultaneously, like the old array processors.
• The vector architectures pipelined the ALU to get good performance at lower cost.
• The basic idea of vector architecture is to collect data elements from memory, put them in order into a
large set of registers, operate on them sequentially in registers using pipelined execution units, and then
write the results back to memory.
• A key feature of vector architectures is then a set of vector registers. Thus, vector architecture might
have 32 vector registers, each with 64-bit elements.

Pipelining
Pipelining is a technique of decomposing a sequential process into suboperations, with each subprocess
being executed in a special dedicated segment that operates concurrently with all other segments. The
name “pipeline” implies aflow of information analogous to an industrial assembly line. It is characteristic
of pipelines that several computations can be in progress in distinct segments at the same time.
Perhaps the simplest way of viewing the pipeline structure is to imagine that each segment consists of
an input register followed by a combinational circuit.
o The register holds the data.
o The combinational circuit performs the suboperation in the particular segment.
A clock is applied to all registers after enough time has elapsed to perform all segment activity.
Example
The pipeline organization will be demonstrated by means of a simple example.
o To perform the combined multiply and add operations with a stream of numbers

Ai * Bi + Ci fori = 1, 2, 3, …, 7

Each suboperation is to be implemented in a segment within a pipeline.


o R1  Ai, R2  Bi Input Ai and Bi
o R3  R1 * R2, R4 Ci Multiply and input Ci
o R5  R3 + R4 Add Ci to product
The five registers are loaded with new data every clock pulse. The effect of each clock is shown in Table
9-1. The first clock pulse transfers A1 and 31 into R1 andR2. The second clock pulse transfers the
product of R1 and R2 into R3 and C1into R4. The same clock pulse transfers A2 and B2 into R1 and R2.
The third clock pulse operates on all three segments simultaneously. It places A3 and B3into R1 and R2,
transfers the product of R1 and R2 into R3, transfers C2 intoR4, and places the sum of R3 and R4 into
R5. It takes three clock pulses to fill up the pipe and retrieve the first output from R5. From there on, each
clock produces a new output and moves the data one step down the pipeline. This happens as long as new
input data flow into the system. When no more input data are available, the clock must continue until the
last output emerges out of the pipeline.
Each segment has one or two registers and a combinational circuit as shown in Fig

The five registers are loaded with new data every clock pulse. The effect of each clock is shown in
Table below
General considerations
Any operation that can be decomposed into a sequence of sub operations of about the same
complexity can be implemented by a pipeline processor.
The general structure of a four-segment pipeline is Fig.
We define a task as the total operation performed going through all the segments in the pipeline.
The behavior of a pipeline can be illustrated with a space-time diagram.
o It shows the segment utilization as a function of time.

The space-time diagram of a four-segment pipeline is demonstrated in Fig. 9-4.


Where a k-segment pipeline with a clock cycle time tp is used to execute n tasks.
o The first task T1 requires a time equal to ktp to complete its operation.
o The remaining n-1 tasks will be completed after a time equal to (n-1)tp
o Therefore, to complete n tasks using a k-segment pipeline requires k+(n-1) clock cycles.
Consider a nonpipeline unit that performs the same operation and takes a time equal to tn to
complete each task.
o The total time required for n tasks is ntn.

The speedup of a pipeline processing over an equivalent nonpipeline processing is defined by the
ratio S = ntn/(k+n-1)tp .
If n becomes much larger than k-1, the speedup becomes S = tn/tp.
If we assume that the time it takes to process a task is the same in the pipeline and nonpipeline
circuits, i.e.,tn = ktp, the speedup reduces to S=ktp/tp=k.
This shows that the theoretical maximum speedup that a pipeline can provide is k, where k is the
number of segments in the pipeline.
To duplicate the theoretical speed advantage of a pipeline process by means of multiple functional
units, it is necessary to construct k identical units that will be operating in parallel.

This is illustrated in Fig below, where four identical circuits are connected in parallel.
Instead of operating with the input data in sequence as in a pipeline, the parallel circuits accept
four input data items simultaneously and perform four tasks at the same time.
Arithmetic Pipeline
There are various reasons why the pipeline cannot operate at its maximum theoretical rate.
o Different segments may take different times to complete their sub operation.
o It is not always correct to assume that a nonpipe circuit has the same time delay as that of an equivalent
pipeline circuit.
There are two areas of computer design where the pipeline organization is applicable.
o Arithmetic pipeline
o Instruction pipeline
Arithmetic Pipeline: Introduction
Pipeline arithmetic units are usually found in very high speed computers
o Floating–point operations, multiplication of fixed-point numbers, and similar computations in scientific
problem
Floating–point operations are easily decomposed into sub operations as demonstrated below.
An example of a pipeline unit for floating-point addition and subtraction is showed in the following:
o The inputs to the floating-point adder pipeline are two normalized floating-point binary number

A and B are two fractions that represent the mantissas, a and b are the exponents.
The floating-point addition and subtraction can be performed in four segments, as shown in Fig. below.
The suboperations that are performed in the four segments are:
o Compare the exponents
The larger exponent is chosen as the exponent of the result.
o Align the mantissas
The exponent difference determines how many times the mantissa associated with the smaller exponent
must be shifted to the right.
o Add or subtract the mantissas
o Normalize the result
When an overflow occurs, the mantissa of the sum or difference is shifted right and the exponent
incremented by one.
If an underflow occurs, the number of leading zeros in the mantissa determines the number of left shifts
in the mantissa and the number that must be subtracted from the exponent.

Instruction Pipeline
Introduction:
Pipeline processing can occur not only in the data stream but in the instruction as well.
Consider a computer with an instruction fetch unit (FIFO)and an instruction execution unit(PC)
designed to provide a two-segment pipeline.
Computers with complex instructions require other phases in addition to above phases to process
an instruction completely.
In the most general case, the computer needs to process each instruction with the following
sequence of steps.
o Fetch the instruction from memory.
o Decode the instruction.
o Calculate the effective address.
o Fetch the operands from memory.
o Execute the instruction.
o Store the result in the proper place.
• There are certain difficulties that will prevent the instruction pipeline from operating at its
maximum rate.
o Different segments may take different times to operate on the incoming information.
o Some segments are skipped for certain operations.
o Two or more segments may require memory access at the same time, causing one segment to
wait until another is finished with the memory.

Example: four-segment instruction pipeline:


Assume that:
o The decoding of the instruction can be combined with the calculation of the effective address into
one segment.
o The instruction execution and storing of the result can be combined into one segment.
Fig below shows how the instruction cycle in the CPU can be processed with a four-segment
pipeline.
o Thus up to four suboperations in the instruction cycle can overlap and up to four different
instructions can be in progress of being processed at the same time.
An instruction in the sequence may be causes a branch out of normal sequence.
o In that case the pending operations in the last two segments are completed and all information
stored in the instruction buffer is deleted.
o Similarly, an interrupt request will cause the pipeline to empty and start again from a new
address value.

The time in the horizontal axis is divided into steps of equal duration. The four segments are
represented in the diagram with an abbreviated symbol.
1. F1 is the segment that fetches an instruction.
2. DA is the segment that decodes the instruction and calculates the effective address.
3. Fo is the segment that fetches the operand.
4. EX is the segment that executes the instruction.

It is assumed that the processor has separate instruction and data memories so that the operation in F1 and
PC can proceed at the same time. In the absence of a branch instruction, each segment operates on
different instructions. Thus, in step 4, instruction 1 is being executed in segment EX; the operand for
instruction 2 is being fetched in segment FO; instruction 3 is being decoded in segment DA; and
instruction 4 is being fetched from memory in segment FI.
Assume now that instruction 3 is a branch instruction. As soon as this instruction is decoded in segment
DA in step 4, the transfer from F1 to DA of the other instructions is halted until the branch instruction is
executed in step6. If the branch is taken, a new instruction is fetched in step 7. If the branch is not taken,
the instruction fetched previously in step 4 can be used. The pipeline then continues until a new branch
instruction is encountered.
Another delay may occur in the pipeline if the EX segment needs to store the result of the operation in the
data memory while the FO segment needs to fetch an operand. In that case, segment FO must wait until
segment EX has finished its operation.
Data dependency:
o A difficulty that may caused a degradation of performance in an instruction pipeline is due to possible
collision of data or address.
A data dependency occurs when an instruction needs data that are not yet available
An address dependency may occur when an operand address cannot be calculated because the
information needed by the addressing mode is not available.
o Pipelined computers deal with such conflicts between data dependencies in a variety of ways.
o Hardware interlocks: an interlock is a circuit that detects instructions whose source operands are
destinations of instructions farther up in the pipeline.
This approach maintains the program sequence by using hardware to insert the required delays.
o Operand forwarding: uses special hardware to detect a conflict and then avoid it by routing the data
through special paths between pipeline segments.
This method requires additional hardware paths through multiplexers as well as the circuit that detects
the conflict.
o Delayed load: the compiler for such computers is designed to detect a data conflict and reorder the
instructions as necessary to delay the loading of the conflicting data by inserting no-operation
instructions.
Handling of branch instructions
One of the major problems in operating an instruction pipeline is the occurrence of branch instructions.
An unconditional branch always alters the sequential program flow by loading the program counter
with the target address.
In a conditional branch, the control selects the target instruction if the condition is satisfied or the next
sequential instruction if the condition is not satisfied.
Pipelined computers employ various hardware techniques to minimize the performance degradation
caused by instruction branching.

Prefetch target instruction: To pre fetch the target instruction in addition to the instruction following
the branch. Both are saved until the branch is executed.

Branch target buffer(BTB): The BTB is an associative memory included in the fetch segment of the
pipeline.
Each entry in the BTB consists of the address of a previously executed branch instruction and the target
instruction for that branch.
It also stores the next few instructions after the branch target instruction.
Loop buffer: This is a small very high speed register file maintained by the instruction fetch segment of
the pipeline.
Branch prediction: A pipeline with branch prediction uses some additional logic to guess the outcome of
a conditional branch instruction before it is executed.
Delayed branch: in this procedure, the compiler detects the branch instructions and rearranges the
machine language code sequence by inserting useful instructions that keep the pipeline operating without
interruptions.
A procedure employed in most RISC processors.
e.g. no-operation instruction
RISC Pipeline
Among the characteristics attributed to RISC is its ability to use an efficient
instruction pipeline. The simplicity of the instruction set can be utilized to implement an instruction pipeline
using a small number of sub operations, with each being executed in one clock cycle. Because of the fixed-
length instruction format, the decoding of the operation can occur at the same time as the register selection.
All data manipulation instructions have register-to-register operations. Since all operands are in registers,
there is no need for calculating an effective address or fetching of operands from memory. Therefore, the
instruction pipeline can be implemented with two or three segments. One segment fetches the instruction
from program memory, and the other segment executes the instruction in the ALU. A third segment may
be used to store the result of the ALU operation in a destination register.
The data transfer instructions in RISC are limited to load and store instructions.
o These instructions use register indirect addressing. They usually need three or four stages in the pipeline.
o To prevent conflicts between a memory access to fetch an instruction and to load or store an operand,
most RISC machines use two separate buses with two memories.

o Cache memory: operate at the same speed as the CPU clock


One of the major advantages of RISC is its ability to execute instructions at the rate of one per clock
cycle.
o In effect, it is to start each instruction with each clock cycle and to pipeline the processor to achieve the
goal of single-cycle instruction execution.
o RISC can achieve pipeline segments, requiring just one clock cycle.
Compiler supported that translates the high-level language program into machine language program.
o Instead of designing hardware to handle the difficulties associated with data conflicts and branch
penalties.

o RISC processors rely on the efficiency of the compiler to detect and minimize the delays encountered
with these problems.

Example: Three-Segment Instruction Pipeline


A typical set of instructions for a RISC processor are discussed earlier.
Thee are three types of instructions:
o The data manipulation instructions: operate on data in processor registers
o The data transfer instructions:
o The program control instructions:

Now consider the hardware operation for such a computer.


The control section fetches the instruction from program memory into an instruction register.
o The instruction is decoded at the same time that the registers needed for the execution of the instruction
are selected.

The processor unit consists of a number of registers and an arithmetic logic unit (ALU).
A data memory is used to load or store the data from a selected register in the register file.
The instruction cycle can be divided into three sub operations and implemented in three segments:
o I: Instruction fetch
Fetches the instruction from program memory
o A: ALU operation
The instruction is decoded and an ALU operation is performed.
It performs an operation for a data manipulation instruction.
It evaluates the effective address for a load or store instruction.
It calculates the branch address for a program control instruction.
o E: Execute instruction
Directs the output of the ALU to one of three destinations, depending on the decoded instruction.
It transfers the result of the ALU operation into a destination register in the register file.
It transfers the effective address to a data memory for loading or storing.
It transfers the branch address to the program counter.

Delayed Load
Consider the operation of the following four instructions:

1. LOAD: R1 <- M[address 1]


2. LOAD: R2 <- M[address 2]
3. ADD: R3 <- R1 +R2
4. STORE: M[address 3] <- R3

There will be a data conflict in instruction 3 because the operand in R2 is not yet available in the A segment.
This can be seen from the timing of the pipeline shown in Fig. 9-9(a).

The E segment in clock cycle 4 is in a process of placing the memory data into R2. The A segment in clock
cycle 4 is using the data from R2, but the value in R2 will not be the correct value since it has not yet been
transferred from memory. It is up to the compiler to make sure that the instruction following the load
instruction uses the data fetched from memory. If the compiler cannot find a useful instruction to put after
the load, it inserts a no-op(no-operation) instruction. This is a type of instruction that is fetched from
memory but has no operation, thus wasting a clock cycle. This concept of delaying the use of the data
loaded from memory is referred to as delayed load.
Figure 9-9(b) shows the same program with a no-op instruction inserted after the load to R2 instruction.
The data is loaded into R2 in clock cycle 4. The add instruction uses the value of R2 in step 5. Thus the no-
op instruction is used to advance one clock cycle in order to compensate for the data conflict in the pipeline.
(Note that no operation is performed in segment A during clock cycle 4 or segment E during clock cycle
5.) The advantage of the delayed load approach is that the data dependency is taken care of by the compiler
Rather than the hardware. This results in a simpler hardware segment since the segment does not
have to check if the content of the register being accessed is currently valid or not.

Delayed Branch
The method used in most RISC processors is to rely on the compiler to redefine the branches so that
they take effect at the proper time in the pipeline. This method is referred to as delayed branch.
The compiler is designed to analyse the instructions before and after the branch and rearrange the
program sequence by inserting useful instructions in the delay steps.
It is up to the compiler to find useful instructions to put after the branch instruction. Failing that, the
compiler can insert no-op instructions.

An Example of Delayed Branch


The program for this example consists of five instructions.
o Load from memory to R1
o Increment R2
o Add R3 to R4
o Subtract R5 from R6
o Branch to address X

In Fig. 9-10(a) the compiler inserts two no-op instructions after the branch.
The branch address X is transferred to PC in clock cycle 7.
The program in Fig. 9-10(b) is rearranged by placing the add and subtract instructions after the branch
instruction.
PC is updated to the value of X in clock cycle 5.

In Fig. 9-10(a) the compiler inserts two no-op instructions after the branch. The branch address X is
transferred to PC in clock cycle 7. The fetching of the instruction at X is delayed by two clock cycles by
the no-op instructions. The instruction at X starts the fetch phase at clock cycle 8 after the program counter
PC has been updated.
The program in Fig. 9-10(b) is rearranged by placing the add and subtract instructions after the branch
instruction instead of before as in the original program. Inspection of the pipeline timing shows that PC is
updated to the value of X in clock cycle 5, but the add and subtract instructions are fetched from memory
and executed in the proper sequence. In other words, if the load instruction is at address 101 and X is equal
to 350, the branch instruction is fetched from address 103. The add instruction is fetched from address 104
and executed in clock cycle 6. The subtract instruction is fetched from address 105and executed in clock
cycle 7. Since the value of X is transferred to PC with clock cycle 5 in the E segment, the instruction fetched
from memory at clock cycle6 is from address 350, which is the instruction at the branch address.
Data Hazards

A data hazard occurs when the current instruction requires the result of a preceding instruction, but there
are insufficient segments in the pipeline to compute the result and write it back to the register file in time
for the current instruction to read that result from the register file.

We typically remedy this problem in one of three ways:


• Forwarding: In order to resolve a dependency, one adds special circuitry to the pipeline that is
comprised of wires and switches with which one forwards or transmits the desired value to the
pipeline segment that needs that value for computation. Although this adds hardware and control
circuitry, the method works because it takes far less time for the required value(s) to travel
through a wire than it does for a pipeline segment to compute its result.
• Code Re-Ordering: Here, the compiler reorders statements in the source code, or the assembler
reorders object code, to place one or more statements between the current instruction and the
instruction in which the required operand was computed as a result. This requires an "intelligent"
compiler or assembler, which must have detailed information about the structure and timing of the
pipeline on which the data hazard would occur. We call this type of software a hardware-
dependent compiler.
• Stall Insertion: It is possible to insert one or more stalls (no-op instructions) into the pipeline,
which delays the execution of the current instruction until the required operand is written to the
register file. This decreases pipeline efficiency and throughput, which is contrary to the goals of
pipeline processor design. Stalls are an expedient method of last resort that can be used when
compiler action or forwarding fails or might not be supported in hardware or software design.

Vector Processors

There are two essentially different models of parallel computers: vector processors and multiprocessors. A
vector processor, is simply a machine that has an instruction that can operate on a vector. A pipelined vector
processor is a vector processor that can issue a vector instruction that operates on all of the elements of the
vector in parallel by sending those elements through a highly pipelined functional unit with a fast clock. A
processor array is a vector processor that achieves the parallelism by having a collection of identical,
synchronized processing elements (PE), each of which executes the same instruction on different data,
which are controlled by a single control unit. Every PE has a unique identifier, its processor id, which can
be used during the computation. The control unit, which might be a full-fledged CPU, broadcasts the
instruction to be executed to the processing elements, which execute it on data from a memory that is
usually local to each, and can store the result in their local memories, or can return global results back to
the CPU. A global result line is usually a separate, parallel bus that allows each PE to transmit values back
to the CPU to be combined by a parallel, global operation, such as a logical-and or a logical-or, depending
upon the hardware support in the CPU.
Because all PEs execute the same instruction at the same time, this type of architecture is suited to problems
with data parallelism. Data parallelism is a type of parallelism that is characterized by the ability to perform
the same operation on different data simultaneously. For example, a loop of the form
for i = 0 to N-1
do a[i] = a[i] + 1;
has data parallelism because the updates to the distinct array elements a[i] are independent of each other
and may be performed in parallel, whereas the loop
for i = 1 to N-1
do a[i] = a[i-1] + 1;
has no data parallelism because the update to a[i] cannot be performed until the update to a[i-1]. If the value
of N is smaller than the number of processing elements, the entire loop takes the same amount of time as a
single processor takes to perform the increment on a scalar variable. If the value of N is larger, then the
work has to be distributed to the PEs so that they each update the values of several array elements. This
may be handled by the hardware, by a runtime library, or by the programmer, depending on the particular
architecture and software.

Vector processor classification


According to from where the operands are retrieved in a vector processor, pipe lined vector computers are
classified
into two architectural configurations:
1. Memory to memory architecture –
In memory to memory architecture, source operands, intermediate and final results are retrieved (read)
directly from the main memory. For memory to memory vector instructions, the information of the base
address, the offset, the increment, and the the vector length must be specified in order to enable streams of
data transfers between the main memory and pipelines.
The main points about memory to
memory architecture are:

There is no limitation of size

Speed is comparatively slow in this architecture
2. Register to register architecture –
In register to register architecture, operands and results are retrieved indirectly from the main memory
through the use of large number of vector registers or scalar registers. The processors like Cray-1 and the
Fujitsu VP-200 use vector instructions in register to register formats. The main points about register to
register architecture are:
• Register to register architecture has limited size.
• Speed is very high as compared to the memory to memory architecture.
• The hardware cost is high in this architecture.
A block diagram of a modern multiple pipeline vector computer is shown below:
Vector instruction types
A Vector operand contains an ordered set of n elements, where n is called the a vector are same type
scalar quantities, which may be a floating character.

Four primitive types of vector instructions are:


f1 : V --> V
f2 : V --> S
f3 : V x V --> V
f4 : V x S --> V

Where V and S denotes a vector operand and a scalar operand, respectively. The instructions, f1 and f2
are unary operations and f3 and f4 are binary operations. The VCOM (vector complement), which
complements each complement of the vector, is an f1 operation. The pipe lined implementation of f1
operation is shown in the figure 1:

The VMAX (vector maximum), which finds the maximum scalar quantity from all the complements in
the vector, is an f2 operation. The pipe lined implementation of f2 operation is shown in the fig 2.

The VMPL (vector multiply) , which multiply the respective scalar components of two vector operands and
produces another product vector, is an f3 operation. The pipe lined implementation of f3 operation is shown
in the figure 3:
The SVP (scalar vector product), which multiply one constant value to each component of the vector is f4
operation. The pipe lined implementation of f4 operation is shown in the figure 4.

Vector Instruction Format in Vector Processors


Different Instruction formats are used by different vector processors. Vector instructions are generally
specified by some fields. The main fields that are used in Vector Instruction Set are given below.
1. Operations Code (Opcode) –
The operation code must be specified to select the functional unit or to reconfigure a multi functional unit
to perform the specified operation dictated by this field. Usually, microcode control is used to set up the
required resources.
For example:
Opcode – 0001 mnemonic – ADD operation
Opcode – 0010 mnemonic – SUB operation
Opcode – 1111 mnemonic – HLT operation
2. Base addresses –
For a memory reference instruction, the base addresses are needed for both source operands and result
vectors. The designated vector registers must be specified in the instruction, if the operands and results are
located in the vector register file, i.e., collection of registers.
1. For example:
ADD R1, R2
Here, R1 and R2 are the addresses of the register.
2. Offset (or Displacement) –
This field is required to get the effective memory address of operand vector. The address offset relative to
the base address should be specified. Using the base address and the offset (positive or negative), the
effective address is calculated.
3. Address Increment –
The address increment between the scalar elements of vector operand must be specified. Some computers,
i.e., the increment is always 1.
For example:
R1 <- 400
Auto incr-R1 is incremented the value of R1 by 1.
R1 = 399
4. Vector length – The vector length (positive integer) is needed to determine the termination of a
instruction

Array processor
Array processor A computer/processor that has an architecture especially designed for processing arrays
(e.g. matrices) of numbers. The architecture includes a number of processors (say 64 by 64) working
simultaneously, each handling one element of the array, so that a single operation can apply to all elements
of the array in parallel. To obtain the same effect in a conventional processor, the operation must be applied
to each element of the array sequentially, and so consequently much more slowly. An array processor may
be built as a self-contained unit attached to a main computer via an I/O port or internal bus; alternatively,
it may be a distributed array processor where the processing elements are distributed throughout, and
closely linked to, a section of the computer's memory. Array processors are very powerful tools for handling
problems with a high degree of parallelism. They do however demand a modified approach to
programming. The conversion of conventional (sequential) programs to serve array processors is not a
trivial task, and it is sometimes necessary to select different parallel) algorithms to suit the parallel
approach.

Types of Array Processors


There are basically two types of array processors:
1. Attached Array Processors
2. SIMD Array Processors

Attached Array Processors


An attached array processor is a processor which is attached to a general purpose computer and its
purpose is to enhance and improve the performance of that computer in numerical computational tasks. It
achieves high performance by means of parallel processing with multiple functional units.

SIMD Array Processors


SIMD is the organization of a single computer containing multiple processors operating in parallel. The
processing units are made to operate under the control of a common control unit, thus providing a single
instruction stream and multiple data streams. A general block diagram of an array processor is shown below.
It contains a set of identical processing elements (PE's), each of which is having a local memory M. Each
processor element includes an ALU and registers. The master control unit controls all the operations of the
processor elements. It also decodes the instructions and determines how the instruction is to be executed.
The main memory is used for storing the program. The control unit is responsible for fetching the
instructions. Vector instructions are send to all PE's simultaneously and results are returned to the memory.
The best known SIMD array processor is the ILLIAC IV computer developed by the Burroughs corps.
SIMD processors are highly specialized computers. They are only suitable for numerical problems that can
be expressed in vector or matrix form and they are not suitable for other types of computations.

Summary about Array Processor

1)Array processors increases the overall instruction processing speed.


2)As most of the Array processors operates asynchronously from the host CPU, hence it improves the
overall capacity of the system.

3)Array Processors has its own local memory, hence providing extra memory for systems with low
memory.

4)The AP (array processor) is most efficient in doing repetitive operations such as doing FFT’s and
multiplying large vectors. Its efficiency degrades for non-repetitive operations, or operations requiring a
great number of decisions based on the results of computations.

5)Since the AP’s have their own program and data memory, the AP instruction and data must be transferred
to, and the results transferred from the AP. These I/O operations may cost more CPU time than the amount
saved by using the array processor.

6) As a general rule , use of AP is most efficient than the CPU when multiple or complex (such as FFT)
operations, which are highly repetitious, are going to be done on relatively large amount of data (thousands
of words or more.). In other cases use of AP will not help much and will keep other processes from using
valuable resource.

Cache Coherence : In a single CPU system two copies of the same data, one in cache and another one in
main memory become different. This data inconsistency is called as Cache coherence problem.

MESI protocol The MESI protocol makes it possible to maintain the coherence in cached systems. It is
based on the four states that a block in the cache memory can have. These four states are the abbreviations
for MESI: modified, exclusive, shared and invalid. States are explained below: - Invalid: It is a non-valid
state. The data you are looking for are not in the cache, or the local copy of these data is not correct because
another processor has updated the corresponding memory position. - Shared: Shared without having been
modified. Another processor can have the data into the cache memory and both copies are in their current
version. - Exclusive: Exclusive without having been modified. That is, this cache is the only one that has
the correct value of the block. Data blocks are according to the existing ones in the main memory. -
Modified: Actually, it is an exclusive-modified state. It means that the cache has the only copy that is correct
in the whole system. The data which are in the main memory are wrong. Figure below explains this in a
more detailed way.

The state of each cache memory block can change depending on the actions taken by the CPU
At the beginning, when the cache is empty and a block of memory is written into the cache by the processor,
this block has the exclusive state because there are no copies of that block in the cache. Then, if this block
is written, it changes to a modified state, because the block is only in one cache but it has been modified
and the block that is in the main memory is different to it.
On the other hand, if a block is in the exclusive state, when the CPU tries to read it and it does not find the
block, it has to find it in the main memory and loads it into its cache memory. Then, that block is in two
different caches so its state is shared. Then, if a CPU wants to write into a block that is in the modified state
and it is not in its cache, this block has to be cleared from the cache where it was and it has to be loaded
into the main memory because it was the most current copy of that block in the system. In that case, the
CPU writes the block and it is loaded in its cache memory with the exclusive state, because it is the most
current version now. If the CPU wants to read a block and it does not find the block in its cache, this is
because there is a more recent copy, so the system has to clear the block from the cache where it was and
to load it in the main memory. From there, the block is read and the new state is shared because there are
two current copies in the system. Another option is that a CPU writes into a shared block, in this case the
block changes its state into exclusive.

1. Widely Used Cache Coherence Protocol: The MESI protocol is one of the most commonly used cache
coherence protocols. It extends the MSI protocol by introducing the Exclusive state, which allows a cache to
hold a copy of a block exclusively, indicating that no other cache has a copy.

2. Advantages of Exclusive and Modified States: The Exclusive state in the MESI protocol enables faster
read operations by allowing caches to read the block without checking for other copies. When a cache wants
to modify a block in the Exclusive state, it can transition to the Modified state directly without invalidating
other copies. This reduces the need for unnecessary bus transactions, enhancing performance.
Cluster
is a set of loosely or tightly connected computers working together as a unified computing resource that
can create the illusion of being one machine. Computer clusters have each node set to perform the same
task, controlled and produced by software. The components of a cluster are usually connected to each
other using fast area networks, with each node running its own instance of an operating system. In
most circumstances, all the nodes use same hardware and the same operating system, although in
some setups different hardware or different operating system can be used.

NUMA (non-uniform memory access) Part of the Hardware glossary: NUMA (non-uniform memory
access) is a method of configuring a cluster of microprocessor in a multiprocessing system so that they can
share memory locally, improving performance and the ability of the system to be expanded. NUMA is used
in a symmetric multiprocessing ( SMP ) system. An SMP system is a "tightly-coupled," "share everything"
system in which multiple processors working under a single operating system access each other's memory
over a common bus or "interconnect" path. Ordinarily, a limitation of SMP is that as microprocessors are
added, the shared bus or data path get overloaded and becomes a performance bottleneck. NUMA adds an
intermediate level of memory shared among a few microprocessors so that all data accesses don't have to
travel on the main bus. NUMA can be thought of as a "cluster in a box." The cluster typically consists of
four microprocessors (for example, four Pentium microprocessors) interconnected on a local bus (for
example, a Peripheral Component Interconnect bus) to a shared memory (called an "L3 cache ") on a single
motherboard (it could also probably be referred to as a card ).

This unit can be added to similar units to form a symmetric multiprocessing system in which a common
SMP bus interconnects all of the clusters. Such a system typically contains from 16 to 256 microprocessors.
To an application program running in an SMP system, all the individual processor memories look like a
single memory. When a processor looks for data at a certain memory address, it first looks in the L1 cache
on the microprocessor itself, then on a somewhat larger L1 and L2 cache chip nearby, and then on a third
level of cache that the NUMA configuration provides before seeking the data in the "remote memory"
located near the other microprocessors. Each of these clusters is viewed by NUMA as a "node" in the
interconnection network. NUMA maintains a hierarchical view of the data on all the nodes. Data is moved
on the bus between the clusters of a NUMA SMP system using scalable coherent interface (SCI)
technology. SCI coordinates what is called "cache coherence" or consistency across the nodes of the
multiple clusters. SMP and NUMA systems are typically used for applications such as data mining and
decision support system in which processing can be parceled out to a number of processors that collectively
work on a common database. Sequent, Data General, and NCR are among companies that produce NUMA
SMP systems.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy