0% found this document useful (0 votes)
2 views21 pages

Cao Unit 6

Chapter 9 discusses pipeline and vector processing, focusing on parallel processing techniques to enhance computational speed through various classifications such as SISD, SIMD, MISD, and MIMD. It covers the principles of pipelining, including arithmetic and instruction pipelines, as well as the challenges faced such as resource conflicts and data dependencies. Additionally, the chapter explores vector processing applications and operations, emphasizing their efficiency in handling large data sets.

Uploaded by

lalithaacse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views21 pages

Cao Unit 6

Chapter 9 discusses pipeline and vector processing, focusing on parallel processing techniques to enhance computational speed through various classifications such as SISD, SIMD, MISD, and MIMD. It covers the principles of pipelining, including arithmetic and instruction pipelines, as well as the challenges faced such as resource conflicts and data dependencies. Additionally, the chapter explores vector processing applications and operations, emphasizing their efficiency in handling large data sets.

Uploaded by

lalithaacse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 21

Chap.

9 Pipeline and Vector Processing 9-1

 9-1 Parallel Processing


 Simultaneous data processing tasks for the purpose of increasing the
= computational speed
 Perform concurrent data processing to achieve faster execution time
 Multiple Functional Unit : Fig. a Parallel Processing Example
 Separate the execution unit into eight functional units operating in parallel
 Computer Architectural Classification
Adder- subtrac tor
 Data-Instruction Stream : Flynn
Integer multiply
 Serial versus Parallel Processing : Feng
 Parallelism and Pipelining : Händler Logic unit

 Flynn’s Classification Shift unit

To Memory
 1) SISD (Single Instruction - Single Data stream) Inc rementer

» for practical purpose: only one processor is useful


Proc essor
registers
Floatint- point
add- subtrac t
» Example systems : Amdahl 470V/6, IBM 360/91
Floatint- point
IS multiply

Floatint- point
divide

CU PU MM
IS DS
fig: a

Chap. 9 Pipeline and Vector Processing


9-2

 2) SIMD
Shared memmory
(Single Instruction - Multiple Data stream) DS 1
PU 1 MM1
» vector or array operations
 one vector operation includes many DS 2
PU 2 MM2
operations on a data stream
IS
» Example systems : CRAY -1, ILLIAC-IV CU

DS n
PU n MMn

IS

 3) MISD
(Multiple Instruction - Single Data stream)
» Data Stream 에 Bottle neck DS

IS 1 IS 1
CU1 PU 1
Shared memory

IS 2 IS 2
CU2 PU 2 MMn MM2 MM1

IS n IS n
CUn PU n

DS

Chap. 9 Pipeline and Vector Processing


9-3

 4) MIMD
(Multiple Instruction - Multiple Data stream)
» Multiprocessor System Shared memory
IS 1 IS 1 DS
CU1 PU 1 MM1

IS 2 IS 2
CU2 PU 2 MM2

v v

IS n IS n
CUn PU n MMn

 Main topics in this Chapter


 Pipeline processing : Sec. 9-2
» Arithmetic pipeline : Sec. 9-3
» Instruction pipeline : Sec. 9-4
 Vector processing :adder/multiplier pipeline, Sec. 9-6
 Array processing :array processor, Sec. 9-7 Large vector, Matrices,
» Attached array processor : Fig. 9-14 Array Data
» SIMD array processor : Fig. 9-15

Chap. 9 Pipeline and Vector Processing


9-4

 9-2 Pipelining
 Pipelining
 Decomposing a sequential process into sub-operations
 Each sub-process is executed in a special dedicated segment concurrently

 Pipelining : Fig. 9-2


 Multiply and add operation : Ai * Bi  Ci ( for i = 1, 2, …, 7 )
 3 Sub-operation Segment
» 1) R1 Ai, R2  Bi : Input Ai and Bi
» 2) R3 R1*R2, R4  Ci : Multiply and input Ci
» 3) R5 R3R4 : Add Ci
 Content of registers in pipeline example : Tab. 9-1
 General considerations
 4 segment pipeline : Fig. 9-3
» S : Combinational circuit for Sub-operation
» R : Register(intermediate results between the segments)
 Space-time diagram : Fig. 9-4 Segment
» Show segment utilization as a function of time versus
clock-cycle
 Task : T1, T2, T3,…, T6
» Total operation performed going through all the segment
Chap. 9 Pipeline and Vector Processing
9-5

 Speedup S : Nonpipeline / Pipeline


 S = n • t / ( k + n - 1 ) • t = 6 • 6 t / ( 4 + 6 - 1 ) • t = 36 t / 9 t = 4
n p n p n n

» n : task number ( 6 ) Pipeline= 9 clock cycles


» tn : time to complete each task in nonpipeline ( 6 cycle times = 6 tp)
k+n-1n » tp : clock cycle time ( 1 clock cycle )
» k : segment number ( 4 )
C loc k c yc les 1 2 3 4 5 6 7 8 9
 If n  then S = tn / tp T1 T2 T3 T4 T5 T6
1

Segment
2 T1 T2 T3 T4 T5 T6
In non-pipeline ( tn ) = pipeline ( k • tp )
3 T1 T2 T3 T4 T5 T6
S = tn / tp = k • tp / tp = k
4 T1 T2 T3 T4 T5 T6

 Pipeline: Arithmetic Pipeline - Instruction Pipeline


 Sec. 9-3 Arithmetic Pipeline
 Floating-point Adder Pipeline Example : Fig. 9-6
 Add / Subtract two normalized floating-point binary number
» X = A x 2a = 0.9504 x 103
» Y = B x 2b = 0.8200 x 102
Chap. 9 Pipeline and Vector Processing
9-6

 4 segments suboperations
» 1) Compare exponents by subtraction : E xponents
a b
Mantissas
A B

3-2=1
R R
 X = 0.9504 x 103
 Y = 0.8200 x 102 C ompare Differenc e
Segment 1 :
» 2) Align mantissas exponents
by subtrac tion
 X = 0.9504 x 103
 Y = 0.08200 x 103 R

» 3) Add mantissas Segment 2 : C hoose exponent Align mantissas


 Z = 1.0324 x 103
» 4) Normalize result R

 Z = 0.1324 x 104 Add or subtrac t


Segment 3 :
mantissas

R R

Adjust Normalize
Segment 4 :
exponent result

R R

Chap. 9 Pipeline and Vector Processing


9-7

Fetc h instruc tion

 9-4 Instruction Pipeline


Segment 1 :
from memory

 Instruction Cycle
Dec ode instruc tion
Segment 2 : and c alc ulate
effec tive address

1) Fetch the instruction from memory B ranc h ?

2) Decode the instruction Segment 3 :


Fetc h operand
from memory

3) Calculate the effective address


fig: a
Segment 4 : E xec ute instruc tion

4) Fetch the operands from memory


Interrupt

5) Execute the instruction


Interrupt ?
handling

6) Store the result in the proper place U pdate PC

 Example : Four-segment Instruction Pipeline E mpty pipe

 Four-segment CPU pipeline : Fig. a


Step : 1 2 3 4 5 6 7 8 9 10 11 12 13
» 1) FI : Instruction Fetch Instruc tion : 1 FI DA FO EX

» 2) DA : Decode Instruction & calculate EA 2 FI DA FO EX

» 3) FO : Operand Fetch (Branc h) 3 FI DA FO EX

4 FI FI DA FO EX
» 4) EX : Execution
5 FI DA FO EX
 Timing of Instruction Pipeline : Fig. b 6 FI DA FO EX
» Instruction takes 3 Branch 7 FI DA FO EX

No Branch Branch
fig: b
Chap. 9 Pipeline and Vector Processing
9-8

 Pipeline Conflicts : 3 major difficulties


 1) Resource conflicts
» memory access by two segments at the same time
 2) Data dependency
» when an instruction depend on the result of a previous instruction, but this result is not
yet available
 3) Branch difficulties
» branch and other instruction (interrupt, ret, ..) that change the value of PC
 Data Dependency
 Hardware
» Hardware Interlock
 previous instruction- Hardware Delay
» Operand Forwarding
 previous instruction
 Software
» Delayed Load
 previous instruction - No-operation instruction
 Handling of Branch Instructions
 Prefetch target instruction
» Conditional branch - branch target instruction -instruction fetch

Chap. 9 Pipeline and Vector Processing


9-9

 Branch Target Buffer : BTB


» 1) Associative memory- branch target address instruction BTB.
» 2) branch instruction BTB

» Loop Buffer C loc k c yc les : 1 2 3 4 5 6 7 8 9 10


» 1) small very high speed register file (RAM) 1. Load I A E

» 2) Loop Buffer load 2. Inc rement I A E

3. Add I A E

4. Subtrac t I A E
 Branch Prediction 5. B ranc h to X I A E
» Branch predict- additional hardware logic 6. No- operation I A E

 Delayed Branch 7. No- operation I A E

 Fig. a branch instruction 8. Instruc tion in X I A E

(a) Using no- operation instruc tions


pipeline operation
C loc k c yc les : 1 2 3 4 5 6 7 8
 Fig. b, 1. Load I A E

» 1) No-operation instruction 2. Inc rement I A E

» 2) Instruction Rearranging : Compiler 3. B ranc h to X I A E

4. Add I A E

5. Subtrac t I A E

6. Instruc tion in X I A E

(b) Rearranging the instruc tions

Chap. 9 Pipeline and Vector Processing


9-10

 9-5 RISC Pipeline


 RISC CPU Conflict
 Instruction Pipeline
C loc k c yc les : 1 2 3 4 5 6
 Single-cycle instruction execution
1. Load R1 I A E
 Compiler support
2. Load R2 I A E
 Example : Three-segment Instruction Pipeline 3. Add R1+R2 I A E
 3 Sub-operations Instruction Cycle
4. Store R3 I A E
» 1) I : Instruction fetch
(a) Pipeline timing with data c onflic t
» 2) A : Instruction decoded and ALU operation
» 3) E : Transfer the output of ALU to a register, C loc k c yc les : 1 2 3 4 5 6 7
memory, or PC 1. Load R1 I A E
 Delayed Load : Fig. (a) 2. Load R2 I A E

» Fig. (b) 3. No- operation I A E


 No-operation 4. Add R1+R2 I A E
 Delayed Branch : Sec. 9-4 5. Store R3 I A E

(b) Pipeline timing with delayed load

Chap. 9 Pipeline and Vector Processing


9-11

 9-6 Vector Processing


 Science and Engineering Applications
 Long-range weather forecasting, Petroleum explorations, Seismic data analysis,
Medical diagnosis, Aerodynamics and space flight simulations, Artificial
intelligence and expert systems, Mapping the human genome, Image processing
 Vector Operations
 Arithmetic operations on large arrays of numbers
 Conventional scalar processor
» Machine language » Fortran language
Initialize I = 0 DO 20 I = 1, 100
20 Read A(I) 20 C(I) = A(I) + B(I)
Read B(I)
Store C(I) = A(I) + B(I)
Increment I = I + 1
If I  100 go to 20
Continue

 Vector processor
» Single vector instruction
C(1:100) = A(1:100) + B(1:100)

Chap. 9 Pipeline and Vector Processing


9-12

 Vector Instruction Format :


O peration Base address Base address Base address Vec tor
c ode sourc e 1 sourc e 2 destination length

ADD A B C 100
 Matrix Multiplication
 3 x 3 matrices multiplication : n2 = 9 inner product

 a11 a12 a13   b11 b12 b13   c11 c12 c13 


a a a23   b21 b22 b23   c21 c22 c23 
 21 22     
 a31 a32 a33   b31 b32 b33   c31 c32 c33 

» c11 a11 b11  a12 b21  a13 b31 :

 Cumulative multiply-add operation : n3 = 27 multiply-add


c c  a b
» c11 c11  a11 b11  a12 b21  a13 b31 :
    
9 X 3 multiply-add = 27

Chap. 9 Pipeline and Vector Processing


9-13

 Pipeline for calculating an inner product : Fig.


 Floating point multiplier pipeline : 4 segment
 Floating point adder pipeline : 4 segment
C A1B1  A2B2  A3B3  Ak Bk
» after 1st clock input » after 4th clock input
Sourc e Sourc e
A A

A1B1 A4B4 A3B3 A2B2 A1B1

Sourc e Multiplier Adder Sourc e Multiplier Adder


B pipeline pipeline B pipeline pipeline

» after 8th clock input » after 9th, 10th, 11th ,...


Sourc e Sourc e
A A

A8B8 A7B7 A6B6 A5B5 A4B4 A3B3 A2B2 A1B1 A8B8 A7B7 A6B6 A5B5 A4B4 A3B3 A2B2 A1B1

Sourc e Multiplier Adder Sourc e Multiplier Adder


B pipeline pipeline B pipeline pipeline

» Four section summation


C A1B1  A5B5  A9B9  A13B13  A2B2  A6B6 A1B1  A5B5
 A2B2  A6B6  A10B10  A14B14  ,,,
 A3B3  A7B7  A11B11  A15B15 
 A4B4  A8B8  A12B12  A16B16 

Chap. 9 Pipeline and Vector Processing


9-14

Address bus

 Memory Interleaving : Fig. a


AR AR AR AR
 Simultaneous access to memory from two or
more source using one memory bus system Memory Memory Memory Memory
array array array array

 Even / Odd Address Memory Access DR DR DR DR

Data bus
 Supercomputer
 Supercomputer = Vector Instruction + Pipelined floating-point arithmetic
fig: a
 Performance Evaluation Index
» MIPS : Million Instruction Per Second
» FLOPS : Floating-point Operation Per Second
 megaflops : 106, gigaflops : 109
 Cray supercomputer : Cray Research
» Clay-1 : 80 megaflops, 4 million 64 bit words memory
» Clay-2 : 12 times more powerful than the clay-1
 VP supercomputer : Fujitsu
» VP-200 : 300 megaflops, 32 million memory, 83 vector instruction, 195 scalar
instruction
» VP-2600 : 5 gigaflops

Chap. 9 Pipeline and Vector Processing


9-15

 9-7 Array Processors


 Performs computations on large arrays of data
Vector processing : Adder/Multiplier pipeline
Array processing :array processor
 Array Processing
 Attached array processor : Fig. a
» Auxiliary processor attached to a general purpose computer
 SIMD array processor : Fig. b
» Computer with multiple processing units operating in parallel
 Vector C = A + B ci = ai + bi

PE 1 M1

Master c o ntrol
unit
PE 2 M2

G eneral- purpose Input- O utput Attac hed array


c omputer interfac e Proc essor PE 3 M3

High- speed memory to-


Main memory Loc al memory
memory bus
Main memory
PE n Mn

fig: a
fig: b
Chap. 9 Pipeline and Vector Processing
Multiprocessors 9-16

 13-1 Characteristics of Multiprocessors


 Multiprocessors System = MIMD
 An interconnection of two or more CPUs with memory and I/O equipment
» a single CPU and one or more IOPs is usually not included in a multiprocessor system
 Unless the IOP has computational facilities comparable to a CPU
 Computation can proceed in parallel in one of two ways
 1) Multiple independent jobs can be made to operate in parallel
 2) A single job can be partitioned into multiple parallel tasks
 Classified by the memory Organization
 1) Shared memory or Tightly-coupled system
» Local memory + Shared memory
 higher degree of interaction between tasks
 2) Distribute memory or Loosely-coupled system
» Local memory + message passing scheme (packet or message)
 most efficient when the interaction between tasks is minimal
 13-2 Interconnection Structure
 Multiprocessor System Components
 1) Time-shared common bus
 2) Multi-port memory
 3) Crossbar switch CPU, IOP, Memory unit ,
 4) Multistage switching network
 5) Hypercube system Interconnection Components
Chap. 9 Pipeline and Vector Processing
9-17

 Time-shared Common Bus


 Time-shared single common bus system : Fig. a
» Only one processor can communicate with the memory or another processor at any given
time
 when one processor is communicating with the memory, all other processors are either busy with
internal operations or must be idle waiting for the bus
 Dual common bus system : Fig. b
» System bus + Local bus
» Shared memory
 the memory connected to the common system bus is shared by all processors
» System bus controller
 Link each local but to a common system bus
Loc al bus

C O mmon System
Loc al
shared bus C PU IO P
memory
Memory unit memory c ontroller

Loc al bus

C PU 1 C PU 2 C PU 3 IO P 1 IO P 2 System System
Loc al Loc al
bus C PU IO P bus C PU
memory memory
c ontroller c ontroller

fig: a Loc al bus Loc al bus

fig: b
Chap. 9 Pipeline and Vector Processing
9-18

 Multi-port memory : Fig. a


 multiple paths between processors and memory
» Advantage : high transfer rate can be achieved
» Disadvantage : expensive memory control logic / large number of cables & connectors
 Crossbar Switch : Fig. b
 Memory Module I/O Port Crossbar Switch
 Block diagram of crossbar switch : Fig. c

MM CPUs
Memory modules Memory modules

MM 1 MM 2 MM 3 MM 4 MM 1 MM 2 MM 3 MM 4
Data,address, and
c ontrol form C PU 1

Data

C PU 1 C PU 1
Data,address, and
Address Multiplexers c ontrol form C PU 2
Memory and
C PU 2 module arbitration
C PU 2
Read/ write logic
Data,address, and
c ontrol form C PU 3

C PU 3 Memory
C PU 3
enable
Data,address, and
c ontrol form C PU 4
C PU 4
C PU 4

fig: a fig: b
fig: c

Chap. 9 Pipeline and Vector Processing


9-19

 Multistage Switching Network


 Control the communication between a number of sources and destinations
» Tightly coupled system : PU MM
» Loosely coupled system : PU PU
 Basic components of a multistage switching network :
two-input, two-output interchange switch : Fig a
 2 Processor (P1 and P2) are connected through switches to 8 memory modules
(000 - 111) : Fig b
 Omega Network : Fig c
» 2 x 2 Interchange switch (N input x N output network topology)

0 0 000
000
0 0 1 1 001
A A 0
001

1 1 1
B B 0
010 2 010

1 3
A c onnec ted to 0 A c onnec ted to 1 0
011 011
P0
1
P1
0
100 4 100
0 0 1
A A 101 5 101
0

1 1
B B 1
0
110
6 110
B c onnec ted to 0 B c onnec ted to 1 1
111
7 111

fig: a fig: b
fig: c

Chap. 9 Pipeline and Vector Processing


9-20

 Hypercube Interconnection : Fig


 Loosely coupled system
 Hypercube Architecture : Intel iPSC ( n = 7, 128 node )

011 111

0 01 11 010 110

001 101

0 00 10 000 100

 13-3 Interprocessor Arbitration : Bus Control


 Single Bus System : Address bus, Data bus, Control bus
 Multiple Bus System : Memory bus, I/O bus, System bus
 System bus : Bus that connects CPUs, IOPs, and Memory in multiprocessor
system
 Data transfer method over the system bus
 Synchronous bus : achieved by driving both units from a common clock source
 Asynchronous bus : accompanied by handshaking control signals

Chap. 9 Pipeline and Vector Processing


9-21

 System Bus: IEEE Standard 796 MultiBus


 86 signal lines : Tab.
» Bus Arbitration : BREQ, BUSY, … * Bus Busy Line
 Bus Arbitration Algorithm : Static / Dynamic If this line is inactive,
no other processor is using the bus
 Static : priority fixed

» Serial arbitration : Fig.


Highest Lowest
priority priority
B us B us B us B us To next
arbiter
1 PI PO PI PO PI PO PI PO
arbiter 1 arbiter 1 arbiter 1 arbiter 1

B us busy line

Bus Bus Bus Bus

» Parallel arbitration : Fig.


arbiter 1 arbiter 2 arbiter 3 arbiter 4

Ac k Req Ac k Req Ac k Req Ac k Req

 Dynamic : priority flexible Bus busy line

» Time slice (fixed length time)


» Polling 4× 2
Priority enc oder

» LRU
» FIFO 2× 4
Dec oder

» Rotating daisy-chain

Chap. 9 Pipeline and Vector Processing

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy