0% found this document useful (0 votes)
235 views11 pages

Flynn'S Classification: Cs6303 Computer Architecture

The document discusses Flynn's taxonomy for classifying computer architectures based on the number of instruction and data streams. It describes the four categories in Flynn's taxonomy: SISD, SIMD, MISD, and MIMD. SISD refers to a single instruction stream and single data stream, like a traditional uniprocessor. SIMD uses a single instruction stream to process multiple data streams, as in vector processors. MIMD uses multiple instruction and data streams, as in multiprocessors. The document provides details on each classification.

Uploaded by

Jeya Sheeba A
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
235 views11 pages

Flynn'S Classification: Cs6303 Computer Architecture

The document discusses Flynn's taxonomy for classifying computer architectures based on the number of instruction and data streams. It describes the four categories in Flynn's taxonomy: SISD, SIMD, MISD, and MIMD. SISD refers to a single instruction stream and single data stream, like a traditional uniprocessor. SIMD uses a single instruction stream to process multiple data streams, as in vector processors. MIMD uses multiple instruction and data streams, as in multiprocessors. The document provides details on each classification.

Uploaded by

Jeya Sheeba A
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

CS6303 COMPUTER ARCHITECTURE

FLYNN’S CLASSIFICATION
 In 1966, Michael Flynn proposed a classification for computer architectures based on the
number of instruction steams and data streams (Flynn’s Taxonomy).
 Flynn uses the stream concept for describing a machine's structure
 A stream simply means a sequence of items (data or instructions).
Flynn’s taxonomy:
The classification of computer architectures based on the number of instruction steams and data
streams (Flynn’s Taxonomy).

SISD:

 SISD (Singe-Instruction stream, Singe-Data stream)


 SISD corresponds to the traditional mono-processor. A single data stream is being
processed by one instruction stream.
 A uni-processor in which a single stream of instruction is generated from the
program.

where CU= Control Unit, PE= Processing Element, M= Memory


SIMD:

 SIMD (Single-Instruction stream, Multiple-Data streams)


 Each instruction is executed on a different set of data by different processors i.e
multiple processing units of the same type process on multiple-data streams.
CSE/AJS/CS6303/UNIT-IV Page 1
CS6303 COMPUTER ARCHITECTURE

 This group is dedicated to array processing machines.


 Sometimes, vector processors can also be seen as a part of this group.

where CU= Control Unit, PE= Processing Element, M= Memory


 SIMD computers operate on vectors of data. For example, a single SIMD instruction
might add 64 numbers by sending 64 data streams to 64 ALUs to form 64 sums within a
single clock cycle. The subword parallel instructions are another example of SIMD .
 Advantages:
o It amortizes the cost of the control unit over dozens of execution units.
o Another advantage is the reduced instruction bandwidth and space i.e,SIMD
needs only one copy of the code that is being simultaneously executed, while
message-passing MIMDs may need a copy in every processor, and shared
memory MIMD will need multiple instruction caches.
 SIMD works best when dealing with arrays in for loops. Hence, for parallelism to work
in SIMD, there must be structured data, which is called data-level parallelism.
 Vector
 An older more elegant interpretation of SIMD is called a vector
architecture,
 The basic philosophy of vector architecture is to collect data elements
from memory, put them in order into a large set of registers, operate on
them sequentially in registers using pipelined execution units, and then
write the results back to memory.

CSE/AJS/CS6303/UNIT-IV Page 2
CS6303 COMPUTER ARCHITECTURE

 A key feature of vector architectures is then a set of vector registers. Thus,


a vector architecture might have 32 vector registers, each with 64 64-bit
elements.
 Vector versus Scalar
Vector instructions have several important properties compared to conventional instruction
set architectures, which are called scalar architectures
 A single vector instruction specifies a great deal of work it is equivalent to executing an
entire loop. The instruction fetch and decode bandwidth needed is dramatically reduced.
 By using a vector instruction, the compiler or programmer indicates that the computation
of each result in the vector is independent of the computation of other results in the same
vector, so hardware does not have to check for data hazards within a vector instruction.
 Vector architectures and compilers have a reputation of making it much easier than when
using MIMD multiprocessors to write efficient applications when they contain data-level
parallelism.
 Hardware need only check for data hazards between two vector instructions once per
vector operand, not once for every element within the vectors. Reduced checking can
save energy as well as time.
 Vector instructions that access memory have a known access pattern. If the vector’s
elements are all adjacent, then fetching the vector from a set of heavily interleaved
memory banks works very well. Thus, the cost of the latency to main memory is seen
only once for the entire vector, rather than once for each word of the vector.
 Because an entire loop is replaced by a vector instruction whose behavior is
predetermined, control hazards that would normally arise from the loop branch are
nonexistent.
 The savings in instruction bandwidth and hazard checking plus the efficient use of
memory bandwidth give vector architectures advantages in power and energy versus
scalar architectures.
For these reasons, vector operations can be made faster than a sequence of scalar operations on
the same number of data items.

CSE/AJS/CS6303/UNIT-IV Page 3
CS6303 COMPUTER ARCHITECTURE

MISD:

 MISD (Multiple-Instruction streams, Singe-Data stream)


 Each processor executes a different sequence of instructions.
 In case of MISD computers, multiple processing units operate on one single-data
stream.
 In practice, this kind of organization has never been used.

MIMD:

 MIMD (Multiple-Instruction streams, Multiple-Data streams)


 Each processor has a separate program.
 An instruction stream is generated from each program.
 Each instruction operates on different data.
 This last machine type builds the group for the traditional multi-processors. Several
processing units operate on multiple-data streams.

CSE/AJS/CS6303/UNIT-IV Page 4
CS6303 COMPUTER ARCHITECTURE

Computer Architecture Classifications:


Processor Organizations

Single Instruction, Single Instruction, Multiple Instruction Multiple Instruction


Single Data Stream Multiple Data Stream Single Data Stream Multiple Data Stream
(SISD) (SIMD) (MISD) (MIMD)

Uniprocessor Vector Array Shared Memory Multicomputer


Processor Processor (tightly coupled) (loosely coupled)

HARDWARE MULTITHREADING
 A related concept to MIMD, especially from the programmer’s perspective, is hardware
multithreading.

 While MIMD relies on multiple processes or threads to try to keep multiple processors
busy, hardware multithreading allows multiple threads to share the functional units of a
single processor in an overlapping fashion to try to utilize the hardware resources
efficiently.

 Multithreading in higher level parallelism is called thread level parallelism because it is


logically structured as separate threads of execution.

 A thread is a separate process with its own instruction and data. A thread may represent a
process that is part of a parallel program consisting of multiple processes or it may
represent an independent program on its own. In addition, the hardware must support the
ability to change to a different thread relatively quickly.
 A thread switch should be much more efficient than a process switch, which typically
requires hundreds to thousands of processor cycles while a thread switch can be
instantaneous.
 Two Approaches to multithreading:
o Fine-grained multithreading
o Coarse-grained multithreading

CSE/AJS/CS6303/UNIT-IV Page 5
CS6303 COMPUTER ARCHITECTURE

 Fine-grained multithreading:
o It switches between threads on each instruction, resulting in interleaved execution
of multiple threads. This interleaving is oft en done in a round-robin fashion,
skipping any threads that are stalled at that clock cycle.
o To make fine-grained multithreading practical, the processor must be able to
switch threads on every clock cycle.

o One advantage of fine-grained multithreading is that it can hide the throughput


losses that arise from both short and long stalls, since instructions from other
threads can be executed when one thread stalls.

o The primary disadvantage of fine-grained multithreading is that it slows down


the execution of the individual threads, since a thread that is ready to execute
without stalls will be delayed by instructions from other threads.

 Coarse-grained multithreading:
o Coarse-grained multithreading was invented as an alternative to fine-grained
multithreading. A coarse-grained multithreading switches thread only on costly
stalls, such as last-level cache misses.

o This change relieves the need to have thread switching be extremely fast and is
much less likely to slow down the execution of an individual thread, since
instructions from other threads will only be issued when a thread encounters a
costly stall.

o Drawback:

 It is limited in its ability to overcome throughput losses, especially from


shorter stalls.

 Simultaneous multithreading:

o Simultaneous multithreading (SMT) is a variation on hardware multithreading


that uses the resources of a multiple-issue, dynamically scheduled pipelined
processor to exploit thread-level parallelism at the same time it exploits
instruction level parallelism.

o The key insight that motivates SMT is that multiple-issue processors often have
more functional unit parallelism available than most single threads can effectively
use.

o Furthermore, with register renaming and dynamic scheduling, multiple


instructions from independent threads can be issued without regard to the
dependences among them; the resolution of the dependences can be handled by
the dynamic scheduling capability.

CSE/AJS/CS6303/UNIT-IV Page 6
CS6303 COMPUTER ARCHITECTURE

o The following figure conceptually illustrates the differences in a processor’s


ability to exploit superscalar resources for the following processor configurations.

 A superscalar with no multithreading support


 A superscalar with coarse-grained multithreading
 A superscalar with fine-grained multithreading
 A superscalar with simultaneous multithreading

 Horizontal dimension represents the instruction issue capability in each clock cycle.
 Vertical dimension represents a sequence of clock cycle.
 Empty slots indicate that the corresponding issue slots are unused in that clock cycle.

 In the superscalar without hardware multithreading support, the use of issue slots is
limited by a lack of instruction-level parallelism. In addition, a major stall, such as an
instruction cache miss, can leave the entire processor idle.

 In the coarse-grained multithreaded superscalar, the long stalls are partially hidden by
switching to another thread that uses the resources of the processor.

 Although this reduces the number of completely idle clock cycles, the pipeline start-up
overhead still leads to idle cycles, and limitations to ILP means all issue slots will not be
used.

CSE/AJS/CS6303/UNIT-IV Page 7
CS6303 COMPUTER ARCHITECTURE

 In the fine-grained case, the interleaving of threads mostly eliminates idle clock cycles.
Because only a single thread issues instructions in a given clock cycle, however,
limitations in instruction-level parallelism still lead to idle slots within some clock cycles.

 In the SMT case, thread-level parallelism and instruction-level parallelism are both
exploited, with multiple threads using the issue slots in a single clock cycle.

 Ideally, the issue slot usage is limited by imbalances in the resource needs and resource
availability over multiple threads.

MULTICORE PROCESSOR
 While hardware multithreading improved the efficiency of processors at modest cost, the
big challenge of the last decade has been to deliver on the performance potential of Moore’s
Law by efficiently programming the increasing number of processors per chip.

 To simplify the task of rewriting old programs to run well on parallel hardware, the
solution was to provide a single physical address space that all processors can share, so that
programs need not concern themselves with where their data is, merely that programs may be
executed in parallel.

 In this approach, all variables of a program can be made available at any time to any
processor. The alternative is to have a separate address space per processor that requires that
sharing must be explicit.

 A shared memory multiprocessor (SMP) is one that offers the programmer a single
physical address space across all processors which is nearly always the case for multicore
chips .It is also called as shared-address multiprocessor.

 Single address space multiprocessors come in two styles.


 In the first style, the latency to a word in memory does not depend on which
processor asks for it. Such machines are called uniform memory access (UMA)
multiprocessors.

 In the second style, some memory accesses are much faster than others, depending on
which processor asks for which word. Such machines are called nonuniform memory
access (NUMA) multiprocessors.

 The programming challenges are harder for a NUMA multiprocessor than for a UMA
multiprocessor, but NUMA machines can scale to larger sizes and NUMAs can have
lower latency to nearby memory.

CSE/AJS/CS6303/UNIT-IV Page 8
CS6303 COMPUTER ARCHITECTURE

 As processors operating in parallel will normally share data, they also need to coordinate
when operating on shared data; otherwise, one processor could start working on data
before another is finished with it. This coordination is called synchronization .

 When sharing is supported with a single address space, there must be a separate
mechanism for synchronization.

 One approach uses a lock for a shared variable. Only one processor at a time can acquire
the lock, and other processors interested in shared data must wait until the original
processor unlocks the variable.

 Example
A Simple Parallel Processing Program for a Shared Address Space

 Suppose we want to sum 64,000 numbers on a shared memory multiprocessor


computer with uniform memory access time. Let’s assume we have 64 processors.

 The first step is to ensure a balanced load per processor, so we split the set of
numbers into subsets of the same size. We do not allocate the subsets to a
different memory space, since there is a single memory space for this machine;
we just give different starting addresses to each processor.
 Pn is the number that identifies the processor, between 0 and 63.
 All processors start the program by running a loop that sums their subset of
numbers:
sum[Pn] = 0;
for (i = 1000*Pn; i < 1000*(Pn+1); i += 1)
sum[Pn] += A[i]; /*sum the assigned areas*/
 The next step is to add these 64 partial sums. This step is called a reduction,
where we divide to conquer.
 Half of the processors add pairs of partial sums, and then a quarter add pairs of the
new partial sums, and so on until we have the single, final sum.
 Figure illustrates the hierarchical nature of this reduction.
 Reduction is a function that processes a data structure and returns a single value.

CSE/AJS/CS6303/UNIT-IV Page 9
CS6303 COMPUTER ARCHITECTURE

 Major MIMD styles:

o Centralized shared memory multiprocessor


o Physically distributed memory multiprocessor

 Centralized shared memory:

o A many-core processor is one in which the number of cores is large enough that
traditional multiprocessor techniques are no longer efficient largely due to issues
with congestion supplying sufficient instructions and data to the many processor.

o The key architectural property is the uniform access time to all of the memory
from all the processor.

o In a multichip version the shared cache would be omitted and the bus or
interconnection network connecting the processors to memory would run between
chips as opposed to within a single chip.

CSE/AJS/CS6303/UNIT-IV Page 10
CS6303 COMPUTER ARCHITECTURE

 Physically distributed memory multiprocessor

o Each processor shares the entire memory, although the access time to the lock
memory attached to the core’s chip will be much faster than the access time to
remote memories.

o Distributing the memory among the nodes has two major benefits:
 It is a cost effective way to scale the memory bandwidth if most of the
accesses are to the local memory in the node.
 It reduces the latency for accesses to the local memory.

CSE/AJS/CS6303/UNIT-IV Page 11

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy