Flynn'S Classification: Cs6303 Computer Architecture
Flynn'S Classification: Cs6303 Computer Architecture
FLYNN’S CLASSIFICATION
In 1966, Michael Flynn proposed a classification for computer architectures based on the
number of instruction steams and data streams (Flynn’s Taxonomy).
Flynn uses the stream concept for describing a machine's structure
A stream simply means a sequence of items (data or instructions).
Flynn’s taxonomy:
The classification of computer architectures based on the number of instruction steams and data
streams (Flynn’s Taxonomy).
SISD:
CSE/AJS/CS6303/UNIT-IV Page 2
CS6303 COMPUTER ARCHITECTURE
CSE/AJS/CS6303/UNIT-IV Page 3
CS6303 COMPUTER ARCHITECTURE
MISD:
MIMD:
CSE/AJS/CS6303/UNIT-IV Page 4
CS6303 COMPUTER ARCHITECTURE
HARDWARE MULTITHREADING
A related concept to MIMD, especially from the programmer’s perspective, is hardware
multithreading.
While MIMD relies on multiple processes or threads to try to keep multiple processors
busy, hardware multithreading allows multiple threads to share the functional units of a
single processor in an overlapping fashion to try to utilize the hardware resources
efficiently.
A thread is a separate process with its own instruction and data. A thread may represent a
process that is part of a parallel program consisting of multiple processes or it may
represent an independent program on its own. In addition, the hardware must support the
ability to change to a different thread relatively quickly.
A thread switch should be much more efficient than a process switch, which typically
requires hundreds to thousands of processor cycles while a thread switch can be
instantaneous.
Two Approaches to multithreading:
o Fine-grained multithreading
o Coarse-grained multithreading
CSE/AJS/CS6303/UNIT-IV Page 5
CS6303 COMPUTER ARCHITECTURE
Fine-grained multithreading:
o It switches between threads on each instruction, resulting in interleaved execution
of multiple threads. This interleaving is oft en done in a round-robin fashion,
skipping any threads that are stalled at that clock cycle.
o To make fine-grained multithreading practical, the processor must be able to
switch threads on every clock cycle.
Coarse-grained multithreading:
o Coarse-grained multithreading was invented as an alternative to fine-grained
multithreading. A coarse-grained multithreading switches thread only on costly
stalls, such as last-level cache misses.
o This change relieves the need to have thread switching be extremely fast and is
much less likely to slow down the execution of an individual thread, since
instructions from other threads will only be issued when a thread encounters a
costly stall.
o Drawback:
Simultaneous multithreading:
o The key insight that motivates SMT is that multiple-issue processors often have
more functional unit parallelism available than most single threads can effectively
use.
CSE/AJS/CS6303/UNIT-IV Page 6
CS6303 COMPUTER ARCHITECTURE
Horizontal dimension represents the instruction issue capability in each clock cycle.
Vertical dimension represents a sequence of clock cycle.
Empty slots indicate that the corresponding issue slots are unused in that clock cycle.
In the superscalar without hardware multithreading support, the use of issue slots is
limited by a lack of instruction-level parallelism. In addition, a major stall, such as an
instruction cache miss, can leave the entire processor idle.
In the coarse-grained multithreaded superscalar, the long stalls are partially hidden by
switching to another thread that uses the resources of the processor.
Although this reduces the number of completely idle clock cycles, the pipeline start-up
overhead still leads to idle cycles, and limitations to ILP means all issue slots will not be
used.
CSE/AJS/CS6303/UNIT-IV Page 7
CS6303 COMPUTER ARCHITECTURE
In the fine-grained case, the interleaving of threads mostly eliminates idle clock cycles.
Because only a single thread issues instructions in a given clock cycle, however,
limitations in instruction-level parallelism still lead to idle slots within some clock cycles.
In the SMT case, thread-level parallelism and instruction-level parallelism are both
exploited, with multiple threads using the issue slots in a single clock cycle.
Ideally, the issue slot usage is limited by imbalances in the resource needs and resource
availability over multiple threads.
MULTICORE PROCESSOR
While hardware multithreading improved the efficiency of processors at modest cost, the
big challenge of the last decade has been to deliver on the performance potential of Moore’s
Law by efficiently programming the increasing number of processors per chip.
To simplify the task of rewriting old programs to run well on parallel hardware, the
solution was to provide a single physical address space that all processors can share, so that
programs need not concern themselves with where their data is, merely that programs may be
executed in parallel.
In this approach, all variables of a program can be made available at any time to any
processor. The alternative is to have a separate address space per processor that requires that
sharing must be explicit.
A shared memory multiprocessor (SMP) is one that offers the programmer a single
physical address space across all processors which is nearly always the case for multicore
chips .It is also called as shared-address multiprocessor.
The programming challenges are harder for a NUMA multiprocessor than for a UMA
multiprocessor, but NUMA machines can scale to larger sizes and NUMAs can have
lower latency to nearby memory.
CSE/AJS/CS6303/UNIT-IV Page 8
CS6303 COMPUTER ARCHITECTURE
As processors operating in parallel will normally share data, they also need to coordinate
when operating on shared data; otherwise, one processor could start working on data
before another is finished with it. This coordination is called synchronization .
When sharing is supported with a single address space, there must be a separate
mechanism for synchronization.
One approach uses a lock for a shared variable. Only one processor at a time can acquire
the lock, and other processors interested in shared data must wait until the original
processor unlocks the variable.
Example
A Simple Parallel Processing Program for a Shared Address Space
The first step is to ensure a balanced load per processor, so we split the set of
numbers into subsets of the same size. We do not allocate the subsets to a
different memory space, since there is a single memory space for this machine;
we just give different starting addresses to each processor.
Pn is the number that identifies the processor, between 0 and 63.
All processors start the program by running a loop that sums their subset of
numbers:
sum[Pn] = 0;
for (i = 1000*Pn; i < 1000*(Pn+1); i += 1)
sum[Pn] += A[i]; /*sum the assigned areas*/
The next step is to add these 64 partial sums. This step is called a reduction,
where we divide to conquer.
Half of the processors add pairs of partial sums, and then a quarter add pairs of the
new partial sums, and so on until we have the single, final sum.
Figure illustrates the hierarchical nature of this reduction.
Reduction is a function that processes a data structure and returns a single value.
CSE/AJS/CS6303/UNIT-IV Page 9
CS6303 COMPUTER ARCHITECTURE
o A many-core processor is one in which the number of cores is large enough that
traditional multiprocessor techniques are no longer efficient largely due to issues
with congestion supplying sufficient instructions and data to the many processor.
o The key architectural property is the uniform access time to all of the memory
from all the processor.
o In a multichip version the shared cache would be omitted and the bus or
interconnection network connecting the processors to memory would run between
chips as opposed to within a single chip.
CSE/AJS/CS6303/UNIT-IV Page 10
CS6303 COMPUTER ARCHITECTURE
o Each processor shares the entire memory, although the access time to the lock
memory attached to the core’s chip will be much faster than the access time to
remote memories.
o Distributing the memory among the nodes has two major benefits:
It is a cost effective way to scale the memory bandwidth if most of the
accesses are to the local memory in the node.
It reduces the latency for accesses to the local memory.
CSE/AJS/CS6303/UNIT-IV Page 11