0% found this document useful (0 votes)
59 views54 pages

CH 3 Multithreading

The document discusses different techniques for achieving parallelism at various levels including application, thread, data, instruction, and hardware levels. It covers concepts like data parallelism, task parallelism, Flynn's classification, and latency hiding techniques like prefetching, coherent caches, and multithreading. It also discusses cache coherence protocols for shared memory systems and how directory-based protocols improve over snoopy protocols.

Uploaded by

digvijay dhole
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views54 pages

CH 3 Multithreading

The document discusses different techniques for achieving parallelism at various levels including application, thread, data, instruction, and hardware levels. It covers concepts like data parallelism, task parallelism, Flynn's classification, and latency hiding techniques like prefetching, coherent caches, and multithreading. It also discusses cache coherence protocols for shared memory systems and how directory-based protocols improve over snoopy protocols.

Uploaded by

digvijay dhole
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 54

Ch 3

Multithreading

PCA:Ch 3 Multithreading Prepared by: Priyanka More 1


Application Level: Data vs Task Parallelism

PCA:Ch 3 Multithreading Prepared by: Priyanka More 2


Thread Level Parallelism

PCA:Ch 3 Multithreading Prepared by: Priyanka More 3


Data Level Parallelism

PCA:Ch 3 Multithreading Prepared by: Priyanka More 4


Computer Hardware Level Parallelism

PCA:Ch 3 Multithreading Prepared by: Priyanka More 5


Flynn’s Classification

PCA:Ch 3 Multithreading Prepared by: Priyanka More 6


PCA:Ch 3 Multithreading Prepared by: Priyanka More 7
PCA:Ch 3 Multithreading Prepared by: Priyanka More 8
PCA:Ch 3 Multithreading Prepared by: Priyanka More 9
PCA:Ch 3 Multithreading Prepared by: Priyanka More 10
Instruction stream and Data stream

• In the entire cycle of instruction execution a flow of instructions from main memory to
Central Processing Unit is established. This flow of instructions is known as instruction
stream.

• There is a flow of operands between processor and memory bi-directionally. This flow of
operands is known as data stream.

PCA:Ch 3 Multithreading Prepared by: Priyanka More 11


Limitations of Memory System Performance

• Memory system, and not processor speed, is often the bottleneck for many
applications.
• Memory system performance is largely captured by two parameters, latency and
bandwidth.
• Latency is the time from the issue of a memory request to the time the data is
available at the processor.
• Bandwidth is the rate at which data can be pumped to the processor by the memory
system.

PCA:Ch 3 Multithreading Prepared by: Priyanka More 12


Memory System Performance: Bandwidth and Latency

• It is very important to understand the difference between latency and bandwidth.


• Consider the example of a fire-hose. If the water comes out of the hose two seconds
after the hydrant is turned on, the latency of the system is two seconds.
• Once the water starts flowing, if the hydrant delivers water at the rate of 5
gallons/second, the bandwidth of the system is 5 gallons/second.
• If you want immediate response from the hydrant, it is important to reduce latency.
• If you want to fight big fires, you want high bandwidth.

PCA:Ch 3 Multithreading Prepared by: Priyanka More 13


Latency Hiding Techniques
• Parallel and scalable systems typically use distributed shared memory. The access of remote
memory significantly increases memory latency
• The processor speed has been increasing at a much faster rate than memory speed
• Three latency hiding mechanisms are used to increase scalability and programmability.
1. Pre-fetching techniques
2. Coherent caches
3. Multiple Context Processors
• Using prefetching technique: which bring instruction or data close to the processor before they
are actually needed.
• Using coherent caches: supported by hardware to reduce caches misses.
• Using multiple context processors: to allow a processor to switch from one context to another.

PCA:Ch 3 Multithreading Prepared by: Priyanka More 14


PCA:Ch 3 Multithreading Prepared by: Priyanka More 15
PCA:Ch 3 Multithreading Prepared by: Priyanka More 16
PCA:Ch 3 Multithreading Prepared by: Priyanka More 17
PCA:Ch 3 Multithreading Prepared by: Priyanka More 18
PCA:Ch 3 Multithreading Prepared by: Priyanka More 19
PCA:Ch 3 Multithreading Prepared by: Priyanka More 20
PCA:Ch 3 Multithreading Prepared by: Priyanka More 21
PCA:Ch 3 Multithreading Prepared by: Priyanka More 22
2. Cache Coherence in Multiprocessor Systems

• Interconnects provide basic mechanisms for data transfer.


• In the case of shared address space machines, additional hardware is required to
coordinate access to data that might have multiple copies in the network.
• The underlying technique must provide some guarantees on the semantics.
• This guarantee is generally one of serializability, i.e., there exists some serial order of
instruction execution that corresponds to the parallel schedule.

PCA:Ch 3 Multithreading Prepared by: Priyanka More 23


Cache Coherence Protocols

1. Snoopy Bus Protocol


2. Directory based Protocol

PCA:Ch 3 Multithreading Prepared by: Priyanka More 24


PCA:Ch 3 Multithreading Prepared by: Priyanka More 25
PCA:Ch 3 Multithreading Prepared by: Priyanka More 26
PCA:Ch 3 Multithreading Prepared by: Priyanka More 27
PCA:Ch 3 Multithreading Prepared by: Priyanka More 28
PCA:Ch 3 Multithreading Prepared by: Priyanka More 29
Performance of Snoopy Caches

• Once copies of data are tagged dirty, all subsequent operations can be performed
locally on the cache without generating external traffic.
• If a data item is read by a number of processors, it transitions to the shared state in the
cache and all subsequent read operations become local.
• If processors read and update data at the same time, they generate coherence
requests on the bus - which is ultimately bandwidth limited.

PCA:Ch 3 Multithreading Prepared by: Priyanka More 30


Directory Based Systems

• In snoopy caches, each coherence operation is sent to all processors. This is an


inherent limitation.
• Why not send coherence requests to only those processors that need to be
notified?
• This is done using a directory, which maintains a presence vector for each data
item (cache line) along with its global state.

PCA:Ch 3 Multithreading Prepared by: Priyanka More 31


PCA:Ch 3 Multithreading Prepared by: Priyanka More 32
PCA:Ch 3 Multithreading Prepared by: Priyanka More 33
PCA:Ch 3 Multithreading Prepared by: Priyanka More 34
PCA:Ch 3 Multithreading Prepared by: Priyanka More 35
PCA:Ch 3 Multithreading Prepared by: Priyanka More 36
PCA:Ch 3 Multithreading Prepared by: Priyanka More 37
Directory Based Systems

Architecture of typical directory based systems: (a) a centralized directory


(b) a distributed directory.

PCA:Ch 3 Multithreading Prepared by: Priyanka More 38


Performance of Directory Based Schemes

• The need for a broadcast media is replaced by the directory.


• The additional bits to store the directory may add significant overhead.
• The underlying network must be able to carry all the coherence requests.
• The directory is a point of contention, therefore, distributed directory schemes must
be used.

PCA:Ch 3 Multithreading Prepared by: Priyanka More 39


3. Multiple context processors

• Consider the problem of browsing the web on a very slow network connection. We
deal with the problem in one of three possible ways:
– we anticipate which pages we are going to browse ahead of time and issue
requests for them in advance;
– we open multiple browsers and access different pages in each browser, thus while
we are waiting for one page to load, we could be reading others; or
– we access a whole bunch of pages in one go - amortizing the latency across
various accesses.
• The first approach is called prefetching, the second multithreading, and the third one

corresponds to spatial locality in accessing memory words .


PCA:Ch 3 Multithreading Prepared by: Priyanka More 40
3. Multithreading for Latency Hiding
A thread is a single stream of control in the flow of a program.
We illustrate threads with a simple example:

for (i = 0; i < n; i++)


c[i] = dot_product(get_row(a, i), b);

Each dot-product is independent of the other, and therefore represents a concurrent unit
of execution. We can safely rewrite the above code segment as:

for (i = 0; i < n; i++)


c[i] = create_thread(dot_product,get_row(a, i), b);

PCA:Ch 3 Multithreading Prepared by: Priyanka More 41


Multithreading for Latency Hiding: Example

• In the code, the first instance of this function accesses a pair of vector elements and
waits for them.
• In the meantime, the second instance of this function can access two other vector
elements in the next cycle, and so on.
• After l units of time, where l is the latency of the memory system, the first function
instance gets the requested data from memory and can perform the required
computation.
• In the next cycle, the data items for the next function instance arrive, and so on. In this
way, in every clock cycle, we can perform a computation.

PCA:Ch 3 Multithreading Prepared by: Priyanka More 42


PCA:Ch 3 Multithreading Prepared by: Priyanka More 43
PCA:Ch 3 Multithreading Prepared by: Priyanka More 44
PCA:Ch 3 Multithreading Prepared by: Priyanka More 45
PCA:Ch 3 Multithreading Prepared by: Priyanka More 46
PCA:Ch 3 Multithreading Prepared by: Priyanka More 47
PCA:Ch 3 Multithreading Prepared by: Priyanka More 48
PCA:Ch 3 Multithreading Prepared by: Priyanka More 49
PCA:Ch 3 Multithreading Prepared by: Priyanka More 50
Communication Model of Parallel Platforms

• There are two primary forms of data exchange between parallel tasks - accessing
a shared data space and exchanging messages.
• Platforms that provide a shared data space are called shared-address-space
machines or multiprocessors.
• Platforms that support messaging are also called message passing platforms or
multicomputers.

PCA:Ch 3 Multithreading Prepared by: Priyanka More 51


Shared-Address-Space Platforms

• Part (or all) of the memory is accessible to all processors.


• Processors interact by modifying data objects stored in this shared-address-space.
• If the time taken by a processor to access any memory word in the system global or
local is identical, the platform is classified as a uniform memory access (UMA),
else, a non-uniform memory access (NUMA) machine.

PCA:Ch 3 Multithreading Prepared by: Priyanka More 52


NUMA and UMA Shared-Address-Space Platforms

Typical shared-address-space architectures: (a) Uniform-memory access shared-address-space


computer; (b) Uniform-memory-access shared-address-space computer with caches and memories;
(c) Non-uniform-memory-access shared-address-space computer with local memory only.

PCA:Ch 3 Multithreading Prepared by: Priyanka More 53


NUMA and UMA Shared-Address-Space Platforms

• The distinction between NUMA and UMA platforms is important from the point of view of
algorithm design. NUMA machines require locality from underlying algorithms for
performance.
• Programming these platforms is easier since reads and writes are implicitly visible to
other processors.
• However, read-write data to shared data must be coordinated (this will be discussed in
greater detail when we talk about threads programming).
• Caches in such machines require coordinated access to multiple copies. This leads to
the cache coherence problem.
• A weaker model of these machines provides an address map, but not coordinated
access. These models are called non cache coherent shared address space machines.
PCA:Ch 3 Multithreading Prepared by: Priyanka More 54

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy