0% found this document useful (0 votes)
10 views12 pages

410A Week 4

The document discusses the concept of false sharing in cache memory, highlighting how simultaneous access to different variables within the same cache line can lead to inefficiencies. It also covers performance metrics in parallel programming, including linear speedup and Amdahl's Law, which illustrates the limitations of parallelization based on the fraction of code that can be parallelized. Additionally, it touches on the impact of overheads and the importance of memory architecture in optimizing performance.

Uploaded by

261905138
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views12 pages

410A Week 4

The document discusses the concept of false sharing in cache memory, highlighting how simultaneous access to different variables within the same cache line can lead to inefficiencies. It also covers performance metrics in parallel programming, including linear speedup and Amdahl's Law, which illustrates the limitations of parallelization based on the fraction of code that can be parallelized. Additionally, it touches on the impact of overheads and the importance of memory architecture in optimizing performance.

Uploaded by

261905138
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Cache re-visited


False sharing

Core 0 has a variable in cache

Core 1 has a different variable in cache

Both variables belong to same cache line

Only core 0 updates its variable

Core 1 invalidates the line in its cache even though it did not
change its variable

This is called false sharing
Cache re-visited


False sharing

This happens with arrays as well

Two cores accessing different elements, but in the same line

Solution:

Keep simultaneously used variables far apart in memory

Doing so will push them in different cache lines

Its a difficult trade-off

Large cache lines are good for locality but cause false sharing
Cache re-visited


A few words about spinning

On SMP without cache, spinning is a bad idea

On NUMA without caches, spinning may be acceptable if
memory is local to core

On SMP and NUMA with caches spinning consumes much less
resources

Once a value is loaded in cache spinning becomes local
Performance


Objective of writing parallel programs: higher performance

Assumption: All cores have same architecture (non-GPU cores)

Theoretical best: If run on p cores, program runs p times faster

Only if work can be equally divided with no overhead

If serial run time is Tserial, then best possible Tparallel is Tserial/p

Tparallel = Tserial/p is called linear speedup

Not possible in practice: Why
Performance


Some reasons why linear speedup is not obtained

Shared memory programs

Critical sections (only one thread or process in CS)

Mutex function overhead (to provide exclusive access to CS)

Distributed memory programs

Data transmission across network (messaging among nodes)

More threads means longer delay to access CS
Performance


E = S/p, S = Tserial/Tparallel, E = (Tserial/Tparallel)/p = Tserial/(p Tparallel)

If Tserial and Tparallel are on same core type then

E is fraction of time spend by cores on solving the problem

Example: Tserial = 24ms, Tparallel = 4ms
p = 8. With this E = 24/(8 x 4) = ¾

This means each core spends 75%
time on problem and 25% on overheads
Performance


Speedup and efficiency plots as functions of N
Performance


This is the expected behavior

With p fixed, increasing N increases overhead but

Increase in overhead < increase in time spent on problem

Hence E increases, as seen in the graph and table

Just a reminder

We will measure Tserial and Tparallel on same core architecture

Some researchers measure it differently
Performance


Amdahl’s Law

Let’s think about Tparallel in another way

We start with the serial algo and parallelize it

Assume we parallelize 90% of the serial algo, and do so “perfectly”

If we run parallelized algo on single core (p = 1), then

Tparallel(p = 1) = 0.9 Tserial + 0.1 Tserial = Tserial

If we run it on p > 1 cores, Tparallel(p) = (0.9 Tserial)/p + 0.1 Tserial
Performance


Amdahl’s Law

Let’s calculate Tparallel for two cores

Tparallel(p = 2) = 0.9 Tserial/2 + 0.1 Tserial = 0.55 Tserial

Speedup S = Tserial /Tparallel = 1/0.55 = 1.8

With two cores, speedup will always be less than 2

If we have 10 cores, S = 1/( (0.9 / 10) + 0.1) = 5.2

With 10 cores, speedup will always be less than 6
Performance


Amdahl’s Law

Suppose fraction r of an algo cannot be parallelized at all

Then Tparallel = (1 – r) Tserial / p + r Tserial

Speedup

This is called Amdahl’s Law

It provides upper bound on speedup which is 1/r
Performance


Amdahl’s Law

This law doesn’t account for problem size N

Often as N increases r becomes smaller and 1/r increases

A more mathematical version is Gustafson’s law

Not like the laws of physics

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy