410A Week 4
410A Week 4
False sharing
Core 0 has a variable in cache
Core 1 has a different variable in cache
Both variables belong to same cache line
Only core 0 updates its variable
Core 1 invalidates the line in its cache even though it did not
change its variable
This is called false sharing
Cache re-visited
False sharing
This happens with arrays as well
Two cores accessing different elements, but in the same line
Solution:
Keep simultaneously used variables far apart in memory
Doing so will push them in different cache lines
Its a difficult trade-off
Large cache lines are good for locality but cause false sharing
Cache re-visited
A few words about spinning
On SMP without cache, spinning is a bad idea
On NUMA without caches, spinning may be acceptable if
memory is local to core
On SMP and NUMA with caches spinning consumes much less
resources
Once a value is loaded in cache spinning becomes local
Performance
Objective of writing parallel programs: higher performance
Assumption: All cores have same architecture (non-GPU cores)
Theoretical best: If run on p cores, program runs p times faster
Only if work can be equally divided with no overhead
If serial run time is Tserial, then best possible Tparallel is Tserial/p
Tparallel = Tserial/p is called linear speedup
Not possible in practice: Why
Performance
Some reasons why linear speedup is not obtained
Shared memory programs
Critical sections (only one thread or process in CS)
Mutex function overhead (to provide exclusive access to CS)
Distributed memory programs
Data transmission across network (messaging among nodes)
More threads means longer delay to access CS
Performance
E = S/p, S = Tserial/Tparallel, E = (Tserial/Tparallel)/p = Tserial/(p Tparallel)
If Tserial and Tparallel are on same core type then
E is fraction of time spend by cores on solving the problem
Example: Tserial = 24ms, Tparallel = 4ms
p = 8. With this E = 24/(8 x 4) = ¾
This means each core spends 75%
time on problem and 25% on overheads
Performance
Speedup and efficiency plots as functions of N
Performance
This is the expected behavior
With p fixed, increasing N increases overhead but
Increase in overhead < increase in time spent on problem
Hence E increases, as seen in the graph and table
Just a reminder
We will measure Tserial and Tparallel on same core architecture
Some researchers measure it differently
Performance
Amdahl’s Law
Let’s think about Tparallel in another way
We start with the serial algo and parallelize it
Assume we parallelize 90% of the serial algo, and do so “perfectly”
If we run parallelized algo on single core (p = 1), then
Tparallel(p = 1) = 0.9 Tserial + 0.1 Tserial = Tserial
If we run it on p > 1 cores, Tparallel(p) = (0.9 Tserial)/p + 0.1 Tserial
Performance
Amdahl’s Law
Let’s calculate Tparallel for two cores
Tparallel(p = 2) = 0.9 Tserial/2 + 0.1 Tserial = 0.55 Tserial
Speedup S = Tserial /Tparallel = 1/0.55 = 1.8
With two cores, speedup will always be less than 2
If we have 10 cores, S = 1/( (0.9 / 10) + 0.1) = 5.2
With 10 cores, speedup will always be less than 6
Performance
Amdahl’s Law
Suppose fraction r of an algo cannot be parallelized at all
Then Tparallel = (1 – r) Tserial / p + r Tserial
Speedup
This is called Amdahl’s Law
It provides upper bound on speedup which is 1/r
Performance
Amdahl’s Law
This law doesn’t account for problem size N
Often as N increases r becomes smaller and 1/r increases
A more mathematical version is Gustafson’s law
Not like the laws of physics