0% found this document useful (0 votes)

10 views12 pages

410A Week 4

The document discusses the concept of false sharing in cache memory, highlighting how simultaneous access to different variables within the same cache line can lead to inefficiencies. It also covers performance metrics in parallel programming, including linear speedup and Amdahl's Law, which illustrates the limitations of parallelization based on the fraction of code that can be parallelized. Additionally, it touches on the impact of overheads and the importance of memory architecture in optimizing performance.

Uploaded by

261905138

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views12 pages

410A Week 4

Uploaded by

261905138

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 12

Cache re-visited


False sharing

Core 0 has a variable in cache

Core 1 has a different variable in cache

Both variables belong to same cache line

Only core 0 updates its variable

Core 1 invalidates the line in its cache even though it did not
change its variable

This is called false sharing
Cache re-visited


False sharing

This happens with arrays as well

Two cores accessing different elements, but in the same line

Solution:

Keep simultaneously used variables far apart in memory

Doing so will push them in different cache lines

Its a difficult trade-off

Large cache lines are good for locality but cause false sharing
Cache re-visited


A few words about spinning

On SMP without cache, spinning is a bad idea

On NUMA without caches, spinning may be acceptable if
memory is local to core

On SMP and NUMA with caches spinning consumes much less
resources

Once a value is loaded in cache spinning becomes local
Performance


Objective of writing parallel programs: higher performance

Assumption: All cores have same architecture (non-GPU cores)

Theoretical best: If run on p cores, program runs p times faster

Only if work can be equally divided with no overhead

If serial run time is Tserial, then best possible Tparallel is Tserial/p

Tparallel = Tserial/p is called linear speedup

Not possible in practice: Why
Performance


Some reasons why linear speedup is not obtained

Shared memory programs

Critical sections (only one thread or process in CS)

Mutex function overhead (to provide exclusive access to CS)

Distributed memory programs

Data transmission across network (messaging among nodes)

More threads means longer delay to access CS
Performance


E = S/p, S = Tserial/Tparallel, E = (Tserial/Tparallel)/p = Tserial/(p Tparallel)

If Tserial and Tparallel are on same core type then

E is fraction of time spend by cores on solving the problem

Example: Tserial = 24ms, Tparallel = 4ms
p = 8. With this E = 24/(8 x 4) = ¾

This means each core spends 75%
time on problem and 25% on overheads
Performance


Speedup and efficiency plots as functions of N
Performance


This is the expected behavior

With p fixed, increasing N increases overhead but

Increase in overhead < increase in time spent on problem

Hence E increases, as seen in the graph and table

Just a reminder

We will measure Tserial and Tparallel on same core architecture

Some researchers measure it differently
Performance


Amdahl’s Law

Let’s think about Tparallel in another way

We start with the serial algo and parallelize it

Assume we parallelize 90% of the serial algo, and do so “perfectly”

If we run parallelized algo on single core (p = 1), then

Tparallel(p = 1) = 0.9 Tserial + 0.1 Tserial = Tserial

If we run it on p > 1 cores, Tparallel(p) = (0.9 Tserial)/p + 0.1 Tserial
Performance


Amdahl’s Law

Let’s calculate Tparallel for two cores

Tparallel(p = 2) = 0.9 Tserial/2 + 0.1 Tserial = 0.55 Tserial

Speedup S = Tserial /Tparallel = 1/0.55 = 1.8

With two cores, speedup will always be less than 2

If we have 10 cores, S = 1/( (0.9 / 10) + 0.1) = 5.2

With 10 cores, speedup will always be less than 6
Performance


Amdahl’s Law

Suppose fraction r of an algo cannot be parallelized at all

Then Tparallel = (1 – r) Tserial / p + r Tserial

Speedup

This is called Amdahl’s Law

It provides upper bound on speedup which is 1/r
Performance


Amdahl’s Law

This law doesn’t account for problem size N

Often as N increases r becomes smaller and 1/r increases

A more mathematical version is Gustafson’s law

Not like the laws of physics

Interact With IT Book 1 Answers
No ratings yet
Interact With IT Book 1 Answers
77 pages
Mdobook
No ratings yet
Mdobook
642 pages
Principles of Scalable Performance
No ratings yet
Principles of Scalable Performance
61 pages
Nscet E-Learning Presentation: Listen Learn Lead
No ratings yet
Nscet E-Learning Presentation: Listen Learn Lead
51 pages
Introduction To Parallel Programming: Linda Woodard CAC 19 May 2010
100% (1)
Introduction To Parallel Programming: Linda Woodard CAC 19 May 2010
38 pages
Analytical Modeling of Parallel Systems: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
No ratings yet
Analytical Modeling of Parallel Systems: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
67 pages
ParallelIzation Principles
No ratings yet
ParallelIzation Principles
40 pages
Perspective On Parallel Programming: CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley
No ratings yet
Perspective On Parallel Programming: CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley
42 pages
Parallel Programming: Sathish S. Vadhiyar Course Web Page
No ratings yet
Parallel Programming: Sathish S. Vadhiyar Course Web Page
36 pages
09 ParallelizationRecap PDF
No ratings yet
09 ParallelizationRecap PDF
62 pages
Assignment 1: Sample Solution
No ratings yet
Assignment 1: Sample Solution
8 pages
OOAD
No ratings yet
OOAD
67 pages
Simulating Ocean Currents
No ratings yet
Simulating Ocean Currents
35 pages
Defeating Bit Locker Encryption With Keys From RAM
No ratings yet
Defeating Bit Locker Encryption With Keys From RAM
44 pages
Open Sees Command Language Manual June 2006
No ratings yet
Open Sees Command Language Manual June 2006
465 pages
Air Conditioning Laboratory Unit: Solteq
100% (1)
Air Conditioning Laboratory Unit: Solteq
4 pages
UNIT-2 Parallel Programming Challenges
No ratings yet
UNIT-2 Parallel Programming Challenges
32 pages
Google Analytics Certification Question
100% (1)
Google Analytics Certification Question
5 pages
Configure To Order Cycle
100% (1)
Configure To Order Cycle
47 pages
ACA Answer Key
No ratings yet
ACA Answer Key
24 pages
BDS Session 2
No ratings yet
BDS Session 2
58 pages
Lecture04 PDF
No ratings yet
Lecture04 PDF
27 pages
24-25 - Parallel Processing PDF
No ratings yet
24-25 - Parallel Processing PDF
36 pages
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
No ratings yet
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
33 pages
Use of Deep Learning in Modern Recommendation System: A Summary of Recent Works
No ratings yet
Use of Deep Learning in Modern Recommendation System: A Summary of Recent Works
6 pages
Pc7 Performance
No ratings yet
Pc7 Performance
50 pages
Creativity and Innovation BM006-3-2-CRI Individual Assignment
No ratings yet
Creativity and Innovation BM006-3-2-CRI Individual Assignment
24 pages
Pc98 Lect5 Part1 Speedup
No ratings yet
Pc98 Lect5 Part1 Speedup
36 pages
HPC Overview
No ratings yet
HPC Overview
45 pages
Intro Parallel Programming 2015
No ratings yet
Intro Parallel Programming 2015
38 pages
Gmail - Experian Credit Report and Credit Score Through INDIALENDS
No ratings yet
Gmail - Experian Credit Report and Credit Score Through INDIALENDS
2 pages
Zindagi Zama Da
No ratings yet
Zindagi Zama Da
21 pages
Lec7 PDF
No ratings yet
Lec7 PDF
16 pages
Composable Multi-Threading For Python Libraries: Hsutter Wtichy
No ratings yet
Composable Multi-Threading For Python Libraries: Hsutter Wtichy
5 pages
Temperature Calibration: Applications Solutions
No ratings yet
Temperature Calibration: Applications Solutions
40 pages
Design and Development of A Model For Parallelization of Sequential Program For Execution On Multicore Architecture
No ratings yet
Design and Development of A Model For Parallelization of Sequential Program For Execution On Multicore Architecture
19 pages
PC 2
No ratings yet
PC 2
44 pages
Parallel2 PDF
No ratings yet
Parallel2 PDF
16 pages
HW2 Solutions
No ratings yet
HW2 Solutions
4 pages
Scoreboarding or SVA?: in A UVM Class-Based Environment
No ratings yet
Scoreboarding or SVA?: in A UVM Class-Based Environment
3 pages
960 X 240 TFT LCD Single Chip Digital Driver: Himax Confidential
No ratings yet
960 X 240 TFT LCD Single Chip Digital Driver: Himax Confidential
72 pages
Lecture 4 Analytical Modeling of Parallel Programs
No ratings yet
Lecture 4 Analytical Modeling of Parallel Programs
11 pages
Spss Tasks
No ratings yet
Spss Tasks
11 pages
What Is ETL
No ratings yet
What Is ETL
14 pages
Derek Photos
No ratings yet
Derek Photos
20 pages
Arch13 Multiprocessors Afterlecture
No ratings yet
Arch13 Multiprocessors Afterlecture
70 pages
Operatig System
100% (1)
Operatig System
29 pages
VLAN
No ratings yet
VLAN
31 pages
Performance Analysis: PE PE
No ratings yet
Performance Analysis: PE PE
10 pages
Interview Questions All
No ratings yet
Interview Questions All
13 pages
13 Wrapup
No ratings yet
13 Wrapup
21 pages
Lect 02
No ratings yet
Lect 02
51 pages
Lecture Week - 3 Amdahl Law 1
No ratings yet
Lecture Week - 3 Amdahl Law 1
19 pages
Pipelining vs. Parallel Processing
No ratings yet
Pipelining vs. Parallel Processing
23 pages
Scada ppt1
No ratings yet
Scada ppt1
34 pages
AUTOSCAN 4 Installation Guide
No ratings yet
AUTOSCAN 4 Installation Guide
2 pages
Veeam Backup 11 0 Storage Integration User Guide
No ratings yet
Veeam Backup 11 0 Storage Integration User Guide
259 pages
Performance and Tuning of Openmp Programs
No ratings yet
Performance and Tuning of Openmp Programs
76 pages
Omartarekamer Corrected
No ratings yet
Omartarekamer Corrected
22 pages
Block-Diagram Tour Late Model (5100 Series) EF Johnson 700/800 MHZ 2-Way Radio RF Deck
No ratings yet
Block-Diagram Tour Late Model (5100 Series) EF Johnson 700/800 MHZ 2-Way Radio RF Deck
7 pages
CS621 Week 14 - Complete
No ratings yet
CS621 Week 14 - Complete
69 pages
2 New Module 2 Performance Analysis of Multiprocessor Architectures Students Version
No ratings yet
2 New Module 2 Performance Analysis of Multiprocessor Architectures Students Version
13 pages
Block Coding
No ratings yet
Block Coding
2 pages
Introduction To Paralel Procesing
No ratings yet
Introduction To Paralel Procesing
40 pages
ACA 2024W 01 Introduction
No ratings yet
ACA 2024W 01 Introduction
19 pages
HPC Parallel
No ratings yet
HPC Parallel
122 pages
Screenshot 2024-12-05 at 2.01.32 PM
No ratings yet
Screenshot 2024-12-05 at 2.01.32 PM
49 pages
.Trashed-1650000204-Hpc Prac Exam
No ratings yet
.Trashed-1650000204-Hpc Prac Exam
5 pages
Presentation 3
No ratings yet
Presentation 3
63 pages
Rapport Stage
No ratings yet
Rapport Stage
22 pages
Lecture 3 Amdahl's Law and Karp Flatt Metric
No ratings yet
Lecture 3 Amdahl's Law and Karp Flatt Metric
42 pages
CAAL Previous Year Paper
No ratings yet
CAAL Previous Year Paper
5 pages
Written Asst2
No ratings yet
Written Asst2
27 pages
L05 Performance
No ratings yet
L05 Performance
53 pages
Mid 1 Spring 2024
No ratings yet
Mid 1 Spring 2024
9 pages
Xi Cs Abbreviations
No ratings yet
Xi Cs Abbreviations
3 pages
Unit II Performance 1
No ratings yet
Unit II Performance 1
13 pages
Multicore02 2
No ratings yet
Multicore02 2
18 pages
Daa Unit-V
No ratings yet
Daa Unit-V
50 pages
HLK-LD2410S User Manual-V1.2
No ratings yet
HLK-LD2410S User Manual-V1.2
33 pages
Unit1 2 and 3
No ratings yet
Unit1 2 and 3
76 pages
Parallel Algorithm Analysis
No ratings yet
Parallel Algorithm Analysis
11 pages
Viva Q Python
No ratings yet
Viva Q Python
4 pages
PDC Notes by Zatch-1
No ratings yet
PDC Notes by Zatch-1
42 pages
OpenMP Performance Consideration
No ratings yet
OpenMP Performance Consideration
49 pages
1st Ia Preparation
No ratings yet
1st Ia Preparation
15 pages
Amdahl's Law: Example 1
No ratings yet
Amdahl's Law: Example 1
12 pages
Remote Journal Function For High Availability and Data Replication-SG24-5189-00
No ratings yet
Remote Journal Function For High Availability and Data Replication-SG24-5189-00
130 pages
Essential Algorithms: A Practical Approach to Computer Algorithms
From Everand
Essential Algorithms: A Practical Approach to Computer Algorithms
Rod Stephens
4.5/5 (2)
Design And Analysis Of Algorithm
From Everand
Design And Analysis Of Algorithm
Bhupendra Mandloi
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

410A Week 4

Uploaded by

410A Week 4

Uploaded by

Cache re-visited

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.