0% found this document useful (0 votes)

34 views203 pages

Chapter 9 Multicore Systems

Uploaded by

ampi priyavarshini

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views203 pages

Chapter 9 Multicore Systems

Uploaded by

ampi priyavarshini

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 203

Chapter 9:

Multicore Systems

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 1

Background Required to Understand this Chapter

Caches

On-Chip Network Chapt

ers 7
Graph Algorithms and 8

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 2

Contents

1. Parallel Programming

2. Theoretical Foundations

3. Cache Coherence

4. Memory Models

5. Data Races

AGENDA ITEM 06
6. Transactional Memory
Green marketing is a practice whereby companies seek to go above and beyond.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 3

Era of Sequential Computing is Over

35
30
Speclnt 2006 Score 25
20
15
40
10
5
0
29/10/99 12/03/01 25/07/02 07/12/03 20/04/05 02/09/06 15/01/08 29/05/09 11/10/10 23/02/12

Date

Single core performance from 2001 to 2010

It has clearly saturated after 2008.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 4

Solution

Have multiple cores that collaboratively solve a task.

Big Problem

Map each
sub-
problem
to a core

Smaller problems

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 5

Problem

• Take an array numbers with SIZE elements.

• Compute the sum of all of its elements.
• Distribute the work among N threads.

Approach

• Divide the array into N parts.

• Assign each part to a core.
• Compute the partial sum (sum of each part)
• Add the partial sums

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 6

Shared Memory based Programming

Core 1 Core 2 Core N

Shared memory

• Each core runs a thread.

• Different threads share parts of the virtual memory space
• They communicate by reading and writing to shared variables

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 7

/* variable declaration */
int partialSums [N];
int numbers [SIZE];
int result = 0;

/* initialise arrays */
...
/* parallel section */ # pragma omp parallel {
/* get my processor id */
OpenMP
int myId = omp_ get_ thread_ num (); code
/* add my portion of numbers */
int startIdx = myId * SIZE/ N;
int end Idx = startIdx + SIZE/ N;

for( int jdx = startIdx; jdx < end Idx; jdx++)

partialSums[ myId] += numbers[ jdx];
}

/* sequential section */
for( int idx=0; idx < N; idx++)
result += partialSums[ idx];

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 8

A Typical Shared Memory Program
Parent thread

Initialization
Spawn child threads

Child
threads

Time

Thread join operation

Sequential
section

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 9

Message Passing Code

• Cores communicate by sending messages to each other

• They do not share memory.
• There are two basic functions.

Function Semantics
send(pid, val) Send the integer val to the process with id, pid
receive(pid) 1. Receive an integer from process pid
2. This is a blocking call
3. If the pid is equal to ANYSOURCE, then the receive
function returns with the value sent by any process

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 10

/* start all the parallel processes */
Spawn AllParallelProcesses();

/* For each process execute the following code */

int myId = getMyProcessId ();

/* compute the partial sums */ Message

int startIdx = myId * SIZE/ N;
int end Idx = startIdx + SIZE/ N;
passing
int partialSum = 0; code
for( int jdx = startIdx; jdx < end Idx; jdx++)
partialSum += numbers[ jdx];

/* All the non - root nodes send their partial sums to

the root */
if( myId != 0) {
/* send the partial sum to the root */
send (0 , partialSum );

}
McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 11
Continuation

else {

/* for the root */ int sum = partialSum ;

for ( int pid = 1; pid < N; pid ++) {
sum += receive( ANYSOURCE );
}

/* shut down all the processes */

shutDown AllProcesses();

/* return the sum */

return sum;
}

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 12

Shared Memory vs Message Passing

Shared memory

• Easy to program.
• Issues with scalability
• The code is portable across machines.

Message passing

• Hard to program.
• Scalable
• The code may not be portable across machines.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 13

Amdahl’s Law

How much can we parallelize?

We are limited by the fraction of the

sequential section: fseq

For P processors (
𝑇 𝑝𝑎𝑟 =𝑇 𝑠𝑒𝑞 × 𝑓 𝑠𝑒𝑞 +
1− 𝑓 𝑠𝑒𝑞
𝑃 )
𝑇 𝑠𝑒𝑞 1
𝑆= = As ,
Speedup 𝑇 𝑝𝑎𝑟 1 − 𝑓 𝑠𝑒𝑞
𝑓 𝑠𝑒𝑞 +
𝑃

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 14

Speedup vs #Processors
Note the saturation
of the speeudup

Number of processors (P)

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 15
Gustafson-Barsis’s Law
New workload: Scale the
Old workload: W parallel part by #procs (P)
Entity Mathematical expression
New workload
Sequential Time ()
is a constant of
proportionality
Parallel Time ()

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 16

Design Space of Multiprocessors: Flynn’s Classification

• Single inst. • Single inst.

single data multiple
data

SISD SIMD

MISD MIMD
• Multiple • Multiple
inst. single inst.
data multiple
data

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 17

Explanation of the Flynn’s Classification

SISD Processors
• A regular single core processor

SIMD Processors
• Single instruction stream, multiple data streams
• Vector instruction set. Example: add v1, v2, v3
• v1, v2, and v3 are vector registers
• They pack multiple integers (let’s say 4)
• A pairwise addition is performed
• Summary: We can do 4 additions using just a single
instruction

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 18

Explanation II

MISD Processors
• Used in airplanes: Run the same program on three
separate processors that have different instruction
sets. Compare the outputs and decide by voting.
MIMD Processors
• SPMD Processors: Single program, multiple data.
Run the same program on different cores with
different data streams (most common)
• MPMD Processors: Consider a processor with
different accelerators. Each core or accelerator
runs a different program.
McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 19
Hardware
Threads

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 20

Notion of Hardware Threads

• We traditionally have processes that run on different cores.

• Let’s say we have cores with large issue widths
• If processes have low ILP, then we waste issue slots

• Run multiple processes simultaneously on the same core

Definition

Such processes that share cores are known as hardware threads.

• They are not the same as software threads that share the
virtual address space.
McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 21
Coarse-grained Multithreading
1

4 2

3
• Run instructions from thread 1 for k cycles.
• Then switch to thread 2, then to thread 3, thread 4, thread 1, ...
• Separate program counters, ROBs and retirement register files
per thread
• Each instruction packet, rename table entry, LSQ entry, physical
register is tagged with the thread id
• If there is a high-latency event like an L2 miss, the core can
switch to a new thread.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 22

Fine-grained Multithreading

• Coarse-grained multithreading does not let the processor idle, if there

is a high-latency event.
• Fine-grained multithreading reduces k to 1-5 cycles.
• We can tolerate low-latency events like L1 misses.
• There is an additional overhead of excessive thread switching.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 23

Simultaneous Multithreading (SMT)

• Run multiple hardware threads simultaneously

• Dynamically split the issue slots between the threads
• There are several heuristics for partitioning the issue slots
• Fairness
• Instruction criticality (higher priority to loads)
• Thread criticality: real time, non-real time
• Based on instruction type (expected ILP, etc.)
• Hyperthreading  SMT with static partitioning (typically 50:50)

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 24

Comparison of Different Hardware Multithreading Schemes

Coarse-grained Fine-grained Simultaneous

multithreading multithreading multithreading

Thread 1

Thread 4

issue
slots

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 25

Contents

1. Parallel Programming

2. Theoretical Foundations

3. Cache Coherence

4. Memory Models

5. Data Races

AGENDA ITEM 06
6. Transactional Memory
Green marketing is a practice whereby companies seek to go above and beyond.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 26

Issues with Large Caches

• Large caches’ access times can be between 10-50 cycles

• Requests need to traverse the NoC
• If we have many cores, they will make parallel accesses.

We are dealing with large, slow, multi-ported

caches.
This is not possible to build.

Create large distributed caches.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 27

Notion on a Distributed Cache

Make a set of small caches act like one single, large cache.

Each sister cache is small and fast.

Many parallel accesses.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 28
Shared vs Distributed Caches

We want a set of small L1 caches (sister caches) act like one,

single, large L1 cache. How is this possible?

Consider accesses to a single

memory address/variable, x

• If we have a single physical location for x, then there is no

advantage of a distributed design.
• If we have multiple locations for storing the variable x, then
the replicas have to contain the same value.
• This is the problem of coherence.

Coherence  Make a set of locations behave

like a single location.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 29

Consider Multiple Locations

Assumptions

• All global variables are initialized to 0. They start with u, v, x, y, or z.

• All local variables are mapped to registers. They start with a ‘t’.
• The delay between the execution of consecutive instructions can be
indefinitely large. Instructions across threads can execute in any order.

Thread 1 Thread 2
x=1 t1 = y
y=1 t2 = x

• Is the outcome <t1, t2> = <1,0> possible?

• It for some reason bothers us because no sequential ordering of
instruction executions can produce this outcome.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 30

This outcome is indeed valid on many machines

Thread 1 Thread 2
x=1 t1 = y
y=1 t2 = x

• Thread 1 sets x to 1
• It sends an update message on the NoC. The message gets caught
in congestion.
• Thread 1 sets y to 1. The corresponding message on the NoC is
swiftly delivered.
• Thread 2 reads y to be 1.
• Thread 2 reads x as 0 (initialized value)

This is indeed possible!

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 31
One More Example

Thread 1 Thread 2
x=1 y=1
t1 = y t2 = x

Can we read <t1, t2> = <0,0> ?

The loads are issued early, and the writes happen at

commit time.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 32

Set of Valid Outcomes
Given a parallel program, what are the set of valid
outcomes?
This depends on the system on which the program is
running. It depends on ....
• The pipeline
• Memory system
• NoC
• Memory controller (for off-chip memory)
Every processor has a set of specifications that specify
the allowed outcomes/behaviors. If the behavior is
consistent with the specifications, the execution is said
to be consistent.
This specification is known as the memory
model or memory consistency model.
McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 33
Difference between Memory Consistency and Memory
Coherence

A memory consistency model or a memory model is a policy

that specifies the behavior of a parallel, multithreaded
program. In general, a multithreaded program can produce a
large number of outcomes depending on the relative order of
scheduling of the threads, and the behavior of memory
operations. A memory consistency model restricts the set of
allowed outcomes for a given multithreaded program. It is a
set of rules that defines the interaction of memory instructions
between each other.

Coherence is a subset of the overall memory model that

specifies the behavior with respect to a single location.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 34

Theoretical Foundations

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 35

Every Observer has a Point of View tstart Start time
tend End time
Cache
tcomp Completion
time

Queue
Observer

Timeline of memory requests seen by the observer:

Time
R1 R2 R3

tstart tend tstart tend tstart tend

tcomp tcomp tcomp

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 36

Definition of an Execution
Operation

tid tstart tend type addr value

read write

Operation

Execution
Operation

Operation

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 37

More about Executions

Sequential
All the operations are ordered.
Execution

Legal
Sequential Every read operation returns the value of the
Execution latest write operation to the same address.

The observer sitting on a memory location needs to

observe a legal sequential execution.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 38

Observer sitting on a Core

• In the case of a load, this relationship still holds 

• However, things change for a store.

• We don’t know when it will complete. The store will be put on the
NoC. It may reach the cache bank a long time later.
• The operation ends when the store leaves the ROB.
• We can very well have 

• Even in this case, the observer on the core expects to see a legal
sequential execution

Otherwise, the execution will not make any sense ....

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 39

Parallel Executions
Consider a parallel execution.

Term Meaning
Rx1 Read the value of x as 1
Wy2 Set y = 2

Multiple threads: one observer per thread. Each observer records the
local execution history of a thread.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 40

A Parallel Execution

Note the time line: ordered by the completion time.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 41

Issues with Parallel Executions

• Atomicity  Every operation has a single completion time. It appears to

execute instantaneously at that point of time.
• Have multiple observers (one per each thread) with their points of view
does not help.
• It is better if we can think of parallel executions in terms of sequential
executions  It is very intuitive.
• For this, we need to introduce the notion of the equivalence of executions:
P and S.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 42

Equivalence of Two Executions

Equivalence of two executions

Expression Meaning
P|T All the operations issued by thread T (in the same
order). This is an ordered sequence.
There is a one-to-one mapping between the two
sequences of operations.
For all T,

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 43

Example of Execution Equivalence

Mapping
1  1’, 2  2’, ... , 7  7’

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 44

Sequential Consistency

When is a parallel execution equivalent to some sequential execution?

Sequential Consistency (SC)

When a parallel execution is equivalent to a legal

sequential execution and the order of operations in the
sequential execution is as per program order, we say that
the execution is sequentially consistent.

We can interleave the executions of different threads such

that they are arranged sequentially, every read receives
the value of the latest write, and for each thread the
operations are arranged in program order.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 45

Key Properties of SC

Writes are atomic

Program order is preserved

SC = Atomicity + Program Order

Reordering accesses to different addresses might

create havoc in multiprocessor systems.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 46

Examples of Single-variable Programs

T1 T2 T3 T1 T2 T3
Wx1 Rx0 Rx1 Wx1 Rx1 Rx2
Wx2 Rx2 Rx2 Wx2 Rx2 Rx1

Execution in SC Execution that is not in SC

This behavior is non-intuitive.

Rx0  Wx1  Rx1  Wx2  Rx2
(T2)  Rx2 (T3)
For x there appear to be
different storage locations.

PLSC
If we consider all the accesses to a single variable (memory location), let
such an execution be always in SC. This is known as the PLSC (Per
Location Sequential Consistency) constraint. It is needed to provide the
illusion of a single memory location, even if we have a distributed cache.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 47

SC for Multi-variable Programs
T1 T2
Wx1 Wy1
Ry0 Rx0

Can <x,y> = <0,0>?

This execution is not in SC

PLSC does not guarantee SC?

Accesses w.r.t x Accesses w.r.t y

T1 T2 T1 T2 Both are in
PLSC
Wx1 Wy1
Rx0 Ry0

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 48

Why not make all machines guarantee SC?
T1 T2 Threads can reorder
x=1 y=1 accesses to different
addresses.
t1 = y t2 = x

• The outcome <x,y> = <0,0> is clearly not in SC.

• This is because
• The loads will be sent early to the cache
• The stores will only be sent at commit time
• This means that both the loads will read x and y to be 0
• If we still need SC then
• Loads will have to be issued at commit time
• All the benefits of OOO execution and the LSQ will go away
• For high performance we need to sacrifice SC
• It prohibits many performance-enhancing optimizations

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 49

Non-atomic Writes
T1 T2 T3
Wx1 while (x != 1) {} while (y != 1) {}
Wy1 Rx0
T1 T2 T3
Wx1 Rx1
T1 T2 T3 Rx0
Wx1 Rx1 Ry1
Wy1 Rx0 T1 T2 T3
Ry1
Wy1

• This execution is clearly not in SC

• It is however in PLSC. Verify
• However, the write to x appears to have different completion times for
different threads. The write is non-atomic.
McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 50
Non-Atomic Writes - II

• This is possible in many architectures such as IBM and ARM

machines
• Thread 2 is basically reading the write to x early  before it is visible
to Thread 3
• This is possible if multiple locations store the variable x, and Core 2
is closer to Core 1 and Core 3 is much farther away.
• In a system with an NoC, this can easily happen.

Given that this execution is in PLSC, let us accept it.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 51

Behaviors not in PLSC

T1 T2 T3
Wx1 Rx1 Rx2
Wx2 Rx2 Rx1

• This behavior is not in PLSC

• SC implies PLSC, PLSC does not necessarily imply SC
• Writes atomic w.r.t. one location need not be atomic
w.r.t. accesses for multiple locations.
• Such behaviors break the notion of shared memory
completely

Let us thus allow non-atomic writes as long as

PLSC is preserved.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 52

Summary Till Now

SC is • SC = Atomicity + Program order

intuitive yet • In OOO machines, it is impractical
impractical
• SC implies PLSC (not the other way
round)
PLSC is • PLSC holds in systems that have non-
definitely atomic writes.
required • It is needed to provide the illusion of a
single memory location in a distributed
cache

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 53

From PLSC to Coherence
• Akin to the definition of SC, we can define PLSC as per-location
program order + atomicity (viewed per location)
• Program order basically means that processors or cache controllers
are not allowed to reorder accesses to the same address.
• Since caches have FIFO queues for requests, they already ensure
program order.
• Consider atomicity. Because of PLSC an observer at a memory
address always sees atomic writes, but an observer on a core may
see non-atomic writes because of intervening accesses to other
addresses.

• For a given address, let us look at the following orders of operations

• Global order  All observers record this order between operations
• Local order  A few observers do, and a few don’t

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 54

Ordering between Accesses to the Same Variable
across Threads (Observer at the Core)

Order Implication
Read  Read Does not matter
Write  Read Given that a core can read another core’s write early,
while other cores may not even see the write, this
order may not be global all the time. A core does not
know when a write reaches the sister caches present
in other cores. Hence, it may not agree about the
values read by other cores; this order is local.
Write  Write If this order is not global, then the same variable will
end up with multiple final states. This is not allowed.
Hence, in all systems this order is global.

Read  Write Writes are globally ordered. Assume one core records
Wi  Ri  Wj . All the cores will record Ri  Wj because
the core that issued Ri could not have seen Wj else it
would have read a different value.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 55

Implications of PLSC (Observer at the Core)

Write  Write order is global

Read  Write order is global

Write  Read order is global only for machines with atomic writes.

Axioms of Coherence

Write
Writes to the same location are globally ordered.
Serialization
Write A write is eventually seen by all the threads.
Propagation
McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 56
SC Using Synchronisation
Instructions

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 57

Communicating a Value between Threads

T1 T2
value = 3 while (status != 1) {}
status = 1 temp = value

• This code will work correctly on an SC machine. It will not work

correctly on other machines with other memory models.
• The ordering between the read and write in T2 may not hold.
• A store instruction fully completes when all the threads can read the
stored value

fence instruction  All the instructions fetched before the fence

need to fully complete before the fence instruction executes. All
the instructions after the fence instruction cannot execute until the
fence instruction has completed.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 58

Code with fence Instructions

T1 T2
value = 3 while (status != 1) {}
fence fence
status = 1 temp = value

• Regardless of the underlying memory model this code will always work
• Memory barriers
• A fence is an example of a memory barrier
• Specifies rules of completion for instructions before and after the
barrier (in program order)
• Store barrier  Ensures an ordering between store instructions before
and after the barrier instruction.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 59

Acquire and Release Instructions

Acquire instruction  No instruction after the acquire instruction in program

order can execute before it has completed.

Release instruction  The release instruction can only complete if all the
instructions before it have been fully completed. Note that the release
instruction allows instructions after it to execute before it has completed.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 60

Theory of Memory Models

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 61

Where do we stand ...

• Processors reorder instructions that access different addresses

• Very problematic in multithreaded systems  non-intuitive behaviors
• Writes can be non-atomic

How do we reason about the correctness of these systems?

SC gave us a mechanism where we could map a parallel execution

to a legal sequential execution (LSE). We could then reason on the LSE.

What about non-SC executions?

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 62

Method of Execution Witnesses

Sequential
Parallel Execution Execution Witness
Execution

• A given piece of parallel code can have many different executions.

• The valid outcomes are dependent on the memory model of the
machine.
• For each execution (valid or invalid), we can create an execution witness
(a graph containing nodes and edges).
• If we can convert the execution witness to a sequential execution that
obeys the memory model, the execution is valid, otherwise it is invalid.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 63

Execution Witness
SC Execution

T1 T2 1 x=1

x=1 y=1 2 t1 = y

t1 = y t2 = x 3 y=1
4 t2 = x
It is a graph <t1,t2> = <0,1>
• The nodes are the instructions
• We can have edges between the instructions based on
the orders that are guaranteed by the memory model.
• We can have two edges: global and local
• These are happens-before edges  If means that event A
happened first and then event B happened after that. It could be
immediately later or after a very long time.
• A global hb edge (ghb) is agreed to by all threads.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 64

Program Order (po) Edge

All the edges are between memory operations issued

by the same thread regardless of their address.

Edge Description
poRW Read  Write edge
Not always
poRR Read  Read edge global.
poWR Write  Read edge Global only
in SC.
poWW Write  Write edge
poIS read/write  synch operation
Global
poSI synch  read/write operation

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 65

Execution witness
Example of a po edge (a)
Wx1
T1
po
(a) x=1 (b)
(b) y=1 Wy1
(c) t1 = x
po
(c)
Rx1

• We create a node for each memory operation

• We add edges between them (as per the memory model)

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 66

rf (read from) edge
Read and write ops in
rfe edge different threads

rf edge
Read and write ops in
rfi edge the same thread

• If writes are atomic, the rfe edge is global. Not otherwise.

• rfi is not global when we have optimizations like load-store forwarding
where a core can read its own value before other cores can.

(a) (b)
T1 T2 rfe
Wx1 Rx1
(a) x = 1 (b) t1 = x

t1 = 1

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 67

Write Serialization (ws) Edge
Execution witness
(a) (c)
T1 T2 T3 rf
(a) x = 1 (b) x = 2 (c) t1 = x Wx1 Rx1
(d) fence
ws po
(e) t2 = x
(d)
t1 = 1, t2 = 2 (b) Wx2 fence

rf po
(e)
Rx2

• This is always global. This is a direct consequence of PLSC and the

requirements of coherence.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 68

From-read (fr) Edge
Execution witness
T1 T2 (a) (c)
(a) x = 1 (c) t1 = x rf
(b) x = 2 Wx1 Rx1

ws fr
t1 = 1
(b) Wx2

• This edge is global because of PLSC.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 69

Synchronization Edge: so
(a) (c)
T1 T2 Wx1 Ry1
(a) x = 1 (c) t1 = y
(b) y = 1 (d) t2 = x
po po

so
rf
t1 = 1, t2 = 1

rf/
(b) Wy1 Rx1

Assume y is a synchronization variable.

All updates to such synch variables are globally ordered.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 70

Relationships between memory accesses

po • program order edge (same or different addresses)

rf • write  read, dependence (same address)

ws • write  write (same address)

fr • read  write (same address)

so • Synchronisation edge between synch operations.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 71

Summary: Memory Models

ws, fr, and so are always global

Let be global

grf=rf and gpo=po only for SC

𝑔h𝑏=𝑔 𝑝𝑜∪ 𝑤𝑠∪ 𝑓𝑟 ∪ 𝑔𝑟𝑓

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 72

Cycles in an Execution Witness

If a graph does not have a cycle, we can arrange

all the nodes in a linear sequence such that

If there is a path from operation A to B

in the graph, then A appears before B in
the linear sequence.

This is a topological sort.

An acyclic Sequential
execution execution that
witness respect ghb

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 73

Implications of a Sequential Execution

• For an arbitrary memory model

• We can now establish an equivalence between a parallel execution
and a sequential execution.

It may not be legal.

• The sad part is that it may not be legal if writes are not atomic.

At least the fact that we established an equivalence means that

the execution is feasible according to the memory model.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 74

Access to a Single Location
Data and Control Dependences

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 75

Access Graphs: All accesses to the same location
(a)
T1 Wx1
(a) x = 1
up
(b) x = 2
(c) y = 3 (b)
Wx2
(d) z = x + y up
(e) x = 4
(d)
Rx2

up
Instead of po edges, we have up
(e)
edges  Edge between accesses Wx4
to the same location in the same
thread.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 76

𝑃𝐿𝑆𝐶=𝑢𝑝∪𝑟𝑓∪𝑓𝑟∪𝑤𝑠
• PLSC is always respected regardless of the memory model.
• Hence, the access graph never has cycles.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 77

Example
(a) (d)
T1 T2 T3 rf
Wx1 Rx1
(a) x = 1 (c) x = 2 (d) t2 = x
(b) t1 = x (e) t3 = x ws (c) fr

up Wx2 up
<t1,t2,t3> = <2,1,0>?
(b) rf

Rx2 Rx0
(e)
fr

Note the up edges and the presence of the cycle.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 78

Data and Control Dependences
(a) (d)
Rx1 Ry2
T1 T2
(a) t1 = x (d) t2 = y po
(b) if (t1 == 1) { (e) fence (e)
(b)
(c) y = 2 } (f) x = 1 if-stmt fence

t1 = 1, t2 = 2 po
rfe
rfe (f)
(c)
Wy2 Wx1

There is a clear breakdown of causality. Rx1 seems to be

happening without a preceding write. This is a thin air read.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 79

Causal Graph (d)
(a)
Rx1 Ry2

dep po

(b) (e)
if-stmt fence

dep po
rfe
rfe (f)
(c)
Wy2 Wx1

Three kinds of global edges: rf, gpo (all global program order edges),
dep (dependences)

To stop thin air reads 𝑎𝑐𝑦𝑐𝑙𝑖𝑐 (𝑟𝑓 ∪ 𝑔𝑝𝑜∪ 𝑑𝑒𝑝)

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 80
Putting it all Together

Condition Test
Satisfies the memory The execution witness is acyclic.
model
PLSC holds for all All access graphs are acyclic.
memory locations
No thin air reads The causal graph is acyclic.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 81

Contents

1. Parallel Programming

2. Theoretical Foundations

3. Cache Coherence

4. Memory Models

5. Data Races

AGENDA ITEM 06
6. Transactional Memory
Green marketing is a practice whereby companies seek to go above and beyond.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 82

Write Update Protocol

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 83

Enforce PLSC in Hardware

All writes to the same location

are seen in the same order

A write ultimately completes

Bus based Model

Cache 1 Cache 2 Cache 3 Cache 4

Shared bus

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 84

Write-Update Protocol

• Every cache line has three states

State Meaning
M Modified
S Shared with other sister caches
I Invalid

Event | Message Notation

Event Type Meaning
Rd Read request
Wr Write request
Evict Evict the block
Wb Write back data to the lower level
Update Update the copy of the block

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 85

Message Types
Message Type Meaning
RdX Generate a read miss message. Send it
on the bus/NoC if required.
WrX Generate a write miss message. Send it
on the bus/NoC if required.
WrX.u Get permission to write to a block that is
already present in the cache.
Broadcast Broadcast a write on the bus.
Send Send a copy of the block to the
requesting cache.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 86

Snoopy Protocols

• Cache coherence protocols that use buses are known

as Snoopy protocols.
• All messages are essentially broadcast messages.
• All the caches can read them (snoop on them).
• These are easy to design

There are serious scalability issues. Such systems do not

scale beyond 4 to 8 cores.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 87

Reads

• If the block is not present, send a read miss message on the bus.
• S state  If it is already present, don’t do anything. Just read it.
• The S state allows seamless evictions

Rd | -
Rd | RdX
I S
Evict | -

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 88

Writes
• If the block is not present, send a write miss message on the bus.
Transition to the M state.
• If it is already present, don’t do anything. Just read it.
• M state  If we need to write, the write is broadcasted first to the rest of
the sister caches.
• The M state does not allow seamless evictions. We write back the data
to the lower level, while evicting  Otherwise, the updates will be lost.
Rd | -
Wr | Broadcast
Wr | WrX
I M
Evict | Wb

Broadcasting after every write is the issue.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 89
All the transitions due to read, write, and evict events

Rd | -
Rd | RdX
I S
Evict | -
W
r
|W
Ev rx Wr | Broadcast
ic
t|
W
b
M Rd | -
Wr | Broadcast

Power hungry
step

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 90

Events Received from the Bus

Broadcast | Update
RdX | Send
WrX | Send
Broadcast | Update

M RdX | Send
S
WrX | Send

Given that only one sister cache can use the bus at any time, a
global order of writes is automatically enforced.
If the bus master disallows starvation, then all writes will ultimately
complete.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 91

Write Invalidate
Protocol

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 92

Single Writer Multiple Reader Model for each Cache Line

Single Writer Multiple Readers

Ensures a global order of writes to the same memory

location

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 93

Write Invalidate Protocol (MSI Protocol)

Rd | RdX Rd | -
I S
Evict | -

Upgrade the status

Wr
of the line
Ev

|Wr
ict

Wr | WrX.u
X
|W
b

• Only one cache can contain

a block in the M state. At that
Rd | -
M point, no other cache has a valid
Wr | - copy.
• Multiple caches can have the
block in the S state. They can
Seamless reads
all read from it.
and writes
McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 94
Transitions because of Messages Received on the Bus

A writeback is
necessary here. The S
state has seamless
evictions.

If any line is in the modified state

and another cache requests for
the block, then a state transition
is necessary.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 95

MESI Protocol

• Insight: Consider a core that reads a block, and the block is never
shared. It will be read first in the shared state, and then an additional
message (WrX.u) is required to transition to the M state.
• Can we avoid this?
• Add an additional Exclusive (E) state  The block is only present in
one cache. If we are reading a block from the lower level, it enters
the E state.
• The rest remains the same.
• MSI protocol  MESI protocol

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 96

MESI (Read, Write, and Evict events)

Rd | RdX
(from sister cache) Rd | -
I S
Evict | -

Rd | -
Rd | - Wr | -
E M
Wr | -

New state Seamless transition

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 97
MESI (messages received from the bus)

The moment another cache

requests for the data (read-only), we
need to transition to the S state.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 98

Important Engineering Questions

Who supplies data if a sister cache sends a read miss or

write miss message?

Answer The caches who have a copy of the block, arbitrate for the
: bus. The one who gets access to the bus first, sends the
data. The rest snoop this value, and then cancel their
request. This is an overhead in terms of time and power.

Do we have to write data back on an M  S transition?

Answer In the interest of performance and power, it will be the best

: if we can avoid it.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 99

MOESI (messages received from the bus)

• Whenever, another cache requests for data, from an E or M state, the

line moves to the O state.
• In the O state, it sends data to other requesting caches.
• We will discuss the Probe message in the next slide.
• From the S state, no data is sent.
McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 100
MOESI
Reply
St |-

• Ignore the two temporary states – St and Se – for the time being.
• Main problem: An eviction from the O state, leaves us with a
state where there is no owner.
• This makes the transitions from the I state tricky

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 101

MOESI
Reply
St |-

• We first transition to the St state upon encountering a read miss

• If a Reply is received, then the state transitions to the shared(S) state.
• Otherwise, there is a timeout. A Probe message is sent. We transition
to the Se state.
• Again, if we receive a Reply from a cache (line in the S state), we
transition to the S state.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 102

MOESI
Reply
St |-

• If there is a timeout in the Se state, we read from the lower level.

• Transition to the E state.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 103

Directory
Protocol

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 104

A Scalable Solution: Directory Protocol

Do not have a bus

• Have a dedicated structure called a directory
• The directory co-ordinates the actions of the coherence protocol
• It sends and receives messages to/from all the caches and the lower
level in the memory hierarchy
• Scalable

Directory
Cache Cache

Cache Cache

Lower Level in the Memory Hierarchy

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 105

Structure of the Directory

Directory
Directory Entry

State Block Address List of Sharers

• State of the entry

• Block address
• List of sharers
• List of caches that contain a copy of the block.
• Typically, stored as a bit vector. If the ith bit is set, it means that
the ith cache has a copy of the block.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 106

Design of the Directory

What are the messages that the directory receives?

• RdX
• WrX and WrX.u (regular and upgrade)
• Evict
What should the directory do:
• RdX  Locate a cache that contains the block (sharer) and fetch the block
• WrX and WrX.u  Ask all sharers to invalidate their lines, give exclusive
rights to the cache that wants to write
• Evict  Delete the cache from the list of sharers
• The basic protocol at the level of the caches remains the same. The state
transitions of directory entries are as follows.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 107

State Transitions of a Directory Entry (from the U and S
states)
P sends a message
RdX | 1. Send RdX to one of the
sharers. Ask it to forward
a copy.
2. sharers += {P}
U S
sharers={} | - Evict | sharers -= {P}
RdX |
1. sharers = {P}
2. Read from LL
WrX.u | 1. Send WrX.u to all sharers
WrX | (other than P)
1. sharers = {P} 2. sharers = {P}
2. Read from LL WrX | 1. Send WrX to all sharers

E 2. Ask one of the sharers to

forward a copy.
3. sharers = {P}
LL  Lower level
U  Uncached
McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 108
State Transitions of a Directory Entry (from the E state)

U S

Evict |
1. sharers = {} RdX | 1. Send RdX to the sharer. Ask it
to forward a copy of the block.
2. sharers += {P}

WrX | 1. Send WrX to the sharer

E
2. Ask the sharer to forward a copy
3. sharers = {P}

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 109

Enhancements to the Directory Protocol

Let us list some of the common problems associated with directories

• We need an entry for each block in a program’s working set (lot of
storage)
• In each directory entry, we need an entry for each constituent cache
(storage overheads)
• The directory itself can become a point of contention
• Let us look at 

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 110

Distributed Directories
Split the
address
space

Address
Space

Directory Directory Directory Directory Directory

• Split the physical address space

• A directory handles all the requests for the part of the address
space it owns
• Resolves the issue of the single point of contention.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 111

List of Sharers

• How to maintain the list of sharers?

• Solution 1 [Fully mapped scheme]:
• If there are N processors, have a bit vector of N processors.
• Each block is associated with a bit vector of sharers

block address 11000000 10001011

Space-efficient Solutions
• Maintain a bit for a set of caches. Run a snoopy protocol inside the set.
• Store the ids of only k sharers. Have an overflow bit to indicate that there are
more than k sharers. In this case, every message needs to be broadcasted.
[Partially mapped scheme]

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 112

Size of the Directory

The directory should ideally be as large as the number of blocks in the

programs’ working sets
Having an entry for every block in the physical address space is impractical
Practical Solution:
• Design a directory as a cache
• Keep the state of a limited number of blocks
• If an entry is evicted from the directory, invalidate it in all the caches

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 113

A Few More Issues

• False sharing  Consider a 64-byte block. It is possible that different

threads access different memory words within the block.
• The block will keep bouncing between caches.
• These are false sharing misses.
• Use a smart compiler to place data more intelligently.
• Race conditions
• In real hardware, there are a lot of interactions. It is possible that
multiple messages of different types for the same block might
arrive at the same time.
• Such concurrent events (race conditions) need to be handled.
Hence, a practical cache coherence protocol has close to a 100
states.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 114

Critical
Sections and Atomic
Operations

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 115

Consider this piece of code

t1 = account.balance;
t2 = t1 + 100;
account.balance = t2;

Can this code be executed in parallel

by multiple threads?

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 116

What is the problem?

100 t1 = account.balance; 100 t1 = account.balance;

200 t2 = t1 + 100; 200 t2 = t1 + 100;
account.balance = t2; 200 account.balance = t2;
200

• Each line corresponds to a line of assembly code

• Assume corresponding lines execute in parallel.
• The final answer is wrong. It should be 300, it is 200.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 117

Solution: Use Locks

Only one thread can execute this piece of code at any

single point in time.

lock();
t1 = account.balance;
t2 = t1 + 100;
account.balance = t2;
unlock();

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 118

Use Atomic Instructions to Implement Locks

For implementing lock and unlock functions.

• We need atomic instructions
• Either execute completely or not at all. Nobody observes a partial
state.
• Most atomic operations also act as a fence.

Atomic exchange operation

xchg reg, <mem> Atomically exchanges their contents

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 119

Assembly Level Implementation of the Lock and Unlock
Functions

.lock: • r0 contains the lock address

mov r1, 1 • The xchg instruction contains a
xchg r1, 0[r0] fence
cmp r1, 0 • Contains 0 if the lock is not acquired
bne .lock • 1 if acquired
ret

.unlock:
mov r1, 0
• We store 0 at the lock address
xchg r1, 0[r0]
ret

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 120

Implementation of Atomic Exchange
xchg r1, [r2]
• temp = r1, r1 = [r2], [r2] = temp
It involves 3 steps
• 1 memory read + 1 memory write + register move
• All the operations need to happen atomically
• This is called a read-modify-write instruction (RMW)
Method
• Get exclusive access (M state) with write permissions for the memory
address in r2
• Perform the read-modify-write operation
• Do not respond to any other requests from the local cache, or other caches,
or the directory when the operation is in progress
• Respond to the directory or other caches only when the operation is over

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 121

Spin Locks

• We need to repeatedly try to acquire the lock

• This is a spin lock
• There are a lot of overheads:
• Every time we access the lock, it needs to be read in the M state.
• Too many invalidation messages

• Test and Exchange Lock

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 122

Test and Exchange Lock

.lock: .unlock:
mov r1, 1 mov r1, 0
xchg r1, 0[r0]
.test ret
/* test if the lock is free */
ld r2, 0[r0]
cmp r2, 0
bne .test

/* attempt an exchange only when the

lock is free */
xchg r1, 0[r0]
cmp r1, 0
bne .test
ret

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 123

Atomic Operation Example Explanation
Test and Set tas r1, 8[r0] if (8[r0] == 0) {
8[r0] = 1;
r1 = 1;
}
else r1 = 0;
Fetch and Increment fai r1, 8[r0] r1 = 8[r0];
8[r0] = r1 + 1;
Fetch and Add faa r1, r2, 8[r0] r1 = 8[r0];
8[r0] = r1 + r2;
Compare and Set cas r1, r2, r3, 8[r0] if (8[r0] == r3) {
8[r0] = r2;
r1 = 1;
} else r1 = 0;
Load linked (ll) ll r1, 8[r0] r1 = 8[r0] /* Use the ll instruction */
Store conditional (sc) .... ...
mov r2, 1 /* sc */
sc r3, r2, 8[r0] if (8[r0] is not written to since the last ll){
8[r0] = r2;
r3 = 1;
} else r3 = 0;

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 124

Lock-free Algorithms

• Let us write the same code without locks.

• If the thread goes to sleep after acquiring the lock, all the
threads wait.

while (1) {
t1 = account.balance;
t2 = t1 + 100;
if (CAS (account.balance, t1, t2))
break;
}

It is possible for a thread to starve – never get the lock.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 125

How do we eliminate starvation?

Answer:
Wait free algorithms

Basic Idea

A request, T, first finds another request, R,

that is waiting for a long time
T decides to help R
This strategy ensures that no request is
left behind
Also known as an altruistic algorithm

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 126

Summary of Consensus Numbers

Are all atomic operations equally powerful?

Consensus problem: each thread proposes a value –

Consensus
one among the proposed values is chosen. The
number
consensus number is the maximum number of threads
that can solve this problem using a wait-free algorithm.

Type of Operation Consensus Number

Atomic exchange 2
Test and Set 2
Fetch and add 2
CAS (Compare and Set)
LL/SC

The consensus problem forms the theoretical basis of most concurrent algorithms.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 127

Contents

1. Parallel Programming

2. Theoretical Foundations

3. Cache Coherence

4. Memory Models

5. Data Races

AGENDA ITEM 06
6. Transactional Memory
Green marketing is a practice whereby companies seek to go above and beyond.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 128

Memory Models

The memory model is dependent on the processor

architecture. If there are very aggressive optimizations,
then the memory model has to be very weak.

PLSC requires the ws and fr orders to be global.

All popular architectures follow PLSC, and many
disallow thin air reads.

The only orders that can be local are rf and po.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 129

Write-to-Read Order
LSQ forwarding and
write buffers
rfi

Execution witness

T1 fr
T2 (a) Wx1 Rx0 (e)
(a) x = 1 (d) t2 = y
(b) t1 = x (e) t3 = x rfi po
(c) y = 1

(b) Rx1 Ry1 (d)

t1= 1, t2 = 1, t3 = 0 ?
po rfe
(c) Wy1

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 130

Non-atomic Writes
Create local tiles of
caches.
rfe
Execution witness
T1 T2 T3 (a) Wx1 fr Rx0 (e)
(a) x = 1 (b) t1 = x (d) t2 = y
(c) y = 1 (e) t3 = x rfe po

(b) Rx1 Ry1 (d)

t1= 1, t2 = 1, t3 = 0 ?
po rfe
Wy1
(c)

If writes are atomic, this behavior is not allowed, otherwise it is.

We are seeing writes to different locations in different orders.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 131

Write-to-Write Order
Non-blocking Caches

Execution witness
T1 T2 (a) Wx1 Ry1 (c)
(a) x = 1 (c) t1 = y fr
(b) y = 1 (d) t2 = x po po
rfe

t1= 1, t2 = 0 ? (b) Wy1 Rx0 (d)

Even if writes are atomic, this behavior is not allowed.

If the W W ordering is relaxed, we will see writes to different

locations in different orders.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 132

Read-to-Read Order Issues loads OOO in the
LSQ

Execution witness

(a) Rx1 (c) Wy1 rf Ry1 (d)

T1 T2 T3
(a) t1 = x (c) y = 1(d) t3 = y po po
(b) t2 = y (e) x = 1 rf
fr
(b) Ry0 Wx1 (e)
t1 = 1, t2 = 0, t3 = 1?

If the R R ordering is relaxed, we will observe this behavior.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 133

Read-to-Write Order
Speculative writes

Execution witness

T1 T2 (a) Rx1 Ry1 (c)

rf
(a) t1 = x (c) t2 = y
po po
(b) y = 1 (d) x = 1 rf
(b) Wy1 Wx1 (d)
t1= 1, t2 = 1 ?

If the R  W ordering is relaxed, we will observe this behavior.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 134

Special case of rfi in SC.

Can the rfi relation be relaxed in SC?

The proof is there in the book

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 135

Summary of Memory Models

Relaxation WR WW RR RW rfe rfi

TSO (Intel)

Processor
consistency
PSO

Weak Ordering/ RC

IBM PowerPC

ARM

Ordering is
relaxed

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 136

Contents

1. Parallel Programming

2. Theoretical Foundations

3. Cache Coherence

4. Memory Models

5. Data Races

AGENDA ITEM 06
6. Transactional Memory
Green marketing is a practice whereby companies seek to go above and beyond.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 137

Increment a Counter

Let us say that all we want to do is

counter++

t1 = counter;
t2 = t1 + 1; fetch_and_increment (counter)
counter = t2;

Uses an atomic instruction

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 138

Code with Locks

If we do not want to write code using atomic instructions, we need to

encapsulate this code within a critical section.

lock()

t1 = counter;
t2 = t1 + 1;
counter = t2;

unlock()

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 139

Critical Sections

time
T1 T2 T1 T2 T2

Successfully locked

Unlocked

Failed to acquire a lock

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 140

Why do we need to use locks?

If there are no shared variables, do we need to use locks?

Answer:

When do we need locks then?

Answer:Two blocks of code need to be making conflicting and

concurrent accesses to the same address.

Conflicting accesses At least one access is a write

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 141

Concurrent Accesses
Common sense meaning  At the same time.
Code for T1 Code for T2
Code for T1 and T2
x = 1; while (y != 1) {}
x++ ; y = 1; t1 = x;

T1 T2 T1 T2
(a) Rx0 (c) Rx0 (a) Wx1 (c) Ry1
y is a
(b) Wx1 (d) Wx1 (b) Wy1 (d) Rx1
synch
variable
(a) (c)

(a) (c) (a) (c)

Rx0 fr Rx0 Wx1 rf Ry1
fr
po po po po
(b) (b) rf
(d)
Wy1 so / Rx1
(d)
Wx1 Wx1
ws

Execution witness Execution witness

(b) (d)
Note the so edge
McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 142
Data Races

Two accesses are said to be concurrent when there

is no path between them in the execution witness
that contains an so edge.

Data Race
A pair of conflicting and concurrent accesses to the
same regular variable constitute a data race.

If a piece of code does not have data races, what does it mean?

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 143

Does SC Imply Data-Race-Freedom?

Rx0 S2
po po
T1 T2 T1 T2
S1 S2 Wx1
Rx0 S2 Rx1 Data
Wx1 so S1 Wx1 rf
S1 race
S2 Rx1
po po
(a) (c)
Wx1 S1

Execution 1 Execution 2
(b) (d)

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 144

Does Data-Race-Freedom imply SC?

Refer to the book for the detailed proof.

Salient Points

 If there are two conflicting accesses, there will be an so edge on

at least one path between them in the execution witness.
 They will thus be ordered by the so edge.
 Let us add all the SC edges to the execution witness.
 A cycle implies that there is a cycle between synch operations.
 This is not possible. Synch operations follow SC.
 Proof by contradiction.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 145

What does having data races imply?

Theorem

If we have a data race in a program, we can construct an

SC execution that also has a data race.

Proof  refer to
the book

If an automated tool cannot construct an SC execution

What does it imply? that has a data race, then it means that the program
is data race free.

Method to detect data races

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 146
Summary of All the Results
Properties
SC does not imply data-race-freedom.
Data-race-freedom implies SC
Non-SC execution implies data races
If an automated tool cannot construct an SC execution
that has a data race, then it means that the program
is data race free.

Moral of the story If there are no data races, the

memory model doesn’t matter

• Programming languages need to define data-race-free memory (DRF)

memory models that provide synchronization primitives.
• Programmers need to use them to write properly synchronized programs.
• This means that all accesses to shared variables are protected with
critical sections such that there are no concurrent, conflicting accesses.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 147

Methods to Detect Data
Races

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 148

Data Race Detection Algorithm

How do we detect data races?

• We need an algorithm that tests a piece of code for data

races
• Exercise all possible control paths
• Create as many interleavings as possible
• A data race must show up in an SC execution
• Try to find it …

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 149

Notion of Lock Sets

L(v) Set of locks held by this memory location, v.

L(T) Set of locks held by thread, T.

A lock is identified by the lock address.

After each access we modify the lock set.

𝐿 ( 𝑣 ) =𝐿 ( 𝑣 ) ⋂ 𝐿( 𝑇 )

Compute an intersection

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 150

Standard Approach

1. Add annotations to multithreaded code. These annotations modify

the lock set.
2. The locksets of threads are initially empty.
3. For each variable, its lockset initially has all the locks.
4. When a thread acquires a lock, we add the lock to its lock set.
5. When it releases a lock, we remove the lock.
6. As the program executes, , keeps getting updated.
7. At the end, if there is a variable with an empty lock set, it is
probably involved in a data race.

If the lock set is empty, it means that there is no

synchronization between accesses to the variable.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 151

Notion of False Positives

Note that we will detect many scenarios that are actually

not data races. For example, the basic algorithm will flag a
data race when we consider read-only variables.

A few more examples

Example Description
Initialization We typically initialize variables without
using locks
Read-only Written once (during initialization) and
variables read many times
Reader-writer Multiple threads read a variable
pattern concurrently.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 152

Modified State Diagram for each Variable

1. The first access has to be a write. We then move to the Exclusive state.
2. If a different thread reads it, move to the Shared state.
3. The Shared state allows reads (irrespective of the thread).
4. After a write, move to the Modified state.
5. In the Modified state, run the regular lock set algorithm.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 153

Notion of Vector Clocks

• We can think of a multithreaded execution environment as a

classical distributed system
• Here, there is no notion of global time.
• Every thread has a local clock.
• The local clocks are updated any time there is an interaction
between threads.
• Assume there are n threads. Every thread maintains an n-element
vector clock. Thread i’s vector clock is .
• is the best estimate that i has for j’s local clock

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 154

Comparability of Clocks

• Two vector clocks are equal when

V i =V j ⟺ ∀ 𝑘 ,𝑉 𝑖 [ 𝑘 ] =𝑉 𝑗 [𝑘]

• Two vector clocks are totally ordered when

V i ≺ V j ⟺ ( V i ≠ V j ) ∧( ∀ 𝑘, 𝑉 𝑖 [ 𝑘 ] ≤ 𝑉 𝑗 [𝑘])
V i ≼ V j ⟺ ( V i=V j ) ⋁ (V i ≺ V j )
• Two vector clocks may not always be comparable

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 155

Notion of a Message

A message represents an interaction between two threads.

Say if a thread i accesses a variable in memory, it updates

its state, and then thread j accesses the variable (and its
state)  This is an interaction that falls under the theoretical
definition of a message.

hb
Event 1 Event 2

x=1 t1 = x
Memory
location

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 156

More about Events

Increment the local clock (e.g., for thread i) before sending a

message and after receiving a message.

Let’s say, thread i sends a message to thread j. When j

receives the message, it updates its clock as follows.

∀ 𝑘 ,𝑉 𝑗 [ 𝑘 ] =max ⁡(𝑉 𝑖 [𝑘],𝑉 𝑗 [ 𝑘 ] )

The receiver is as at least as up to date as the sender. It records

an additional receive event.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 157

Vector Clocks and Causality

Let’s say that there is a happens-before relationship between

events ei and ej. We use vector clocks to track the interaction.

Event Time
ei Event in thread i at time Vi
ej Event in thread j at time Vj

We have the following relationship

h𝑏
𝑉 𝑖 ≺ 𝑉 𝑗 ⟺ 𝑒 𝑖⟶ 𝑒 𝑗

This happens because the receiver has all the sender’s updates
and without a chain of happens-before edges, two vector clocks
will not remain comparable.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 158

Vector Clock based Algorithm
Symbol Meaning
Vector clock of the current thread
Vector clock of the current lock
Read clock of variable v
Write clock of variable v
Thread id

𝐶 𝑇 [ 𝑡𝑖𝑑 ] ← 𝐶 𝑇 [ 𝑡𝑖𝑑 ] +1
𝐶𝑇 ← 𝐶𝑇 ⋃ 𝐶 𝐿 Increment the vector
𝐶 𝐿 ← 𝐶𝑇 clock of CT and CL

𝐶 𝑇 .𝑖𝑛𝐿𝑜𝑐𝑘← 𝐓𝐫𝐮𝐞

𝐶 𝑇 .𝑖𝑛𝐿𝑜𝑐𝑘← 𝐅𝐚𝐥𝐬𝐞
McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 159
Read Operation

𝒊𝒇 (∼ 𝐶 𝑇 . 𝑖𝑛𝐿𝑜𝑐𝑘) 𝒕𝒉𝒆𝒏
Increment the local count
𝐶 𝑇 [ 𝑡𝑖𝑑 ] ← 𝐶 𝑇 [ 𝑡𝑖𝑑 ] +1 of the thread clock
𝒆𝒏𝒅
𝒊𝒇 (𝑊 𝑣 ≼ 𝐶 𝑇 ) 𝒕𝒉𝒆𝒏
If no thread has overwritten
𝑅 𝑣 ← 𝑅𝑣 ∪ 𝐶 𝑇 the value after the transaction
𝒆𝒏𝒅 started, then the read is
𝒆𝒍𝒔𝒆 successful. Update the read
clock.
𝑫𝒆𝒄𝒍𝒂𝒓𝒆𝑫𝒂𝒕𝒂𝑹𝒂𝒄𝒆
𝒆𝒏𝒅

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 160

Write Operation

𝒊𝒇 (∼ 𝐶 𝑇 . 𝑖𝑛𝐿𝑜𝑐𝑘) 𝒕𝒉𝒆𝒏
Increment the local count
𝐶 𝑇 [ 𝑡𝑖𝑑 ] ← 𝐶 𝑇 [ 𝑡𝑖𝑑 ] +1 of the thread clock
𝒆𝒏𝒅
𝒊𝒇 ( 𝑊 𝑣 ≼ 𝐶 𝑇 ) ∧ ( 𝑅 𝑣 ≼ 𝐶 𝑇 ) 𝒕𝒉𝒆𝒏
𝑅 𝑣 ← 𝑅𝑣 ∪ 𝐶 𝑇 If no thread has overwritten or
read the value after the
𝑊 𝑣 ← 𝑊 𝑣 ∪ 𝐶𝑇 transaction started, then the
𝒆𝒏𝒅 read is successful. Update the
𝒆𝒍𝒔𝒆 read and write clocks.
𝑫𝒆𝒄𝒍𝒂𝒓𝒆𝑫𝒂𝒕𝒂𝑹𝒂𝒄𝒆
𝒆𝒏𝒅

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 161

Contents

1. Parallel Programming

2. Theoretical Foundations

3. Cache Coherence

4. Memory Models

5. Data Races

AGENDA ITEM 06
6. Transactional Memory
Green marketing is a practice whereby companies seek to go above and beyond.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 162

Lock and Unlock functions have a problem
void updateBalance ( int amount , Account account ) {
lock ();

int temp = account.balance ;

temp = temp + amount ;
account.balance = temp ;

unlock ();
}

• If we have one lock, then there is no parallelism in the system.

• We can afford much more parallelism.
• The only constraint is:
• We should not have two parallel accesses to the same account.

Disjoint access parallelism

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 163

Code that Allows Disjoint Access Parallelism

void updateBalance ( int amount , Account account ) {

account.lock ();

int temp = account.balance ;

Account temp = temp + amount ;
specific account.balance = temp ;
locks
account.unlock ();
}

• This is a scalable solution.

• But, there is a problem. If we have code where we
acquire multiple locks, we may have a deadlock.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 164

Deadlock Situation
Thread B waits
for lock A

Thread A Thread B
acquires acquires
lock A lock B

Thread A waits
for lock B

Deadlock
McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 165
Create a Transaction

Provides disjoint access parallelism

Automatically manages all the locks and avoids deadlocks.

void updateBalance ( int amount , Account account ){

atomic { /* an atomic transaction */
int temp = account . balance ;
temp = temp + amount ;
account.balance = temp ;
}
}

The atomic block implements a transaction

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 166

Properties of a Transaction

Atomicity: Either the entire transaction completes or fails.

Consistency: A consistent state is a valid state that is as per a

given set of specifications (consistency model). The property of
consistency says that if the full system was consistent before a
transaction started, it should be consistent after it ended.

Isolation: It appears that while the transaction was executing,

regular instructions or other transactions were not executing.

Durability: Once a transaction has finished, its results are written

to stable storage.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 167

Transactional Memory

A memory system that provides support for transactions

is known as a transactional memory (TM).

Two Types

Hardware TM Software TM

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 168

Basics of Transactional Memory

read set write set

set of variables that are set of variables that are
read by the transaction written by the transaction

Term Meaning
Ri Read set of transaction i
Wi Write set of transaction i
Rj Read set of transaction j
Wj Write set of transaction j

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 169

When do transactions conflict?

There is a conflict if and only if

Wi ∩ Wj ≠ φ Wi ∩ Rj ≠ φ
OR
Ri ∩ Wj ≠ φ

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 170

Abort and Commit

Commit
• A transaction completed without any conflicts
• Finished writing its data to main memory

 Abort
◦ A transaction could not complete
due to conflicts
◦ Did not make any of its writes visible

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 171

What happens after an abort?

• The transaction restarts and re-executes

• Might wait for a random duration of time to minimize future
conflicts

• do {
…
…
} while (! Tx.commit());

This is automatically handled by the

transactional memory system

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 172

Basics of Concurrency Control

A conflict occurs when the read-write sets overlap

A conflict is detected when the TM system
becomes aware of it
A conflict is resolved when the TM system either
• delays a transaction
• aborts it

occurrence

detection

resolution

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 173

Pessimistic vs Optimistic Concurrency Control

occurrence,
detection,
resolution
pessimistic
concurrency
control

detection
occurrence resolution
optimistic
concurrency
control

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 174

Version Management

Eager version management

• Write directly to memory
• Maintain an undo log
writeback
commit flush undo log abort
undo log
Lazy version management
• Write to a buffer (redo log)
• Transfer the buffer to memory on a commit

commit writeback abort flush redo log

redo log

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 175

Conflict Detection

Eager
• Check for conflicts as soon as a
transaction accesses a memory
location

Lazy
• Check at the time of
committing a transaction

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 176

Semantics of Transactions

Serializable
• Sequential consistency at the level of transactions
Strictly Serializable
• The sequential ordering is consistent with the real time ordering
• This means that if Transaction A starts after Transaction B ends,
it should be ordered after it in the equivalent sequential ordering
• For concurrent transactions, their ordering does not matter
Opacity
• Even aborted transactions need to see a consistent state – one
produced by only committed transactions

What about aborted transactions?

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 177

Opacity

Thread 1 Thread 2

atomic { atomic {
t1 = x; x = 5;
t2 = y; y = 5;
while (t1 != t2) {} }
}
• If one transaction executes after the other, x will always be equal to y
• Assume optimistic concurrency control: resolution at the end
• Assume Thread 1 reads x=0
• Then the transaction on thread 2 finishes
• The transaction on thread 1 needs to be aborted (will read y=5)
• However, it will be stuck in the while loop, it will never reach the end
• Opacity will not y to be read as 5
McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 178
Mixed Mode Accesses: Transactional and Non-Transactional

Single Lock Atomicity (SLA)

• Assume all the transactions are protected by a single lock
• Transactions first acquire a hypothetical (global) lock
• Same definition of data races
• Reduces concurrency
Disjoint Lock Atomicity (DLA)
• Uses more locks than SLA: (let’s say one per variable)
• We a priori need to know the locks that a transaction is going to use
Transactional Sequential Consistency (TSC)
• We can order all transactions (committed or aborted) and regular
instructions in a sequential order
• All the instructions are in program order (incl. within transactions)
McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 179
Software Transactional Memory

choices

Concurrency Control
• Optimistic or Pessimistic
Version Management
• Lazy or Eager
Conflict Detection
• Lazy or Eager

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 180

Support Required

Augment every transactional object/ variable with meta data

object 1. Transaction
that has locked
the object
2. Read or write

object metadata

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 181

Maintaining Read–Write Sets

Each transaction maintains a list of locations that it has

• read in the read-set
• written in the write-set
Every memory read or write operation is augmented
• readTX (read, and enter in the read set)
• writeTX (write, and enter in the write set, make changes
to the undo/redo log)

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 182

Bartok STM
Eager version management, lazy conflict detection

Every variable has the following fields

• version
• value
• lock

value version

Transactional Variable

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 183

Read Operation

Record the version of the

variable

Add the variable to the read set

Read the value

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 184

Write Operation

Lock the variable

Abort if it is already locked

Add the old value to the

undo log

Write the new value

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 185

Commit Operation

For each entry in the read set

Check if the version of the

variable is still the same

Yes No Abort

For each entry in the write set

Increment the version

Release the lock

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 186

Pros and Cons

does not
simple
provide opacity

reads are simple uses locks

provide a strong
writes are slow
semantics for
transactions

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 187

Subtle Points

• With an undo log, aborts are more expensive than commits

• A transaction can read intermediate values written by
transactions
• Opacity is thus not guaranteed
• Locks are kept for a long time: lock  commit

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 188

TL2 STM

• Uses lazy version management  redo log

• Uses a global timestamp (globalClock)
• Locks variables only at commit time
• Every transaction does the following (atomically) when it starts:

globalClock++ ;
Tx.rv = globalClock ;

Set the value of the transaction’s

Tx.rv timestamp (read version)

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 189

Read Operation
Read Operation: read (tX, obj)

obj in the redo log?

No Yes Return value in

the redo log

Check if the v1 = obj.timestamp

value and result = obj.value
timestamp are v2 = obj.timestamp
collected if( (v1 != v2) || (v1 > Tx.rv) ||
atomically obj.lock) abort();
addToReadSet(obj); Collect a snapshot
return result;

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 190

Write Operation

Add entry to the redo log if

required

Perform the write

Writes are

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 191

Commit Operation

For each entry in the write set

failure
Lock object abort

Tx.wv = ++globalClock (atomic)

For each entry e in the read set

failure
if (e.timestamp > Tx.rv) abort

writeback redo log

For each entry e in the write set e.timestamp = Tx.wv

Release the lock
McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 192
Pros and Cons

simple A redo log is slower

provides opacity Commits are more

expensive than
aborts
holds locks for a
lesser amount of uses locks
time

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 193

Subtle Points

• We use two timestamps per transaction: Tx.rv and Tx.wv

• We have,
• We first write the variables to permanent state
• Then, we update their timestamps
• If another transaction sees an updated timestamp, it is
sure that the variable has been written to
• Finally, we release the locks. This allows later reads.

Provides opacity

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 194

Hardware Transactional
Memory

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 195

Case for Hardware

• STM systems do not handle non-transactional accesses

• Acquiring and releasing locks is expensive
• Maintaining undo and redo logs is difficult
• Hardware is much faster …

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 196

Hardware Support (mostly based on LogTM)

ISA Support
• Add three new instructions: begin, abort, and commit
Version Management
• HW schemes mostly use eager version management – undo log
• The log has its dedicated set of addresses in virtual memory

Cache line R W

If (R=1) some If (W=1) some

word has been word has been
read written

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 197

Conflict Detection

Use the coherence protocol to detect conflicts

1. Let us say core D has a miss. It sends a read-miss to the directory.

2. The directory forwards it to core C.
3. C detects a conflict.
4. It sends a nack message to D (via the directory).
5. The transaction at D aborts.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 198

Eviction

What if there is an eviction?

• If there is an eviction in the M state, the state at the directory

is set to M@C. C is the number of the core that evicted the block.
• Sets the overflow bit to 1.
• Let us assume silent evictions from the S state.

• Whenever the directory gets a request for a block in the M@C

state, it forwards it to core C.
• Core C may not have the block in the cache.
• It will however infer a conflict because of the state of the block in
the directory (the block is in its write set).
• If it gets a normal request (not @C), it will assume that the line
was in the S state.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 199

Example

1. Core D sends a read-miss message to the directory.

2. The directory forwards it to core C with the (M@C) directive
3. Core C infers a conflict
4. Sends a nack back to D (via the directory)

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 200

Subtle Issues

• Assume a block has been evicted.

• If there is a conflicting request by a non-transactional access, the
current transaction has to abort
• If a transaction aborts, then we need to clean the read/write sets.
Some blocks might have gone to lower levels of the memory
hierarchy.
• Since we restore the entire undo log, the correct state of the blocks
is restored (in the L1 cache). Wrong values at lower levels do not
matter.
• After a transaction finishes, all the R, W, M@C, and overflow bits
need to be cleared.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 201

Conclusion

There are two paradigms in parallel

programming:
shared memory and message passing
Per-location sequential consistency (PLSC) is
followed by
all systems today. It translates to the axioms
ofmemory
A coherence.
model is determined by two factors:
write atomicity and program order. It is specified
by the po, rf, fr, and ws relations.
Cache coherence protocols enforce the axioms
of coherence.
If a program is data-race-free its execution is in
SC.
Transactional memory is a very easy-to-use
paradigm for writing data-race-free code
(wrapped in atomic blocks).

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 202

The
McGraw-Hill |
End
Advanced Computer Architecture. Smruti R. Sarangi 203

1-1. RedSun Portable Foldable Solar Panel I - Specification
No ratings yet
1-1. RedSun Portable Foldable Solar Panel I - Specification
49 pages
04 Hardware
No ratings yet
04 Hardware
109 pages
Assignment C++
No ratings yet
Assignment C++
219 pages
M1-CS405 Computer System Architecture-Ktustudents - in
100% (1)
M1-CS405 Computer System Architecture-Ktustudents - in
97 pages
1 Introduction
No ratings yet
1 Introduction
30 pages
Full Download Getting Started with R: An Introduction for Biologists 2nd Edition Andrew Beckerman PDF DOCX
100% (1)
Full Download Getting Started with R: An Introduction for Biologists 2nd Edition Andrew Beckerman PDF DOCX
55 pages
Shared Memory Synchronization
No ratings yet
Shared Memory Synchronization
223 pages
1. Adv CA - slide deck 1
No ratings yet
1. Adv CA - slide deck 1
105 pages
HPC-Unit-1
No ratings yet
HPC-Unit-1
65 pages
CS-704 Handouts Version 1
No ratings yet
CS-704 Handouts Version 1
477 pages
StudM1p1Parallel Computer Modelsppt1shared
No ratings yet
StudM1p1Parallel Computer Modelsppt1shared
107 pages
Module 1 - PART B - DR - Ilavarasi
No ratings yet
Module 1 - PART B - DR - Ilavarasi
34 pages
RG1-Intro-ParallelArch-HPCAI-Jan2020
No ratings yet
RG1-Intro-ParallelArch-HPCAI-Jan2020
47 pages
10 The Relevance of GIS Techniques To Resources Evaluation - A Project On Archive Data From Zimbabwe - Tenyears Gumede
No ratings yet
10 The Relevance of GIS Techniques To Resources Evaluation - A Project On Archive Data From Zimbabwe - Tenyears Gumede
28 pages
Lec-1-To-10 19ECE349-RISC Processor Design Using HDL
No ratings yet
Lec-1-To-10 19ECE349-RISC Processor Design Using HDL
95 pages
DE1-SoC Getting Started Guide
No ratings yet
DE1-SoC Getting Started Guide
29 pages
Comp422 534 2020 Lecture1 Introduction
No ratings yet
Comp422 534 2020 Lecture1 Introduction
49 pages
Asset List For 27001 Risk Assessment en
No ratings yet
Asset List For 27001 Risk Assessment en
3 pages
Multiprocessing vs Multithreading 2
No ratings yet
Multiprocessing vs Multithreading 2
16 pages
Chapter 1 COA
No ratings yet
Chapter 1 COA
77 pages
Arch13 Multiprocessors Afterlecture
No ratings yet
Arch13 Multiprocessors Afterlecture
70 pages
aca pdf
No ratings yet
aca pdf
42 pages
Tao of React
No ratings yet
Tao of React
113 pages
Lecture-13-14 Parallel and Distributed Systems Programming Models-Jameel
No ratings yet
Lecture-13-14 Parallel and Distributed Systems Programming Models-Jameel
70 pages
Lecture 1 - Introduction
No ratings yet
Lecture 1 - Introduction
26 pages
CS Chap7 Multicores Multiprocessors Clusters
No ratings yet
CS Chap7 Multicores Multiprocessors Clusters
65 pages
ECNU Beamer Presentation Theme Base On Sapienza Presentation
No ratings yet
ECNU Beamer Presentation Theme Base On Sapienza Presentation
24 pages
Rebong Ermintrude Roark RIPActivity
No ratings yet
Rebong Ermintrude Roark RIPActivity
9 pages
lecture1
No ratings yet
lecture1
37 pages
CS 61C: Great Ideas in Computer Architecture: Course Introduction
No ratings yet
CS 61C: Great Ideas in Computer Architecture: Course Introduction
55 pages
An Introduction To Android Development: CS231M - Alejandro Troccoli
No ratings yet
An Introduction To Android Development: CS231M - Alejandro Troccoli
22 pages
Business Informatics Assignment
No ratings yet
Business Informatics Assignment
11 pages
Siklu License Upgrade Procedure (2020)
No ratings yet
Siklu License Upgrade Procedure (2020)
8 pages
Galaxy A40 Manual PDF
No ratings yet
Galaxy A40 Manual PDF
176 pages
17CS72 Mod 2 PPT
No ratings yet
17CS72 Mod 2 PPT
74 pages
DigitalLogic ComputerOrganization L23 Multicore Handout
No ratings yet
DigitalLogic ComputerOrganization L23 Multicore Handout
32 pages
Manual NVR Safevant
No ratings yet
Manual NVR Safevant
22 pages
Multi Threading
No ratings yet
Multi Threading
168 pages
Dynamic Power Estimation Using Transaction Level Modeling: Hmostafa@uwaterloo - Ca
No ratings yet
Dynamic Power Estimation Using Transaction Level Modeling: Hmostafa@uwaterloo - Ca
14 pages
Multiprocessors
No ratings yet
Multiprocessors
39 pages
HPA - Notes
No ratings yet
HPA - Notes
5 pages
CS6461 - Computer Architecture Fall 2016: Morris Lancaster - Lecturer
No ratings yet
CS6461 - Computer Architecture Fall 2016: Morris Lancaster - Lecturer
58 pages
Final Vlsi Battleground Rulebook KUET EEE Day
No ratings yet
Final Vlsi Battleground Rulebook KUET EEE Day
16 pages
Cloud Computing CS 15-319: Programming Models-Part I Lecture 4, Jan 25, 2012
No ratings yet
Cloud Computing CS 15-319: Programming Models-Part I Lecture 4, Jan 25, 2012
40 pages
Parallel_computing
No ratings yet
Parallel_computing
32 pages
Untitled
No ratings yet
Untitled
4 pages
FINEU2012 Himmighoefer Howto
No ratings yet
FINEU2012 Himmighoefer Howto
11 pages
Anu Mishra Resume
No ratings yet
Anu Mishra Resume
2 pages
Lecture 3 19th Jan 2024
No ratings yet
Lecture 3 19th Jan 2024
8 pages
IAM - Smart O&M - Brochure
No ratings yet
IAM - Smart O&M - Brochure
2 pages
Prebook MCAP
No ratings yet
Prebook MCAP
11 pages
Unit1 RMD PDF
No ratings yet
Unit1 RMD PDF
27 pages
1. GPU Unit-1
No ratings yet
1. GPU Unit-1
10 pages
MD Abu Hashib
No ratings yet
MD Abu Hashib
3 pages
CS3350B Computer Architecture: Marc Moreno Maza
100% (1)
CS3350B Computer Architecture: Marc Moreno Maza
45 pages
Unit 5 (Slides)
No ratings yet
Unit 5 (Slides)
75 pages
Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
No ratings yet
Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
72 pages
ACA Syllabus
No ratings yet
ACA Syllabus
4 pages
Chapter 1 (Parallel Computer Models)
No ratings yet
Chapter 1 (Parallel Computer Models)
20 pages
Parallel Computer Architecture A Hardware-Software
No ratings yet
Parallel Computer Architecture A Hardware-Software
18 pages
Power BI Basic
100% (2)
Power BI Basic
2,387 pages
Parallel Architecture
No ratings yet
Parallel Architecture
33 pages
Data Modeling
No ratings yet
Data Modeling
61 pages
Ps 1
No ratings yet
Ps 1
4 pages
Concurrent Programming With Threads: Rajkumar Buyya
No ratings yet
Concurrent Programming With Threads: Rajkumar Buyya
168 pages
Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
No ratings yet
Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
91 pages
Parallel Computer Models: CSE7002: Advanced Computer Architecture
No ratings yet
Parallel Computer Models: CSE7002: Advanced Computer Architecture
37 pages
S 00458 Ed 1 V 01 y 201212 Cac 021
No ratings yet
S 00458 Ed 1 V 01 y 201212 Cac 021
111 pages
Instruction for RP-WD02 Filehub for All Versions
No ratings yet
Instruction for RP-WD02 Filehub for All Versions
1 page
Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
No ratings yet
Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
72 pages
Multi Processors and Thread Level Parallelism
No ratings yet
Multi Processors and Thread Level Parallelism
74 pages
Memory in Multiprocessor System
No ratings yet
Memory in Multiprocessor System
52 pages
OpenCart - Test Plan
No ratings yet
OpenCart - Test Plan
7 pages
Design and Implementation of Relational DBMS
No ratings yet
Design and Implementation of Relational DBMS
19 pages
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
No ratings yet
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
26 pages
Architecture1 1 (2012)
No ratings yet
Architecture1 1 (2012)
87 pages
Advanced Computer Architecture
No ratings yet
Advanced Computer Architecture
8 pages
Lesson 2 - Introduction To SDS2
No ratings yet
Lesson 2 - Introduction To SDS2
2 pages
CSE 820 Graduate Computer Architecture: Dr. Enbody
No ratings yet
CSE 820 Graduate Computer Architecture: Dr. Enbody
25 pages
CS647
No ratings yet
CS647
2 pages
Multi-Core Processing: Advantages & Challenges
No ratings yet
Multi-Core Processing: Advantages & Challenges
35 pages
Aca
No ratings yet
Aca
3 pages
Control Software Applications
100% (1)
Control Software Applications
58 pages
CICS 504 Computer Organization
No ratings yet
CICS 504 Computer Organization
35 pages
Cncmillingprograms 160318071113 PDF
No ratings yet
Cncmillingprograms 160318071113 PDF
33 pages
Multicore Question Bank
No ratings yet
Multicore Question Bank
5 pages
Zig Programming: From Zero to Systems Master
From Everand
Zig Programming: From Zero to Systems Master
Niklas Hoffmann
No ratings yet
Mastering C: Advanced Techniques and Tricks
From Everand
Mastering C: Advanced Techniques and Tricks
Ted Norice
No ratings yet
Mesh Generation: Advances and Applications in Computer Vision Mesh Generation
From Everand
Mesh Generation: Advances and Applications in Computer Vision Mesh Generation
Fouad Sabry
No ratings yet
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
From Everand
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
Rodrigo Copetti
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.