0% found this document useful (0 votes)
34 views203 pages

Chapter 9 Multicore Systems

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views203 pages

Chapter 9 Multicore Systems

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 203

Chapter 9:

Multicore Systems

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 1


Background Required to Understand this Chapter

Caches

On-Chip Network Chapt


ers 7
Graph Algorithms and 8

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 2


Contents

1. Parallel Programming

2. Theoretical Foundations

3. Cache Coherence

4. Memory Models

5. Data Races

AGENDA ITEM 06
6. Transactional Memory
Green marketing is a practice whereby companies seek to go above and beyond.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 3


Era of Sequential Computing is Over

35
30
Speclnt 2006 Score 25
20
15
40
10
5
0
29/10/99 12/03/01 25/07/02 07/12/03 20/04/05 02/09/06 15/01/08 29/05/09 11/10/10 23/02/12

Date

Single core performance from 2001 to 2010


It has clearly saturated after 2008.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 4


Solution

Have multiple cores that collaboratively solve a task.

Big Problem

Map each
sub-
problem
to a core

Smaller problems

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 5


Problem

• Take an array numbers with SIZE elements.


• Compute the sum of all of its elements.
• Distribute the work among N threads.

Approach

• Divide the array into N parts.


• Assign each part to a core.
• Compute the partial sum (sum of each part)
• Add the partial sums

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 6


Shared Memory based Programming

Core 1 Core 2 Core N

Shared memory

• Each core runs a thread.


• Different threads share parts of the virtual memory space
• They communicate by reading and writing to shared variables

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 7


/* variable declaration */
int partialSums [N];
int numbers [SIZE];
int result = 0;

/* initialise arrays */
...
/* parallel section */ # pragma omp parallel {
/* get my processor id */
OpenMP
int myId = omp_ get_ thread_ num (); code
/* add my portion of numbers */
int startIdx = myId * SIZE/ N;
int end Idx = startIdx + SIZE/ N;

for( int jdx = startIdx; jdx < end Idx; jdx++)


partialSums[ myId] += numbers[ jdx];
}

/* sequential section */
for( int idx=0; idx < N; idx++)
result += partialSums[ idx];

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 8


A Typical Shared Memory Program
Parent thread

Initialization
Spawn child threads

Child
threads

Time

Thread join operation

Sequential
section

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 9


Message Passing Code

• Cores communicate by sending messages to each other


• They do not share memory.
• There are two basic functions.

Function Semantics
send(pid, val) Send the integer val to the process with id, pid
receive(pid) 1. Receive an integer from process pid
2. This is a blocking call
3. If the pid is equal to ANYSOURCE, then the receive
function returns with the value sent by any process

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 10


/* start all the parallel processes */
Spawn AllParallelProcesses();

/* For each process execute the following code */


int myId = getMyProcessId ();

/* compute the partial sums */ Message


int startIdx = myId * SIZE/ N;
int end Idx = startIdx + SIZE/ N;
passing
int partialSum = 0; code
for( int jdx = startIdx; jdx < end Idx; jdx++)
partialSum += numbers[ jdx];

/* All the non - root nodes send their partial sums to


the root */
if( myId != 0) {
/* send the partial sum to the root */
send (0 , partialSum );

}
McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 11
Continuation

else {

/* for the root */ int sum = partialSum ;


for ( int pid = 1; pid < N; pid ++) {
sum += receive( ANYSOURCE );
}

/* shut down all the processes */


shutDown AllProcesses();

/* return the sum */


return sum;
}

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 12


Shared Memory vs Message Passing

Shared memory

• Easy to program.
• Issues with scalability
• The code is portable across machines.

Message passing

• Hard to program.
• Scalable
• The code may not be portable across machines.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 13


Amdahl’s Law

How much can we parallelize?

We are limited by the fraction of the


sequential section: fseq

For P processors (
𝑇 𝑝𝑎𝑟 =𝑇 𝑠𝑒𝑞 × 𝑓 𝑠𝑒𝑞 +
1− 𝑓 𝑠𝑒𝑞
𝑃 )
𝑇 𝑠𝑒𝑞 1
𝑆= = As ,
Speedup 𝑇 𝑝𝑎𝑟 1 − 𝑓 𝑠𝑒𝑞
𝑓 𝑠𝑒𝑞 +
𝑃

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 14


Speedup vs #Processors
Note the saturation
of the speeudup

Number of processors (P)


McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 15
Gustafson-Barsis’s Law
New workload: Scale the
Old workload: W parallel part by #procs (P)
Entity Mathematical expression
New workload
Sequential Time ()
is a constant of
proportionality
Parallel Time ()

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 16


Design Space of Multiprocessors: Flynn’s Classification

• Single inst. • Single inst.


single data multiple
data

SISD SIMD

MISD MIMD
• Multiple • Multiple
inst. single inst.
data multiple
data

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 17


Explanation of the Flynn’s Classification

SISD Processors
• A regular single core processor

SIMD Processors
• Single instruction stream, multiple data streams
• Vector instruction set. Example: add v1, v2, v3
• v1, v2, and v3 are vector registers
• They pack multiple integers (let’s say 4)
• A pairwise addition is performed
• Summary: We can do 4 additions using just a single
instruction

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 18


Explanation II

MISD Processors
• Used in airplanes: Run the same program on three
separate processors that have different instruction
sets. Compare the outputs and decide by voting.
MIMD Processors
• SPMD Processors: Single program, multiple data.
Run the same program on different cores with
different data streams (most common)
• MPMD Processors: Consider a processor with
different accelerators. Each core or accelerator
runs a different program.
McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 19
Hardware
Threads

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 20


Notion of Hardware Threads

• We traditionally have processes that run on different cores.


• Let’s say we have cores with large issue widths
• If processes have low ILP, then we waste issue slots

• Run multiple processes simultaneously on the same core

Definition

Such processes that share cores are known as hardware threads.

• They are not the same as software threads that share the
virtual address space.
McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 21
Coarse-grained Multithreading
1

4 2

3
• Run instructions from thread 1 for k cycles.
• Then switch to thread 2, then to thread 3, thread 4, thread 1, ...
• Separate program counters, ROBs and retirement register files
per thread
• Each instruction packet, rename table entry, LSQ entry, physical
register is tagged with the thread id
• If there is a high-latency event like an L2 miss, the core can
switch to a new thread.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 22


Fine-grained Multithreading

• Coarse-grained multithreading does not let the processor idle, if there


is a high-latency event.
• Fine-grained multithreading reduces k to 1-5 cycles.
• We can tolerate low-latency events like L1 misses.
• There is an additional overhead of excessive thread switching.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 23


Simultaneous Multithreading (SMT)

• Run multiple hardware threads simultaneously


• Dynamically split the issue slots between the threads
• There are several heuristics for partitioning the issue slots
• Fairness
• Instruction criticality (higher priority to loads)
• Thread criticality: real time, non-real time
• Based on instruction type (expected ILP, etc.)
• Hyperthreading  SMT with static partitioning (typically 50:50)

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 24


Comparison of Different Hardware Multithreading Schemes

Coarse-grained Fine-grained Simultaneous

multithreading multithreading multithreading


Thread 1

Thread 2
Time Thread 3

Thread 4

issue
slots

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 25


Contents

1. Parallel Programming

2. Theoretical Foundations

3. Cache Coherence

4. Memory Models

5. Data Races

AGENDA ITEM 06
6. Transactional Memory
Green marketing is a practice whereby companies seek to go above and beyond.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 26


Issues with Large Caches

• Large caches’ access times can be between 10-50 cycles


• Requests need to traverse the NoC
• If we have many cores, they will make parallel accesses.

We are dealing with large, slow, multi-ported


caches.
This is not possible to build.

Create large distributed caches.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 27


Notion on a Distributed Cache

Make a set of small caches act like one single, large cache.

Each sister cache is small and fast.

Many parallel accesses.


McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 28
Shared vs Distributed Caches

We want a set of small L1 caches (sister caches) act like one,


single, large L1 cache. How is this possible?

Consider accesses to a single


memory address/variable, x

• If we have a single physical location for x, then there is no


advantage of a distributed design.
• If we have multiple locations for storing the variable x, then
the replicas have to contain the same value.
• This is the problem of coherence.

Coherence  Make a set of locations behave


like a single location.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 29


Consider Multiple Locations

Assumptions

• All global variables are initialized to 0. They start with u, v, x, y, or z.


• All local variables are mapped to registers. They start with a ‘t’.
• The delay between the execution of consecutive instructions can be
indefinitely large. Instructions across threads can execute in any order.

Thread 1 Thread 2
x=1 t1 = y
y=1 t2 = x

• Is the outcome <t1, t2> = <1,0> possible?


• It for some reason bothers us because no sequential ordering of
instruction executions can produce this outcome.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 30


This outcome is indeed valid on many machines

Thread 1 Thread 2
x=1 t1 = y
y=1 t2 = x

• Thread 1 sets x to 1
• It sends an update message on the NoC. The message gets caught
in congestion.
• Thread 1 sets y to 1. The corresponding message on the NoC is
swiftly delivered.
• Thread 2 reads y to be 1.
• Thread 2 reads x as 0 (initialized value)

This is indeed possible!


McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 31
One More Example

Thread 1 Thread 2
x=1 y=1
t1 = y t2 = x

Can we read <t1, t2> = <0,0> ?

The loads are issued early, and the writes happen at


commit time.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 32


Set of Valid Outcomes
Given a parallel program, what are the set of valid
outcomes?
This depends on the system on which the program is
running. It depends on ....
• The pipeline
• Memory system
• NoC
• Memory controller (for off-chip memory)
Every processor has a set of specifications that specify
the allowed outcomes/behaviors. If the behavior is
consistent with the specifications, the execution is said
to be consistent.
This specification is known as the memory
model or memory consistency model.
McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 33
Difference between Memory Consistency and Memory
Coherence

A memory consistency model or a memory model is a policy


that specifies the behavior of a parallel, multithreaded
program. In general, a multithreaded program can produce a
large number of outcomes depending on the relative order of
scheduling of the threads, and the behavior of memory
operations. A memory consistency model restricts the set of
allowed outcomes for a given multithreaded program. It is a
set of rules that defines the interaction of memory instructions
between each other.

Coherence is a subset of the overall memory model that


specifies the behavior with respect to a single location.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 34


Theoretical Foundations

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 35


Every Observer has a Point of View tstart Start time
tend End time
Cache
tcomp Completion
time

Queue
Observer

Timeline of memory requests seen by the observer:


Time
R1 R2 R3

tstart tend tstart tend tstart tend


tcomp tcomp tcomp

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 36


Definition of an Execution
Operation

tid tstart tend type addr value

read write

Operation

Operation

Execution
Operation

Operation

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 37


More about Executions

Sequential
All the operations are ordered.
Execution

Legal
Sequential Every read operation returns the value of the
Execution latest write operation to the same address.

The observer sitting on a memory location needs to


observe a legal sequential execution.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 38


Observer sitting on a Core

• In the case of a load, this relationship still holds 

• However, things change for a store.


• We don’t know when it will complete. The store will be put on the
NoC. It may reach the cache bank a long time later.
• The operation ends when the store leaves the ROB.
• We can very well have 

• Even in this case, the observer on the core expects to see a legal
sequential execution

Otherwise, the execution will not make any sense ....

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 39


Parallel Executions
Consider a parallel execution.

Term Meaning
Rx1 Read the value of x as 1
Wy2 Set y = 2

Multiple threads: one observer per thread. Each observer records the
local execution history of a thread.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 40


A Parallel Execution

Note the time line: ordered by the completion time.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 41


Issues with Parallel Executions

• Atomicity  Every operation has a single completion time. It appears to


execute instantaneously at that point of time.
• Have multiple observers (one per each thread) with their points of view
does not help.
• It is better if we can think of parallel executions in terms of sequential
executions  It is very intuitive.
• For this, we need to introduce the notion of the equivalence of executions:
P and S.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 42


Equivalence of Two Executions

Equivalence of two executions

Expression Meaning
P|T All the operations issued by thread T (in the same
order). This is an ordered sequence.
There is a one-to-one mapping between the two
sequences of operations.
For all T,

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 43


Example of Execution Equivalence

Mapping
1  1’, 2  2’, ... , 7  7’

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 44


Sequential Consistency

When is a parallel execution equivalent to some sequential execution?

Sequential Consistency (SC)

When a parallel execution is equivalent to a legal


sequential execution and the order of operations in the
sequential execution is as per program order, we say that
the execution is sequentially consistent.

We can interleave the executions of different threads such


that they are arranged sequentially, every read receives
the value of the latest write, and for each thread the
operations are arranged in program order.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 45


Key Properties of SC

Writes are atomic

Program order is preserved

SC = Atomicity + Program Order

Reordering accesses to different addresses might


create havoc in multiprocessor systems.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 46


Examples of Single-variable Programs

T1 T2 T3 T1 T2 T3
Wx1 Rx0 Rx1 Wx1 Rx1 Rx2
Wx2 Rx2 Rx2 Wx2 Rx2 Rx1

Execution in SC Execution that is not in SC

This behavior is non-intuitive.


Rx0  Wx1  Rx1  Wx2  Rx2
(T2)  Rx2 (T3)
For x there appear to be
different storage locations.

PLSC
If we consider all the accesses to a single variable (memory location), let
such an execution be always in SC. This is known as the PLSC (Per
Location Sequential Consistency) constraint. It is needed to provide the
illusion of a single memory location, even if we have a distributed cache.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 47


SC for Multi-variable Programs
T1 T2
Wx1 Wy1
Ry0 Rx0

Can <x,y> = <0,0>?

This execution is not in SC

PLSC does not guarantee SC?

Accesses w.r.t x Accesses w.r.t y


T1 T2 T1 T2 Both are in
PLSC
Wx1 Wy1
Rx0 Ry0

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 48


Why not make all machines guarantee SC?
T1 T2 Threads can reorder
x=1 y=1 accesses to different
addresses.
t1 = y t2 = x

• The outcome <x,y> = <0,0> is clearly not in SC.


• This is because
• The loads will be sent early to the cache
• The stores will only be sent at commit time
• This means that both the loads will read x and y to be 0
• If we still need SC then
• Loads will have to be issued at commit time
• All the benefits of OOO execution and the LSQ will go away
• For high performance we need to sacrifice SC
• It prohibits many performance-enhancing optimizations

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 49


Non-atomic Writes
T1 T2 T3
Wx1 while (x != 1) {} while (y != 1) {}
Wy1 Rx0
T1 T2 T3
Wx1 Rx1
T1 T2 T3 Rx0
Wx1 Rx1 Ry1
Wy1 Rx0 T1 T2 T3
Ry1
Wy1

• This execution is clearly not in SC


• It is however in PLSC. Verify
• However, the write to x appears to have different completion times for
different threads. The write is non-atomic.
McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 50
Non-Atomic Writes - II

• This is possible in many architectures such as IBM and ARM


machines
• Thread 2 is basically reading the write to x early  before it is visible
to Thread 3
• This is possible if multiple locations store the variable x, and Core 2
is closer to Core 1 and Core 3 is much farther away.
• In a system with an NoC, this can easily happen.

Given that this execution is in PLSC, let us accept it.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 51


Behaviors not in PLSC

T1 T2 T3
Wx1 Rx1 Rx2
Wx2 Rx2 Rx1

• This behavior is not in PLSC


• SC implies PLSC, PLSC does not necessarily imply SC
• Writes atomic w.r.t. one location need not be atomic
w.r.t. accesses for multiple locations.
• Such behaviors break the notion of shared memory
completely

Let us thus allow non-atomic writes as long as


PLSC is preserved.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 52


Summary Till Now

SC is • SC = Atomicity + Program order


intuitive yet • In OOO machines, it is impractical
impractical
• SC implies PLSC (not the other way
round)
PLSC is • PLSC holds in systems that have non-
definitely atomic writes.
required • It is needed to provide the illusion of a
single memory location in a distributed
cache

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 53


From PLSC to Coherence
• Akin to the definition of SC, we can define PLSC as per-location
program order + atomicity (viewed per location)
• Program order basically means that processors or cache controllers
are not allowed to reorder accesses to the same address.
• Since caches have FIFO queues for requests, they already ensure
program order.
• Consider atomicity. Because of PLSC an observer at a memory
address always sees atomic writes, but an observer on a core may
see non-atomic writes because of intervening accesses to other
addresses.

• For a given address, let us look at the following orders of operations


• Global order  All observers record this order between operations
• Local order  A few observers do, and a few don’t

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 54


Ordering between Accesses to the Same Variable
across Threads (Observer at the Core)

Order Implication
Read  Read Does not matter
Write  Read Given that a core can read another core’s write early,
while other cores may not even see the write, this
order may not be global all the time. A core does not
know when a write reaches the sister caches present
in other cores. Hence, it may not agree about the
values read by other cores; this order is local.
Write  Write If this order is not global, then the same variable will
end up with multiple final states. This is not allowed.
Hence, in all systems this order is global.

Read  Write Writes are globally ordered. Assume one core records
Wi  Ri  Wj . All the cores will record Ri  Wj because
the core that issued Ri could not have seen Wj else it
would have read a different value.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 55


Implications of PLSC (Observer at the Core)

Write  Write order is global

Read  Write order is global

Write  Read order is global only for machines with atomic writes.

Axioms of Coherence

Write
Writes to the same location are globally ordered.
Serialization
Write A write is eventually seen by all the threads.
Propagation
McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 56
SC Using Synchronisation
Instructions

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 57


Communicating a Value between Threads

T1 T2
value = 3 while (status != 1) {}
status = 1 temp = value

• This code will work correctly on an SC machine. It will not work


correctly on other machines with other memory models.
• The ordering between the read and write in T2 may not hold.
• A store instruction fully completes when all the threads can read the
stored value

fence instruction  All the instructions fetched before the fence


need to fully complete before the fence instruction executes. All
the instructions after the fence instruction cannot execute until the
fence instruction has completed.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 58


Code with fence Instructions

T1 T2
value = 3 while (status != 1) {}
fence fence
status = 1 temp = value

• Regardless of the underlying memory model this code will always work
• Memory barriers
• A fence is an example of a memory barrier
• Specifies rules of completion for instructions before and after the
barrier (in program order)
• Store barrier  Ensures an ordering between store instructions before
and after the barrier instruction.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 59


Acquire and Release Instructions

Acquire instruction  No instruction after the acquire instruction in program


order can execute before it has completed.

Release instruction  The release instruction can only complete if all the
instructions before it have been fully completed. Note that the release
instruction allows instructions after it to execute before it has completed.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 60


Theory of Memory Models

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 61


Where do we stand ...

• Processors reorder instructions that access different addresses


• Very problematic in multithreaded systems  non-intuitive behaviors
• Writes can be non-atomic

How do we reason about the correctness of these systems?

SC gave us a mechanism where we could map a parallel execution


to a legal sequential execution (LSE). We could then reason on the LSE.

What about non-SC executions?

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 62


Method of Execution Witnesses

Sequential
Parallel Execution Execution Witness
Execution

• A given piece of parallel code can have many different executions.


• The valid outcomes are dependent on the memory model of the
machine.
• For each execution (valid or invalid), we can create an execution witness
(a graph containing nodes and edges).
• If we can convert the execution witness to a sequential execution that
obeys the memory model, the execution is valid, otherwise it is invalid.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 63


Execution Witness
SC Execution

T1 T2 1 x=1

x=1 y=1 2 t1 = y

t1 = y t2 = x 3 y=1
4 t2 = x
It is a graph <t1,t2> = <0,1>
• The nodes are the instructions
• We can have edges between the instructions based on
the orders that are guaranteed by the memory model.
• We can have two edges: global and local
• These are happens-before edges  If means that event A
happened first and then event B happened after that. It could be
immediately later or after a very long time.
• A global hb edge (ghb) is agreed to by all threads.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 64


Program Order (po) Edge

All the edges are between memory operations issued


by the same thread regardless of their address.

Edge Description
poRW Read  Write edge
Not always
poRR Read  Read edge global.
poWR Write  Read edge Global only
in SC.
poWW Write  Write edge
poIS read/write  synch operation
Global
poSI synch  read/write operation

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 65


Execution witness
Example of a po edge (a)
Wx1
T1
po
(a) x=1 (b)
(b) y=1 Wy1
(c) t1 = x
po
(c)
Rx1

• We create a node for each memory operation


• We add edges between them (as per the memory model)

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 66


rf (read from) edge
Read and write ops in
rfe edge different threads

rf edge
Read and write ops in
rfi edge the same thread

• If writes are atomic, the rfe edge is global. Not otherwise.


• rfi is not global when we have optimizations like load-store forwarding
where a core can read its own value before other cores can.

(a) (b)
T1 T2 rfe
Wx1 Rx1
(a) x = 1 (b) t1 = x

t1 = 1

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 67


Write Serialization (ws) Edge
Execution witness
(a) (c)
T1 T2 T3 rf
(a) x = 1 (b) x = 2 (c) t1 = x Wx1 Rx1
(d) fence
ws po
(e) t2 = x
(d)
t1 = 1, t2 = 2 (b) Wx2 fence

rf po
(e)
Rx2

• This is always global. This is a direct consequence of PLSC and the


requirements of coherence.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 68


From-read (fr) Edge
Execution witness
T1 T2 (a) (c)
(a) x = 1 (c) t1 = x rf
(b) x = 2 Wx1 Rx1

ws fr
t1 = 1
(b) Wx2

• This edge is global because of PLSC.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 69


Synchronization Edge: so
(a) (c)
T1 T2 Wx1 Ry1
(a) x = 1 (c) t1 = y
(b) y = 1 (d) t2 = x
po po

so
rf
t1 = 1, t2 = 1

rf/
(b) Wy1 Rx1

Assume y is a synchronization variable.


All updates to such synch variables are globally ordered.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 70


Relationships between memory accesses

po • program order edge (same or different addresses)

rf • write  read, dependence (same address)

ws • write  write (same address)

fr • read  write (same address)

so • Synchronisation edge between synch operations.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 71


Summary: Memory Models

ws, fr, and so are always global

Let be global

Let be global

grf=rf and gpo=po only for SC

𝑔h𝑏=𝑔 𝑝𝑜∪ 𝑤𝑠∪ 𝑓𝑟 ∪ 𝑔𝑟𝑓

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 72


Cycles in an Execution Witness

If a graph does not have a cycle, we can arrange


all the nodes in a linear sequence such that

If there is a path from operation A to B


in the graph, then A appears before B in
the linear sequence.

This is a topological sort.

An acyclic Sequential
execution execution that
witness respect ghb

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 73


Implications of a Sequential Execution

• For an arbitrary memory model


• We can now establish an equivalence between a parallel execution
and a sequential execution.

It may not be legal.

• The sad part is that it may not be legal if writes are not atomic.

At least the fact that we established an equivalence means that


the execution is feasible according to the memory model.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 74


Access to a Single Location
Data and Control Dependences

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 75


Access Graphs: All accesses to the same location
(a)
T1 Wx1
(a) x = 1
up
(b) x = 2
(c) y = 3 (b)
Wx2
(d) z = x + y up
(e) x = 4
(d)
Rx2

up
Instead of po edges, we have up
(e)
edges  Edge between accesses Wx4
to the same location in the same
thread.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 76


𝑃𝐿𝑆𝐶=𝑢𝑝∪𝑟𝑓∪𝑓𝑟∪𝑤𝑠
• PLSC is always respected regardless of the memory model.
• Hence, the access graph never has cycles.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 77


Example
(a) (d)
T1 T2 T3 rf
Wx1 Rx1
(a) x = 1 (c) x = 2 (d) t2 = x
(b) t1 = x (e) t3 = x ws (c) fr

up Wx2 up
<t1,t2,t3> = <2,1,0>?
(b) rf

Rx2 Rx0
(e)
fr

Note the up edges and the presence of the cycle.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 78


Data and Control Dependences
(a) (d)
Rx1 Ry2
T1 T2
(a) t1 = x (d) t2 = y po
(b) if (t1 == 1) { (e) fence (e)
(b)
(c) y = 2 } (f) x = 1 if-stmt fence

t1 = 1, t2 = 2 po
rfe
rfe (f)
(c)
Wy2 Wx1

There is a clear breakdown of causality. Rx1 seems to be


happening without a preceding write. This is a thin air read.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 79


Causal Graph (d)
(a)
Rx1 Ry2

dep po

(b) (e)
if-stmt fence

dep po
rfe
rfe (f)
(c)
Wy2 Wx1

Three kinds of global edges: rf, gpo (all global program order edges),
dep (dependences)

To stop thin air reads 𝑎𝑐𝑦𝑐𝑙𝑖𝑐 (𝑟𝑓 ∪ 𝑔𝑝𝑜∪ 𝑑𝑒𝑝)


McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 80
Putting it all Together

Condition Test
Satisfies the memory The execution witness is acyclic.
model
PLSC holds for all All access graphs are acyclic.
memory locations
No thin air reads The causal graph is acyclic.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 81


Contents

1. Parallel Programming

2. Theoretical Foundations

3. Cache Coherence

4. Memory Models

5. Data Races

AGENDA ITEM 06
6. Transactional Memory
Green marketing is a practice whereby companies seek to go above and beyond.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 82


Write Update Protocol

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 83


Enforce PLSC in Hardware

All writes to the same location


are seen in the same order

A write ultimately completes

Bus based Model

Cache 1 Cache 2 Cache 3 Cache 4

Shared bus

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 84


Write-Update Protocol

• Every cache line has three states

State Meaning
M Modified
S Shared with other sister caches
I Invalid

Event | Message Notation


Event Type Meaning
Rd Read request
Wr Write request
Evict Evict the block
Wb Write back data to the lower level
Update Update the copy of the block

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 85


Message Types
Message Type Meaning
RdX Generate a read miss message. Send it
on the bus/NoC if required.
WrX Generate a write miss message. Send it
on the bus/NoC if required.
WrX.u Get permission to write to a block that is
already present in the cache.
Broadcast Broadcast a write on the bus.
Send Send a copy of the block to the
requesting cache.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 86


Snoopy Protocols

• Cache coherence protocols that use buses are known


as Snoopy protocols.
• All messages are essentially broadcast messages.
• All the caches can read them (snoop on them).
• These are easy to design

There are serious scalability issues. Such systems do not


scale beyond 4 to 8 cores.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 87


Reads

• If the block is not present, send a read miss message on the bus.
• S state  If it is already present, don’t do anything. Just read it.
• The S state allows seamless evictions

Rd | -
Rd | RdX
I S
Evict | -

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 88


Writes
• If the block is not present, send a write miss message on the bus.
Transition to the M state.
• If it is already present, don’t do anything. Just read it.
• M state  If we need to write, the write is broadcasted first to the rest of
the sister caches.
• The M state does not allow seamless evictions. We write back the data
to the lower level, while evicting  Otherwise, the updates will be lost.
Rd | -
Wr | Broadcast
Wr | WrX
I M
Evict | Wb

Broadcasting after every write is the issue.


McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 89
All the transitions due to read, write, and evict events

Rd | -
Rd | RdX
I S
Evict | -
W
r
|W
Ev rx Wr | Broadcast
ic
t|
W
b
M Rd | -
Wr | Broadcast

Power hungry
step

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 90


Events Received from the Bus

Broadcast | Update
RdX | Send
WrX | Send
Broadcast | Update

M RdX | Send
S
WrX | Send

Given that only one sister cache can use the bus at any time, a
global order of writes is automatically enforced.
If the bus master disallows starvation, then all writes will ultimately
complete.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 91


Write Invalidate
Protocol

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 92


Single Writer Multiple Reader Model for each Cache Line

Single Writer Multiple Readers

Ensures a global order of writes to the same memory


location

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 93


Write Invalidate Protocol (MSI Protocol)

Rd | RdX Rd | -
I S
Evict | -

Upgrade the status

Wr
of the line
Ev

|Wr
ict

Wr | WrX.u
X
|W
b

• Only one cache can contain


a block in the M state. At that
Rd | -
M point, no other cache has a valid
Wr | - copy.
• Multiple caches can have the
block in the S state. They can
Seamless reads
all read from it.
and writes
McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 94
Transitions because of Messages Received on the Bus

A writeback is
necessary here. The S
state has seamless
evictions.

If any line is in the modified state


and another cache requests for
the block, then a state transition
is necessary.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 95


MESI Protocol

• Insight: Consider a core that reads a block, and the block is never
shared. It will be read first in the shared state, and then an additional
message (WrX.u) is required to transition to the M state.
• Can we avoid this?
• Add an additional Exclusive (E) state  The block is only present in
one cache. If we are reading a block from the lower level, it enters
the E state.
• The rest remains the same.
• MSI protocol  MESI protocol

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 96


MESI (Read, Write, and Evict events)

Rd | RdX
(from sister cache) Rd | -
I S
Evict | -

W
Ev r|
Rd | RdX ic W
t| rX Wr | WrX.u
(from lower W
level) b
Evict | -

Rd | -
Rd | - Wr | -
E M
Wr | -

New state Seamless transition


McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 97
MESI (messages received from the bus)

The moment another cache


requests for the data (read-only), we
need to transition to the S state.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 98


Important Engineering Questions

Who supplies data if a sister cache sends a read miss or


write miss message?

Answer The caches who have a copy of the block, arbitrate for the
: bus. The one who gets access to the bus first, sends the
data. The rest snoop this value, and then cancel their
request. This is an overhead in terms of time and power.

Do we have to write data back on an M  S transition?

Answer In the interest of performance and power, it will be the best


: if we can avoid it.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 99


MOESI (messages received from the bus)

• Whenever, another cache requests for data, from an E or M state, the


line moves to the O state.
• In the O state, it sends data to other requesting caches.
• We will discuss the Probe message in the next slide.
• From the S state, no data is sent.
McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 100
MOESI
Reply
St |-

Rd Rd | -
Timeout | Probe | Rd I S
X Evict | -
Evict |
Wb
Se |- W
Repl y Ev r| Wr | WrX.u
ic W .u O
t| rX rX
W W
Timeout | Read from b r| Rd | -
Evict | - W
lower level
Rd | -
E M Wr | -
Rd | - Wr | -

• Ignore the two temporary states – St and Se – for the time being.
• Main problem: An eviction from the O state, leaves us with a
state where there is no owner.
• This makes the transitions from the I state tricky

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 101


MOESI
Reply
St |-

Rd Rd | -
Timeout | Probe | Rd I S
X Evict | -
Evict |
Wb
Se |- W
Repl y Ev r| Wr | WrX.u
ic W .u O
t| rX rX
W W
Timeout | Read from b r| Rd | -
Evict | - W
lower level
Rd | -
E M Wr | -
Rd | - Wr | -

• We first transition to the St state upon encountering a read miss


• If a Reply is received, then the state transitions to the shared(S) state.
• Otherwise, there is a timeout. A Probe message is sent. We transition
to the Se state.
• Again, if we receive a Reply from a cache (line in the S state), we
transition to the S state.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 102


MOESI
Reply
St |-

Rd Rd | -
Timeout | Probe | Rd I S
X Evict | -
Evict |
Wb
Se |- W
Repl y Ev r| Wr | WrX.u
ic W .u O
t| rX rX
W W
Timeout | Read from b r| Rd | -
Evict | - W
lower level
Rd | -
E M Wr | -
Rd | - Wr | -

• If there is a timeout in the Se state, we read from the lower level.


• Transition to the E state.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 103


Directory
Protocol

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 104


A Scalable Solution: Directory Protocol

Do not have a bus


• Have a dedicated structure called a directory
• The directory co-ordinates the actions of the coherence protocol
• It sends and receives messages to/from all the caches and the lower
level in the memory hierarchy
• Scalable

Directory
Cache Cache

Cache Cache

Lower Level in the Memory Hierarchy

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 105


Structure of the Directory

Directory
Directory Entry

State Block Address List of Sharers

• State of the entry


• Block address
• List of sharers
• List of caches that contain a copy of the block.
• Typically, stored as a bit vector. If the ith bit is set, it means that
the ith cache has a copy of the block.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 106


Design of the Directory

What are the messages that the directory receives?


• RdX
• WrX and WrX.u (regular and upgrade)
• Evict
What should the directory do:
• RdX  Locate a cache that contains the block (sharer) and fetch the block
• WrX and WrX.u  Ask all sharers to invalidate their lines, give exclusive
rights to the cache that wants to write
• Evict  Delete the cache from the list of sharers
• The basic protocol at the level of the caches remains the same. The state
transitions of directory entries are as follows.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 107


State Transitions of a Directory Entry (from the U and S
states)
P sends a message
RdX | 1. Send RdX to one of the
sharers. Ask it to forward
a copy.
2. sharers += {P}
U S
sharers={} | - Evict | sharers -= {P}
RdX |
1. sharers = {P}
2. Read from LL
WrX.u | 1. Send WrX.u to all sharers
WrX | (other than P)
1. sharers = {P} 2. sharers = {P}
2. Read from LL WrX | 1. Send WrX to all sharers

E 2. Ask one of the sharers to


forward a copy.
3. sharers = {P}
LL  Lower level
U  Uncached
McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 108
State Transitions of a Directory Entry (from the E state)

U S

Evict |
1. sharers = {} RdX | 1. Send RdX to the sharer. Ask it
to forward a copy of the block.
2. sharers += {P}

WrX | 1. Send WrX to the sharer


E
2. Ask the sharer to forward a copy
3. sharers = {P}

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 109


Enhancements to the Directory Protocol

Let us list some of the common problems associated with directories


• We need an entry for each block in a program’s working set (lot of
storage)
• In each directory entry, we need an entry for each constituent cache
(storage overheads)
• The directory itself can become a point of contention
• Let us look at 

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 110


Distributed Directories
Split the
address
space

Address
Space

Directory Directory Directory Directory Directory

• Split the physical address space


• A directory handles all the requests for the part of the address
space it owns
• Resolves the issue of the single point of contention.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 111


List of Sharers

• How to maintain the list of sharers?


• Solution 1 [Fully mapped scheme]:
• If there are N processors, have a bit vector of N processors.
• Each block is associated with a bit vector of sharers

block address 11000000 10001011

Space-efficient Solutions
• Maintain a bit for a set of caches. Run a snoopy protocol inside the set.
• Store the ids of only k sharers. Have an overflow bit to indicate that there are
more than k sharers. In this case, every message needs to be broadcasted.
[Partially mapped scheme]

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 112


Size of the Directory

The directory should ideally be as large as the number of blocks in the


programs’ working sets
Having an entry for every block in the physical address space is impractical
Practical Solution:
• Design a directory as a cache
• Keep the state of a limited number of blocks
• If an entry is evicted from the directory, invalidate it in all the caches

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 113


A Few More Issues

• False sharing  Consider a 64-byte block. It is possible that different


threads access different memory words within the block.
• The block will keep bouncing between caches.
• These are false sharing misses.
• Use a smart compiler to place data more intelligently.
• Race conditions
• In real hardware, there are a lot of interactions. It is possible that
multiple messages of different types for the same block might
arrive at the same time.
• Such concurrent events (race conditions) need to be handled.
Hence, a practical cache coherence protocol has close to a 100
states.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 114


Critical
Sections and Atomic
Operations

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 115


Consider this piece of code

t1 = account.balance;
t2 = t1 + 100;
account.balance = t2;

Can this code be executed in parallel


by multiple threads?

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 116


What is the problem?

100 t1 = account.balance; 100 t1 = account.balance;


200 t2 = t1 + 100; 200 t2 = t1 + 100;
account.balance = t2; 200 account.balance = t2;
200

• Each line corresponds to a line of assembly code


• Assume corresponding lines execute in parallel.
• The final answer is wrong. It should be 300, it is 200.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 117


Solution: Use Locks

Only one thread can execute this piece of code at any


single point in time.

lock();
t1 = account.balance;
t2 = t1 + 100;
account.balance = t2;
unlock();

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 118


Use Atomic Instructions to Implement Locks

For implementing lock and unlock functions.


• We need atomic instructions
• Either execute completely or not at all. Nobody observes a partial
state.
• Most atomic operations also act as a fence.

Atomic exchange operation

xchg reg, <mem> Atomically exchanges their contents

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 119


Assembly Level Implementation of the Lock and Unlock
Functions

.lock: • r0 contains the lock address


mov r1, 1 • The xchg instruction contains a
xchg r1, 0[r0] fence
cmp r1, 0 • Contains 0 if the lock is not acquired
bne .lock • 1 if acquired
ret

.unlock:
mov r1, 0
• We store 0 at the lock address
xchg r1, 0[r0]
ret

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 120


Implementation of Atomic Exchange
xchg r1, [r2]
• temp = r1, r1 = [r2], [r2] = temp
It involves 3 steps
• 1 memory read + 1 memory write + register move
• All the operations need to happen atomically
• This is called a read-modify-write instruction (RMW)
Method
• Get exclusive access (M state) with write permissions for the memory
address in r2
• Perform the read-modify-write operation
• Do not respond to any other requests from the local cache, or other caches,
or the directory when the operation is in progress
• Respond to the directory or other caches only when the operation is over

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 121


Spin Locks

• We need to repeatedly try to acquire the lock


• This is a spin lock
• There are a lot of overheads:
• Every time we access the lock, it needs to be read in the M state.
• Too many invalidation messages

• Test and Exchange Lock

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 122


Test and Exchange Lock

.lock: .unlock:
mov r1, 1 mov r1, 0
xchg r1, 0[r0]
.test ret
/* test if the lock is free */
ld r2, 0[r0]
cmp r2, 0
bne .test

/* attempt an exchange only when the


lock is free */
xchg r1, 0[r0]
cmp r1, 0
bne .test
ret

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 123


Atomic Operation Example Explanation
Test and Set tas r1, 8[r0] if (8[r0] == 0) {
8[r0] = 1;
r1 = 1;
}
else r1 = 0;
Fetch and Increment fai r1, 8[r0] r1 = 8[r0];
8[r0] = r1 + 1;
Fetch and Add faa r1, r2, 8[r0] r1 = 8[r0];
8[r0] = r1 + r2;
Compare and Set cas r1, r2, r3, 8[r0] if (8[r0] == r3) {
8[r0] = r2;
r1 = 1;
} else r1 = 0;
Load linked (ll) ll r1, 8[r0] r1 = 8[r0] /* Use the ll instruction */
Store conditional (sc) .... ...
mov r2, 1 /* sc */
sc r3, r2, 8[r0] if (8[r0] is not written to since the last ll){
8[r0] = r2;
r3 = 1;
} else r3 = 0;

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 124


Lock-free Algorithms

• Let us write the same code without locks.


• If the thread goes to sleep after acquiring the lock, all the
threads wait.

while (1) {
t1 = account.balance;
t2 = t1 + 100;
if (CAS (account.balance, t1, t2))
break;
}

It is possible for a thread to starve – never get the lock.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 125


How do we eliminate starvation?

Answer:
Wait free algorithms

Basic Idea

A request, T, first finds another request, R,


that is waiting for a long time
T decides to help R
This strategy ensures that no request is
left behind
Also known as an altruistic algorithm

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 126


Summary of Consensus Numbers

Are all atomic operations equally powerful?

Consensus problem: each thread proposes a value –


Consensus
one among the proposed values is chosen. The
number
consensus number is the maximum number of threads
that can solve this problem using a wait-free algorithm.

Type of Operation Consensus Number


Atomic exchange 2
Test and Set 2
Fetch and add 2
CAS (Compare and Set)
LL/SC

The consensus problem forms the theoretical basis of most concurrent algorithms.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 127


Contents

1. Parallel Programming

2. Theoretical Foundations

3. Cache Coherence

4. Memory Models

5. Data Races

AGENDA ITEM 06
6. Transactional Memory
Green marketing is a practice whereby companies seek to go above and beyond.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 128


Memory Models

The memory model is dependent on the processor


architecture. If there are very aggressive optimizations,
then the memory model has to be very weak.

PLSC requires the ws and fr orders to be global.


All popular architectures follow PLSC, and many
disallow thin air reads.

The only orders that can be local are rf and po.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 129


Write-to-Read Order
LSQ forwarding and
write buffers
rfi

Execution witness

T1 fr
T2 (a) Wx1 Rx0 (e)
(a) x = 1 (d) t2 = y
(b) t1 = x (e) t3 = x rfi po
(c) y = 1

(b) Rx1 Ry1 (d)


t1= 1, t2 = 1, t3 = 0 ?
po rfe
(c) Wy1

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 130


Non-atomic Writes
Create local tiles of
caches.
rfe
Execution witness
T1 T2 T3 (a) Wx1 fr Rx0 (e)
(a) x = 1 (b) t1 = x (d) t2 = y
(c) y = 1 (e) t3 = x rfe po

(b) Rx1 Ry1 (d)


t1= 1, t2 = 1, t3 = 0 ?
po rfe
Wy1
(c)

If writes are atomic, this behavior is not allowed, otherwise it is.

We are seeing writes to different locations in different orders.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 131


Write-to-Write Order
Non-blocking Caches

Execution witness
T1 T2 (a) Wx1 Ry1 (c)
(a) x = 1 (c) t1 = y fr
(b) y = 1 (d) t2 = x po po
rfe

t1= 1, t2 = 0 ? (b) Wy1 Rx0 (d)

Even if writes are atomic, this behavior is not allowed.

If the W W ordering is relaxed, we will see writes to different


locations in different orders.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 132


Read-to-Read Order Issues loads OOO in the
LSQ

Execution witness

(a) Rx1 (c) Wy1 rf Ry1 (d)


T1 T2 T3
(a) t1 = x (c) y = 1(d) t3 = y po po
(b) t2 = y (e) x = 1 rf
fr
(b) Ry0 Wx1 (e)
t1 = 1, t2 = 0, t3 = 1?

If the R R ordering is relaxed, we will observe this behavior.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 133


Read-to-Write Order
Speculative writes

Execution witness

T1 T2 (a) Rx1 Ry1 (c)


rf
(a) t1 = x (c) t2 = y
po po
(b) y = 1 (d) x = 1 rf
(b) Wy1 Wx1 (d)
t1= 1, t2 = 1 ?

If the R  W ordering is relaxed, we will observe this behavior.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 134


Special case of rfi in SC.

Can the rfi relation be relaxed in SC?

The proof is there in the book

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 135


Summary of Memory Models

Relaxation WR WW RR RW rfe rfi

SC

TSO (Intel)

Processor
consistency
PSO

Weak Ordering/ RC

IBM PowerPC

ARM

Ordering is
relaxed

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 136


Contents

1. Parallel Programming

2. Theoretical Foundations

3. Cache Coherence

4. Memory Models

5. Data Races

AGENDA ITEM 06
6. Transactional Memory
Green marketing is a practice whereby companies seek to go above and beyond.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 137


Increment a Counter

Let us say that all we want to do is

counter++

t1 = counter;
t2 = t1 + 1; fetch_and_increment (counter)
counter = t2;

Uses an atomic instruction

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 138


Code with Locks

If we do not want to write code using atomic instructions, we need to


encapsulate this code within a critical section.

lock()

t1 = counter;
t2 = t1 + 1;
counter = t2;

unlock()

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 139


Critical Sections

time
T1 T2 T1 T2 T2

Successfully locked

Unlocked

Failed to acquire a lock

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 140


Why do we need to use locks?

If there are no shared variables, do we need to use locks?

Answer:

When do we need locks then?

Answer:Two blocks of code need to be making conflicting and


concurrent accesses to the same address.

Conflicting accesses At least one access is a write

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 141


Concurrent Accesses
Common sense meaning  At the same time.
Code for T1 Code for T2
Code for T1 and T2
x = 1; while (y != 1) {}
x++ ; y = 1; t1 = x;

T1 T2 T1 T2
(a) Rx0 (c) Rx0 (a) Wx1 (c) Ry1
y is a
(b) Wx1 (d) Wx1 (b) Wy1 (d) Rx1
synch
variable
(a) (c)

(a) (c) (a) (c)


Rx0 fr Rx0 Wx1 rf Ry1
fr
po po po po
(b) (b) rf
(d)
Wy1 so / Rx1
(d)
Wx1 Wx1
ws

Execution witness Execution witness

(b) (d)
Note the so edge
McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 142
Data Races

Two accesses are said to be concurrent when there


is no path between them in the execution witness
that contains an so edge.

Data Race
A pair of conflicting and concurrent accesses to the
same regular variable constitute a data race.

If a piece of code does not have data races, what does it mean?

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 143


Does SC Imply Data-Race-Freedom?

Rx0 S2
po po
T1 T2 T1 T2
S1 S2 Wx1
Rx0 S2 Rx1 Data
Wx1 so S1 Wx1 rf
S1 race
S2 Rx1
po po
(a) (c)
Wx1 S1

Execution 1 Execution 2
(b) (d)

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 144


Does Data-Race-Freedom imply SC?

Refer to the book for the detailed proof.

Salient Points

 If there are two conflicting accesses, there will be an so edge on


at least one path between them in the execution witness.
 They will thus be ordered by the so edge.
 Let us add all the SC edges to the execution witness.
 A cycle implies that there is a cycle between synch operations.
 This is not possible. Synch operations follow SC.
 Proof by contradiction.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 145


What does having data races imply?

Theorem

If we have a data race in a program, we can construct an


SC execution that also has a data race.

Proof  refer to
the book

If an automated tool cannot construct an SC execution


What does it imply? that has a data race, then it means that the program
is data race free.

Method to detect data races


McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 146
Summary of All the Results
Properties
SC does not imply data-race-freedom.
Data-race-freedom implies SC
Non-SC execution implies data races
If an automated tool cannot construct an SC execution
that has a data race, then it means that the program
is data race free.

Moral of the story If there are no data races, the


memory model doesn’t matter

• Programming languages need to define data-race-free memory (DRF)


memory models that provide synchronization primitives.
• Programmers need to use them to write properly synchronized programs.
• This means that all accesses to shared variables are protected with
critical sections such that there are no concurrent, conflicting accesses.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 147


Methods to Detect Data
Races

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 148


Data Race Detection Algorithm

How do we detect data races?

• We need an algorithm that tests a piece of code for data


races
• Exercise all possible control paths
• Create as many interleavings as possible
• A data race must show up in an SC execution
• Try to find it …

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 149


Notion of Lock Sets

L(v) Set of locks held by this memory location, v.

L(T) Set of locks held by thread, T.

A lock is identified by the lock address.

After each access we modify the lock set.

𝐿 ( 𝑣 ) =𝐿 ( 𝑣 ) ⋂ 𝐿( 𝑇 )

Compute an intersection

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 150


Standard Approach

1. Add annotations to multithreaded code. These annotations modify


the lock set.
2. The locksets of threads are initially empty.
3. For each variable, its lockset initially has all the locks.
4. When a thread acquires a lock, we add the lock to its lock set.
5. When it releases a lock, we remove the lock.
6. As the program executes, , keeps getting updated.
7. At the end, if there is a variable with an empty lock set, it is
probably involved in a data race.

If the lock set is empty, it means that there is no


synchronization between accesses to the variable.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 151


Notion of False Positives

Note that we will detect many scenarios that are actually


not data races. For example, the basic algorithm will flag a
data race when we consider read-only variables.

A few more examples


Example Description
Initialization We typically initialize variables without
using locks
Read-only Written once (during initialization) and
variables read many times
Reader-writer Multiple threads read a variable
pattern concurrently.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 152


Modified State Diagram for each Variable

1. The first access has to be a write. We then move to the Exclusive state.
2. If a different thread reads it, move to the Shared state.
3. The Shared state allows reads (irrespective of the thread).
4. After a write, move to the Modified state.
5. In the Modified state, run the regular lock set algorithm.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 153


Notion of Vector Clocks

• We can think of a multithreaded execution environment as a


classical distributed system
• Here, there is no notion of global time.
• Every thread has a local clock.
• The local clocks are updated any time there is an interaction
between threads.
• Assume there are n threads. Every thread maintains an n-element
vector clock. Thread i’s vector clock is .
• is the best estimate that i has for j’s local clock

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 154


Comparability of Clocks

• Two vector clocks are equal when

V i =V j ⟺ ∀ 𝑘 ,𝑉 𝑖 [ 𝑘 ] =𝑉 𝑗 [𝑘]

• Two vector clocks are totally ordered when

V i ≺ V j ⟺ ( V i ≠ V j ) ∧( ∀ 𝑘, 𝑉 𝑖 [ 𝑘 ] ≤ 𝑉 𝑗 [𝑘])
V i ≼ V j ⟺ ( V i=V j ) ⋁ (V i ≺ V j )
• Two vector clocks may not always be comparable

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 155


Notion of a Message

A message represents an interaction between two threads.

Say if a thread i accesses a variable in memory, it updates


its state, and then thread j accesses the variable (and its
state)  This is an interaction that falls under the theoretical
definition of a message.

hb
Event 1 Event 2

x=1 t1 = x
Memory
location

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 156


More about Events

Increment the local clock (e.g., for thread i) before sending a


message and after receiving a message.

Let’s say, thread i sends a message to thread j. When j


receives the message, it updates its clock as follows.

∀ 𝑘 ,𝑉 𝑗 [ 𝑘 ] =max ⁡(𝑉 𝑖 [𝑘],𝑉 𝑗 [ 𝑘 ] )

The receiver is as at least as up to date as the sender. It records


an additional receive event.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 157


Vector Clocks and Causality

Let’s say that there is a happens-before relationship between


events ei and ej. We use vector clocks to track the interaction.

Event Time
ei Event in thread i at time Vi
ej Event in thread j at time Vj

We have the following relationship

h𝑏
𝑉 𝑖 ≺ 𝑉 𝑗 ⟺ 𝑒 𝑖⟶ 𝑒 𝑗

This happens because the receiver has all the sender’s updates
and without a chain of happens-before edges, two vector clocks
will not remain comparable.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 158


Vector Clock based Algorithm
Symbol Meaning
Vector clock of the current thread
Vector clock of the current lock
Read clock of variable v
Write clock of variable v
Thread id

𝐶 𝑇 [ 𝑡𝑖𝑑 ] ← 𝐶 𝑇 [ 𝑡𝑖𝑑 ] +1
𝐶𝑇 ← 𝐶𝑇 ⋃ 𝐶 𝐿 Increment the vector
𝐶 𝐿 ← 𝐶𝑇 clock of CT and CL

𝐶 𝑇 .𝑖𝑛𝐿𝑜𝑐𝑘← 𝐓𝐫𝐮𝐞

𝐶 𝑇 .𝑖𝑛𝐿𝑜𝑐𝑘← 𝐅𝐚𝐥𝐬𝐞
McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 159
Read Operation

𝒊𝒇 (∼ 𝐶 𝑇 . 𝑖𝑛𝐿𝑜𝑐𝑘) 𝒕𝒉𝒆𝒏
Increment the local count
𝐶 𝑇 [ 𝑡𝑖𝑑 ] ← 𝐶 𝑇 [ 𝑡𝑖𝑑 ] +1 of the thread clock
𝒆𝒏𝒅
𝒊𝒇 (𝑊 𝑣 ≼ 𝐶 𝑇 ) 𝒕𝒉𝒆𝒏
If no thread has overwritten
𝑅 𝑣 ← 𝑅𝑣 ∪ 𝐶 𝑇 the value after the transaction
𝒆𝒏𝒅 started, then the read is
𝒆𝒍𝒔𝒆 successful. Update the read
clock.
𝑫𝒆𝒄𝒍𝒂𝒓𝒆𝑫𝒂𝒕𝒂𝑹𝒂𝒄𝒆
𝒆𝒏𝒅

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 160


Write Operation

𝒊𝒇 (∼ 𝐶 𝑇 . 𝑖𝑛𝐿𝑜𝑐𝑘) 𝒕𝒉𝒆𝒏
Increment the local count
𝐶 𝑇 [ 𝑡𝑖𝑑 ] ← 𝐶 𝑇 [ 𝑡𝑖𝑑 ] +1 of the thread clock
𝒆𝒏𝒅
𝒊𝒇 ( 𝑊 𝑣 ≼ 𝐶 𝑇 ) ∧ ( 𝑅 𝑣 ≼ 𝐶 𝑇 ) 𝒕𝒉𝒆𝒏
𝑅 𝑣 ← 𝑅𝑣 ∪ 𝐶 𝑇 If no thread has overwritten or
read the value after the
𝑊 𝑣 ← 𝑊 𝑣 ∪ 𝐶𝑇 transaction started, then the
𝒆𝒏𝒅 read is successful. Update the
𝒆𝒍𝒔𝒆 read and write clocks.
𝑫𝒆𝒄𝒍𝒂𝒓𝒆𝑫𝒂𝒕𝒂𝑹𝒂𝒄𝒆
𝒆𝒏𝒅

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 161


Contents

1. Parallel Programming

2. Theoretical Foundations

3. Cache Coherence

4. Memory Models

5. Data Races

AGENDA ITEM 06
6. Transactional Memory
Green marketing is a practice whereby companies seek to go above and beyond.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 162


Lock and Unlock functions have a problem
void updateBalance ( int amount , Account account ) {
lock ();

int temp = account.balance ;


temp = temp + amount ;
account.balance = temp ;

unlock ();
}

• If we have one lock, then there is no parallelism in the system.


• We can afford much more parallelism.
• The only constraint is:
• We should not have two parallel accesses to the same account.

Disjoint access parallelism

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 163


Code that Allows Disjoint Access Parallelism

void updateBalance ( int amount , Account account ) {


account.lock ();

int temp = account.balance ;


Account temp = temp + amount ;
specific account.balance = temp ;
locks
account.unlock ();
}

• This is a scalable solution.


• But, there is a problem. If we have code where we
acquire multiple locks, we may have a deadlock.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 164


Deadlock Situation
Thread B waits
for lock A

Thread A Thread B
acquires acquires
lock A lock B

Thread A waits
for lock B

Deadlock
McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 165
Create a Transaction

Provides disjoint access parallelism

Automatically manages all the locks and avoids deadlocks.

void updateBalance ( int amount , Account account ){


atomic { /* an atomic transaction */
int temp = account . balance ;
temp = temp + amount ;
account.balance = temp ;
}
}

The atomic block implements a transaction

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 166


Properties of a Transaction

Atomicity: Either the entire transaction completes or fails.

Consistency: A consistent state is a valid state that is as per a


given set of specifications (consistency model). The property of
consistency says that if the full system was consistent before a
transaction started, it should be consistent after it ended.

Isolation: It appears that while the transaction was executing,


regular instructions or other transactions were not executing.

Durability: Once a transaction has finished, its results are written


to stable storage.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 167


Transactional Memory

A memory system that provides support for transactions


is known as a transactional memory (TM).

Two Types

Hardware TM Software TM

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 168


Basics of Transactional Memory

read set write set


set of variables that are set of variables that are
read by the transaction written by the transaction

Term Meaning
Ri Read set of transaction i
Wi Write set of transaction i
Rj Read set of transaction j
Wj Write set of transaction j

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 169


When do transactions conflict?

There is a conflict if and only if

Wi ∩ Wj ≠ φ Wi ∩ Rj ≠ φ
OR
Ri ∩ Wj ≠ φ

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 170


Abort and Commit

Commit
• A transaction completed without any conflicts
• Finished writing its data to main memory

 Abort
◦ A transaction could not complete
due to conflicts
◦ Did not make any of its writes visible

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 171


What happens after an abort?

• The transaction restarts and re-executes


• Might wait for a random duration of time to minimize future
conflicts

• do {


} while (! Tx.commit());

This is automatically handled by the


transactional memory system

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 172


Basics of Concurrency Control

A conflict occurs when the read-write sets overlap


A conflict is detected when the TM system
becomes aware of it
A conflict is resolved when the TM system either
• delays a transaction
• aborts it

occurrence

detection

resolution

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 173


Pessimistic vs Optimistic Concurrency Control

occurrence,
detection,
resolution
pessimistic
concurrency
control

detection
occurrence resolution
optimistic
concurrency
control

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 174


Version Management

Eager version management


• Write directly to memory
• Maintain an undo log
writeback
commit flush undo log abort
undo log
Lazy version management
• Write to a buffer (redo log)
• Transfer the buffer to memory on a commit

commit writeback abort flush redo log


redo log

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 175


Conflict Detection

Eager
• Check for conflicts as soon as a
transaction accesses a memory
location

Lazy
• Check at the time of
committing a transaction

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 176


Semantics of Transactions

Serializable
• Sequential consistency at the level of transactions
Strictly Serializable
• The sequential ordering is consistent with the real time ordering
• This means that if Transaction A starts after Transaction B ends,
it should be ordered after it in the equivalent sequential ordering
• For concurrent transactions, their ordering does not matter
Opacity
• Even aborted transactions need to see a consistent state – one
produced by only committed transactions

What about aborted transactions?

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 177


Opacity

Thread 1 Thread 2

atomic { atomic {
t1 = x; x = 5;
t2 = y; y = 5;
while (t1 != t2) {} }
}
• If one transaction executes after the other, x will always be equal to y
• Assume optimistic concurrency control: resolution at the end
• Assume Thread 1 reads x=0
• Then the transaction on thread 2 finishes
• The transaction on thread 1 needs to be aborted (will read y=5)
• However, it will be stuck in the while loop, it will never reach the end
• Opacity will not y to be read as 5
McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 178
Mixed Mode Accesses: Transactional and Non-Transactional

Single Lock Atomicity (SLA)


• Assume all the transactions are protected by a single lock
• Transactions first acquire a hypothetical (global) lock
• Same definition of data races
• Reduces concurrency
Disjoint Lock Atomicity (DLA)
• Uses more locks than SLA: (let’s say one per variable)
• We a priori need to know the locks that a transaction is going to use
Transactional Sequential Consistency (TSC)
• We can order all transactions (committed or aborted) and regular
instructions in a sequential order
• All the instructions are in program order (incl. within transactions)
McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 179
Software Transactional Memory

choices

Concurrency Control
• Optimistic or Pessimistic
Version Management
• Lazy or Eager
Conflict Detection
• Lazy or Eager

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 180


Support Required

Augment every transactional object/ variable with meta data

object 1. Transaction
that has locked
the object
2. Read or write

object metadata

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 181


Maintaining Read–Write Sets

Each transaction maintains a list of locations that it has


• read in the read-set
• written in the write-set
Every memory read or write operation is augmented
• readTX (read, and enter in the read set)
• writeTX (write, and enter in the write set, make changes
to the undo/redo log)

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 182


Bartok STM
Eager version management, lazy conflict detection

Every variable has the following fields


• version
• value
• lock

value version

Transactional Variable

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 183


Read Operation

Read Operation

Record the version of the


variable

Add the variable to the read set

Read the value

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 184


Write Operation

Lock the variable


Abort if it is already locked

Add the old value to the


undo log

Write the new value

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 185


Commit Operation

For each entry in the read set

Check if the version of the


variable is still the same

Yes No Abort

For each entry in the write set

Increment the version

Release the lock

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 186


Pros and Cons

does not
simple
provide opacity

reads are simple uses locks

provide a strong
writes are slow
semantics for
transactions

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 187


Subtle Points

• With an undo log, aborts are more expensive than commits


• A transaction can read intermediate values written by
transactions
• Opacity is thus not guaranteed
• Locks are kept for a long time: lock  commit

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 188


TL2 STM

• Uses lazy version management  redo log


• Uses a global timestamp (globalClock)
• Locks variables only at commit time
• Every transaction does the following (atomically) when it starts:

globalClock++ ;
Tx.rv = globalClock ;

Set the value of the transaction’s


Tx.rv timestamp (read version)

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 189


Read Operation
Read Operation: read (tX, obj)

obj in the redo log?

No Yes Return value in


the redo log

Check if the v1 = obj.timestamp


value and result = obj.value
timestamp are v2 = obj.timestamp
collected if( (v1 != v2) || (v1 > Tx.rv) ||
atomically obj.lock) abort();
addToReadSet(obj); Collect a snapshot
return result;

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 190


Write Operation

Add entry to the redo log if


required

Perform the write

Writes are

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 191


Commit Operation

For each entry in the write set

failure
Lock object abort

Tx.wv = ++globalClock (atomic)

For each entry e in the read set

failure
if (e.timestamp > Tx.rv) abort

writeback redo log

For each entry e in the write set e.timestamp = Tx.wv


Release the lock
McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 192
Pros and Cons

simple A redo log is slower

provides opacity Commits are more


expensive than
aborts
holds locks for a
lesser amount of uses locks
time

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 193


Subtle Points

• We use two timestamps per transaction: Tx.rv and Tx.wv


• We have,
• We first write the variables to permanent state
• Then, we update their timestamps
• If another transaction sees an updated timestamp, it is
sure that the variable has been written to
• Finally, we release the locks. This allows later reads.

Provides opacity

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 194


Hardware Transactional
Memory

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 195


Case for Hardware

• STM systems do not handle non-transactional accesses


• Acquiring and releasing locks is expensive
• Maintaining undo and redo logs is difficult
• Hardware is much faster …

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 196


Hardware Support (mostly based on LogTM)

ISA Support
• Add three new instructions: begin, abort, and commit
Version Management
• HW schemes mostly use eager version management – undo log
• The log has its dedicated set of addresses in virtual memory

Cache line R W

If (R=1) some If (W=1) some


word has been word has been
read written

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 197


Conflict Detection

Use the coherence protocol to detect conflicts

1. Let us say core D has a miss. It sends a read-miss to the directory.


2. The directory forwards it to core C.
3. C detects a conflict.
4. It sends a nack message to D (via the directory).
5. The transaction at D aborts.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 198


Eviction

What if there is an eviction?

• If there is an eviction in the M state, the state at the directory


is set to M@C. C is the number of the core that evicted the block.
• Sets the overflow bit to 1.
• Let us assume silent evictions from the S state.

• Whenever the directory gets a request for a block in the M@C


state, it forwards it to core C.
• Core C may not have the block in the cache.
• It will however infer a conflict because of the state of the block in
the directory (the block is in its write set).
• If it gets a normal request (not @C), it will assume that the line
was in the S state.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 199


Example

1. Core D sends a read-miss message to the directory.


2. The directory forwards it to core C with the (M@C) directive
3. Core C infers a conflict
4. Sends a nack back to D (via the directory)

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 200


Subtle Issues

• Assume a block has been evicted.


• If there is a conflicting request by a non-transactional access, the
current transaction has to abort
• If a transaction aborts, then we need to clean the read/write sets.
Some blocks might have gone to lower levels of the memory
hierarchy.
• Since we restore the entire undo log, the correct state of the blocks
is restored (in the L1 cache). Wrong values at lower levels do not
matter.
• After a transaction finishes, all the R, W, M@C, and overflow bits
need to be cleared.

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 201


Conclusion

There are two paradigms in parallel


programming:
shared memory and message passing
Per-location sequential consistency (PLSC) is
followed by
all systems today. It translates to the axioms
ofmemory
A coherence.
model is determined by two factors:
write atomicity and program order. It is specified
by the po, rf, fr, and ws relations.
Cache coherence protocols enforce the axioms
of coherence.
If a program is data-race-free its execution is in
SC.
Transactional memory is a very easy-to-use
paradigm for writing data-race-free code
(wrapped in atomic blocks).

McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 202


The
McGraw-Hill |
End
Advanced Computer Architecture. Smruti R. Sarangi 203

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy