Chapter 9 Multicore Systems
Chapter 9 Multicore Systems
Multicore Systems
Caches
1. Parallel Programming
2. Theoretical Foundations
3. Cache Coherence
4. Memory Models
5. Data Races
AGENDA ITEM 06
6. Transactional Memory
Green marketing is a practice whereby companies seek to go above and beyond.
35
30
Speclnt 2006 Score 25
20
15
40
10
5
0
29/10/99 12/03/01 25/07/02 07/12/03 20/04/05 02/09/06 15/01/08 29/05/09 11/10/10 23/02/12
Date
Big Problem
Map each
sub-
problem
to a core
Smaller problems
Approach
Shared memory
/* initialise arrays */
...
/* parallel section */ # pragma omp parallel {
/* get my processor id */
OpenMP
int myId = omp_ get_ thread_ num (); code
/* add my portion of numbers */
int startIdx = myId * SIZE/ N;
int end Idx = startIdx + SIZE/ N;
/* sequential section */
for( int idx=0; idx < N; idx++)
result += partialSums[ idx];
Initialization
Spawn child threads
Child
threads
Time
Sequential
section
Function Semantics
send(pid, val) Send the integer val to the process with id, pid
receive(pid) 1. Receive an integer from process pid
2. This is a blocking call
3. If the pid is equal to ANYSOURCE, then the receive
function returns with the value sent by any process
}
McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 11
Continuation
else {
Shared memory
• Easy to program.
• Issues with scalability
• The code is portable across machines.
Message passing
• Hard to program.
• Scalable
• The code may not be portable across machines.
For P processors (
𝑇 𝑝𝑎𝑟 =𝑇 𝑠𝑒𝑞 × 𝑓 𝑠𝑒𝑞 +
1− 𝑓 𝑠𝑒𝑞
𝑃 )
𝑇 𝑠𝑒𝑞 1
𝑆= = As ,
Speedup 𝑇 𝑝𝑎𝑟 1 − 𝑓 𝑠𝑒𝑞
𝑓 𝑠𝑒𝑞 +
𝑃
SISD SIMD
MISD MIMD
• Multiple • Multiple
inst. single inst.
data multiple
data
SISD Processors
• A regular single core processor
SIMD Processors
• Single instruction stream, multiple data streams
• Vector instruction set. Example: add v1, v2, v3
• v1, v2, and v3 are vector registers
• They pack multiple integers (let’s say 4)
• A pairwise addition is performed
• Summary: We can do 4 additions using just a single
instruction
MISD Processors
• Used in airplanes: Run the same program on three
separate processors that have different instruction
sets. Compare the outputs and decide by voting.
MIMD Processors
• SPMD Processors: Single program, multiple data.
Run the same program on different cores with
different data streams (most common)
• MPMD Processors: Consider a processor with
different accelerators. Each core or accelerator
runs a different program.
McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 19
Hardware
Threads
Definition
• They are not the same as software threads that share the
virtual address space.
McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 21
Coarse-grained Multithreading
1
4 2
3
• Run instructions from thread 1 for k cycles.
• Then switch to thread 2, then to thread 3, thread 4, thread 1, ...
• Separate program counters, ROBs and retirement register files
per thread
• Each instruction packet, rename table entry, LSQ entry, physical
register is tagged with the thread id
• If there is a high-latency event like an L2 miss, the core can
switch to a new thread.
Thread 2
Time Thread 3
Thread 4
issue
slots
1. Parallel Programming
2. Theoretical Foundations
3. Cache Coherence
4. Memory Models
5. Data Races
AGENDA ITEM 06
6. Transactional Memory
Green marketing is a practice whereby companies seek to go above and beyond.
Make a set of small caches act like one single, large cache.
Assumptions
Thread 1 Thread 2
x=1 t1 = y
y=1 t2 = x
Thread 1 Thread 2
x=1 t1 = y
y=1 t2 = x
• Thread 1 sets x to 1
• It sends an update message on the NoC. The message gets caught
in congestion.
• Thread 1 sets y to 1. The corresponding message on the NoC is
swiftly delivered.
• Thread 2 reads y to be 1.
• Thread 2 reads x as 0 (initialized value)
Thread 1 Thread 2
x=1 y=1
t1 = y t2 = x
Queue
Observer
read write
Operation
Operation
Execution
Operation
Operation
Sequential
All the operations are ordered.
Execution
Legal
Sequential Every read operation returns the value of the
Execution latest write operation to the same address.
• Even in this case, the observer on the core expects to see a legal
sequential execution
Term Meaning
Rx1 Read the value of x as 1
Wy2 Set y = 2
Multiple threads: one observer per thread. Each observer records the
local execution history of a thread.
Expression Meaning
P|T All the operations issued by thread T (in the same
order). This is an ordered sequence.
There is a one-to-one mapping between the two
sequences of operations.
For all T,
Mapping
1 1’, 2 2’, ... , 7 7’
T1 T2 T3 T1 T2 T3
Wx1 Rx0 Rx1 Wx1 Rx1 Rx2
Wx2 Rx2 Rx2 Wx2 Rx2 Rx1
PLSC
If we consider all the accesses to a single variable (memory location), let
such an execution be always in SC. This is known as the PLSC (Per
Location Sequential Consistency) constraint. It is needed to provide the
illusion of a single memory location, even if we have a distributed cache.
T1 T2 T3
Wx1 Rx1 Rx2
Wx2 Rx2 Rx1
Order Implication
Read Read Does not matter
Write Read Given that a core can read another core’s write early,
while other cores may not even see the write, this
order may not be global all the time. A core does not
know when a write reaches the sister caches present
in other cores. Hence, it may not agree about the
values read by other cores; this order is local.
Write Write If this order is not global, then the same variable will
end up with multiple final states. This is not allowed.
Hence, in all systems this order is global.
Read Write Writes are globally ordered. Assume one core records
Wi Ri Wj . All the cores will record Ri Wj because
the core that issued Ri could not have seen Wj else it
would have read a different value.
Write Read order is global only for machines with atomic writes.
Axioms of Coherence
Write
Writes to the same location are globally ordered.
Serialization
Write A write is eventually seen by all the threads.
Propagation
McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 56
SC Using Synchronisation
Instructions
T1 T2
value = 3 while (status != 1) {}
status = 1 temp = value
T1 T2
value = 3 while (status != 1) {}
fence fence
status = 1 temp = value
• Regardless of the underlying memory model this code will always work
• Memory barriers
• A fence is an example of a memory barrier
• Specifies rules of completion for instructions before and after the
barrier (in program order)
• Store barrier Ensures an ordering between store instructions before
and after the barrier instruction.
Release instruction The release instruction can only complete if all the
instructions before it have been fully completed. Note that the release
instruction allows instructions after it to execute before it has completed.
Sequential
Parallel Execution Execution Witness
Execution
T1 T2 1 x=1
x=1 y=1 2 t1 = y
t1 = y t2 = x 3 y=1
4 t2 = x
It is a graph <t1,t2> = <0,1>
• The nodes are the instructions
• We can have edges between the instructions based on
the orders that are guaranteed by the memory model.
• We can have two edges: global and local
• These are happens-before edges If means that event A
happened first and then event B happened after that. It could be
immediately later or after a very long time.
• A global hb edge (ghb) is agreed to by all threads.
Edge Description
poRW Read Write edge
Not always
poRR Read Read edge global.
poWR Write Read edge Global only
in SC.
poWW Write Write edge
poIS read/write synch operation
Global
poSI synch read/write operation
rf edge
Read and write ops in
rfi edge the same thread
(a) (b)
T1 T2 rfe
Wx1 Rx1
(a) x = 1 (b) t1 = x
t1 = 1
rf po
(e)
Rx2
ws fr
t1 = 1
(b) Wx2
so
rf
t1 = 1, t2 = 1
rf/
(b) Wy1 Rx1
Let be global
Let be global
An acyclic Sequential
execution execution that
witness respect ghb
• The sad part is that it may not be legal if writes are not atomic.
up
Instead of po edges, we have up
(e)
edges Edge between accesses Wx4
to the same location in the same
thread.
up Wx2 up
<t1,t2,t3> = <2,1,0>?
(b) rf
Rx2 Rx0
(e)
fr
t1 = 1, t2 = 2 po
rfe
rfe (f)
(c)
Wy2 Wx1
dep po
(b) (e)
if-stmt fence
dep po
rfe
rfe (f)
(c)
Wy2 Wx1
Three kinds of global edges: rf, gpo (all global program order edges),
dep (dependences)
Condition Test
Satisfies the memory The execution witness is acyclic.
model
PLSC holds for all All access graphs are acyclic.
memory locations
No thin air reads The causal graph is acyclic.
1. Parallel Programming
2. Theoretical Foundations
3. Cache Coherence
4. Memory Models
5. Data Races
AGENDA ITEM 06
6. Transactional Memory
Green marketing is a practice whereby companies seek to go above and beyond.
Shared bus
State Meaning
M Modified
S Shared with other sister caches
I Invalid
• If the block is not present, send a read miss message on the bus.
• S state If it is already present, don’t do anything. Just read it.
• The S state allows seamless evictions
Rd | -
Rd | RdX
I S
Evict | -
Rd | -
Rd | RdX
I S
Evict | -
W
r
|W
Ev rx Wr | Broadcast
ic
t|
W
b
M Rd | -
Wr | Broadcast
Power hungry
step
Broadcast | Update
RdX | Send
WrX | Send
Broadcast | Update
M RdX | Send
S
WrX | Send
Given that only one sister cache can use the bus at any time, a
global order of writes is automatically enforced.
If the bus master disallows starvation, then all writes will ultimately
complete.
Rd | RdX Rd | -
I S
Evict | -
Wr
of the line
Ev
|Wr
ict
Wr | WrX.u
X
|W
b
A writeback is
necessary here. The S
state has seamless
evictions.
• Insight: Consider a core that reads a block, and the block is never
shared. It will be read first in the shared state, and then an additional
message (WrX.u) is required to transition to the M state.
• Can we avoid this?
• Add an additional Exclusive (E) state The block is only present in
one cache. If we are reading a block from the lower level, it enters
the E state.
• The rest remains the same.
• MSI protocol MESI protocol
Rd | RdX
(from sister cache) Rd | -
I S
Evict | -
W
Ev r|
Rd | RdX ic W
t| rX Wr | WrX.u
(from lower W
level) b
Evict | -
Rd | -
Rd | - Wr | -
E M
Wr | -
Answer The caches who have a copy of the block, arbitrate for the
: bus. The one who gets access to the bus first, sends the
data. The rest snoop this value, and then cancel their
request. This is an overhead in terms of time and power.
Rd Rd | -
Timeout | Probe | Rd I S
X Evict | -
Evict |
Wb
Se |- W
Repl y Ev r| Wr | WrX.u
ic W .u O
t| rX rX
W W
Timeout | Read from b r| Rd | -
Evict | - W
lower level
Rd | -
E M Wr | -
Rd | - Wr | -
• Ignore the two temporary states – St and Se – for the time being.
• Main problem: An eviction from the O state, leaves us with a
state where there is no owner.
• This makes the transitions from the I state tricky
Rd Rd | -
Timeout | Probe | Rd I S
X Evict | -
Evict |
Wb
Se |- W
Repl y Ev r| Wr | WrX.u
ic W .u O
t| rX rX
W W
Timeout | Read from b r| Rd | -
Evict | - W
lower level
Rd | -
E M Wr | -
Rd | - Wr | -
Rd Rd | -
Timeout | Probe | Rd I S
X Evict | -
Evict |
Wb
Se |- W
Repl y Ev r| Wr | WrX.u
ic W .u O
t| rX rX
W W
Timeout | Read from b r| Rd | -
Evict | - W
lower level
Rd | -
E M Wr | -
Rd | - Wr | -
Directory
Cache Cache
Cache Cache
Directory
Directory Entry
U S
Evict |
1. sharers = {} RdX | 1. Send RdX to the sharer. Ask it
to forward a copy of the block.
2. sharers += {P}
Address
Space
Space-efficient Solutions
• Maintain a bit for a set of caches. Run a snoopy protocol inside the set.
• Store the ids of only k sharers. Have an overflow bit to indicate that there are
more than k sharers. In this case, every message needs to be broadcasted.
[Partially mapped scheme]
t1 = account.balance;
t2 = t1 + 100;
account.balance = t2;
lock();
t1 = account.balance;
t2 = t1 + 100;
account.balance = t2;
unlock();
.unlock:
mov r1, 0
• We store 0 at the lock address
xchg r1, 0[r0]
ret
.lock: .unlock:
mov r1, 1 mov r1, 0
xchg r1, 0[r0]
.test ret
/* test if the lock is free */
ld r2, 0[r0]
cmp r2, 0
bne .test
while (1) {
t1 = account.balance;
t2 = t1 + 100;
if (CAS (account.balance, t1, t2))
break;
}
Answer:
Wait free algorithms
Basic Idea
The consensus problem forms the theoretical basis of most concurrent algorithms.
1. Parallel Programming
2. Theoretical Foundations
3. Cache Coherence
4. Memory Models
5. Data Races
AGENDA ITEM 06
6. Transactional Memory
Green marketing is a practice whereby companies seek to go above and beyond.
Execution witness
T1 fr
T2 (a) Wx1 Rx0 (e)
(a) x = 1 (d) t2 = y
(b) t1 = x (e) t3 = x rfi po
(c) y = 1
Execution witness
T1 T2 (a) Wx1 Ry1 (c)
(a) x = 1 (c) t1 = y fr
(b) y = 1 (d) t2 = x po po
rfe
Execution witness
Execution witness
SC
TSO (Intel)
Processor
consistency
PSO
Weak Ordering/ RC
IBM PowerPC
ARM
Ordering is
relaxed
1. Parallel Programming
2. Theoretical Foundations
3. Cache Coherence
4. Memory Models
5. Data Races
AGENDA ITEM 06
6. Transactional Memory
Green marketing is a practice whereby companies seek to go above and beyond.
counter++
t1 = counter;
t2 = t1 + 1; fetch_and_increment (counter)
counter = t2;
lock()
t1 = counter;
t2 = t1 + 1;
counter = t2;
unlock()
time
T1 T2 T1 T2 T2
Successfully locked
Unlocked
Answer:
T1 T2 T1 T2
(a) Rx0 (c) Rx0 (a) Wx1 (c) Ry1
y is a
(b) Wx1 (d) Wx1 (b) Wy1 (d) Rx1
synch
variable
(a) (c)
(b) (d)
Note the so edge
McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 142
Data Races
Data Race
A pair of conflicting and concurrent accesses to the
same regular variable constitute a data race.
If a piece of code does not have data races, what does it mean?
Rx0 S2
po po
T1 T2 T1 T2
S1 S2 Wx1
Rx0 S2 Rx1 Data
Wx1 so S1 Wx1 rf
S1 race
S2 Rx1
po po
(a) (c)
Wx1 S1
Execution 1 Execution 2
(b) (d)
Salient Points
Theorem
Proof refer to
the book
𝐿 ( 𝑣 ) =𝐿 ( 𝑣 ) ⋂ 𝐿( 𝑇 )
Compute an intersection
1. The first access has to be a write. We then move to the Exclusive state.
2. If a different thread reads it, move to the Shared state.
3. The Shared state allows reads (irrespective of the thread).
4. After a write, move to the Modified state.
5. In the Modified state, run the regular lock set algorithm.
V i =V j ⟺ ∀ 𝑘 ,𝑉 𝑖 [ 𝑘 ] =𝑉 𝑗 [𝑘]
V i ≺ V j ⟺ ( V i ≠ V j ) ∧( ∀ 𝑘, 𝑉 𝑖 [ 𝑘 ] ≤ 𝑉 𝑗 [𝑘])
V i ≼ V j ⟺ ( V i=V j ) ⋁ (V i ≺ V j )
• Two vector clocks may not always be comparable
hb
Event 1 Event 2
x=1 t1 = x
Memory
location
Event Time
ei Event in thread i at time Vi
ej Event in thread j at time Vj
h𝑏
𝑉 𝑖 ≺ 𝑉 𝑗 ⟺ 𝑒 𝑖⟶ 𝑒 𝑗
This happens because the receiver has all the sender’s updates
and without a chain of happens-before edges, two vector clocks
will not remain comparable.
𝐶 𝑇 [ 𝑡𝑖𝑑 ] ← 𝐶 𝑇 [ 𝑡𝑖𝑑 ] +1
𝐶𝑇 ← 𝐶𝑇 ⋃ 𝐶 𝐿 Increment the vector
𝐶 𝐿 ← 𝐶𝑇 clock of CT and CL
𝐶 𝑇 .𝑖𝑛𝐿𝑜𝑐𝑘← 𝐓𝐫𝐮𝐞
𝐶 𝑇 .𝑖𝑛𝐿𝑜𝑐𝑘← 𝐅𝐚𝐥𝐬𝐞
McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 159
Read Operation
𝒊𝒇 (∼ 𝐶 𝑇 . 𝑖𝑛𝐿𝑜𝑐𝑘) 𝒕𝒉𝒆𝒏
Increment the local count
𝐶 𝑇 [ 𝑡𝑖𝑑 ] ← 𝐶 𝑇 [ 𝑡𝑖𝑑 ] +1 of the thread clock
𝒆𝒏𝒅
𝒊𝒇 (𝑊 𝑣 ≼ 𝐶 𝑇 ) 𝒕𝒉𝒆𝒏
If no thread has overwritten
𝑅 𝑣 ← 𝑅𝑣 ∪ 𝐶 𝑇 the value after the transaction
𝒆𝒏𝒅 started, then the read is
𝒆𝒍𝒔𝒆 successful. Update the read
clock.
𝑫𝒆𝒄𝒍𝒂𝒓𝒆𝑫𝒂𝒕𝒂𝑹𝒂𝒄𝒆
𝒆𝒏𝒅
𝒊𝒇 (∼ 𝐶 𝑇 . 𝑖𝑛𝐿𝑜𝑐𝑘) 𝒕𝒉𝒆𝒏
Increment the local count
𝐶 𝑇 [ 𝑡𝑖𝑑 ] ← 𝐶 𝑇 [ 𝑡𝑖𝑑 ] +1 of the thread clock
𝒆𝒏𝒅
𝒊𝒇 ( 𝑊 𝑣 ≼ 𝐶 𝑇 ) ∧ ( 𝑅 𝑣 ≼ 𝐶 𝑇 ) 𝒕𝒉𝒆𝒏
𝑅 𝑣 ← 𝑅𝑣 ∪ 𝐶 𝑇 If no thread has overwritten or
read the value after the
𝑊 𝑣 ← 𝑊 𝑣 ∪ 𝐶𝑇 transaction started, then the
𝒆𝒏𝒅 read is successful. Update the
𝒆𝒍𝒔𝒆 read and write clocks.
𝑫𝒆𝒄𝒍𝒂𝒓𝒆𝑫𝒂𝒕𝒂𝑹𝒂𝒄𝒆
𝒆𝒏𝒅
1. Parallel Programming
2. Theoretical Foundations
3. Cache Coherence
4. Memory Models
5. Data Races
AGENDA ITEM 06
6. Transactional Memory
Green marketing is a practice whereby companies seek to go above and beyond.
unlock ();
}
Thread A Thread B
acquires acquires
lock A lock B
Thread A waits
for lock B
Deadlock
McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 165
Create a Transaction
Two Types
Hardware TM Software TM
Term Meaning
Ri Read set of transaction i
Wi Write set of transaction i
Rj Read set of transaction j
Wj Write set of transaction j
Wi ∩ Wj ≠ φ Wi ∩ Rj ≠ φ
OR
Ri ∩ Wj ≠ φ
Commit
• A transaction completed without any conflicts
• Finished writing its data to main memory
Abort
◦ A transaction could not complete
due to conflicts
◦ Did not make any of its writes visible
• do {
…
…
} while (! Tx.commit());
occurrence
detection
resolution
occurrence,
detection,
resolution
pessimistic
concurrency
control
detection
occurrence resolution
optimistic
concurrency
control
Eager
• Check for conflicts as soon as a
transaction accesses a memory
location
Lazy
• Check at the time of
committing a transaction
Serializable
• Sequential consistency at the level of transactions
Strictly Serializable
• The sequential ordering is consistent with the real time ordering
• This means that if Transaction A starts after Transaction B ends,
it should be ordered after it in the equivalent sequential ordering
• For concurrent transactions, their ordering does not matter
Opacity
• Even aborted transactions need to see a consistent state – one
produced by only committed transactions
Thread 1 Thread 2
atomic { atomic {
t1 = x; x = 5;
t2 = y; y = 5;
while (t1 != t2) {} }
}
• If one transaction executes after the other, x will always be equal to y
• Assume optimistic concurrency control: resolution at the end
• Assume Thread 1 reads x=0
• Then the transaction on thread 2 finishes
• The transaction on thread 1 needs to be aborted (will read y=5)
• However, it will be stuck in the while loop, it will never reach the end
• Opacity will not y to be read as 5
McGraw-Hill | Advanced Computer Architecture. Smruti R. Sarangi 178
Mixed Mode Accesses: Transactional and Non-Transactional
choices
Concurrency Control
• Optimistic or Pessimistic
Version Management
• Lazy or Eager
Conflict Detection
• Lazy or Eager
object 1. Transaction
that has locked
the object
2. Read or write
object metadata
value version
Transactional Variable
Read Operation
Yes No Abort
does not
simple
provide opacity
provide a strong
writes are slow
semantics for
transactions
globalClock++ ;
Tx.rv = globalClock ;
Writes are
failure
Lock object abort
failure
if (e.timestamp > Tx.rv) abort
Provides opacity
ISA Support
• Add three new instructions: begin, abort, and commit
Version Management
• HW schemes mostly use eager version management – undo log
• The log has its dedicated set of addresses in virtual memory
Cache line R W