05-Semaphores Monitors Barriers-S20
05-Semaphores Monitors Barriers-S20
• Acknowledgements
• Thanks to Gadi Taubenfield: I borrowed and modified some of his slides on barriers
• Image credits
• https://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=2ahUKEwjxi4uip8LdAhWFq1MKHbBeD4sQjRx6BAgBEAU&url=http%3A%2F%2Fpreshing.com%2F20150316%2
Fsemaphores-are-surprisingly-versatile&psig=AOvVaw20Zw2eU9WAmbX8qxDSLSRd&ust=1537282884760655
• https://images-na.ssl-images-amazon.com/images/I/31EcIPmMniL.jpg
• https://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=2ahUKEwjBivLOp8LdAhWF0VMKHdMvAnwQjRx6BAgBEAU&url=https%3A%2F%2Fprocastproducts.com%2Fal
aska-barriers-10-tall&psig=AOvVaw24KBCgTpBd7ynNpqcwcaqO&ust=1537282983281741
Faux Quiz (answer any 2, 5 min)
• What is the difference between Mesa and Hoare monitors?
• Why recheck the condition on wakeup from a monitor wait?
• How can you build a barrier with spinlocks?
• How can you build a barrier with monitors?
• How can you build a barrier without spinlocks or monitors?
• What is the difference between mutex and semaphores?
• How are monitors and semaphores related?
• Why does pthread_cond_init accept a pthread_mutex_t parameter? Could it use
a pthread_spinlock_t? Why [not]?
• Why do modern CPUs have both coherence and HW-supported RMW
instructions? Why not just one or the other?
• What is priority inheritance?
Lab 1: Baseline
Lab 1: Algorithm in Sequential Context
Upsweep
Downsweep
Lab 1: Parallel
Upsweep
Downsweep
Instrumentation
Instrumentation
Discussion Could you make it
scale?
Lab Tricks: Output CSV
Lab Tricks: scripting your experiments
Producer-Consumer (Bounded-Buffer) Problem
0 1 N-1
Producer Consumer
• Bounded buffer: size ‘N’
• Access entry 0… N-1, then
OK, let’s write some code for this “wrap around” to 0 again
(using locks only) • Producer writes data
• Consumer reads data
object array[N]
void enqueue(object x);
object dequeue();
0 1 N-1
Producer Consumer
Semaphore Motivation
• Problem with locks: mutual exclusion, but no ordering
• Inefficient for producer-consumer (and lots of other things)
• Producer: creates a resource
• Consumer: uses a resource
• bounded buffer between them
• You need synchronization for correctness, and…
• Scheduling order:
• producer waits if buffer full, consumer waits if buffer empty
Semaphores function V(semaphore S, integer I):
[S ← S + I]
• Synchronization variable function P(semaphore S, integer I):
repeat:
• Integer value if S ≥ I:
S ← S − I
• Can’t access value directly break ]
• Must initialize to some value
• sem_init(sem_t *s, int pshared, unsigned int value)
• Two operations
• sem_wait, or down(), P()
• sem_post, or up(), V()
sem_init(&full, 0, 0);
sem_init(&empty, 0, N);
producer() { consumer() {
sem_wait(empty); sem_wait(full);
… // fill a slot … // empty a slot
sem_post(full); sem_post(empty);
} }
Producer-Consumer with semaphores
• Three semaphores
• sem_t full; // # of filled slots
• sem_t empty; // # of empty slots
• sem_t mutex; // mutual exclusion
sem_init(&full, 0, 0);
sem_init(&empty, 0, N);
sem_init(&mutex, 0, 1);
producer() { consumer() {
sem_wait(empty); sem_wait(full);
sem_wait(&mutex); sem_wait(&mutex);
… // fill a slot … // empty a slot
sem_post(&mutex); sem_post(&mutex);
sem_post(full); sem_post(empty);
} }
Pthreads and Semaphores
• No pthread_semaphore_t!
• Type: pthread_semaphore_t
• POSIX does define standard
• #include <semaphore.h>
int pthread_semaphore_init(pthread_spinlock_t *lock);
int pthread_semaphore_destroy(pthread_spinlock_t *lock);
…
• ?????
What is a monitor?
❑ Monitor: one big lock for set of
operations/ methods
❑ Language-level implementation of
mutex
Many variants…
Pthreads and conditions
• Why a mutex_t parameter for pthread_cond_wait?
• Why not in p_cond_init?
• Type pthread_cond_t
notify c:
if c.q.any()
t c.q.pop_front() // t is "notified “
e.push_back(t)
time
a+b+c a+b+c
end a a+b a+b+c a+b+c+d
+d+e +d+e+f
27
Prefix Sum
begin a b c d e f
a a+b c d e f
a a+b a+b+c d e f
time
a+b+c a+b+c
end a a+b a+b+c a+b+c+d
+d+e +d+e+f
28
Parallel Prefix Sum
begin a b c d e f
a+b+c a+b+c
end a a+b a+b+c a+b+c+d
+d+e +d+e+f
Synchronization Algorithms and Concurrent Programming
Chapter 5 29
Gadi Taubenfeld © 2014
Pthreads Parallel Prefix Sum
Will this
work?
Pthreads Parallel Prefix Sum
fixed?
Parallel Prefix Sum
begin a b c d e f
barrier
a+b+c a+b+c
end a a+b a+b+c a+b+c+d
+d+e +d+e+f
Synchronization Algorithms and Concurrent Programming
Chapter 5 32
Gadi Taubenfeld © 2014
What is a Barrier ?
P1 P1 P1
➢ Coordination mechanism (algorithm) P2 P2 P2
Barrier
Barrier
Barrier
➢ processes/threads to wait until all P3 P3 P3
reached specified point.
P4 P4 P4
➢ Once all reach barrier, all can pass.
time
33
Pthreads and barriers
Type pthread_barrier_t
fixed?
Barrier Goals
36
Barrier Building Blocks
• Conditions
• Semaphores
• Atomic Bit
• Atomic Register
• Fetch-and-increment register
• Test and set bits
• Read-Modify-Write register
37
Barrier with Semaphores
Barrier using Semaphores
Algorithm for N threads
shared sem_t arrival = 1; // sem_init(&arrival, NULL, 1)
sem_t departure = 0; // sem_init(&departure, NULL, 0)
atomic int counter = 0; // (gcc intrinsics are verbose)
1 sem_wait(arrival);
First N-1 threads post on
2 if(++counter < N)
arrival, wait on departure
Phase I 3 sem_post(arrival);
4 else Nth thread post on
5 sem_post(departure); departure, releasing
6 sem_wait(departure); threads into phase II
7 if(--counter > 0) (what is value of arrival?)
Phase II 8 sem_post(departure) First N-1 threads post on
9 else departure, last posts arrival
10 sem_post(arrival)
39
Semaphore Barrier Action Zone
N == 3
shared sem_t arrival = 1; 1
0
sem_t departure = 0;1
0
02
atomic int counter = 0;130
CPU 0 CPU 1 CPU 2 1
sem_wait(arrival); sem_wait(arrival); sem_wait(arrival);
if(++counter < N) if(++counter < N) if(++counter < N)
sem_post(arrival); sem_post(arrival); sem_post(arrival);
else else else
sem_post(departure); sem_post(departure); sem_post(departure);
sem_wait(departure); sem_wait(departure); sem_wait(departure); Do we need two
if(--counter > 0) if(--counter > 0) if(--counter > 0) phases?
sem_post(departure) sem_post(departure) sem_post(departure)
else else else Still correct if
sem_post(arrival) sem_post(arrival) sem_post(arrival) counter is not
atomic?
40
Barrier using Semaphores
Properties
• Pros:
• Very Simple
• Space complexity O(1)
• Symmetric
• Cons:
• Required a strong object
• Requires some central manager
• High contention on the semaphores
• Propagation delay O(n)
Barriers based on counters
Counter Barrier Ingredients
43
Simple Barrier Using an Atomic Counter
1 local.go := go
2 local.counter := fetch-and-increment (counter)
3 if local.counter + 1 = n then
4 counter := 0
5 go := 1 - go
6 else await(local.go ≠ go)
44
Simple Barrier Using an Atomic Counter
Run for n=2 Threads
counter ? go ? SM
local.go ? local.go ?
P1 P2
local.counter ? local.counter ?
1 local.go := go
2 local.counter := fetch-and-increment (counter)
3 if local.counter + 1 = n then
4 counter := 0
5 go := 1 - go
6 else await(local.go ≠ go)
Simple Barrier Using an Atomic Counter
Run for n=2 Threads
counter 0
2
1 go 0
1 SM
local.go ?
0 local.go ?
0
P1 P2
local.counter ?
0 local.counter ?
1
P2 P1
1 local.go := go
2 local.counter := fetch-and-increment (counter)
0+1≠2
1+1=2
3 if local.counter + 1 = n then
4 counter := 0
Pros/Cons?
5 go := 1 - go P1 Busy wait
6 else await(local.go ≠ go)
• There is high memory contention on go bit
• Reducing the contention:
• Replace the go bit with n bits:
go[1],…,go[n]
• Process pi may spin only on the bit
46 go[i]
A Local Spinning Counter Barrier
Program of a Thread i
1 local.go := go[i]
2 local.counter := fetch-and-increment (counter)
3 if local.counter + 1 = n then
4 counter := 0
5 for j=1 to n { go[j] := 1 – go[j] }
6 else await(local.go ≠ go[i])
A Local Spinning Counter Barrier
Example Run for n=3 Threads
counter 1
2
3
0 go 0
1
? 0
1
? 0
1
? SM
loc.go ?
0 loc.go ?
0 loc.go 0
?
P1 P2 P3
loc.counter 0
? loc.counter ?
1 loc.counter ?
2
P3 P2 P1 1 local.go := go[i]
2 local.counter := fetch-and-increment (counter)
2+1=3
0+1≠3
1+1≠3
3 if local.counter + 1 = n then
4 counter := 0
5 for j=1 to n { go[j] := 1 – go[j] }
P1,P2
P1 Busy
Busywait
wait
6 else await(local.go ≠ go[i]) Pros/Cons?
Does this
actually reduce
contention? 48
Comparison of counter-based Barriers
Simple Barrier Simple Barrier with go array
• Pros: • Pros:
• Cons: • Cons:
49
Comparison of counter-based Barriers
Simple Barrier Simple Barrier with go array
• Pros: • Pros:
• Very Simple • Low contention on the go array
• Shared memory: O(log n) bits • In some models:
• Takes O(1) until last waiting p is • spinning is done on local
awaken memory
• remote mem. ref.: O(1)
• Cons: • Cons:
• High contention on the go bit • Shared memory: O(n)
• Contention on the counter • Still contention on the counter
register (*) register (*)
• Takes O(n) until last waiting p is
awaken
50
Tree Barriers
A Tree-based Barrier
• Root learns that its 2 children have arrived→tells children they can go
• The signal propagates down the tree until all the threads get the message
2 3
4 5 6 7
52
A Tree-based Barrier: indexing
1 Step 1: label numerically
with depth-first traveral
Assume 𝑛
=𝑖2𝑘 − 1
2 3
2𝑖
2𝑖
+1
4 5 6 7
8 9 10 11 12 13 14 15
arrive
go
2 3 4 5 6 7 8 9 10 11 12 13 14 15
Indexing starts from 2 53
Root → 1, doesn’t need wait objects
A Tree-based Barrier
program of thread i
shared arrive[2..n]: array of atomic bits, initial values = 0
go[2..n]: array of atomic bits, initial values = 0
54
A Tree-based Barrier Arrive[2]=1
Example Run for n=7 threads ? arrive[2]=1 P3 zeros
P2 zeros
arrive[6,7]
P1 zeros
arrive[4,5]
?
P1 zeros arrive[3]
Waiting for
arrive[2]
p3 to arrive 1 Finished!!
Waiting for
Waiting for go[3]
p4 to
go[2]
arrive
4 5 6 7
arrive 01 01 1
0 1
0 01 01 At this point
all non-root
go 1 1 1 1 1 1
threads in some
2 3 4 5 6 7 await(go) case
55
Tree Barrier Tradeoffs
• Pros:
• Low shared memory contention
• No wait object is shared by more than 2 processes
• Good for larger n
• Fast – information from the root propagates after log(n) steps
• Can use only atomic primitives (no special objects)
• On some models:
• each process spins on a locally accessible bit
• # (remote memory ref.) = O(1) per process
• Cons:
• Shared memory space complexity – O(n)
• Asymmetric –all the processes don’t the same amount of work
56
Butterfly Barrier
GPU
CPU
Barriers Summary
Seen:
• Semaphore-based barrier
• Simple barrier
• Based on atomic fetch-and-increment counter
• Local spinning barrier
• Based on atomic fetch-and-increment counter and go array
• Tree-based barrier
Not seen:
• Test-and-Set barriers
• Based on test-and-test-and-set objects
• One version without memory initialization
• See-Saw barrier
59
Questions?