0% found this document useful (0 votes)
17 views60 pages

05-Semaphores Monitors Barriers-S20

The document discusses synchronization mechanisms in concurrent programming, focusing on semaphores, monitors, and barriers. It covers concepts such as the producer-consumer problem, semaphore operations, and the differences between Mesa and Hoare monitors. Additionally, it explores the implementation of barriers using semaphores and the properties of various synchronization techniques.

Uploaded by

jingxue07
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views60 pages

05-Semaphores Monitors Barriers-S20

The document discusses synchronization mechanisms in concurrent programming, focusing on semaphores, monitors, and barriers. It covers concepts such as the producer-consumer problem, semaphore operations, and the differences between Mesa and Hoare monitors. Additionally, it explores the implementation of barriers using semaphores and the properties of various synchronization techniques.

Uploaded by

jingxue07
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Synchronization:

Semaphores, Monitors, Barriers


Chris Rossbach
CS378H
Today
• Questions?
• Administrivia
• Start looking at Lab 2!
• Material for the day
• Lab 1 discussion
• Semaphores
• Monitors
• Barriers

• Acknowledgements
• Thanks to Gadi Taubenfield: I borrowed and modified some of his slides on barriers
• Image credits
• https://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=2ahUKEwjxi4uip8LdAhWFq1MKHbBeD4sQjRx6BAgBEAU&url=http%3A%2F%2Fpreshing.com%2F20150316%2
Fsemaphores-are-surprisingly-versatile&psig=AOvVaw20Zw2eU9WAmbX8qxDSLSRd&ust=1537282884760655
• https://images-na.ssl-images-amazon.com/images/I/31EcIPmMniL.jpg
• https://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=2ahUKEwjBivLOp8LdAhWF0VMKHdMvAnwQjRx6BAgBEAU&url=https%3A%2F%2Fprocastproducts.com%2Fal
aska-barriers-10-tall&psig=AOvVaw24KBCgTpBd7ynNpqcwcaqO&ust=1537282983281741
Faux Quiz (answer any 2, 5 min)
• What is the difference between Mesa and Hoare monitors?
• Why recheck the condition on wakeup from a monitor wait?
• How can you build a barrier with spinlocks?
• How can you build a barrier with monitors?
• How can you build a barrier without spinlocks or monitors?
• What is the difference between mutex and semaphores?
• How are monitors and semaphores related?
• Why does pthread_cond_init accept a pthread_mutex_t parameter? Could it use
a pthread_spinlock_t? Why [not]?
• Why do modern CPUs have both coherence and HW-supported RMW
instructions? Why not just one or the other?
• What is priority inheritance?
Lab 1: Baseline
Lab 1: Algorithm in Sequential Context

Upsweep

Downsweep
Lab 1: Parallel

Upsweep

Downsweep
Instrumentation
Instrumentation
Discussion Could you make it
scale?
Lab Tricks: Output CSV
Lab Tricks: scripting your experiments
Producer-Consumer (Bounded-Buffer) Problem

• Bounded buffer: size ‘N’


• Access entry 0… N-1, then “wrap around” to 0 again
• Producer process writes data to buffer
• Must not write more than ‘N’ items more than consumer “consumes”
• Consumer process reads data from buffer
• Should not try to consume if there is no data

0 1 N-1

Producer Consumer
• Bounded buffer: size ‘N’
• Access entry 0… N-1, then
OK, let’s write some code for this “wrap around” to 0 again
(using locks only) • Producer writes data
• Consumer reads data

object array[N]
void enqueue(object x);
object dequeue();

0 1 N-1

Producer Consumer
Semaphore Motivation
• Problem with locks: mutual exclusion, but no ordering
• Inefficient for producer-consumer (and lots of other things)
• Producer: creates a resource
• Consumer: uses a resource
• bounded buffer between them
• You need synchronization for correctness, and…
• Scheduling order:
• producer waits if buffer full, consumer waits if buffer empty
Semaphores function V(semaphore S, integer I):
[S ← S + I]
• Synchronization variable function P(semaphore S, integer I):
repeat:
• Integer value if S ≥ I:
S ← S − I
• Can’t access value directly break ]
• Must initialize to some value
• sem_init(sem_t *s, int pshared, unsigned int value)
• Two operations
• sem_wait, or down(), P()
• sem_post, or up(), V()

int sem_wait(sem_t *s) { int sem_post(sem_t *s) {


wait until value of semaphore s increment the value of
is greater than 0 semaphore s by 1
decrement the value of if there are 1 or more
semaphore s by 1 threads waiting, wake 1
} }
Semaphore Uses // initialize to X
sem_init(s, 0, X)
• Mutual exclusion
• Semaphore as mutex sem_wait(s);
// critical section
• What should initial value be?
sem_post(s);
• Binary semaphore: X=1
• ( Counting semaphore: X>1 )
• Scheduling order
• One thread waits for another
• What should initial value be?
// thread 0 // thread 1
… // 1st half of computation
sem_post(s); sem_wait(s);
… //2nd half of computation
Producer-Consumer with semaphores
• Two semaphores
• sem_t full; // # of filled slots
• sem_t empty; // # of empty slots
Is this correct?
• Problem: mutual exclusion?

sem_init(&full, 0, 0);
sem_init(&empty, 0, N);

producer() { consumer() {
sem_wait(empty); sem_wait(full);
… // fill a slot … // empty a slot
sem_post(full); sem_post(empty);
} }
Producer-Consumer with semaphores
• Three semaphores
• sem_t full; // # of filled slots
• sem_t empty; // # of empty slots
• sem_t mutex; // mutual exclusion
sem_init(&full, 0, 0);
sem_init(&empty, 0, N);
sem_init(&mutex, 0, 1);

producer() { consumer() {
sem_wait(empty); sem_wait(full);
sem_wait(&mutex); sem_wait(&mutex);
… // fill a slot … // empty a slot
sem_post(&mutex); sem_post(&mutex);
sem_post(full); sem_post(empty);
} }
Pthreads and Semaphores
• No pthread_semaphore_t!
• Type: pthread_semaphore_t
• POSIX does define standard
• #include <semaphore.h>
int pthread_semaphore_init(pthread_spinlock_t *lock);
int pthread_semaphore_destroy(pthread_spinlock_t *lock);

• ?????
What is a monitor?
❑ Monitor: one big lock for set of
operations/ methods
❑ Language-level implementation of
mutex

• Entry procedure: called from outside


• Internal procedure: called within monitor
• Wait within monitor releases lock

Many variants…
Pthreads and conditions
• Why a mutex_t parameter for pthread_cond_wait?
• Why not in p_cond_init?
• Type pthread_cond_t

int pthread_cond_init(pthread_cond_t *cond,


const pthread_condattr_t *attr);
int pthread_cond_destroy(pthread_cond_t *cond);
int pthread_cond_wait(pthread_cond_t
Java: *cond,
pthread_mutex_t *mutex);
synchronized keyword
int pthread_cond_signal(pthread_cond_t *cond);
wait()/notify()/notifyAll()
int pthread_cond_broadcast(pthread_cond_t *cond);
C#: Monitor class
Enter()/Exit()/
Pulse()/PulseAll()
Does this code work?

• Uses “if” to check invariants.


• Why doesn’t if work?
• How could we MAKE it work?
Hoare-style Monitors
(aka blocking condition variables)
Given entrance queue ‘e’, signal queue ‘s’, condition var ‘c’
enter: schedule:
• Signaler must wait, but gets
if(locked): if s.any() priority over threads on
e.push_back(thread) t  s.pop_first()
else t.run entrance queue
lock else if e.any()
t  e.pop_first()
t. run
• Lock only released by
else • Schedule (if no waiters)
unlock // monitor unoccupied
• Application
wait c:
c.q.push_back(thread) leave:
schedule
• Pros/Cons?
schedule // block this thread

Must run signaled thread immediately


signal c :
Options for signaler:
if (c.q.any())
• Switch out (go on s queue)
t = c.q.pop_front() // t → "the signaled thread"
s.push_back(thread) • Exit (Hansen monitors)
t.run • Continue executing?
Mesa-style monitors
(aka non-blocking condition variables)
enter: schedule:
if locked: if e.any()
e.push_back(thread) t  e.pop_front
block t. run
else else
lock unlock

notify c:
if c.q.any()
t  c.q.pop_front() // t is "notified “
e.push_back(t)

• Leave still calls schedule


wait c:
• No signal queue
c.q.push_back(thread) • Extendable with more queues for priority
schedule
block • What are the differences/pros/cons?
Example: anyone see a bug?
Solutions?
• Timeouts
• notifyAll
• Can Hoare monitors support notifyAll?
Barriers
Prefix Sum
begin a b c d e f

time

a+b+c a+b+c
end a a+b a+b+c a+b+c+d
+d+e +d+e+f

27
Prefix Sum
begin a b c d e f

a a+b c d e f

a a+b a+b+c d e f
time

a a+b a+b+c a+b+c+d e f

a a+b a+b+c a+b+c+d a+b+c+d+e f

a+b+c a+b+c
end a a+b a+b+c a+b+c+d
+d+e +d+e+f

28
Parallel Prefix Sum

begin a b c d e f

a a+b b+c c+d d+e e+f


time

a a+b a+b+c a+b+c+d b+c+d+e c+d+e+f

a+b+c a+b+c
end a a+b a+b+c a+b+c+d
+d+e +d+e+f
Synchronization Algorithms and Concurrent Programming
Chapter 5 29
Gadi Taubenfeld © 2014
Pthreads Parallel Prefix Sum

Will this
work?
Pthreads Parallel Prefix Sum

fixed?
Parallel Prefix Sum

begin a b c d e f

barrier

a a+b b+c c+d d+e e+f


time
barrier

a a+b a+b+c a+b+c+d b+c+d+e c+d+e+f

a+b+c a+b+c
end a a+b a+b+c a+b+c+d
+d+e +d+e+f
Synchronization Algorithms and Concurrent Programming
Chapter 5 32
Gadi Taubenfeld © 2014
What is a Barrier ?

P1 P1 P1
➢ Coordination mechanism (algorithm) P2 P2 P2

Barrier

Barrier

Barrier
➢ processes/threads to wait until all P3 P3 P3
reached specified point.
P4 P4 P4
➢ Once all reach barrier, all can pass.
time

four processes all except Once all


approach the P4 arrive arrive, they
barrier continue

33
Pthreads and barriers

Type pthread_barrier_t

int pthread_barrier_init(pthread_barrier_t *barrier,


const pthread_barrierattr_t *attr,
unsigned count);
int pthread_barrier_destroy(pthread_barrier_t *barrier);
int pthread_barrier_wait(pthread_barrier_t *barrier);
Pthreads Parallel Prefix Sum

fixed?
Barrier Goals

Desirable barrier properties:


• Low shared memory space complexity
• Low contention on shared objects
• Low shared memory references per process
• No need for shared memory initialization
• Symmetric: same amount of work for all processes
• Algorithm simplicity
• Simple basic primitive
• Minimal propagation time
• Reusability of the barrier (must!)

36
Barrier Building Blocks
• Conditions
• Semaphores
• Atomic Bit
• Atomic Register
• Fetch-and-increment register
• Test and set bits
• Read-Modify-Write register

37
Barrier with Semaphores
Barrier using Semaphores
Algorithm for N threads
shared sem_t arrival = 1; // sem_init(&arrival, NULL, 1)
sem_t departure = 0; // sem_init(&departure, NULL, 0)
atomic int counter = 0; // (gcc intrinsics are verbose)

1 sem_wait(arrival);
First N-1 threads post on
2 if(++counter < N)
arrival, wait on departure
Phase I 3 sem_post(arrival);
4 else Nth thread post on
5 sem_post(departure); departure, releasing
6 sem_wait(departure); threads into phase II
7 if(--counter > 0) (what is value of arrival?)
Phase II 8 sem_post(departure) First N-1 threads post on
9 else departure, last posts arrival
10 sem_post(arrival)

39
Semaphore Barrier Action Zone
N == 3
shared sem_t arrival = 1; 1
0
sem_t departure = 0;1
0
02
atomic int counter = 0;130
CPU 0 CPU 1 CPU 2 1
sem_wait(arrival); sem_wait(arrival); sem_wait(arrival);
if(++counter < N) if(++counter < N) if(++counter < N)
sem_post(arrival); sem_post(arrival); sem_post(arrival);
else else else
sem_post(departure); sem_post(departure); sem_post(departure);
sem_wait(departure); sem_wait(departure); sem_wait(departure); Do we need two
if(--counter > 0) if(--counter > 0) if(--counter > 0) phases?
sem_post(departure) sem_post(departure) sem_post(departure)
else else else Still correct if
sem_post(arrival) sem_post(arrival) sem_post(arrival) counter is not
atomic?
40
Barrier using Semaphores
Properties

• Pros:
• Very Simple
• Space complexity O(1)
• Symmetric
• Cons:
• Required a strong object
• Requires some central manager
• High contention on the semaphores
• Propagation delay O(n)
Barriers based on counters
Counter Barrier Ingredients

Fetch-and-Increment register Await


• A shared register that supports a F&I operation: • For brevity, we use the await macro
• Input: register r • Not an operation of an object
• Atomic operation: • This is also called: “spinning”
• r is incremented by 1
• the old value of r is returned

function fetch-and-increment (r : register) macro await (condition : boolean condition)


orig_r := r; repeat
r:= r + 1; cond = eval(condition);
return (orig_r); until (cond)
end-function end-macro

43
Simple Barrier Using an Atomic Counter

shared counter: fetch and increment reg. – {0,..n}, initially = 0


go: atomic bit, initial value is immaterial
local local.go: a bit, initial value is immaterial
local.counter: register

1 local.go := go
2 local.counter := fetch-and-increment (counter)
3 if local.counter + 1 = n then
4 counter := 0
5 go := 1 - go
6 else await(local.go ≠ go)

44
Simple Barrier Using an Atomic Counter
Run for n=2 Threads

counter ? go ? SM

local.go ? local.go ?
P1 P2
local.counter ? local.counter ?

1 local.go := go
2 local.counter := fetch-and-increment (counter)
3 if local.counter + 1 = n then
4 counter := 0
5 go := 1 - go
6 else await(local.go ≠ go)
Simple Barrier Using an Atomic Counter
Run for n=2 Threads

counter 0
2
1 go 0
1 SM

local.go ?
0 local.go ?
0
P1 P2
local.counter ?
0 local.counter ?
1
P2 P1
1 local.go := go
2 local.counter := fetch-and-increment (counter)
0+1≠2
1+1=2
3 if local.counter + 1 = n then
4 counter := 0
Pros/Cons?
5 go := 1 - go P1 Busy wait
6 else await(local.go ≠ go)
• There is high memory contention on go bit
• Reducing the contention:
• Replace the go bit with n bits:
go[1],…,go[n]
• Process pi may spin only on the bit
46 go[i]
A Local Spinning Counter Barrier
Program of a Thread i

shared counter: fetch and increment reg. – {0,..n}, initially = 0


go[1..n]: array of atomic bits, initial values are immaterial
local local.go: a bit, initial value is immaterial
local.counter: register

1 local.go := go[i]
2 local.counter := fetch-and-increment (counter)
3 if local.counter + 1 = n then
4 counter := 0
5 for j=1 to n { go[j] := 1 – go[j] }
6 else await(local.go ≠ go[i])
A Local Spinning Counter Barrier
Example Run for n=3 Threads

counter 1
2
3
0 go 0
1
? 0
1
? 0
1
? SM

loc.go ?
0 loc.go ?
0 loc.go 0
?
P1 P2 P3
loc.counter 0
? loc.counter ?
1 loc.counter ?
2

P3 P2 P1 1 local.go := go[i]
2 local.counter := fetch-and-increment (counter)
2+1=3
0+1≠3
1+1≠3
3 if local.counter + 1 = n then
4 counter := 0
5 for j=1 to n { go[j] := 1 – go[j] }
P1,P2
P1 Busy
Busywait
wait
6 else await(local.go ≠ go[i]) Pros/Cons?
Does this
actually reduce
contention? 48
Comparison of counter-based Barriers
Simple Barrier Simple Barrier with go array
• Pros: • Pros:

• Cons: • Cons:

49
Comparison of counter-based Barriers
Simple Barrier Simple Barrier with go array
• Pros: • Pros:
• Very Simple • Low contention on the go array
• Shared memory: O(log n) bits • In some models:
• Takes O(1) until last waiting p is • spinning is done on local
awaken memory
• remote mem. ref.: O(1)
• Cons: • Cons:
• High contention on the go bit • Shared memory: O(n)
• Contention on the counter • Still contention on the counter
register (*) register (*)
• Takes O(n) until last waiting p is
awaken

50
Tree Barriers
A Tree-based Barrier

• Threads are organized in a binary tree


• Each node is owned by a predetermined thread
• Each thread waits until its 2 children arrive
• combines results
• passes them on to its parent

• Root learns that its 2 children have arrived→tells children they can go
• The signal propagates down the tree until all the threads get the message

2 3

4 5 6 7

52
A Tree-based Barrier: indexing
1 Step 1: label numerically
with depth-first traveral
Assume 𝑛
=𝑖2𝑘 − 1

2 3
2𝑖
2𝑖
+1

4 5 6 7

8 9 10 11 12 13 14 15

arrive

go
2 3 4 5 6 7 8 9 10 11 12 13 14 15
Indexing starts from 2 53
Root → 1, doesn’t need wait objects
A Tree-based Barrier
program of thread i
shared arrive[2..n]: array of atomic bits, initial values = 0
go[2..n]: array of atomic bits, initial values = 0

1 if i=1 then // root


2 await(arrive[2] = 1); arrive[2] := 0
Root:
Root • Wait for arriving children
3 await(arrive[3] = 1); arrive[3] := 0
• Tell children to go
4 go[2] = 1; go[3] = 1
5 else if i ≤ (n-1)/2 then // internal node
6 await(arrive[2i] = 1); arrive[2i] := 0 Internal:
• Wait for arriving children
7 await(arrive[2i+1] = 1); arrive[2i+1] := 0
Internal • Wait for parent go signal
8 arrive[i] := 1
• Tell children to go
9 await(go[i] = 1); go[i] := 0
10 go[2i] = 1; go[2i+1] := 1
11 else // leaf
Child:
12 arrive[i] := 1
Leaf • arrive
13 await(go[i] = 1); go[i] := 0 fi • Wait for parent go signal
14 fi

54
A Tree-based Barrier Arrive[2]=1
Example Run for n=7 threads ? arrive[2]=1 P3 zeros
P2 zeros
arrive[6,7]
P1 zeros
arrive[4,5]
?
P1 zeros arrive[3]
Waiting for
arrive[2]
p3 to arrive 1 Finished!!
Waiting for
Waiting for go[3]
p4 to
go[2]
arrive

Waiting for 2 3 Waiting for


go[5]
go[4] go[7]
go[6]

4 5 6 7

arrive 01 01 1
0 1
0 01 01 At this point
all non-root
go 1 1 1 1 1 1
threads in some
2 3 4 5 6 7 await(go) case
55
Tree Barrier Tradeoffs
• Pros:
• Low shared memory contention
• No wait object is shared by more than 2 processes
• Good for larger n
• Fast – information from the root propagates after log(n) steps
• Can use only atomic primitives (no special objects)
• On some models:
• each process spins on a locally accessible bit
• # (remote memory ref.) = O(1) per process
• Cons:
• Shared memory space complexity – O(n)
• Asymmetric –all the processes don’t the same amount of work

56
Butterfly Barrier

• When would this be preferable?


Hardware Supported Barriers

GPU

CPU
Barriers Summary
Seen:
• Semaphore-based barrier
• Simple barrier
• Based on atomic fetch-and-increment counter
• Local spinning barrier
• Based on atomic fetch-and-increment counter and go array
• Tree-based barrier
Not seen:
• Test-and-Set barriers
• Based on test-and-test-and-set objects
• One version without memory initialization
• See-Saw barrier

59
Questions?

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy