0% found this document useful (0 votes)
52 views70 pages

Arrier Ynchronization: by Samir Najar

This document discusses barrier synchronization in parallel and distributed computing. It begins by describing a simple two-phase video game rendering approach that uses barriers to synchronize frame preparation and display across two threads. It then discusses different barrier implementation techniques, including an atomic counter approach where a shared counter tracks the number of threads that have reached the barrier and spins or waits depending on the counter value. Finally, it covers combining tree barriers, which split a large barrier into a tree structure of smaller barriers to reduce contention.

Uploaded by

Samir Najar
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views70 pages

Arrier Ynchronization: by Samir Najar

This document discusses barrier synchronization in parallel and distributed computing. It begins by describing a simple two-phase video game rendering approach that uses barriers to synchronize frame preparation and display across two threads. It then discusses different barrier implementation techniques, including an atomic counter approach where a shared counter tracks the number of threads that have reached the barrier and spins or waits depending on the counter value. Finally, it covers combining tree barriers, which split a large barrier into a tree structure of smaller barriers to reduce contention.

Uploaded by

Samir Najar
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 70

BARRIER SYNCHRONIZATION

By Samir Najar

Seminar in Concurrent and Distributed Computing


26/05/2010

SIMPLE VIDEO GAME


while (true) { frame.prepare(); frame.display(); }

What about overlapping work?


1st thread displays frame 2nd prepares next frame

SIMPLE VIDEO GAME

Prepare frame for display

By graphics coprocessor

soft real-time application


Need at least 35 frames/second OK to mess up rarely

TWO-PHASE RENDERING
while (true) { if (phase) { frame[0].display(); } else { frame[1].display(); } phase = !phase; } while (true) { if (phase) { frame[1].prepare(); } else { frame[0].prepare(); } phase = !phase; }
4

TWO-PHASE RENDERING
while (true) { if (phase) { frame[0].display(); } else { frame[1].display(); } phase = !phase; } while (true) { if (phase) { frame[1].prepare(); } else { frame[0].prepare(); } phase = !phase; }
5

Even phases

TWO-PHASE RENDERING
while (true) { if (phase) { frame[0].display(); } else { frame[1].display(); } phase = !phase; } while (true) { if (phase) { frame[1].prepare(); } else { frame[0].prepare(); } phase = !phase; }
6

odd phases

SYNCHRONIZATION PROBLEMS
How do threads stay in phase? Too early?

we render no frame before its time Recycle memory before frame is displayed

Too late?

IDEAL PARALLEL COMPUTATION


0 0 0 1 1 1

IDEAL PARALLEL COMPUTATION


2 2 2 1 1 1

REAL-LIFE PARALLEL COMPUTATION


0 0 0 zzz 1 1

10

REAL-LIFE PARALLEL COMPUTATION


2 zzz 1 1

Uh, oh
11

BARRIER SYNCHRONIZATION
0 0 0

barrier

12

BARRIER SYNCHRONIZATION
barrier
1 1 1

13

BARRIER SYNCHRONIZATION
barrier

Until every thread has left here No thread enters here


14

BARRIER DEF.

A barrier is a coordination mechanism ( an algorithm) that forces processes which participate in a concurrent (or distributed) algorithm to wait until each one of them has reached a certain point in its program. The collection of these coordination point is called the barrier. Once all the processes have reached the barrier, they are all permitted to continue past the barrier.

15

WHY DO WE CARE?

Mostly of interest to

Scientific & numeric computation

Elsewhere
Garbage collection Less common in systems programming Still important topic

16

EXAMPLE: PARALLEL PREFIX

before

after

a+b

a+b+c a+b+c +d
17

PARALLEL PREFIX

One thread Per entry

18

PARALLEL PREFIX: PHASE 1

a+b

b+c

c+d

19

PARALLEL PREFIX: PHASE 2

a+b

a+b+c a+b+c +d
20

PARALLEL PREFIX

N threads can compute

Parallel prefix Of N entries In log2 N rounds


Why we need barriers

What if system is asynchronous?

21

PREFIX
class Prefix extends Thread { private int[] a; private int i; private Barrier b; public Prefix(int[] a, Barrier b, int i) { this.a = a; this.b = b; this.i = i; }

22

PREFIX
class Prefix extends Thread { private int[] a; private int i; private Barrier b; public Prefix(int[] a, Barrier b, int i) { this.a = a; Array of input this.b = b; values this.i = i; }

23

PREFIX
class Prefix extends Thread { private int[] a; private int i; private Barrier b; public Prefix(int[] a, Barrier b, int i) { this.a = a; Thread index this.b = b; this.i = i; }

24

PREFIX
class Prefix extends Thread { private int[] a; private int i; private Barrier b; public Prefix(int[] a, Barrier b, int i) { this.a = a; Shared barrier this.b = b; this.i = i; }
25

PREFIX
class Prefix extends Thread { private int[] a; private int i; Initialize fields private Barrier b; public Prefix(int[] a, Barrier b, int i) { this.a = a; this.b = b; this.i = i; }

26

WHERE DO THE BARRIERS GO?


public void run() { int d = 1, sum = 0; while (d < N) { if (i >= d) sum = a[i-d]; if (i >= d) a[i] += sum; d = d * 2; }}}

27

WHERE DO THE BARRIERS GO?


public void run() { int d = 1, sum = 0; while (d < N) { if (i >= d) sum = a[i-d]; b.await(); if (i >= d) a[i] += sum; d = d * 2; }}}
28

WHERE DO THE BARRIERS GO?


public void run() { int d = 1, sum = 0; while (d < N) { if (i >= d) sum = a[i-d]; Make sure everyone reads b.await(); before anyone writes if (i >= d) a[i] += sum; d = d * 2; }}}
29

WHERE DO THE BARRIERS GO?


public void run() { int d = 1, sum = 0; while (d < N) { if (i >= d) sum = a[i-d]; Make sure everyone reads b.await(); before anyone writes if (i >= d) a[i] += sum; b.await(); d = d * 2; }}}
30

WHERE DO THE BARRIERS GO?


public void run() int d = 1, sum = while (d < N) { if (i >= d) sum = a[i-d]; b.await(); if (i >= d) a[i] += sum; b.await(); d = d * 2; }}} { 0;

Make sure everyone reads before anyone writes


Make sure everyone writes before anyone reads
31

BARRIER IMPLEMENTATIONS

Cache coherence

Spin on locally-cached locations? Spin on statically-defined locations?


How many steps? Do all threads do the same thing?

Latency

Symmetry

32

ATOMIC COUNTER
Shared
counter: atomic counter ranges over {0..n}, initially go: atomic bit, initial value is immaterial
0

Local
local.go: a bit, initial value is immaterial 1 2 3 4 5 6 local.go := 1 - go counter := counter + 1 If counter := n counter := 0 go : 1 - go Else await(local.go go ) fi
33

ATOMIC COUNTER
Shared
counter: atomic counter ranges over {0..n}, initially 0 go: atomic bit, initial value is immaterial

Local
local.go: 1
a bit, initial value is immaterial

local.go := 1 - go

Remember current value?

2 3 4 5 6

counter := counter + 1 If counter := n counter := 0 go : 1 - go Else await(local.go go ) fi


34

ATOMIC COUNTER
Shared
counter: atomic counter ranges over {0..n}, initially 0 go: atomic bit, initial value is immaterial

Local
local.go: 1
a bit, initial value is immaterial

local.go := 1 - go

2
3 4 5 6

counter := counter + 1

Automatically increment the counter

If counter := n counter := 0 go : 1 - go Else await (local.go go ) fi


35

ATOMIC COUNTER
Shared
counter: atomic counter ranges over {0..n}, initially 0 go: atomic bit, initial value is immaterial

Local
local.go: 1 2 3 4 5 6
a bit, initial value is immaterial

local.go := 1 - go counter := counter + 1 If counter := n

Last to arrive to the barrier

counter := 0 go : 1 - go Else await (local.go go ) fi


36

ATOMIC COUNTER
Shared
counter: atomic counter ranges over {0..n}, initially 0 go: atomic bit, initial value is immaterial

Local
local.go: 1 2 3 4 5 6
a bit, initial value is immaterial

local.go := 1 - go counter := counter + 1 If counter := n counter := 0

Reset the barrier

go : 1 - go Else await (local.go go ) fi


37

ATOMIC COUNTER
Shared
counter: atomic counter ranges over {0..n}, initially 0 go: atomic bit, initial value is immaterial

Local
local.go: 1 2 3 4 5 6
a bit, initial value is immaterial

local.go := 1 - go counter := counter + 1 If counter := n counter := 0 go : 1 - go

Notify all

Else await (local.go go ) fi


38

ATOMIC COUNTER
Shared
counter: atomic counter ranges over {0..n}, initially 0 go: atomic bit, initial value is immaterial

Local
local.go: 1 2 3 4 5 6
a bit, initial value is immaterial

local.go := 1 - go counter := counter + 1 If counter := n counter := 0 go : 1 - go

Not the last to arrive

Else await (local.go go ) fi


39

REMARKS
Only atomic counter should be initialized. The number of remote memory in CC model is O(1) Small number of steps

High memory contention


Increment the shared counter. Spin in shared go bit

The number of remote memory in DSM model is unbounded


40

ATOMIC COUNTER
Shared counter:
atomic counter ranges over {0..n}, initially 0 go[1..n]: array of atomic bits, initial value is immaterial Local local.go: a bit, initial value is immaterial

1 local.go := 1 go[i] 2 3 4 5 6 counter := If counter counter for j=1 counter + 1 := n := 0 to n do go[j] : 1 go[j] od

else await (local.go go[i]) fi

41

REMARKS
Memory contention is reduced by letting each process spin only on a locally accessible variable. Still memory contention when accessing the counter which is shared by all processes

42

COMBINING TREE BARRIER


Split huge barrier into many small barriers which are organized in a tree structure. Each small barrier correspond to a node in the tree, identifies by node number and level In each node there is a counter and sense. Number of the subtrees of each node (the degree) equal the number of processes that may participate in each small barrier. Number of the processes n is power of degree Depth of the tree logdegreen The last process arrived the barrier wins and continue to next level.

43

COMBINING TREE BARRIERS


2-barrier

2-barrier

2-barrier

44

COMBINING TREE BARRIERS


2-barrier

2-barrier

2-barrier

45

COMBINING TREE BARRIER


node := i level := -1

repeat node = [node/degreelevel+1] level := level + 1 local.go := go[level,node] counter[level,node] := counter[level,node] + 1 If counter[level,node] = degree then counter[level,node] := 0 if (level = logdegreen-1) then go[level,node] := 1 - go[level,node] else await (local.go go[level,node]) fi Until (local.go go[level,node])
while (level 0) do level := level -1 node := [i/degreelevel+1] go[level,node] := 1 - go[level,node] od

46

COMBINING TREE BARRIER


node := i level := -1 Repeat

node = [node/degreelevel+1]

level := level + 1 local.go := go[level,node] counter[level,node] := counter[level,node] + 1 If counter[level,node] = degree then counter[level,node] := 0 if (level = logdegreen-1) then go[level,node] := 1 - go[level,node] else await (local.go go[level,node]) fi Until (local.go go[level,node]) while (level 0) do level := level -1 node := [i/degreelevel+1] go[level,node] := 1 - go[level,node] od

Access barrier number node, every level number of nodes decremented by half

47

COMBINING TREE BARRIER


node := i level := -1 repeat node = [node/degreelevel+1]

Advance level, begin barrier

level := level + 1
local.go := go[level,node] counter[level,node] := counter[level,node] + 1 If counter[level,node] = degree then counter[level,node] := 0 if (level = logdegreen-1) then go[level,node] := 1 - go[level,node] else await (local.go go[level,node]) fi Until (local.go go[level,node]) while (level 0) do level := level -1 node := [i/degreelevel+1] go[level,node] := 1 - go[level,node] od

48

COMBINING TREE BARRIER


node := i level := -1 repeat node = [node/degreelevel+1] level := level + 1

local.go := go[level,node]

Remember sense value, every barrier identified by node and level

counter[level,node] := counter[level,node] + 1 If counter[level,node] = degree then counter[level,node] := 0 if (level = logdegreen-1) then go[level,node] := 1 - go[level,node] else await (local.go go[level,node]) fi Until (local.go go[level,node]) while (level 0) do level := level -1 node := [i/degreelevel+1] go[level,node] := 1 - go[level,node] od

49

COMBINING TREE BARRIER


node := i level := -1 repeat node = [node/degreelevel+1] level := level + 1 local.go := go[level,node]

Atomic counter per barrier, identified by level and node

counter[level,node] := counter[level,node] + 1
If counter[level,node] = degree then counter[level,node] := 0 if (level = logdegreen-1) then go[level,node] := 1 - go[level,node] else await (local.go go[level,node]) fi Until (local.go go[level,node]) while (level 0) do level := level -1 node := [i/degreelevel+1] go[level,node] := 1 - go[level,node] od

50

COMBINING TREE BARRIER


node := i level := -1 Last to arrive, # repeat of threeds in each node = [node/degreelevel+1] level := level + 1 barrier equal to local.go := go[level,node] counter[level,node] := counter[level,node] + 1 degree

If counter[level,node] = degree then


counter[level,node] := 0 if (level = logdegreen-1) then go[level,node] := 1 - go[level,node] else await (local.go go[level,node]) fi Until (local.go go[level,node]) while (level 0) do level := level -1 node := [i/degreelevel+1] go[level,node] := 1 - go[level,node] od

51

COMBINING TREE BARRIER


node := i level := -1 repeat node = [node/degreelevel+1] level := level + 1 local.go := go[level,node] counter[level,node] := counter[level,node] + 1 If counter[level,node] = degree then

counter[level,node] := 0

if (level = logdegreen-1) then go[level,node] := 1 - go[level,node] else await (local.go go[level,node]) fi Until (local.go go[level,node])
while (level 0) do level := level -1 node := [i/degreelevel+1] go[level,node] := 1 - go[level,node] od

Reset the barrier for next episode

52

COMBINING TREE BARRIER


node := i level := -1 repeat node = [node/degreelevel+1] level := level + 1 local.go := go[level,node] counter[level,node] := counter[level,node] + 1 If counter[level,node] = degree then counter[level,node] := 0

Root

if (level = logdegreen-1) then


go[level,node] := 1 - go[level,node] else await (local.go go[level,node]) fi Until (local.go go[level,node])
while (level 0) do level := level -1 node := [i/degreelevel+1] go[level,node] := 1 - go[level,node] od

53

COMBINING TREE BARRIER


node := i level := -1 repeat node = [node/degreelevel+1] level := level + 1 local.go := go[level,node] counter[level,node] := counter[level,node] + 1 If counter[level,node] = degree then counter[level,node] := 0 if (level = logdegreen-1) then

Notify Root

go[level,node] := 1 - go[level,node]
else await (local.go go[level,node]) fi Until (local.go go[level,node]) while (level 0) do level := level -1 node := [i/degreelevel+1] go[level,node] := 1 - go[level,node] od

54

COMBINING TREE BARRIER


node := i level := -1 repeat node = [node/degreelevel+1] level := level + 1 local.go := go[level,node] counter[level,node] := counter[level,node] + 1 If counter[level,node] = degree then counter[level,node] := 0 if (level = logdegreen-1) then go[level,node] := 1 - go[level,node]

Not the last

else await (local.go go[level,node]) fi


Until (local.go go[level,node]) while (level 0) do level := level -1 node := [i/degreelevel+1] go[level,node] := 1 - go[level,node] od

55

COMBINING TREE BARRIER

node := i level := -1 repeat node = [node/degreelevel+1] level := level + 1 local.go := go[level,node] counter[level,node] := counter[level,node] + 1 If counter[level,node] = degree then counter[level,node] := 0 if (level = logdegreen-1) then go[level,node] := 1 - go[level,node] else await (local.go go[level,node]) fi

Until (local.go go[level,node])


while (level 0) do level := level -1 node := [i/degreelevel+1] go[level,node] := 1 - go[level,node] od

end barrier

56

COMBINING TREE BARRIER


node := i level := -1 repeat node = [node/degreelevel+1] level := level + 1 local.go := go[level,node] counter[level,node] := counter[level,node] + 1 If counter[level,node] = degree then counter[level,node] := 0 if (level = logdegreen-1) then go[level,node] := 1 - go[level,node] else await (local.go go[level,node]) fi Until (local.go go[level,node])

while (level 0) do level := level -1 node := [i/degreelevel+1] go[level,node] := 1 - go[level,node] od

Notify processes in your path back to level 0

57

REMARKS

Low memory contention

Small barriers

Cache behavior
Local spinning on bus-based architecture, not on statically-assigned locations. the number of remote memory references per process is at most O(log n) in the CC model and is unbounded in the DSM model.

58

DISSEMINATION BARRIER
n processes, {0..n-1} At round r

Thread i notifies thread i +2r (mod n) Waits for notification from process j +2r (mod n) = i

Requires log n rounds

59

DISSEMINATION BARRIER
+1 +2 +4

60

DISSEMINATION DETAILS

Use two copies of flags

Avoids interference
Avoid reinitializing fields

Use sense-reversing

Thread Arguments
Parity of round Sense (flips when round becomes 0)

61

DISSEMINATION BARRIER
Type flags = two dimensional array[0..1,0..logn-1] of atomic bits Shared Allflags: array[1..n] of flags, initial values are all 0 Local parity, sense: bits range over {0..1} initially both 0 round: register, range over {0,,log n-1} 1 2 3 4 5 6 for round:=0 to [log n -1] do allflags[i+2round (mod n)[parity,round]]:=sense await (allflags[i][parity,round] = sense) Od If parity = 1 then sense := 1 sense fi parity := 1 - parity

62

DISSEMINATION BARRIER
Type
flags =
bits

two dimensional array[0..1,0..logn-1] of atomic

/* one such flags structure is used per process */

Shared
Allflags:
array[1..n] of flags, initial values are all 0

/* allflags[i] is locally accessible to process i */

Local
parity, sense: round: register,
bits range over {0..1} initially both 0 range over {0,,log n-1}

63

DISSEMINATION BARRIER
for round:=0 to [log n -1] do

Indicates that each barrier episode is composed of exactly log n rounds


allflags[i+2round (mod n)[parity,round]]:=sense await (allflags[i][parity,round] = sense) Od If parity = 1 then sense := 1 sense fi parity := 1 - parity
64

DISSEMINATION BARRIER
for round:=0 to [log n -1] do allflags[i+2round (mod n)[parity,round]]:=sense
Process i in round round notifies process i+2round (mod n) that has arrived

await (allflags[i][parity,round] = sense) Od If parity = 1 then sense := 1 sense fi parity := 1 - parity

65

DISSEMINATION BARRIER
for round:=0 to [log n -1] do allflags[i+2round (mod n)[parity,round]]:=sense await (allflags[i][parity,round] = sense)
Process i wait for notification from process i+2round (mod n) by spinning on allflags[i][parity,round] bit

Od If parity = 1 then sense := 1 sense fi parity := 1 - parity


66

DISSEMINATION BARRIER
for round:=0 to [log n -1] do allflags[i+2round (mod n)[parity,round]]:=sense await (allflags[i][parity,round] = sense) Od If parity = 1 then sense := 1 sense fi parity := 1 parity

Prepare for next barrier In all even barrier episodes process i spins on the bits from allflags[i][0,*] and in odd barrier episodes process i spins on the bit from allflags[i][0,*]

67

REMARKS

Every thread spins in the same place

Good for NUMA implementations

Works even if n not a power of 2 Not very space efficient

68

IDEAS

Sense-reversing

Reuse without reinitializing


Like counters, locks

Combining tree Dissemination barrier

Simple, not space-efficient

69

Bibliography

[Tau06] Gadi Taubenfeld.


Education, 2006.

Synchronization Algorithms and Concurrent Programming. Pearson

[HS08] Maurice Herlihy and Nir Shavit. The Art of Multiprocessor Programming. Morgan Kaufmann, 2008 .
70

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy