PC 10 Esf PDF
PC 10 Esf PDF
Decomposition and
Orchestration
the serial algorithm may explore different alternatives one after the
other, because the branch that may lead to the solution is not known
beforehand
⇒ the parallel program may perform more, less, or the same amount of
aggregate work compared to the serial algorithm depending on the
location of the solution in the search space.
In speculative decomposition
the input at a branch leading to multiple parallel tasks is unknown
the serial algorithm would strictly perform only one of the tasks at a
speculative stage because when it reaches the beginning of that stage,
it knows exactly which branch to take.
⇒ a parallel program employing speculative decomposition performs more
aggregate work than its serial counterpart.
Hybrid Decompositions
Decomposition techs are not exclusive, and can often be combined together.
Often, a computation is structured into multiple stages and it is sometimes necessary
to apply different types of decomposition in different stages.
Example 1: while finding the minimum of a large set of n numbers,
a purely recursive decomposition may result in far more tasks than the number of
processes, P, available.
An efficient decomposition would partition the input into P roughly equal parts and
have each task compute the minimum of the sequence assigned to it.
The final result can be obtained by finding the minimum of the P intermediate results
by using the recursive decomposition shown in Fig
Example 2: quicksort in parallel.
Used a recursive decomposition to derive a concurrent formulation of quicksort.
This formulation results in O(n) tasks for the problem of sorting a sequence of size n.
But due to the dependencies among these tasks and due to uneven sizes of the tasks, the
effective concurrency is quite limited.
For example, the first task for splitting the input list into two parts takes O(n) time, which puts
an upper limit on the performance gain possible via parallelization.
The step of splitting lists performed by tasks in parallel quicksort can also be
decomposed using the input decomposition technique.
The resulting hybrid decomposition that combines recursive decomposition and the input
data-decomposition leads to a highly concurrent formulation of quicksort.
Orchestration by an example
Simplified version of a piece or kernel of Ocean problem: its equation solver.
It uses the equation solver to dig deeper and illustrate how to implement a parallel
program using the three programming models.
The equation solver kernel solves a simple partial differential equation on a grid, using
what is referred to as a finite differencing method.
It operates on a regular, 2-d grid or array of (n+2)-by-(n+2) elements, such as a single
horizontal cross-section of the ocean basin in Ocean.
The border rows and columns of the grid contain boundary values that do not change,
while the interior n-by-n points are updated by the solver starting from their initial values.
The computation proceeds over a number of sweeps.
In each sweep, it operates on all the elements of the grid, for each element replacing its
value with a weighted average of itself and its four nearest neighbor elements (above,
below, left and right).
The updates are done in-place in the grid, so a point sees the new values of the points
above and to the left of it, and the old values of the points below it and to its right.
This form of update is called the Gauss-Seidel method.
During each sweep the kernel also computes the average difference of an updated
element from its previous value.
If this average difference over all elements is smaller than a predefined “tolerance”
parameter, the solution is said to have converged & solver exits at the end of the sweep.
Otherwise, it performs another sweep and tests for convergence again.
Example - decomposition
Decomposition
Sequential: For programs that are structured in successive loops or
loop nests, a simple way to identify concurrency is to
float diff = 0, temp; start from the loop structure itself.
while (!done) do Examine the individual loops or loop nests in the
/*outermost loop over sweeps */ program one at a time, see if their iterations can be
performed in parallel, & determine whether this exposes
diff = 0; /* initialize max.diff. to 0 */ enough concurrency.
for i >=1 to n do Each iteration of the outermost loop, sweeps through the
/* sweep over non-border points of grid */ entire grid.
for j >=1 to n do These iterations clearly are not independent, since data that
are modified in one iteration are accessed in the next.
temp = A[i,j]; /* save old value of elem*/
Look at the inner loop first (the j loop).
A[i,j] <- 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] + Each iteration of this loop reads the grid point (A[i,j-1]) that
A[i,j+1] + A[i+1,j]); was written in the previous iteration.
/*compute average */ The iterations are therefore sequentially dependent, and we
diff += abs(A[i,j] - temp); call this a sequential loop.
end for The outer loop of this nest is also sequential, since the
elements in row i-1 were written in the previous (i-1th)
end for iteration of this loop.
So this simple analysis of existing loops and their
dependences uncovers no concurrency in this case.
Example - decomposition approaches
An alternative to relying on program structure to find concurrency is to go back to the fundamental
dependences in the underlying algorithms used, regardless of program or loop structure.
Look at the fundamental data dependences at the granularity of individual grid points.
Computing a particular grid point in the sequential program uses the updated values of the grid points directly
above and to the left.
Elements along a given anti-diagonal (south-west to north-east) have no dependences among them and can
be computed in parallel, while the points in the next anti-diagonal depend on some points in the previous one.
From this diagram, we can observe that of the work involved in each sweep, there a sequential
dependence proportional to n along the diagonal and inherent concurrency proportional to n.
Decompose the work into individual grid points, so updating a single grid point is a task.
Approach 1:
Leave the loop structure of the program as it is
Insert point-to-point synchronization to ensure that a grid point has been produced in the current sweep
before it is used by the points to the right of or below it.
Different loop nests and even different sweeps might be in progress simultaneously on different elements, as
long as the element-level dependences are not violated.
The overhead of this synchronization at grid-point level may be too high.
Approach 2:
Change the loops: outer loop be over antidiagonals & inner loop be over elements within an anti-diagonal.
The inner loop can now be executed completely in parallel, with global synchronization between iterations of
the outer for loop to preserve dependences conservatively across antidiagonals.
Global synchronization is still very frequent: once per antidiagonal.
Also, the number of iterations in the parallel (inner) loop changes with successive outer loop iterations,
causing load imbalances among processors especially in the shorter antidiagonals.
Because of the frequency of synchronization, the load imbalances, and the programming
complexity, neither of these approaches is used much on modern architectures.
Example – red-black ordering
Approach 3: exploiting knowledge of the problem beyond the sequential program itself.
Gauss-Seidel solution: iterates until convergence, we can update the grid points in a
different order as long as we use updated values for grid points frequently enough.
One such ordering that is used often for parallel versions is called red-black ordering.
The idea here is to separate the grid points into alternating red points and black points as on a
checkerboard, so that no red point is adjacent to a black point or vice versa.
To compute a red point we do not need the updated value of any other red point, but only the
updated values of the above and left black points (in a standard sweep), and vice versa.
We can therefore divide a grid sweep into two phases: first computing all red points and then
computing all black points.
Within each phase there are no dependences among grid points, so we can compute all red
points in parallel, then synchronize globally, and then compute all black points in parallel.
Global synchronization is conservative and can be replaced by point-to-point synchronization at
the level of grid points—since not all black points need to wait for all red points to be
computed—but it is convenient.
The red-black ordering is different from our original sequential ordering, and can
therefore both converge in fewer or more sweeps as well as produce different final
values for the grid points (though still within the convergence tolerance).
Even if we don’t use updated values from the current while loop iteration for any grid
points, and we always use the values as they were at the end of the previous while loop
iteration, the system will still converge, only much slower.
This is called Jacobi rather than Gauss-Seidel iteration.
Example – assignment
Static assignment:
The simplest option is a static (predetermined) assignment in
which each processor is responsible for a contiguous block of
rows: block assignment
Alternative: cyclic assignment in which rows are interleaved
among processes.
Dynamic assignment:
each process repeatedly grabs the next available (not yet
computed) row after it finishes with a row task
it is not predetermined which process computes which rows.
Static block assignment.
Exhibits good load balance across processes as long as the
number of rows is divisible by the number of processes, since
the work per row is uniform.
Orchestration under the Data Parallel Model
int n, nprocs;
Diff from sequential code: /* grid size (n+2-by-n+2) and number of processes*/
Dynamically allocated shared data, are allocated with a float **A, diff = 0;
G_MALLOC (global malloc) call rather than a regular malloc. main()
Use DECOMP statement, begin
Use of for_all loops instead of for loops, read(n); read(nprocs);; /* read input grid size and no.processes*/
Use of a private mydiff variable per process, and A <-G_MALLOC (a 2-d array of size n+2 by n+2 doubles);
Use of a REDUCE statement. initialize(A); /* initialize the matrix A somehow */
Solve (A); /* call the routine to solve equation*/
for_all specify: iterations performed in parallel. end main
DECOMP statement has a two-fold purpose. procedure Solve(A) /* solve the equation system */
assignment of the iterations to processes: float **A; /* A is an n+2 by n+2 array*/
[BLOCK, *, nprocs] assignment: the 1st dim (rows) is begin
partitioned into contiguous pieces among the nprocs int i, j, done = 0;
processes, & 2nd dimension is not partitioned at all. float mydiff = 0, temp;
[CYCLIC, *, nprocs] would have implied a cyclic or DECOMP A[BLOCK,*];
interleaved partitioning of rows among nprocs processes, while (!done) do /* outermost loop over sweeps */
[BLOCK, BLOCK, nprocs] a subblock decomposition, mydiff = 0; /* initialize maximum difference to 0 */
[*, CYCLIC, nprocs] interleaved partitioning of columns. for_all i >=1 to n do /* sweep over non-border points of grid */
for_all j >=1 to n do
specifies how the grid data should be distributed
temp = A[i,j]; /* save old value of element */
among memories on a distributed memory machine
A[i,j] <- 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +
mydiff variable is used to allow each process to A[i,j+1] + A[i+1,j]); /*compute average */
first independently compute the sum of the mydiff += abs(A[i,j] - temp);
difference values for its assigned grid points. end for_all
end for_all
REDUCE: directs the system to add all their partial REDUCE (mydiff, diff, ADD);
mydiff values together into the shared diff variable. if (diff/(n*n) < TOL) then done = 1;
The reduction op may be implemented in a library in a manner end while
best suited to the underlying architecture. end procedure
Orchestration under the Shared Address Space Model
1. int n, nprocs; /* matrix dimension and number of processors to be used */
2a. float **A, diff; /*A is global (shared) array representing the grid */
/* diff is global (shared) maximum difference in current sweep */
Declare the matrix A as a single 2b. LOCKDEC(diff_lock); /* declaration of lock to enforce mutual exclusion */
shared arra 2c. BARDEC (bar1); /* barrier declaration for global sync between sweeps*/
3. main()
4. begin
Processes can reference the parts of 5. read(n); read(nprocs); /* read input matrix size and number of processes*/
it they need using loads and stores 6. A <-G_MALLOC (a two-dimensional array of size n+2 by n+2 doubles);
7. initialize(A); /* initialize A in an unspecified way*/
with exactly the same array indices 8a. CREATE (nprocs-1, Solve, A);
as in a sequential program. 8 Solve(A); /* main process becomes a worker too*/
8b. WAIT_FOR_END; /* wait for all child processes created to terminate */
Communication will be generated 9. end main
10. procedure Solve(A)
implicitly as necessary. 11. float **A; /*A is a n+2-by-n+2 shared array,as in the sequential program */
12. begin
With explicit parallel processes we 13. int i,j, pid, done = 0;
14. float temp, mydiff = 0; /* private variables/
now need mechanisms to: 14a. int mymin <-1 + (pid * n/nprocs); /*assume that n is divisible by */
14b. int mymax <-mymin + n/nprocs - 1; /* nprocs for simplicity here*/
create the processes, 15. while (!done) do /* outer loop over all diagonal elements */
16. mydiff = diff = 0; /* set global diff to 0 (okay for all to do it) */
coordinate them through 17. for i >=mymin to mymax do /* for each of my rows */
18. for j >=1 to n do /* for all elements in that row */
synchronization, and 19. temp = A[i,j];
20. A[i,j] = 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +
control the assignment. 21. A[i,j+1] + A[i+1,j]);
22. mydiff += abs(A[i,j] - temp);
Differences from the sequential code 23. endfor
24. endfor
are shown in italicized bold font 25a. LOCK(diff_lock); /* update global diff if necessary */
25b. diff += mydiff;
Comments: in the textbook 25c. UNLOCK(diff_lock);
25d. BARRIER(bar1,nprocs);/* ensure all have got here before checking if done*/
25e. if (diff/(n*n)<TOL) then done=1; /*check convergence;all get same answer*/
25f. BARRIER(bar1, nprocs); /* see Exercise c */
26. endwhile
27.end procedure
Orchestration under the Message Passing Model
1. int pid, n, nprocs; /* process id, matrix dimension & no. 17. for i >=1 to n’ do /* for each of my rows */
processors to be used */ 18. for j >=1 to n do /* for all elements in that row */
2. float **myA; 19. temp = myA[i,j];
3. main() 20. myA[i,j] <- 0.2 * (myA[i,j] + myA[i,j-1] + myA[i-1,j] +
4. begin 21. myA[i,j+1] + myA[i+1,j]);
5. read(n); read(nprocs); /* read input matrix size and number of 22. mydiff += abs(myA[i,j] - temp);
processes*/ 23. endfor
8a. CREATE (nprocs-1 processes that start at procedure Solve); 24. endfor
8b. Solve(); /* main process becomes a worker too*/ /* communicate local diff values and obtain determine if done;
8c. WAIT_FOR_END; /* wait for all child processes created to can be replaced
terminate */ by reduction and broadcast */
9. end main 25a. if (pid != 0) then /* process 0 holds global total diff*/
10. procedure Solve() 25b. SEND(mydiff,sizeof(float),0,DIFF);
11. begin 25c. RECEIVE(mydiff,sizeof(float),0,DONE);
13. int i,j, pid, n’ = n/nprocs, done = 0; 25d. else
14. float temp, tempdiff, mydiff = 0; /* private variables/ 25e. for i >=1 to nprocs-1 do /* for each of my rows */
6. myA <-malloc(2d array of size[n/nprocs+2] by n+2);/*my 25f. RECEIVE(tempdiff,sizeof(float),*,DONE);
assigned rows of A */
7. initialize(myA); /* initialize my rows of A, in an unspecified way*/ 25g. mydiff += tempdiff; /* accumulate into total */
15.while (!done) do 25h. endfor
16. mydiff = 0; /* set local diff to 0 */ 25i. for i >=1 to nprocs-1 do /* for each of my rows */
16a.if (pid != 0) then SEND(&myA[1,0],n*sizeof(float),pid-1,ROW); 25j. SEND(done,sizeof(int),i,DONE);
16b. if (pid = nprocs-1) then 25k. endfor
SEND(&myA[n’,0],n*sizeof(float),pid+1,ROW); 25l. endif
16c. if (pid != 0) then RECEIVE(&myA[0,0],n*sizeof(float),pid- 26. if (mydiff/(n*n) < TOL) then done = 1;
1,ROW); 27. endwhile
16d. if (pid != nprocs-1) then 28. end procedure
RECEIVE(&myA[n’+1,0],n*sizeof(float),pid+1,ROW);
/*border rows of neighbors have now been copied into myA[0,*] Comments: in the textbook
and myA[n’+1,*]*/