0% found this document useful (0 votes)
36 views

Chapter 8 - Advanced Parallel Algorithms

Uploaded by

topkek69123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Chapter 8 - Advanced Parallel Algorithms

Uploaded by

topkek69123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Advanced Parallel

Algorithms
References
• Michael J. Quinn. Parallel Computing. Theory and Practice.
McGraw-Hill
• Albert Y. Zomaya. Parallel and Distributed Computing
Handbook. McGraw-Hill
• Ian Foster. Designing and Building Parallel Programs.
Addison-Wesley.
• Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar .
Introduction to Parallel Computing, Second Edition. Addison
Wesley.
• Joseph Jaja. An Introduction to Parallel Algorithm. Addison
Wesley.
• Nguyễn Đức Nghĩa. Tính toán song song. Hà Nội 2003.

3
8.1 Parallel Recursive

4
Divide and Conquer
• The principle of “divide and conquer" as follows:
• B1: Divide the original problem into smaller problems.
• B2: Recursive implementation with small problems.
• B3: Combine results from small problems to obtain
original problem results.
• Small problems are independent of each other so they
can be done in parallel.
• The problem is how to do steps 1 and 3 the most
effectively ???

5
Devide and Conquer

6
Complexity
• Considering the problem P having n-length, divided into
q child problems of n/k-length (each problem has k
elements, k>1), executed in parallel with p processors
t_run_trivial(n) if n is small enough
t_run_serial(n) if p = 1
t_divide_conquer t_divide(n,p) + t_combine(n,p) + q/p
(n,p)= t_divide_conquer(k,1) if 1 < p < q
t_divide(n,p) + t_combine(n,p) +
t_divide_conquer(k,p/q) if p > q or p=q

7
For Example (1)
• Sum of n numbers A[1..n] with p processors.
• Idea:
• If n = 1 return value A[1].
• If p = 1 run in serial mode.
• Divide array A into 2 parts A1 and A2, each
containing n/2 elements, executed in parallel:
• Calculate recursively S1: sum of all A1’s
elements with p/2 processors.
• Calculate recursively S2: sum of all A2’s
elements with p/2 processors.
• Get the total S = S1+S2.

8
Sum of n numbers A[1..n] with p
processors
INPUT : A[1..n], p bộ xử lý;
OUTPUT : SUM = ∑ A[i];
FUNCTION S = SUM(A,n,m,p) // n,m la chi so dau tien va cuoi cung
BEGIN
IF p = 1 THEN
S = SEQUENCE_SUM(A,n,m);
END IF.
DO IN PARALLEL
S1 = SUM(A1,n,(n+m)/2,p/2);
S2 = SUM(A2,(n+m)/2,m,p/2);
END DO
S = S1 + S2;
END;
• Recursive equation: T(n) = T(n/2) + O(1) (considering p ≈ n)
• Complexity: O(logn).
• Machine PRAM EREW.
9
Example : convex hull
• The problem of determining the convex envelope of
a set of vertices in the plane.
• Input: n vertex (xk, yk) in the plane.
• Output: Set of vertices that form the smallest convex
polygon containing all the remaining vertices.

10
Parallel QuickHull
• Idea:
• Initial: Define u,v as 2
vertices with x coordinate
values being the smallest
and largest hence u, v are
in convex envelope.
• Segment (u,v) divides the
initial set S into 2 upper and
lower regions: S_upper and
S_lower
• Treating S_upper and
S_lower in parallel.

11
Parallel QuickHull
• Both upper hull and lower hull can be treated in the
same way.
• Division step:
• Select the pivot p as the point that has the longest
distance from the (p1 , p2).
• The pivot point will be on the convex envelope
points in the triangle (p, p1, p2) are eliminated.
• The remaining points are divided into 2 parts outside
the edges: (p, p1) and (p2 p).
• Recursive implementation with these two parts with
the steps as above.

12
Parallel QuickHull

• Signs:
• Pivot point p: max |(p1 – p) x (p2 – p)|;
• The vertexes are in the triangle if the total angles
from that vertex are equal to 2π.
• Angle between 2 vectors: cos(a,b) = (a x b)/(|a||b|)

13
Illustration steps

Uper and Lower Hull

Set of vertexes in S

14
Illustration steps

15
Illustration steps

16
Procedure QUICKHULL

17
Recursive in PRAM
• If we represent the sub-problem levels of a recursive
algorithm as a tree get the k-level tree (with k = 2
binary tree).
• The recursive idea in PRAM is to divide the set of
processors into groups, each of which will correspond
to a sub-tree in the tree.
• Executing in parallel with all processors, in each group
of processors, doing the work corresponding to the sub-
tree.

18
Recursive with UperHull
• Variables used:
• Each point i corresponds to 1 variable F[i]:
• Initial: F[i] = 1;
• Eliminated for being in: F[i] = 0;
• Marked as the point in the convex envelope: F[i] = 2.
• Since each vertex set will be assigned to two bottom
points, each vertex determines the current 2 bottom
points through the variable: P[i] và Q[i].
• Recursive steps:
• All processors perform in parallel to identify the T[i].
• Update peaks P[i] and Q[i]:
• Left vertexes (P[i],T[i]) assigned Q[i] = T[i].
• Left vertexes (T[i],Q[i]) assigned P[i] = T[i].
• Update again values F[i].
• Repeat the above work until all F[i] ≠ 1.

19
8.2 Accelerated Cascading

20
Concepts of complexity
• In serial calculation, there is only one concept of
complexity = number of steps to implement the
algorithm (≈ duration): Called S(n)
• In parallel calculations there is additional concept
number of operations performed on all processors:
Called W(n).
• If Wi(n) is the number of operations performed
simultaneously at step i we have a formula:

21
Example of S(n) and W(n)
• The combined problem with n = 2k values using math
operation ⊕. Balancing tree algorithm as follows:

22
Examples of S(n) and W(n)
• Define S(n) and W(n) values according to algorithm’s
segment codes as follows:

23
Accelerated Cascading Technique
• The cost of an algorithm is the number of operations
that the system must perform.
• Some algorithms are called optimal if: W(n) = Θ(Ts(n)).
where:
• W(n): cost of parallel algorithm.
• Ts(n) : the best execution time of serial algorithm.
• Accelerated Cascading technique combines non-optimal
algorithm but faster execution time with optimal
algorithm but slow execution time.

24
Example (1)
• An array L[1..n] receives integer values from 1..k with
k = O(log2n). Determine the number of occurrences of
integers appeared in array L.
• Let R[i] be the number of occurrences of value i.
• Optimal serial algorithm Ts(n) = Θ(n) :

25
First parallel approach
• Use two-dimensional arrays: C[1..n,1..k]

• The the number of occurrences of integer i equals the total of


C[1:n,j].

26
Example with n = 8 & k = 4

27
Comments …
• The number of operations performed by the first
two parallel loops is Θ(nk) with Θ(1) step.
• Execution time in paragraph 3 according to binary
tree model: Θ(log2n).
• The real number of calculations in paragraph 3 is:
Θ(nk)
• This algorithm is not cost effective not optimal.

28
Using Accelerated Cascading
• With m=n/k, setup array Ĉ[1..m.1.k] corresponding to
array L. That is, dividing L into m sub-array of k
elements.
• Using m processor in order to scan from 1:k, i.e. each
processor performs 1 optimal serial algorithm to
determine the number of occurrentce of j ∈ 1:k.
• The number of steps executed is Θ(k) = Θ(log2n),
• The cost of execution is Θ(mk) = Θ(n).
• R[j] determined by summing each column Ĉ[1:m,j]:
• Using balancing tree algorithm: cost Θ(m)
• Total cost is Θ(mk) = Θ(n). cost optimization
29
Using Accelerated Cascading

30
Using Accelerated Cascading

31
Using Accelerated Cascading

32
Using Accelerated Cascading

33
Using Accelerated Cascading

34
Optimal algorithm

35
Example (2): find Max
• Determine Xi = max { X1, X2, ..Xn }: Xi ≥ Xj ∀ j ∈ 1:n.
• Algorithm with PRAM EREW: O(log2n) step with the cost of
O(n), using O(n) processors and balanced tree model.
• Consider the following algorithm with PRAM CRCW:

36
Find Max with PRAM CRCW

• The algorithm above has 2 parts:


• Part 1 can be done in parallel using n2 processor with
O(1) steps Costs O(n2).
• Part 2 can be done with PRAM CRCW: determine the
value M[i] needs O(1) step, cost O(n) total cost:
O(n).
• If PRAM CRCW is selected, the problem is quickly
resolved with O(1) steps with O(n2) processors, hence
the total cost is O(n2).

37
Find Max: Accelerated Cascading
• Let (W(n),T(n)) be the number of operation and time
duration of the algorithm.
• Find-Max problem:
• (1) Balance-Tree with EREW: (O(n), O(log2n)): optimal
but slow
• (2) Use CRCW with n2 processors: (O(n2), O(1)) : not
optimal but fast.
• (3) Build a new algorithms on the DLDT tree with
(O(n.log2log2n),O(log2log2n));
• Apply (1) and (3) with Accelerated Cascading
technique new algorithm : (O(n), O(log2log2n)).

38
Tree DLDT
• DLDT = Doubly Logarithmic Depth Tree.
• This is recursive tree.
k
• DLDT(n) is a tree with n leaves. (n = 2 ).
2

• With k = 0 n = 2 tree has 1 root and 2 leaves.


• With k > 0, tree is recursively constructed as
follows:
k-1
• Root has √n = 2 sub-trees.
2

• Each sub-tree has √n leaves : DLDT(√n ).


• Comment: The number of leaves in the tree with the
root node at level i equals the number of leaves in the
tree with the root node at level i+1.

39
Tree DLDT
n = 16 nút

• Degree of node u is the number of child nodes of u.


k k-1
• Thanks to n = 2 then the root has degree of 2 = √n
2 2
k-i-1
• Node at level i has degree of 2 2 với 0 ≤ i < k.
• Node at level k-1 has 2 child nodes
• Node at level k has 2 leaves
40
Tree DLDT

• The depth of the tree is : k+1 = log2log2n + 1.


• Let n be the number of the leave of the DLDT tree

41
Find-Max: DLDT Tree
• Comments:
• At level 0 (root) we have n processors used to determine
the Maximum value of obtained results returned from the
child node (√n) at level 1. The algorithm can be applied
using PRAM CRCW with (O(n),O(1)).
• At level 1, dividing n processors into m = √n groups.
Each group corresponds to 1 node; we determine the
Maximum value from √m child node at level 2. The
algorithm can be applied using PRAM CRCW to each
node with (O(m),O(1));
• At level i, each node corresponds to 1 group 22k-i
k-i-1
processors. Each node is the father of 22 child nodes.
The algorithm can bek-iapplied using PRAM CRCW to
each node with (O(22 ),O(1)).

42
Find-Max: DLDT Tree
• Algorithm’s idea:
• Done with k repeating steps (from level k-1 =
loglogn -1 to level 0 (root node)).
• Executing the algorithm from the bottom to up.
• At the i_th iteration step:
• Perform
k k-i
in parallel with n processors k-i
divided into:
22 -2 groups, each group contains 22 processors
(because each child node is assigned to 1
processor =>processor group’s number = child
node group’s number).
• Using algorithms with CRCW for each node, the
largest value of child nodes is stored at parent
nodes.
43
Find-Max: DLDT tree
• Performance evaluation:
• Time duration: O(k)=O(log2log2n) repeating steps.
• Cost evaluation:
k-i
• Each node in i_th iteration
-i
step performs O(2 2 )
operations or O(n )2
k-2k-i -i
• At i_th step, there is 2 2 =n 1-2 nodes total
-i
cost
-i
at each i_th iteration step: Wi(n) = O(n x 1-2

n2 ) = O(n).
• The cost of the entire algorithm is:
W(n) = k*Wi(n)= O(n.k) = O(n.log2log2n).

44
Using Accelerated Cascading
• Step 1: Using balancing tree technique with [logloglogn]
level from the bottom.

45
Using Accelerated Cascading
• Step 2. Performing the algorithm on the DLDT tree
with the number of nodes m = n/log2log2n.

46
Using Accelerated Cascading
• Step 1. Using balanced tree:
• After each step of the balancing tree, the number of nodes
decreases by 1/2 (done with log2log2log2n serial steps =>T(n)
= log2log2log2n).
• With k= log2log2log2n, after this step the remaining number
of nodes equals m = n/2k = n/log2log2n.
• Cost of step 1: W(n) = O(n).
• Step 2. Using DLDT algorithm to the m other nodes:
• T(n) = O(log2log2m) = O(log2log2n).
• W(n) = O(m x log2log2m) = O(m x log2log2n) = O(n).
• Conclusion: New algorithm (O(n), O(log2log2n)).

47
8.3 Pipeline

48
Pipeline Technique
• Widely used to accelerate the executing time of
many problems, including:
• Each main problem can be divided into several
child problems.
• These sub-problems can depend on each other in
an order of execution.
• At each moment, the processors perform an
algorithm for sub-problems of each main
problem in parallel (make sure the execution
order is constant).

49
Pipeline Mechanism
• Considering n problems: t1,t2, …,tn need to do.
• Each ti can be divided into a set of sub-
problems:{ti,1 ,ti,2 , …,ti,m } so that ti,k must be
terminated before starting ti,k+1 .
• Assuming that at each step j = 1..m, algorithm
step Aj will be done with sub-problems: t1,j ,t2,j
,…,tn,j.

50
Pipeline Mechanism

51
52
Example: matrix by vector

53
Example: multiply the matrix by the
vector
• Illustrating with: matrix 4x4 and vector 4x1.
• Step1:
• P0 received (X0 and A00);
• Calculate the product of X0 and A00, then saved to Y0 .
• Push X0 down to P1 and move X1 to the top of the range.
• Step 2:
• P0 received (X1 and A01);P1 received (X0 and A 10 );
• Calculate the corresponding products and save to Y0 and
Y1 .
• Push X0, X1 down, move X2 to the top of the range,
....repeat until Step 8.

54
Parallel Insertion Sort
• Algorithm idea:
• Values that will be
sorted go into the
range of processors
one by one.
• At each processor:
• Read the value just
received.
• Compare with
current value.
• Move a value to
the next processor.

55
Thank you
for your
attentions!

56

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy