Parallel Computing - Unit III
Parallel Computing - Unit III
By
Niranjan Lal
Analytical Modeling of Parallel
Programs
• A sequential algorithm is usually evaluated in terms of its
execution time, expressed as a function of the size of its
input.
• The execution time of a parallel algorithm depends not only
on input size but also on the number of processing elements
used, and their relative computation and inter-process
communication speeds.
• Hence, a parallel algorithm cannot be evaluated in isolation
from a parallel architecture without some loss in accuracy.
• A parallel system is the combination of an algorithm and
the parallel architecture on which it is implemented.
In this chapter, we study various metrics for evaluating the
performance of parallel systems.
Performance
1. Execution Time
• The serial runtime of a program is the time
elapsed between the beginning and the end of its
execution on a sequential computer.
• The parallel runtime is the time that elapses from
the moment a parallel computation starts to the
moment the last processing element finishes
execution.
We denote the serial runtime by TS and the parallel
runtime by TP.
Performance Metrics for Parallel Systems
Performance Metrics for Parallel Systems
Performance Metrics for Parallel Systems
Performance Metrics for Parallel Systems
Performance Metrics for Parallel Systems
Performance Metrics for Parallel Systems
Performance Metrics for Parallel Systems
An Interesting Question
4. Efficiency
•Only an ideal parallel system containing p(Number of Cores)
processing elements can deliver a speedup equal to p.
•In practice, ideal behavior is not achieved because while
executing a parallel algorithm, the processing elements
cannot devote 100% of their time to the computations of the
algorithm.
•we saw in Example 1, part of the time required by the
processing elements to compute the sum of n numbers is
spent idling (and communicating in real systems).
Performance Metrics for Parallel Systems
4. Cost
•We define the cost of solving a problem on a parallel system
as the product of parallel runtime and the number of
processing elements used.
•Cost reflects the sum of the time that each processing
element spends solving the problem.
•Efficiency can also be expressed as the ratio of the execution
time of the fastest known sequential algorithm for solving a
problem to the cost of solving the same problem on p
processing elements.
Performance Metrics for Parallel Systems
• The cost of solving a problem on a single processing element is the
execution time of the fastest known sequential algorithm.
• A parallel system is said to be cost-optimal if the cost of solving a
problem on a parallel computer has the same asymptotic growth (in Q
terms) as a function of the input size as the fastest-known sequential
algorithm on a single processing element.
• Since efficiency is the ratio of sequential cost to parallel cost, a cost-
optimal parallel system has an efficiency of Q(1). Cost is sometimes
referred to as work or processor-time product, and a cost-optimal
system is also known as a pTP -optimal system.
Example 6 Cost of adding n numbers on n
processing elements
• The algorithm given in Example 1 for adding n
numbers on n processing elements has a
processor-time product of Q(n log n).
• Since the serial runtime of this operation is Q(n),
the algorithm is not cost optimal.
• Cost optimality is a very important practical
concept although it is defined in terms of
asymptotics.
• We illustrate this using the following example.
Example 7 Performance of non-cost
optimal algorithms
• Consider a sorting algorithm that uses n processing elements to sort
the list in time (log n)2. Since the serial runtime of a (comparison-
based) sort is n log n, the speedup and efficiency of this algorithm are
given by n/log n and 1/log n, respectively.
• The pTP product of this algorithm is n(log n)2. Therefore, this
algorithm is not cost optimal but only by a factor of log n. Let us
consider a realistic scenario in which the number of processing
elements p is much less than n.
• An assignment of these n tasks to p < n processing elements gives us a
parallel time less than n(log n)2/p.
Example 7 Performance of non-cost
optimal algorithms
• This follows from the fact that if n processing elements take time (log
n)2, then one processing element would take time n(log n)2; and p
processing elements would take time n(log n)2/p.
• The corresponding speedup of this formulation is p/log n. Consider the
problem of sorting 1024 numbers (n = 1024, log n = 10) on 32
processing elements.
• The speedup expected is only p/log n or 3.2. This number gets worse
as n increases. For n = 106, log n = 20 and the speedup is only 1.6.
Clearly, there is a significant cost associated with not being cost-
optimal even by a very small factor (note that a factor of log p is
smaller than even ). This emphasizes the practical importance of
costoptimality.
The Effect of Granularity on Performance
n
– If:p -- virtual PEs, then the computation at each PE
n
increase by a factor of p
8 9 10 8 9 10
11 11
4 5 6 7 4 5 6 7
0 1 2 3
10 32
0 1 2 3 0 1 2 3
Substep 1 Substep 2
12 13 14 15 12 13 14 15
8 9 10 89 11
10
11 5
4 76 54 76
10 32 10 32
0 1 2 3 0 1 2 3
Substep 3 Substep 4
Substep 1 Substep 2
13
12 15
14 13
12 15
14
89 11
10 11
8
74 74
30 30
0 1 2 3 0 1 2 3
Substep 3 Substep 4
Substep 1 Substep 2
15
8
70 15
0
0 1 2 3 0 1 2 3