0% found this document useful (0 votes)

33 views

Parallel Computing - Unit III

The document discusses performance metrics for evaluating parallel systems, including execution time, total parallel overhead, speedup, and efficiency. It provides examples of calculating speedup and efficiency for algorithms like adding numbers and edge detection on images using parallel processing elements. The key metrics help determine the best algorithm, evaluate hardware platforms, and examine benefits from parallelism.

Uploaded by

shraddha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views

Parallel Computing - Unit III

Uploaded by

shraddha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 74

Unit 3

Parallel Computing (CS 14.403)

Performance and Scalability of Parallel Systems

By
Niranjan Lal
Analytical Modeling of Parallel
Programs
• A sequential algorithm is usually evaluated in terms of its
execution time, expressed as a function of the size of its
input.
• The execution time of a parallel algorithm depends not only
on input size but also on the number of processing elements
used, and their relative computation and inter-process
communication speeds.
• Hence, a parallel algorithm cannot be evaluated in isolation
from a parallel architecture without some loss in accuracy.
• A parallel system is the combination of an algorithm and
the parallel architecture on which it is implemented.
In this chapter, we study various metrics for evaluating the
performance of parallel systems.
Performance

•  Why do we care about performance evaluation? –

Purchasing perspective
– given a collection of machines, which has the
• best performance ?
• least cost ?
• best performance / cost ?
• –Design perspective
– faced with design options, which has the
• best performance improvement ?
• least cost ?
• best performance / cost ?
How to measure, report, and summarize performance?
• Performance metric
• Benchmark
Which of these airplanes has the best
performance?

 What metric defines performance?

–Capacity, cruising range, or speed?
 Speed
–Taking one passenger from one point to another in the least time
–Transporting 450 passengers from one point to another
Performance Metrics for Parallel Systems

• Note that an algorithm may have different

performance on different parallel architecture.
• For example, an algorithm may perform
differently on a linear array of processors and on a
hypercube of processors.
• It is important to study the performance of parallel
programs with a view to determining the best
algorithm, evaluating hardware platforms, and
examining the benefits from parallelism.
• A number of metrics have been used based on the
desired outcome of performance analysis.
Performance Metrics for Parallel Systems

1. Execution Time
• The serial runtime of a program is the time
elapsed between the beginning and the end of its
execution on a sequential computer.
• The parallel runtime is the time that elapses from
the moment a parallel computation starts to the
moment the last processing element finishes
execution.
We denote the serial runtime by TS and the parallel
runtime by TP.
Performance Metrics for Parallel Systems
Performance Metrics for Parallel Systems
Performance Metrics for Parallel Systems
Performance Metrics for Parallel Systems
Performance Metrics for Parallel Systems
Performance Metrics for Parallel Systems
Performance Metrics for Parallel Systems

An Interesting Question

If two machines have the same instruction set architecture

(ISA) which of our quantities (e.g., clock rate, CPI,
execution time, # of instructions, MIPS) will always be
identical?
Performance Metrics for Parallel Systems
Performance Metrics for Parallel Systems
2. Total Parallel Overhead
•The overheads incurred by a parallel program are encapsulated into a
single expression referred to as the overhead function.
•We define overhead function or total overhead of a parallel system as
the total time collectively spent by all the processing elements over and
above that required by the fastest known sequential algorithm for solving
the same problem on a single processing element.
•We denote the overhead function of a parallel system by the symbol To.
•The total time spent in solving a problem summed over all processing
elements is pTP. TS units of this time are spent performing useful work,
and the remainder is overhead.
•Therefore, the overhead function (To) is given by
Performance Metrics for Parallel Systems
3. Speedup
•When evaluating a parallel system, we are often interested in knowing
how much performance gain is achieved by parallelizing a given
application over a sequential implementation.
•Speedup is a measure that captures the relative benefit of solving a
problem in parallel.
•It is defined as the ratio of the time taken to solve a problem on a single
processing element to the time required to solve the same problem on a
parallel computer with p (Number of Cores) identical processing
elements.
•We denote speedup by the symbol S.
Performance Metrics for Parallel Systems
Performance Metrics for Parallel Systems
3. Speedup (Example-1 Adding n numbers using n processing
elements)
•Consider the problem of adding n numbers by using n processing
elements. Initially, each processing element is assigned one of the
numbers to be added and, at the end of the computation, one of the
processing elements stores the sum of all the numbers.
•Assuming that n is a power of two, we can perform this operation in log
n steps by propagating partial sums up a logical binary tree of processing
elements.
•Next Figure illustrates the procedure for n = 16. The processing elements
are labeled from 0 to 15.
•Similarly, the 16 numbers to be added are labeled from 0 to 15. The sum
of the numbers with consecutive labels from i to j is denoted by
Computing the global sum of 16 partial sums using
16 processing elements. . denotes the sum of numbers with
consecutive labels from i to j.
Example-1 Adding n numbers using n
processing elements cont…
• Each step shown in Previous Figure , consists of one addition and the
communication of a single word. The addition can be performed in
some constant time, say tc, and the communication of a single word
can be performed in time ts + tw.
• Therefore, the addition and communication operations take a constant
amount of time. Thus,

• Since the problem can be solved in Q(n) time on a single processing

element, its speedup is

• For a given problem, more than one sequential algorithm may be

available, but all of these may not be equally suitable for
parallelization. When a serial computer is used, it is natural to use the
sequential algorithm that solves the problem in the least amount of
time.
Example-1 Adding n numbers using n
processing elements cont…
• Given a parallel algorithm, it is fair to judge its performance with respect to
the fastest sequential algorithm for solving the same problem on a single
processing element. Sometimes, the asymptotically fastest sequential
algorithm to solve a problem is not known, or its runtime has a large constant
that makes it impractical to implement.
• In such cases, we take the fastest known algorithm that would be a practical
choice for a serial computer to be the best sequential algorithm.
• We compare the performance of a parallel algorithm to solve a problem with
that of the best sequential algorithm to solve the same problem. We formally
define the speedup S as the ratio of the serial runtime of the best sequential
algorithm for solving a problem to the time taken by the parallel algorithm to
solve the same problem on p processing elements.
• The p processing elements used by the parallel algorithm are assumed to be
identical to the one used by the sequential algorithm.
Example-2 Computing speedups of parallel
programs
• Consider the example of parallelizing bubble sort.
• Assume that a serial version of bubble sort of 105 records takes 150
seconds and a serial quicksort can sort the same list in 30 seconds.
• If a parallel version of bubble sort, also called odd-even sort, takes 40
seconds on four processing elements, it would appear that the parallel
odd-even sort algorithm results in a speedup of 150/40 or 3.75.
• However, this conclusion is misleading, as in reality the parallel
algorithm results in a speedup of 30/40 or 0.75 with respect to the best
serial algorithm.
Performance Metrics for Parallel Systems

4. Efficiency
•Only an ideal parallel system containing p(Number of Cores)
processing elements can deliver a speedup equal to p.
•In practice, ideal behavior is not achieved because while
executing a parallel algorithm, the processing elements
cannot devote 100% of their time to the computations of the
algorithm.
•we saw in Example 1, part of the time required by the
processing elements to compute the sum of n numbers is
spent idling (and communicating in real systems).
Performance Metrics for Parallel Systems

• Efficiency is a measure of the fraction of time for which a

processing element is usefully employed; it is defined as
the ratio of speedup to the number of processing elements.
• In an ideal parallel system, speedup is equal to p and
efficiency is equal to one. In practice, speedup is less than
p and efficiency is between zero and one, depending on the
effectiveness with which the processing elements are
utilized.
• We denote efficiency by the symbol E. Mathematically, it
is given by
Performance Metrics for Parallel Systems
Performance Metrics for Parallel Systems

Example 3 Efficiency of adding n numbers on n

processing elements
•From speedup equation and the preceding definition, the
efficiency of the algorithm for adding n numbers on n
processing elements is

•We also illustrate the process of deriving the parallel

runtime, speedup, and efficiency while preserving various
constants associated with the parallel platform.
Example 5 . Edge detection on images

• Given an n x n pixel image, the problem of detecting edges

corresponds to applying a3x 3 template to each pixel.
• The process of applying the template corresponds to
multiplying pixel values with corresponding template
values and summing across the template (a convolution
operation).
• This process is illustrated in Figure (a) along with typical
templates (Figure (b)). Since we have nine multiply-add
operations for each pixel, if each multiply-add takes time
tc, the entire operation takes time 9tcn2 on a serial
computer.
Example 5 . Edge detection on images
Example 5 . Edge detection on images
• A simple parallel algorithm for this problem partitions the image
equally across the processing elements and each processing element
applies the template to its own subimage.
• Note that for applying the template to the boundary pixels, a
processing element must get data that is assigned to the adjoining
processing element.
• Specifically, if a processing element is assigned a vertically sliced
subimage of dimension n x (n/p), it must access a single layer of n
pixels from the processing element to the left and a single layer of n
pixels from the processing element to the right (note that one of these
accesses is redundant for the two processing elements assigned the
subimages at the extremities).
Example 5 . Edge detection on images
• This is illustrated in Figure 5.4(c). On a message passing machine, the
algorithm executes in two steps:
• (i) exchange a layer of n pixels with each of the two adjoining
processing elements; and
• (ii) apply template on local subimage.
• The first step involves two n-word messages (assuming each pixel
takes a word to communicate RGB data). This takes time 2(ts + twn).
• The second step takes time 9tcn2/p.
• The total time for the algorithm is therefore given by:
Performance Metrics for Parallel Systems

4. Cost
•We define the cost of solving a problem on a parallel system
as the product of parallel runtime and the number of
processing elements used.
•Cost reflects the sum of the time that each processing
element spends solving the problem.
•Efficiency can also be expressed as the ratio of the execution
time of the fastest known sequential algorithm for solving a
problem to the cost of solving the same problem on p
processing elements.
Performance Metrics for Parallel Systems
• The cost of solving a problem on a single processing element is the
execution time of the fastest known sequential algorithm.
• A parallel system is said to be cost-optimal if the cost of solving a
problem on a parallel computer has the same asymptotic growth (in Q
terms) as a function of the input size as the fastest-known sequential
algorithm on a single processing element.
• Since efficiency is the ratio of sequential cost to parallel cost, a cost-
optimal parallel system has an efficiency of Q(1). Cost is sometimes
referred to as work or processor-time product, and a cost-optimal
system is also known as a pTP -optimal system.
Example 6 Cost of adding n numbers on n
processing elements
• The algorithm given in Example 1 for adding n
numbers on n processing elements has a
processor-time product of Q(n log n).
• Since the serial runtime of this operation is Q(n),
the algorithm is not cost optimal.
• Cost optimality is a very important practical
concept although it is defined in terms of
asymptotics.
• We illustrate this using the following example.
Example 7 Performance of non-cost
optimal algorithms
• Consider a sorting algorithm that uses n processing elements to sort
the list in time (log n)2. Since the serial runtime of a (comparison-
based) sort is n log n, the speedup and efficiency of this algorithm are
given by n/log n and 1/log n, respectively.
• The pTP product of this algorithm is n(log n)2. Therefore, this
algorithm is not cost optimal but only by a factor of log n. Let us
consider a realistic scenario in which the number of processing
elements p is much less than n.
• An assignment of these n tasks to p < n processing elements gives us a
parallel time less than n(log n)2/p.
Example 7 Performance of non-cost
optimal algorithms
• This follows from the fact that if n processing elements take time (log
n)2, then one processing element would take time n(log n)2; and p
processing elements would take time n(log n)2/p.
• The corresponding speedup of this formulation is p/log n. Consider the
problem of sorting 1024 numbers (n = 1024, log n = 10) on 32
processing elements.
• The speedup expected is only p/log n or 3.2. This number gets worse
as n increases. For n = 106, log n = 20 and the speedup is only 1.6.
Clearly, there is a significant cost associated with not being cost-
optimal even by a very small factor (note that a factor of log p is
smaller than even ). This emphasizes the practical importance of
costoptimality.
The Effect of Granularity on Performance

• In parallel computing, granularity (or grain size)

of a task is a measure of the amount of work (or
computation) which is performed by that task.
• It defines granularity as the ratio of computation
time to communication time, wherein,
computation time is the time required to perform
the computation of a task and communication time
is the time required to exchange data between
processors.
Effects of Granularity on Cost-Optimality

• Assume: n initial PEs

– If: p -- physical PEs, then each PE simulates

n
– If:p -- virtual PEs, then the computation at each PE
n
increase by a factor of p

– Note: Even if p<n, this does not necessarily provide

cost-optimal alg.
Example 5.9 Adding n numbers on p
processing elements
• Consider the problem of adding n numbers on p processing
elements such that p < n and both n and p are powers of 2.
We use the same algorithm as in Example 5.1 and simulate
n processing elements on p processing elements.
• The steps leading to the solution are shown in Next Figure
5.5 for n = 16 and p = 4. Virtual processing element i is
simulated by the physical processing element labeled i
mod p; the numbers to be added are distributed similarly.
• The first log p of the log n steps of the original algorithm
are simulated in (n/p) log p steps on p processing elements.
Example 5.9 Adding n numbers on p
processing elements
• In the remaining steps, no communication is required because the
processing elements that communicate in the original algorithm are
simulated by the same processing element; hence, the remaining
numbers are added locally.
• The algorithm takes Q((n/p) log p) time in the steps that require
communication, after which a single processing element is left with
n/p numbers to add, taking time Q(n/p).
• Thus, the overall parallel execution time of this parallel system is
Q((n/p) log p). Consequently, its cost is Q(n log p), which is
asymptotically higher than the Q(n) cost of adding n numbers
sequentially.
• Therefore, the parallel system is not cost-optimal.
Example 5.9 Adding n numbers on p
processing elements
• Figure 5.5. Four processing elements simulating 16 processing
elements to compute the sum of 16 numbers (first two steps).
• denotes the sum of numbers with consecutive labels from i to j .
Four processing elements simulating 16 processing elements to
compute the sum of 16 numbers (last three steps).
12 13 14 15 12 13 14 15

8 9 10 8 9 10
11 11

4 5 6 7 4 5 6 7

0 1 2 3
10 32
0 1 2 3 0 1 2 3

Substep 1 Substep 2

12 13 14 15 12 13 14 15

8 9 10 89 11
10
11 5
4 76 54 76
10 32 10 32
0 1 2 3 0 1 2 3

Substep 3 Substep 4

(a) Four processors simulating the first communication step of 16 processors

13
12 15
14 13
12 15
14
89 11
10 89 11
10
54 76 54 76
10 32 30
0 1 2 3 0 1 2 3

Substep 1 Substep 2

13
12 15
14 13
12 15
14
89 11
10 11
8
74 74
30 30
0 1 2 3 0 1 2 3

Substep 3 Substep 4

(a) Four processors simulating the second communication step of 16 processors

15
12 15
12
11
8 11
8
74
30 70
0 1 2 3 0 1 2 3

Substep 1 Substep 2

(c ) Simulation of the third step in two substeps

15
8

70 15
0
0 1 2 3 0 1 2 3

(d) Simulation of the fourth step (e) Final result

• Example 5.1 showed that n numbers can be added on an n-processor machine
in time Q(log n). When using p processing elements to simulate n virtual
processing elements (p < n), the expected parallel runtime is Q((n/p) log n).
• However, in Example 5.9 this task was performed in time Q((n/p) log p)
instead. The reason is that every communication step of the original algorithm
does not have to be simulated; at times, communication takes place between
virtual processing elements that are simulated by the same physical processing
element.
• For these operations, there is no associated overhead. For example, the
simulation of the third and fourth steps (Figure 5.5(c) and (d)) did not require
any communication.
• However, this reduction in communication was not enough to make the
algorithm cost-optimal. Example 5.10 illustrates that the same problem
(adding n numbers on p processing elements) can be performed cost optimally
with a smarter assignment of data to processing elements.
Example 5.10 Adding n numbers cost-
optimally
• An alternate method for adding n numbers using p
processing elements is illustrated in Next Figure 5.6 for n
= 16 and p = 4.
Example 5.10 Adding n numbers cost-
optimally
• In the first step of this algorithm, each processing element
locally adds its n/p numbers in time Q(n/p).
• Now the problem is reduced to adding the p partial sums
on p processing elements, which can be done in time Q(log
p) by the method described in Example 5.1. The parallel
runtime of this algorithm is and its cost
is Q(n + p log p). As long as n = W(p log p), the cost is
Q(n), which is the same as the serial runtime. Hence, this
parallel system is cost-optimal.
Scalability of Parallel Systems

• Very often, programs are designed and tested for smaller

problems on fewer processing elements. However, the real
problems these programs are intended to solve are much
larger, and the machines contain larger number of
processing elements. Whereas code development is
simplified by using scaled-down versions of the machine
and the problem, their performance and correctness (of
programs) is much more difficult to establish based on
scaled-down systems.
• In this section, we will investigate techniques for
evaluating the scalability of parallel programs using
analytical tools.
5.4.1 Scaling Characteristics of Parallel
Programs
• The efficiency of a parallel program can be written
as:
Example 5.11 Why is performance
extrapolation so difficult?
• Consider three parallel algorithms for computing an n-
point Fast Fourier Transform (FFT) on 64 processing
elements. Next Figure 5.7 illustrates speedup as the value
of n is increased to 18 K.
• Keeping the number of processing elements constant, at
smaller values of n, one would infer from observed
speedups that binary exchange and 3-D transpose
algorithms are the best.
• However, as the problem is scaled up to 18 K points or
more, it is evident from Next Figure 5.7 that the 2-D
transpose algorithm yields best speedup.
Example 5.11 Why is performance
extrapolation so difficult?
Example 5.12 Speedup and efficiency as functions
of the number of processing elements
• Consider the problem of adding n numbers on p processing elements.
We use the same algorithm as in Example 5.10(Example 5.10 Adding
n numbers cost-optimally).
• However, to illustrate actual speedups, we work with constants here
instead of asymptotics.
• Assuming unit time for adding two numbers, the first phase (local
summations) of the algorithm takes roughly n/p time.
• The second phase involves log p steps with a communication and an
addition at each step.
Example 5.12 Speedup and efficiency as functions
of the number of processing elements

If a single communication takes unit time as well, the time

for this phase is 2 log p. Therefore, we can derive parallel
time, speedup, and efficiency as:
Amdahl’s Law (1967)
Amdahl’s Law (1967)
Gustafson’s Law (1988)

• Also known as the Gustafson‐Barsis’s Law

• Any sufficiently large problem can be efficiently
parallelized with a speedup

• where p is the number of processors, and α is the serial

portion of the problem
• Gustafson proposed a fixed time concept which leads to
scaled speedup for larger problem sizes.
• Basically, we use larger systems with more processors to
solve larger problems
Gustafson’s Law (1988)
• Execution time of program on a parallel computer is
(a+b) , a is the sequential time and b is the parallel time
• Total amount of work to be done in parallel varies
linearly with the number of processors. So b is fixed as p
is varied.
The total run time is (a + p*b)
• The speedup is (a+p*b)/(a+b)
• Define α = a/(a+b) , the sequential fraction of the
execution time, then
Scalability (cont.)

• Increase number of processors ‐‐> decrease efficiency

• Increase problem size ‐‐> increase efficiency
• Can a parallel system keep efficiency by increasing the
number of processors and the problem size
simultaneously???
Yes: ‐‐> scalable parallel system
No: ‐‐> non‐scalable parallel system
• A scalable parallel system can always be made cost‐
optimal by adjusting the number of processors and the
problem size.
Scalability (cont.)
ISO-efficiency Metric of Scalability

• We summarize the discussion in the section above

with the following two observations:
1. For a given problem size, as we increase the
number of processing elements, the overall
efficiency of the parallel system goes down. This
phenomenon is common to all parallel systems
2. In many cases, the efficiency of a parallel system
increases if the problem size is increased while
keeping the number of processing elements
constant.
ISO-efficiency Metric of Scalability
• These two phenomena are illustrated in Figure 5.9(a) and (b),
respectively. Following from these two observations, we define a
scalable parallel system as one in which the efficiency can be kept
constant as the number of processing elements is increased, provided
that the problem size is also increased.
• It is useful to determine the rate at which the problem size must
increase with respect to the number of processing elements to keep the
efficiency fixed. For different parallel systems, the problem size must
increase at different rates in order to maintain a fixed efficiency as the
number of processing elements is increased. This rate determines the
degree of scalability of the parallel system.
• As we shall show, a lower rate is more desirable than a higher growth
rate in problem size. Let us now investigate metrics for quantitatively
determining the degree of scalability of a parallel system. However,
before we do that, we must define the notion of problem size precisely.
ISO-efficiency Metric of Scalability
ISO-efficiency Metric of Scalability

• Degree of scalability: The rate to increase the size of the

problem to maintain efficiency as the number of processors
changes.
•Problem size: Number of basic computation steps in best
sequential algorithm to solve the problem on a single processor
( ).
•Overhead function: Part of the parallel system cost
(processor‐time product) that is not incurred by the fastest
known serial algorithm on a serial computer
Iso-efficiency Function
Parallel execution time can be expressed as a function of
problem size, overhead function, and the number of
processing elements. We can write parallel runtime as:

The resulting expression for speedup is

Finally, we write the expression for efficiency as

Iso-efficiency Function
•In Above Efficiency Equation 5.12, if the problem size is
kept constant and p is increased, the efficiency decreases
because the total overhead
•To increases with p. If W is increased keeping the number of
processing elements fixed, then for scalable parallel systems,
the efficiency increases.
•This is because To grows slower than Q(W) for a fixed p.
For these parallel systems, efficiency can be maintained at a
desired value (between 0 and 1) for increasing p, provided W
is also increased.
Iso-efficiency Function
• Solve the above equation for W

Let K = E/(1 - E) be a constant depending on the efficiency to be

maintained. Since To is a function of W and p, Previous Equation
5.13 can be rewritten as

• The isoefficiency function determines the growth rate of W

required to keep the efficiency fixed as p increases.
• Highly scalable systems have small isoefficiency function.
From above Equation 5.14, the problem size W can usually be
obtained as a function of p by algebraic manipulations. This
function dictates the growth rate of W required to keep the efficiency
fixed as p increases. We call this function the isoefficiency function
of the parallel system.
Example 5.14 Isoefficiency function of
adding numbers
• The overhead function for the problem of adding n
numbers on p processing elements is
approximately 2 p log p, as given by Equations 5.9
and 5.1. Substituting To by 2 p log p in Equation
5.14, we get
Example 5.14 Isoefficiency function of
adding numbers
• Thus, the asymptotic isoefficiency function for
this parallel system is Q(p log p).
• This means that, if the number of processing
elements is increased from p to p', the problem
size (in this case, n) must be increased by a factor
of (p' log p')/(p log p) to get the same efficiency as
on p processing elements. In other words,
increasing the number of processing elements by a
factor of p'/p requires that n be increased by a
factor of (p' log p')/(p log p) to increase the
speedup by a factor of p'/p.
• Efficiency of adding n numbers on a p processor
hypercube
– for the cost-optimal alg:
np S n

• S= n  2 p log p E= p n  2 p log p = E(n,p)

n P=1 P=4 P=8 P = 16 P = 32

64 1.0 .80 .57 .33 .17

192 1.0 .92 .80 .60 .38

320 1.0 .95 .87 .71 .50

512 1.0 .97 .91 .80 .62

• n =  (plogp)
• E = 0.80 constant
• for n=64 & p=4: n=8plogp
• for n=192 & p=8 : n=8plogp
• for n=512 & p=16: n=8plogp
• Conclusion:
– for adding n number on p processors w/ cost-optimal
algorithm:
• the alg. is cost-optimal if n =  (plogp)
• the alg. is scalable if n increase proportional w/  (plogp) as p
is increased
• Problem size
– for matrix multiply
• input n => O( n3 ) 3
n’=2n => O( n' ) = O(8 n3 )

– for matrix addition

• input n => O( n3 ) n’=2n => O( n'3 ) = O(4 n3 )

Mint's Original Marketing Plan
No ratings yet
Mint's Original Marketing Plan
6 pages
Computer-Controlled Systems: Theory and Design, Third Edition
From Everand
Computer-Controlled Systems: Theory and Design, Third Edition
Karl J Åström
3/5 (1)
An Introduction To Parallel Algorithms
No ratings yet
An Introduction To Parallel Algorithms
66 pages
140+ Active Angel Investor & Seed Fund Profiles in India
80% (5)
140+ Active Angel Investor & Seed Fund Profiles in India
110 pages
OOAD
No ratings yet
OOAD
67 pages
Analytical Modeling of Parallel Systems: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
No ratings yet
Analytical Modeling of Parallel Systems: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
67 pages
CS621 Week 14 - Complete
No ratings yet
CS621 Week 14 - Complete
69 pages
Unit 4
No ratings yet
Unit 4
64 pages
Performance Metrices
100% (1)
Performance Metrices
18 pages
Analytical Modeling of Parallel Systems: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
No ratings yet
Analytical Modeling of Parallel Systems: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
36 pages
Performance Metrics For Parallel Programs: 8 March 2010
No ratings yet
Performance Metrics For Parallel Programs: 8 March 2010
44 pages
Unit 2 Performance Evaluations: Structure Nos
No ratings yet
Unit 2 Performance Evaluations: Structure Nos
18 pages
3.2 Performance Evaluations
No ratings yet
3.2 Performance Evaluations
18 pages
PC 2
No ratings yet
PC 2
44 pages
Unit_4_HPC
No ratings yet
Unit_4_HPC
82 pages
Chapter 4
No ratings yet
Chapter 4
16 pages
Unit 4 HPC Part2
No ratings yet
Unit 4 HPC Part2
18 pages
Lecture 4 Analytical Modeling of Parallel Programs
No ratings yet
Lecture 4 Analytical Modeling of Parallel Programs
11 pages
Karp
No ratings yet
Karp
5 pages
Pc98 Lect5 Part1 Speedup
No ratings yet
Pc98 Lect5 Part1 Speedup
36 pages
Parallel Algorithms Unit 2 By Dr. Choudhary Ravi Singh
No ratings yet
Parallel Algorithms Unit 2 By Dr. Choudhary Ravi Singh
18 pages
Parallel Algorithm Analysis
No ratings yet
Parallel Algorithm Analysis
11 pages
Slides
No ratings yet
Slides
44 pages
ACA 2024W 01 Introduction
No ratings yet
ACA 2024W 01 Introduction
19 pages
HPC Overview
No ratings yet
HPC Overview
45 pages
Course Outcome 1:: 15Cs4180 - Parallel Computing
No ratings yet
Course Outcome 1:: 15Cs4180 - Parallel Computing
23 pages
2 New Module 2 Performance Analysis of Multiprocessor Architectures Students Version
No ratings yet
2 New Module 2 Performance Analysis of Multiprocessor Architectures Students Version
13 pages
Cours 2
No ratings yet
Cours 2
25 pages
UNIT-2 Parallel Programming Challenges
No ratings yet
UNIT-2 Parallel Programming Challenges
32 pages
Parallel Computing Chapter 7 Performance and Scalability: Jun Zhang Department of Computer Science University of Kentucky
No ratings yet
Parallel Computing Chapter 7 Performance and Scalability: Jun Zhang Department of Computer Science University of Kentucky
26 pages
Week_7 (1)
No ratings yet
Week_7 (1)
27 pages
12 MPIProgramPerformance
No ratings yet
12 MPIProgramPerformance
33 pages
Dis Top Tim Notes 1
No ratings yet
Dis Top Tim Notes 1
3 pages
Chapter (7) Performance Analysis Techniques: Asmaa Ismail Farah Basil Raua Waleed
No ratings yet
Chapter (7) Performance Analysis Techniques: Asmaa Ismail Farah Basil Raua Waleed
46 pages
Lecture Week - 3 Amdahl Law 1
No ratings yet
Lecture Week - 3 Amdahl Law 1
19 pages
Week 7
No ratings yet
Week 7
27 pages
PDC ch#5
No ratings yet
PDC ch#5
12 pages
Cours 2
No ratings yet
Cours 2
25 pages
Unit 2 - 2.1(Parallel Approaches) (1)
No ratings yet
Unit 2 - 2.1(Parallel Approaches) (1)
11 pages
Lecture04 PDF
No ratings yet
Lecture04 PDF
27 pages
Performance Analysis: PE PE
No ratings yet
Performance Analysis: PE PE
10 pages
p2
No ratings yet
p2
19 pages
Principles of Scalable Performance
0% (1)
Principles of Scalable Performance
7 pages
Pc7 Performance
No ratings yet
Pc7 Performance
50 pages
Lect 02
No ratings yet
Lect 02
51 pages
PP 1
No ratings yet
PP 1
41 pages
Performance and Scalability Class
No ratings yet
Performance and Scalability Class
63 pages
HPC 4th Unit - 240504 - 160030
No ratings yet
HPC 4th Unit - 240504 - 160030
19 pages
HW2 Solutions
No ratings yet
HW2 Solutions
4 pages
Speedup and Efficiency of Parallel Algorithms: N N N P T Sequential T N P S
No ratings yet
Speedup and Efficiency of Parallel Algorithms: N N N P T Sequential T N P S
4 pages
Zindagi Zama Da
No ratings yet
Zindagi Zama Da
21 pages
CS526 3 Design of Parallel Programs
No ratings yet
CS526 3 Design of Parallel Programs
83 pages
Lecture 3 Amdahl's Law and Karp Flatt Metric
No ratings yet
Lecture 3 Amdahl's Law and Karp Flatt Metric
42 pages
Speedup and Efficiency
No ratings yet
Speedup and Efficiency
11 pages
07 Parallel Algorithms in Parallel and Distributed Computing
No ratings yet
07 Parallel Algorithms in Parallel and Distributed Computing
13 pages
Chapter 01
No ratings yet
Chapter 01
52 pages
Lec7 PDF
No ratings yet
Lec7 PDF
16 pages
Parallel2 PDF
No ratings yet
Parallel2 PDF
16 pages
Angular Performance Optimization: Everything you need to know
From Everand
Angular Performance Optimization: Everything you need to know
Abdelfattah Ragab
No ratings yet
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet
Manufacturing: Engineering, Management and Marketing
From Everand
Manufacturing: Engineering, Management and Marketing
S.O.T Ogaji
No ratings yet
Advanced Backend Code Optimization
From Everand
Advanced Backend Code Optimization
Sid Touati
No ratings yet
Linux Boot Processes
No ratings yet
Linux Boot Processes
7 pages
1269 - Annex 2 - Requirement Specifications (SRS) of E-Revenue License Project
No ratings yet
1269 - Annex 2 - Requirement Specifications (SRS) of E-Revenue License Project
192 pages
Why Intensity Isn't Dependent On Exposure Time: Lessons in Intensity and Coverage
100% (1)
Why Intensity Isn't Dependent On Exposure Time: Lessons in Intensity and Coverage
1 page
Mason - 1
No ratings yet
Mason - 1
35 pages
Syllabus Nimcet
No ratings yet
Syllabus Nimcet
4 pages
Laporan Penjualan Dagang V1.4 - Contoh Isian Toko Handphone
No ratings yet
Laporan Penjualan Dagang V1.4 - Contoh Isian Toko Handphone
191 pages
Deep Reinforcement Learning
100% (4)
Deep Reinforcement Learning
48 pages
OceanStor Dorado 6.x & OceanStor 6.x DM-Multipath Configuration Guide for Citrix XenServer
No ratings yet
OceanStor Dorado 6.x & OceanStor 6.x DM-Multipath Configuration Guide for Citrix XenServer
12 pages
FTB-500 Platform: Boundless Capabilities, Testing Unlimited
No ratings yet
FTB-500 Platform: Boundless Capabilities, Testing Unlimited
9 pages
Java Lab Experiments
No ratings yet
Java Lab Experiments
2 pages
Installation Instructions- Fiber Panel
No ratings yet
Installation Instructions- Fiber Panel
4 pages
PowerFlex 750-Series AC Drive Installation Guide
No ratings yet
PowerFlex 750-Series AC Drive Installation Guide
214 pages
Educational Research and Statistics
No ratings yet
Educational Research and Statistics
21 pages
Unsteady CFD Tutorial
No ratings yet
Unsteady CFD Tutorial
19 pages
SANS Cheat Sheet 1662156164
No ratings yet
SANS Cheat Sheet 1662156164
1 page
Presentation On Taiwan's Economic Development
No ratings yet
Presentation On Taiwan's Economic Development
33 pages
Ziehm Exchanging and Adjusting Y-Drive Components D
No ratings yet
Ziehm Exchanging and Adjusting Y-Drive Components D
46 pages
email-p2presearch-2009-02-13-023120
No ratings yet
email-p2presearch-2009-02-13-023120
3 pages
Scholarships Cover Letters
No ratings yet
Scholarships Cover Letters
2 pages
Broker Sms Terms and Conditions-1.0.1
No ratings yet
Broker Sms Terms and Conditions-1.0.1
2 pages
Script for AI-generated Video by ChatGPT
No ratings yet
Script for AI-generated Video by ChatGPT
66 pages
CS1354
No ratings yet
CS1354
6 pages
IGuard LM Manual ENG
No ratings yet
IGuard LM Manual ENG
92 pages
Kaushal Resume .pdf
No ratings yet
Kaushal Resume .pdf
2 pages
Admin Datasheet
No ratings yet
Admin Datasheet
5 pages
Manual de Operacion Centrifuga Refrigerada RC-3BP+
100% (1)
Manual de Operacion Centrifuga Refrigerada RC-3BP+
76 pages
Ch_-1001935880746_1638MB_7468_938_IE_51.37.79.80_13-05-21_password_file
No ratings yet
Ch_-1001935880746_1638MB_7468_938_IE_51.37.79.80_13-05-21_password_file
37 pages
Capacitance Level Transmitter - Electronic Preamplifi Er: Technical Documentation EN Rev. S
No ratings yet
Capacitance Level Transmitter - Electronic Preamplifi Er: Technical Documentation EN Rev. S
16 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Parallel Computing - Unit III

Uploaded by

Parallel Computing - Unit III

Uploaded by

Unit 3

Parallel Computing (CS 14.403)

Performance and Scalability of Parallel Systems

•  Why do we care about performance evaluation? –

 What metric defines performance?

• Note that an algorithm may have different

If two machines have the same instruction set architecture

• Since the problem can be solved in Q(n) time on a single processing

• For a given problem, more than one sequential algorithm may be

• Efficiency is a measure of the fraction of time for which a

Example 3 Efficiency of adding n numbers on n

•We also illustrate the process of deriving the parallel

• Given an n x n pixel image, the problem of detecting edges

• In parallel computing, granularity (or grain size)

• Assume: n initial PEs

– Note: Even if p<n, this does not necessarily provide

(a) Four processors simulating the first communication step of 16 processors

(a) Four processors simulating the second communication step of 16 processors

(c ) Simulation of the third step in two substeps

(d) Simulation of the fourth step (e) Final result

• Very often, programs are designed and tested for smaller

If a single communication takes unit time as well, the time

• Also known as the Gustafson‐Barsis’s Law

• where p is the number of processors, and α is the serial

• Increase number of processors ‐‐> decrease efficiency

• We summarize the discussion in the section above

• Degree of scalability: The rate to increase the size of the

The resulting expression for speedup is

Finally, we write the expression for efficiency as

Let K = E/(1 - E) be a constant depending on the efficiency to be

• The isoefficiency function determines the growth rate of W

n P=1 P=4 P=8 P = 16 P = 32

64 1.0 .80 .57 .33 .17

192 1.0 .92 .80 .60 .38

320 1.0 .95 .87 .71 .50

512 1.0 .97 .91 .80 .62

– for matrix addition

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.