To Parallelize or Not To Parallelize, Speed Up Issue
To Parallelize or Not To Parallelize, Speed Up Issue
2, March 2011
TO PARALLELIZE OR NOT TO
PARALLELIZE, SPEED UP ISSUE
Alaa Ismail El-Nashar
Abstract
Running parallel applications requires special and expensive processing resources to obtain the required
results within a reasonable time. Before parallelizing serial applications, some analysis is recommended
to be carried out to decide whether it will benefit from parallelization or not. In this paper we discuss the
issue of speed up gained from parallelization using Message Passing Interface (MPI) to compromise
between the overhead of parallelization cost and the gained parallel speed up. We also propose an
experimental method to predict the speed up of MPI applications.
Key words
Parallel programming, Message Passing Interface, Speed up
1. INTRODUCTION
Execution time reduction is one of the most challenging goals of parallel programming.
Theoretically, adding extra processors to a processing system leads to a smaller execution time
of a program compared with its execution time using a fewer processors system or a single
machine[9]. Practically, when a program is executed in parallel, the hypothesis that the parallel
program will run faster is not always satisfied. If the main goal of parallelizing a serial program
is to obtain a faster run then the main criterion to be considered is the speedup gained from
parallelization.
Speed up is defined as the ratio of serial execution time to the parallel execution time [2], it is
used to express how many times a parallel program works faster than its serial version used to
solve the same problem. Many conflicting parameters such as parallel overhead, hardware
architecture, programming paradigm, programming style may negatively affect the execution
time of a parallel program making its execution time larger than that of the serial version and
thus any parallelization gain will be lost. In order to obtain a faster parallel program, these
conflicted parameters need to be well optimized.
Various parallel programming paradigms can be used to write parallel programs such as
OpenMP [7], Parallel Virtual Machine (PVM) [21], and Message Passing Interface (MPI) [23].
MPI is the most commonly used paradigm in writing parallel programs since it can be
employed not only within a single processing node but also across several connected ones. MPI
enables the programmer to control both data distribution and process synchronization. MPICH2
[22] is an MPI implementation that is working well on a wide range of hardware platforms and
also supports using of C/C++ and FORTRAN programming languages.
In this paper we discuss some of the parameters that affect the parallel programs performance
as a parallelization gain issue and also propose an experimental method to predict the speed up
of MPI applications. We focus on the parallel programs written by MPI paradigm using
DOI : 10.5121/ijdps.2011.2202 14
International Journal of Distributed and Parallel Systems (IJDPS) Vol.2, No.2, March 2011
2. RELATED WORK
Reducing program execution time is one of the advantages that application programmers hope
to achieve. Converting sequential programs into parallel ones is a costly duty; it requires special
hardware and software equipments. It is preferable to virtually anticipate the speed up gained
from parallelism before executing the application on a real parallel environment.
Several systems have been developed for analyzing the performance of parallel programs.
These systems are either model or trace based.
Petrini et al. [8] introduced a model based system to predict the performance of programs on
machines prior to their construction, and to identify the causes of performance variations from
the predictions. These methods pick up the slight variations in a program execution that arise at
runtime that cannot be modeled by examining the static code.
Vampir [10] and Dimemas [18] are two trace based analysis tools that predict parallel programs
performance. These models use a trace file and the user’s selection of network parameters that
is used in the communication model to simulate the program execution.
MPE (Multi-Processing Environment) library and jumpshot [1] that are distributed with
MPICH [22] implementation provide graphical performance analysis for message passing
interface programs.
In this paper we introduce an experimental approach to predict the speed up of message passing
programs. Our approach is based on executing the parallel program several times on a single
physical processor with different numbers of virtual MPI processes.
The first challenge in writing MPI programs is how to divide the concerned problem into
smaller sub problems. Problem decomposition has two types, data parallelism and task
parallelism.
15
International Journal of Distributed and Parallel Systems (IJDPS) Vol.2, No.2, March 2011
Data partitioning challenge concerns with the manner in which the data can be divided among
the available processors. Data are divided into pieces of approximately the same size and then
mapped to different processors or MPI processes depending on the process ID. Each
processor/process then operates only on the portion of the data that is assigned to it. This
strategy can be efficiently used in solving the iterative problems in which processors can
operate independently on large portions of data, communicating only the much smaller data
pieces at each iteration. The processes may need to communicate periodically in order to
exchange data. This approach implies that the program needs to keep track of date pieces
required by a process at any time instance.
Task parallelism focuses on the computation that is to be performed rather than on the data
manipulated by the computation. The problem is decomposed according to the work that must
be done. Each task then performs a portion of the overall work.
4. PERFORMANCE METRICS
Three metrics are commonly used to measure the performance of MPI programs, execution
time, speedup and efficiency. Several factors such as the number of processors used, the size of
the data being processed and inter-processor communications influence parallel program's
performance
17
International Journal of Distributed and Parallel Systems (IJDPS) Vol.2, No.2, March 2011
Amdahl's law treats problem size as a constant and hence the execution time decreases as
number of processors increases. Gustafson law [12] gives another formula for predicting
maximum achievable speedup which is described by
ψ (n, p ) ≤ p + (1 − p ) s (5)
where s is the fraction of total execution time spent in serial code. The two laws ignore the
communication cost ; they overestimate the speed up value [3].
Efficiency is the ratio of speed up obtained to the number of processors used [2]. It measures
processors utilization. Parallel system efficiency of solving an n-size problem on P processors is
given by
ψ ( n, p )
0 ≤ ε ( n, p ) ≤ ≤1 (6)
p
18
International Journal of Distributed and Parallel Systems (IJDPS) Vol.2, No.2, March 2011
1. In most cases, increasing the message size will yield better performance. For
communication intensive applications, the smaller message size reduces MPI application
performance because latency badly affects short messages..
2. for smaller message size with less number of processors, it is better to implement
broadcasting in terms of non-blocking point-to-point communication whereas for other
cases broadcasting using MPI_Bcast saves time significantly.
19
International Journal of Distributed and Parallel Systems (IJDPS) Vol.2, No.2, March 2011
larger testing runs can be conducted on actual clusters to check for scalability and performance
bottlenecks.
The number of processes per processor affects the application performance so the application
programmer has to be aware of the following considerations:
1. In general, maximum performance is achieved when each process has its own processor.
When the number of processes is less than or equal to the number of processors, the appli-
cation will run at its peak performance. Since the total system is either underutilized (there
are unused processors) or fully utilized (all processors are being used), the application is
not hindered by several parameters such as context switching, cache misses, or virtual
memory thrashing caused by other local processes [14].
2. running too many processes, the processors will thrash, continually trying to give each
process its fair share of run time.
3. running too few processes may not enable the programmer to run meaningful data through
his application, or may not cause error conditions that occur with larger numbers of
processes.
We applied the proposed method on two MPI applications. The first one solves the concurrent
wave equation and the second finds the number of primes and also the largest prime number
within an interval of integers. The two applications are also executed in parallel on multiple
physical processors. The recorded serial execution time, Ts for both applications is used to find
out their experimental speed up to be compared with the predicted ones.
20
International Journal of Distributed and Parallel Systems (IJDPS) Vol.2, No.2, March 2011
solving small and middle size parallel problems. The experiments programs was written in
Fortran 90 using MPICH2 version 1.0.6p1, as a message passing implementation.
where i is the position index along the x axis at the time t. Equation 8 implies that the amplitude
at each position index i and time t+1 depends on the previous time steps (t, t-1) and neighboring
points (i-1, i+1).This means that the parallel solution requires interprocess communication. The
parallel solution is based on dividing the vibrating string into points. Each processor is
repeatedly responsible for updating the amplitude of a number of points over time. At each
iteration, each processor exchanges boundary points with their nearest neighbors. The parallel
algorithm that solve this equation is summarized as follows:
21
International Journal of Distributed and Parallel Systems (IJDPS) Vol.2, No.2, March 2011
22
International Journal of Distributed and Parallel Systems (IJDPS) Vol.2, No.2, March 2011
Applying the proposed speed up prediction method to wave equation problem using 10 MPI
processes on a single physical processor we predicted that the application will exhibit a poor
speed up if it is executed in parallel using multiple physical processors.
Our prediction is based on that the execution time is rapidly increases as the number of MPI
processes as shown in figure 5. To prove that our prediction was true, we executed the same
MPI code on 8 physical processors. Knowing the execution time of the serial code version, the
experimental speed up was calculated. Figure 6 shows that the maximum speed up achieved by
8 physical processors was only 0.66228534 and hence our prediction was true.
30
20
15
10
0
1 2 3 4 5 6 7 8 9 10 11
Number of Processes
5
4
3
2
1
0
1 2 3 4 5 6 7 8 9
Number of Processors
To be unbiased, we also re-executed the same parallel code using different number of processes
on the same 8 physical processors. Figure 7 shows that the execution time was negatively
affected as the number of MPI processes increases except in case of running a small number of
MPI processes using 8 physical processors. The experimental results shows that there is no
significant speed up improvement as shown in figure 8. This also proves that our prediction was
true.
Applying the proposed method to prime numbers generator problem using 20 MPI processes on
a single physical processor, we predicted that the application will exhibit a linear speed up if it
is executed in parallel using multiple physical processors.
24
International Journal of Distributed and Parallel Systems (IJDPS) Vol.2, No.2, March 2011
30 8 CPUs 4 CPUs
20
15
10
5
0
1 2 3 4 5 6 7 8 9 10
Number of MPI Processes
Figure 7. Effect of processes number on execution time using 8 CPUs for problem1
12 Ideal 8 CPUs
4 CPUs 2 CPUs
10
1 CPU
8
Speed Up
0
1 3 5 7 9 11
Number of Processes
Our prediction is based on that the execution time is slowly increases or seems to be constant as
the number of MPI processes as shown in figure 9. Running the same MPI code on 8 physical
processors achieved a linear speed up as shown figure 10 and hence our prediction was also
true.
Ex ex c ution T im e (s ec onds )
60
45
30
15
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Number of Processes
25
International Journal of Distributed and Parallel Systems (IJDPS) Vol.2, No.2, March 2011
9
8
Experimental
7
Ideal
6
Speed up
5
4
3
2
1
1 2 3 4 5 6 7 8 9
Number of Processors
7. CONCLUSION
Concerning the issue of speed up gained from parallelization, the decision making to parallelize
or not to parallelize the serial application is not a trivial task.
In this paper we studied the conflicting parameters that affect the parallel programs
performance, specially MPI applications, showing some recommendations to be followed to
achieve a reasonable performance. The problem nature is one of the most important factors that
affect the parallel program speed up. If The problem can be divided into independent subparts
and no communication is required, except to split up the problem and combine the final results,
then there is a great parallelization opportunity, and the resultant parallel program will exhibit a
linear speed up. If the same instruction set are applied to all data and processes communication
is synchronous , speed up will be directly proportional to the computation -communication
ratio. If there are different instruction sets to be applied to all data to solve a specific problem
and the inter-process communication is asynchronous, this will reduce the parallelization
opportunity. Speed up of the resultant parallel application will be negatively affected with extra
communication overhead.
We also proposed an experimental method that aids in speed up prediction. The proposed
method is based on running the MPI applications with several MPI processes using only one
single processor machine. It gives an indication about the speed up behavior of MPI
applications without using extra parallel hardware facilities, so it is recommended to be applied
to MPI applications before running them on real powerful cluster machines or an expensive
parallel systems. The proposed method was applied to predict the speed up of MPI applications
that solve wave equation and prime numbers generator problems. The predicted speed up was
as the same as experimental speed up achieved when using multiple physical processors for
both applications.
REFERENCES
[1] A. Chan D. Ashton, R. Lusk, and W. Gropp, Jumpshot-4 Users Guide, Mathematics
and Computer Science Division, Argonne National Laboratory July 11, 2007.
[2] A. Grama, A. Gupta, and V. Kumar, "Isoefficiency Function: A Scalability Metric for
Parallel Algorithms and Architectures", IEEE Parallel and Distributed Technology,
Special Issue on Parallel and Distributed Systems: From Theory to Practice, Volume 1,
Number 3, pp 12-21, August 1993.
[3] A. H. Karp and H. Flatt, “Measuring Parallel Processor Performance”, Communication
of the ACM Volume 33 Number 5, May 1990.
26
International Journal of Distributed and Parallel Systems (IJDPS) Vol.2, No.2, March 2011
[22] W. Gropp, “MPICH2: A New Start for MPI Implementations”, In Recent Advances in
PVM and MPI: 9th European PVM/MPI Users’ Group Meeting, Linz, Austria, Oct.
2002.
[23] Y. Aoyama J. Nakano “Practical MPI Programming”, International Technical Support
Organization, IBM Coorporation SG24-5380-00, August 1999.
[24] Y. Yan, X. Zhang, and Q. Ma, “Software Support for Multiprocessor Latency
Measurement and Evaluation”, IEEE Transactions on Software Engineering, Volume
23, Number1, pp 4-16, January 1997.
Author
28