0% found this document useful (0 votes)

87 views

To Parallelize or Not To Parallelize, Speed Up Issue

This document discusses whether it is worthwhile to parallelize serial programs using MPI. It proposes an experimental method to predict speedup from parallelization. Specifically, the method involves running an MPI program with different numbers of virtual processes on a single processor. This allows analyzing how speedup is affected by various parameters like overhead before executing on multiple real processors. The document also reviews related work on performance modeling and analysis tools for parallel programs.

Uploaded by

ijdps

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

87 views

To Parallelize or Not To Parallelize, Speed Up Issue

Uploaded by

ijdps

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

International Journal of Distributed and Parallel Systems (IJDPS) Vol.2, No.

2, March 2011

TO PARALLELIZE OR NOT TO
PARALLELIZE, SPEED UP ISSUE
Alaa Ismail El-Nashar

Faculty of Science, Computer Science Department, Minia University, Egypt

Assistant professor, Department of Computer Science, College of Computers and
Information Technology, Taif University, Saudi Arabia
email: nashar_al@yahoo.com

Abstract
Running parallel applications requires special and expensive processing resources to obtain the required
results within a reasonable time. Before parallelizing serial applications, some analysis is recommended
to be carried out to decide whether it will benefit from parallelization or not. In this paper we discuss the
issue of speed up gained from parallelization using Message Passing Interface (MPI) to compromise
between the overhead of parallelization cost and the gained parallel speed up. We also propose an
experimental method to predict the speed up of MPI applications.

Key words
Parallel programming, Message Passing Interface, Speed up

1. INTRODUCTION
Execution time reduction is one of the most challenging goals of parallel programming.
Theoretically, adding extra processors to a processing system leads to a smaller execution time
of a program compared with its execution time using a fewer processors system or a single
machine[9]. Practically, when a program is executed in parallel, the hypothesis that the parallel
program will run faster is not always satisfied. If the main goal of parallelizing a serial program
is to obtain a faster run then the main criterion to be considered is the speedup gained from
parallelization.
Speed up is defined as the ratio of serial execution time to the parallel execution time [2], it is
used to express how many times a parallel program works faster than its serial version used to
solve the same problem. Many conflicting parameters such as parallel overhead, hardware
architecture, programming paradigm, programming style may negatively affect the execution
time of a parallel program making its execution time larger than that of the serial version and
thus any parallelization gain will be lost. In order to obtain a faster parallel program, these
conflicted parameters need to be well optimized.
Various parallel programming paradigms can be used to write parallel programs such as
OpenMP [7], Parallel Virtual Machine (PVM) [21], and Message Passing Interface (MPI) [23].
MPI is the most commonly used paradigm in writing parallel programs since it can be
employed not only within a single processing node but also across several connected ones. MPI
enables the programmer to control both data distribution and process synchronization. MPICH2
[22] is an MPI implementation that is working well on a wide range of hardware platforms and
also supports using of C/C++ and FORTRAN programming languages.
In this paper we discuss some of the parameters that affect the parallel programs performance
as a parallelization gain issue and also propose an experimental method to predict the speed up
of MPI applications. We focus on the parallel programs written by MPI paradigm using

DOI : 10.5121/ijdps.2011.2202 14
International Journal of Distributed and Parallel Systems (IJDPS) Vol.2, No.2, March 2011

MPICH2 implementation. This may be considered as a guide to decide before parallelizing

serial applications, whether it will benefit from parallelization or not.
The paper is organized as follows: section 2 includes the related work. Section 3 covers the
MPI parallelization methodologies. In section 4, we present some different parameters that
affect parallel programs performance. Section 5 focuses on the performance limitations of MPI
programs. In section 6, we propose an experimental method to predict the speed up of MPI
programs.

2. RELATED WORK
Reducing program execution time is one of the advantages that application programmers hope
to achieve. Converting sequential programs into parallel ones is a costly duty; it requires special
hardware and software equipments. It is preferable to virtually anticipate the speed up gained
from parallelism before executing the application on a real parallel environment.
Several systems have been developed for analyzing the performance of parallel programs.
These systems are either model or trace based.
Petrini et al. [8] introduced a model based system to predict the performance of programs on
machines prior to their construction, and to identify the causes of performance variations from
the predictions. These methods pick up the slight variations in a program execution that arise at
runtime that cannot be modeled by examining the static code.
Vampir [10] and Dimemas [18] are two trace based analysis tools that predict parallel programs
performance. These models use a trace file and the user’s selection of network parameters that
is used in the communication model to simulate the program execution.
MPE (Multi-Processing Environment) library and jumpshot [1] that are distributed with
MPICH [22] implementation provide graphical performance analysis for message passing
interface programs.
In this paper we introduce an experimental approach to predict the speed up of message passing
programs. Our approach is based on executing the parallel program several times on a single
physical processor with different numbers of virtual MPI processes.

3. PARALLELIZATION WITH MPI

In message passing paradigm, several separate processes used to complete the overall
computation. In this scheme, many concurrent processes are created, and all of the data
involved in the calculation is distributed among them using different ways. There is no shared
data; when a process needs data held by another one, the second process must send it to the first
process. An MPI message passing protocol describes the internal methods and policies an MPI
implementation employs to accomplish message delivery. There are two common message
passing protocols, eager and rendezvous [13], [17]. Eager protocol is an asynchronous protocol
that allows a send operation to complete without acknowledgement from a matching receive.
Rendezvous protocol is a synchronous protocol which requires an acknowledgement from a
matching receive in order to complete the send operation. Since MPI enables the programmer to
control both data distribution and process synchronization, problem decomposition and inter
process communication represent two challenges in writing MPI parallel programs. Unless they
are coded carefully, program performance will be negatively affected.

3.1 Problem decomposition

The first challenge in writing MPI programs is how to divide the concerned problem into
smaller sub problems. Problem decomposition has two types, data parallelism and task
parallelism.

15
International Journal of Distributed and Parallel Systems (IJDPS) Vol.2, No.2, March 2011

Data partitioning challenge concerns with the manner in which the data can be divided among
the available processors. Data are divided into pieces of approximately the same size and then
mapped to different processors or MPI processes depending on the process ID. Each
processor/process then operates only on the portion of the data that is assigned to it. This
strategy can be efficiently used in solving the iterative problems in which processors can
operate independently on large portions of data, communicating only the much smaller data
pieces at each iteration. The processes may need to communicate periodically in order to
exchange data. This approach implies that the program needs to keep track of date pieces
required by a process at any time instance.
Task parallelism focuses on the computation that is to be performed rather than on the data
manipulated by the computation. The problem is decomposed according to the work that must
be done. Each task then performs a portion of the overall work.

3.2 Processes Communication

Inter process communication challenge concerns with the manner in which the running
processes can be fully controlled. This implies that explicit send and receive data operations
must be executed whenever data needs to move from one process to another. Two approaches
can be used to implement data distribution and communication activities among processors,
namely, “point-to-point communication” and “collective communication”.

3.2.1 Point-to-Point Communication

MPI point-to-point operations enable message passing between only two different MPI
processes. In this scheme, one process performs the message send operation and the other one
performs the matching receive operation. Send and receive operations work in two modes,
blocking and non-blocking. In blocking mode, A blocking send routine will only return after it
is safe to modify the application send buffer. This implies a handshaking with the receive
process to confirm a safe send. A blocking receive only returns after the data has arrived and is
ready for use by the program. In case of non-blocking mode, both send and receive routines
return immediately and do not wait for any communication events to complete, such as message
copying from user memory to system buffer space or the actual arrival of message. In this
mode, non-blocking operations request the MPI library to perform the operation when it is able.
It is unsafe to modify the application buffer until the requested non-blocking operation was
actually performed by the library. There are "wait" routines used to do this task. Non-blocking
communications are primarily used to overlap computation with communication and exploit
possible performance gains [20].

3.2.2 Collective Communication

In general, all data movement among processes can be accomplished using MPI send and
receive routines. More over, a set of standard collective communication routines [20] are
defined in MPI. Each collective communication routine has a parameter called a communicator,
which identifies the group of participating processes. The collective communication routines
allow data movement among all processors or just a specified set of processors.
The function MPI_Barrier blocks the caller processes until all members in the communicator
call the routine. The function MPI_Bcast broadcasts a message from the root process to all
processes in the communicator. The routines MPI_Gather and MPI_Gatherv allow each process
in the communicator to send data to one process, while MPI_Scatter allows one process to send
a different message to each other process. The routines MPI_ Allgather and MPI_Allgatherv
gather fixed and variable sized information, respectively, from all processes and puts the results
to all processes. The function MPI_Alltoall is a generalization of MPI_Allgather, it allows
different messages to be sent to different processes. The most general form of all-to-all
communication is MPI_Alltoallv, which allows general many-to-many or one-to-many
16
International Journal of Distributed and Parallel Systems (IJDPS) Vol.2, No.2, March 2011

communications to be performed by carefully selecting the input arguments. Finally,

MPI_Reduce performs global reduction operations using an operation, such as sum, maximum
or minimum, which is then sent to the root process [4].

4. PERFORMANCE METRICS
Three metrics are commonly used to measure the performance of MPI programs, execution
time, speedup and efficiency. Several factors such as the number of processors used, the size of
the data being processed and inter-processor communications influence parallel program's
performance

4.1 Execution time

A parallel program's execution time is a common performance indicator. It is defined as the
time elapsed from that instance at which the first processor starts program execution to that
instance at which the last processor completes it. MPI enables the programmer to measure the
execution time of his code or a part of it by calling the function MPI_wtime( ). This function
call returns the wall clock time in seconds represented as double precision value on the calling
processor. The part of code to be timed is enclosed between two timing calls, the difference
between the two time values that generated from timing calls is the execution time of this part
of code. The execution time T is given by:
T = TComp + TComm + Tidle (1)
where TComp is the computation time, TComm is the communication time consumed by processor
to send and/or receive messages, and Tidle is the time a process spends waiting for data from
other processors.

4.2 Speed up and Efficiency

Considering execution time only as a performance metric may be insufficient, specially if we
need to study how the number of processors and problem size can affect a program
performance.
Speed up is another performance metric that takes processors number p, and problem size n,
into account. In terms of problem size and processors number, the total parallel execution time
of a program that solves an n size problem on p processors is given by:
ϕ (n)
T parallel = σ (n) + + κ ( n, p ) (2)
p
Where σ (n) is the program’s serial part execution time, ϕ (n) is the program’s parallel part
execution time, and κ (n, p ) is the communication time.
Speed up is the ratio of the time taken to solve a problem on a single processor to the time
required to solve the same problem on a parallel computer with multiple processors [24]. The
speedup metric for solving an n-size problem using P processors is expressed by:
Tserial
ψ ( n, p ) ≤ (3)
T parallel
Amdahl's Law [7] is one way of predicting the maximum achievable speedup for a given
program. The law assumes that a fraction f of a program's execution time was infinitely
parallelizable with no overhead, while the remaining fraction, 1-f, was totally serial [15].
According to this law, the speedup of n-size problem on p processors is governed by
1
ψ ( n, p ) ≤ , 0 ≤ f ≤1 (4)
f + (1 − f ) / p

17
International Journal of Distributed and Parallel Systems (IJDPS) Vol.2, No.2, March 2011

Amdahl's law treats problem size as a constant and hence the execution time decreases as
number of processors increases. Gustafson law [12] gives another formula for predicting
maximum achievable speedup which is described by
ψ (n, p ) ≤ p + (1 − p ) s (5)
where s is the fraction of total execution time spent in serial code. The two laws ignore the
communication cost ; they overestimate the speed up value [3].
Efficiency is the ratio of speed up obtained to the number of processors used [2]. It measures
processors utilization. Parallel system efficiency of solving an n-size problem on P processors is
given by
ψ ( n, p )
0 ≤ ε ( n, p ) ≤ ≤1 (6)
p

5. PERFORMANCE LIMITATIONS OF MPI PROGRAMS

Several factors affect the performance of parallel MPI programs. The application programmer
have to adapt these variables to achieve the optimal performance.

5.1 Effect of problem decomposition

When dividing the data into processes the programmer have to pay attention to the amount of
load being processed by each processor. Load balancing is the task of equally dividing work
among the available processes. This is easy to be programmed when the same operations are
being performed by all the processes on different pieces of data. Irregular load distribution
leads to load imbalance which cause some processes to finish earlier than others. Load
imbalance is one source of overhead, so all tasks should be mapped onto processes as evenly as
possible so that all tasks complete in the shortest amount of time to minimize the processors’
idle time which lead to a faster execution as equation 1 indicates.

5.2 Effect of communication pattern

The cost of communication in the execution time can be measured in terms of latency and
bandwidth. Latency is the time taken to set up the envelope for communication, where
bandwidth is the actual speed of transmission. Regardless of the network hardware architecture
the communication pattern affects the performance of MPI programs. Using collective
communication pattern is more efficient than using of point-to-point communication pattern
[23], so the application programmer have to avoid using of the latter one as much as possible,
specially for large size problems, for the following reasons:
1. Although point-to-point pattern is a simple way of specifying communication in parallel
programs; its use leads to large program size and complicated communication structure,
which negatively affect the program performance.
2. Send-receive does not offer fundamental performance advantages over collective
operations. The latter offer efficient implementations without changing the applications.
3. In practice, using the non-blocking versions of send-receive, MPI_Isend and MPI_Irecv,
often lead to slower execution than the blocking version because of the extra
synchronization.

5.3 Effect of message size

Message size can be a very significant contributor to MPI application performance. The effect
of message size is also influenced by latency, communication pattern and number of processors
used as described in equation 2 and equation 3. To achieve an optimal performance, the
application programmer should take the following considerations into account:

18
International Journal of Distributed and Parallel Systems (IJDPS) Vol.2, No.2, March 2011

1. In most cases, increasing the message size will yield better performance. For
communication intensive applications, the smaller message size reduces MPI application
performance because latency badly affects short messages..
2. for smaller message size with less number of processors, it is better to implement
broadcasting in terms of non-blocking point-to-point communication whereas for other
cases broadcasting using MPI_Bcast saves time significantly.

5.4 Effect of message passing protocol

MPI message passing protocols affect the program performance. The performance is
implementation dependent. So the application programmer has to consider the following
circumstances:
1. In case of eager protocol, the receiving process is responsible for buffering the message
upon its arrival, specially if the receive operation has not been posted [13]. This operation
is based upon the implementation's guarantee of a certain amount of available buffer space
on the receive process. In this case, the application programmer has to pay attention to the
following requirements to achieve a reasonable performance
a. message sizes must be small.
b. avoid using of intensive communication to decrease the time consumed by the
receive process side to pull messages from the network and/or copy the data into
buffer space.
2. If the receiving process buffer space can't be allocated or the limits of the buffer are
exceeded rendezvous protocol is used. In this protocol, sender process sends message
envelope to destination process which receives and stores that envelope. When buffer
space is available, destination process replies to sender that requested data can be sent,
hence sender process receives reply from destination process and then sends data [17]. In
this case, the application programmer has to pay attention to the following requirements to
achieve a reasonable performance
a. message sizes must be large enough to avoid the time consumed for handshaking
between sender and receiver.
b. Using non-blocking sends with waits/tests to prevent program from blocking while
waiting for a receiving confirmation from receive process.

5.5 Effect of processors’ number

Adding extra processors to the system reduces the computation time but increases the
communication time as described in equation 3. The increase in communication time may be
larger than the decrease in computation time which leads to a dramatic decreasing of
performance. Equation 4 assures that the speedup is usually less than the number of processors.
In practice, speed up does not increase linearly as the number of processors increases but tends
to saturate and accordingly the efficiency drops as the number of processors increases [12].
The effect of processor’s number is also influenced by the problem size. Speedup and
efficiency increase as the problem size increases on the same number of processors. If
increasing the number of processors reduces efficiency, and increasing the problem size
increases efficiency, the application programmer should be able to keep efficiency constant by
increasing both simultaneously.

5.6 Effect of processes’ number

MPI implementations allow the programmer to run his application using arbitrary number of
processes and processors. The number of processes may be less than, equal to, or greater than
the number of processors. It is common to develop parallel applications with a small number of
processes on a single processor. As the application becomes more fully developed and stable,

19
International Journal of Distributed and Parallel Systems (IJDPS) Vol.2, No.2, March 2011

larger testing runs can be conducted on actual clusters to check for scalability and performance
bottlenecks.
The number of processes per processor affects the application performance so the application
programmer has to be aware of the following considerations:
1. In general, maximum performance is achieved when each process has its own processor.
When the number of processes is less than or equal to the number of processors, the appli-
cation will run at its peak performance. Since the total system is either underutilized (there
are unused processors) or fully utilized (all processors are being used), the application is
not hindered by several parameters such as context switching, cache misses, or virtual
memory thrashing caused by other local processes [14].
2. running too many processes, the processors will thrash, continually trying to give each
process its fair share of run time.
3. running too few processes may not enable the programmer to run meaningful data through
his application, or may not cause error conditions that occur with larger numbers of
processes.

6. EXPERIMENTAL SPEED UP PREDICTION

In some cases, the predicted performance may differs from that achieved experimentally. In this
section we present an experimental method to predict the speed up of MPI applications as a
performance measure. The proposed method is summarized in the following steps:

1. Execute the serial version of MPI application on a single processor machine.

2. Record the serial execution time, Ts .
3. Execute the parallel MPI application on the same single processor machine repeatedly
using arbitrary number of MPI processes, 1,2,3,…,n.
4. Record the parallel execution times, Tp1 , Tp 2 ,...., Tp n , for each run.
5. Graph the obtained results as a two dimensional graph. The X-axis for MPI processes
number and the Y-axis for the parallel execution times, Tp1 , Tp 2 ,...., Tp n .
6. If the parallel execution time is rapidly increases as the number of MPI processes
increases, this implies that the MPI application will exhibit a poor speed up if it is run in
parallel on multiple physical processors.
7. If the parallel execution time remains constant or slowly increases as the number of MPI
processes increases, this implies that the MPI application will exhibit a linear speed up if it
is run in parallel on multiple physical processors.

We applied the proposed method on two MPI applications. The first one solves the concurrent
wave equation and the second finds the number of primes and also the largest prime number
within an interval of integers. The two applications are also executed in parallel on multiple
physical processors. The recorded serial execution time, Ts for both applications is used to find
out their experimental speed up to be compared with the predicted ones.

6.1 Experimental setup

Since modern parallel machines are very costly and not easy to be access, we used an
experimental system consists of 8 DELL machines. Each of these machines consists of Intel
i386 based P4-1.6GHz processor with 512MB memory running on Microsoft Windows XP
Professional Service Pack 2. These machines are connected via a Fast Ethernet 100Mbps
switch. These machines are not as powerful as the recent cluster machines in terms of the
hardware and performance but they can reasonably perform for testing purposes and also for

20
International Journal of Distributed and Parallel Systems (IJDPS) Vol.2, No.2, March 2011

solving small and middle size parallel problems. The experiments programs was written in
Fortran 90 using MPICH2 version 1.0.6p1, as a message passing implementation.

6.2 Experimented Problems

6.2.1 Problem 1: Concurrent wave equation

The concurrent wave equation [6] is the partial differential equation that describes the
propagation of waves. The one-dimensional wave equation that represents a flexible vibrating
∂ 2u 2
2 ∂ u
string stretched between two points on the x-axis is expressed by = c , where, c is
∂t 2 ∂x 2
v
the speed of the wave’s propagation and u = u ( p, t ) describes the wave’s amplitude at position
v
p at time t .
The numerical solution of this equation can be given by :

u(i,t+1) = (2.0 * u(i,t)) - u(i,t-1) + (c (u(i-1,t)-(2.0u(i,t))+u(i+1,t))) (8)

where i is the position index along the x axis at the time t. Equation 8 implies that the amplitude
at each position index i and time t+1 depends on the previous time steps (t, t-1) and neighboring
points (i-1, i+1).This means that the parallel solution requires interprocess communication. The
parallel solution is based on dividing the vibrating string into points. Each processor is
repeatedly responsible for updating the amplitude of a number of points over time. At each
iteration, each processor exchanges boundary points with their nearest neighbors. The parallel
algorithm that solve this equation is summarized as follows:

1. Initialize MPI environment.

2. Determine number of MPI processes and identities.
3. Determine left and right neighbors.
4. If Process_id=master then
5. obtains input values from user.
6. broadcast time advance parameter, total points and time steps
7. else
8. receive input values from master
9. endif
10.calculate initial values based on sine curve
11. calculate new values using wave equation
12. update their points a specified number of times
13. update values for each point along string
14. exchange data with "left-hand" neighbor
15. exchange data with "right-hand" neighbor
16. If Process_id <> master then
17. send the updated values to the master
18. else
19. receives results from workers and prints
20. endif
21. Finalize MPI environment
22. End
Figure 1. Wave equation parallel algorithm solution

21
International Journal of Distributed and Parallel Systems (IJDPS) Vol.2, No.2, March 2011

6.2.2 Problem 2: Prime numbers generator

There is no general "formula" for generating prime numbers. However there are some
approximations and theorems predicting the number of prime numbers less than a particular
upper bound [11]. Brute-force algorithm [5], shown in figure 2, which is also called “naïve”
algorithm can be used in primality test .
1. Naïve (n:integer,prime:logical)
2. /* Assume first four primes are counted
3. if n > 10 then
4. squareroot = int( n )
5. do i=3,squareroot,2
6. if n % i = 0 then
7. prime = false
8. return
9. endif
10. enddo
11. prime = true
12. return
13. else
14. prime =false
15. return
16. endif
17. End naïve
Figure 2. Naïve and Sieves algorithm
The simplest primality test for a given number n, is to check whether any integer m from 2 to
n − 1 divides n. If n is divisible by any m then n is composite, otherwise it is prime. Rather than
testing all m up to n − 1 , “naïve and sieves” algorithm [16] tests only m up to n , if n is
composite then it can be factored into two values, at least one of which must be less than or
equal to n . The algorithm efficiency can also be improved by skipping all even m except 2.
A pseudo serial version and also the corresponding MPI parallel version that use this algorithm
to find the number of primes and also the largest prime number within an interval of integers
are shown in figure 3 and figure 4 respectively.

1. Determine the upper LIMIT of integers interval.

2. prime_counter = 4
3. do n =11, LIMIT, 2
4. call naïve(n, prime)
5. if (prime) then
6. prime_counter = prime_counter + 1
7. prime_value = n
8. endif
9. Enddo
10. print prime_value, prime_counter
11. End

Figure 3. Serial primes generator pseudo code.

22
International Journal of Distributed and Parallel Systems (IJDPS) Vol.2, No.2, March 2011

1. Initialize MPI environment.

2. Determine number of MPI processes, ntasks, and Identities, rank.
3. Determine the upper LIMIT of integers interval.
4. mystart = (rank*2) + 1; stride = ntasks*2
5. prime_counter = 0; prime_value = 0
6. do n=mystart, LIMIT, stride
7. call naïve(n, prime)
8. if (prime) then
9. prime_counter = prime_counter + 1
10. prime_value = n
11. endif
12. enddo
13. Reduce(pc,pcsum,MPI_SUM,master)
14. Reduce(prime_value, maxprime, MPI_MAX , master)
15. If Process_id=master then
16. print maxprime,pcsum-4
17. endif
18. Finalize MPI environment
19. End

Figure 4. Parallel MPI primes generator pseudo code.

6.3 Predicted versus experimental results
The parallel MPI applications that solve both wave equation and prime numbers generator
problems were executed on the hardware architecture described in section 5.1. Serial execution
time , parallel execution time on a single processor using multiple number of processes and also
parallel execution time on multiple processors for both problems are shown in table 1.
Table 1. Serial and parallel execution times for
Wave Equation and Primes Generator
Parallel execution
Serial Single physical Multiple physical
Problem execution processor processors
time MPI Execution Physical Execution
processes time processors time
1 1.3561 1 1.3561
2 3.6942 2 4.0952
3 6.3833 4 1.2112
4 9.4002 8 11.4501
Problem 1
5 12.5629
Wave 0.80216
6 15.301
Equation 7 18.1778
8 21.5001
9 24.1733
10 27.3349
2 55.5887 1 57.625
4 55.464 2 32.6704
Problem 2
8 54.9653 4 17.38331
Primes 55.625
10 55.5158 6 11.58861
Generator 16 55.1428 8 8.2103
20 55.9213
23
International Journal of Distributed and Parallel Systems (IJDPS) Vol.2, No.2, March 2011

Applying the proposed speed up prediction method to wave equation problem using 10 MPI
processes on a single physical processor we predicted that the application will exhibit a poor
speed up if it is executed in parallel using multiple physical processors.
Our prediction is based on that the execution time is rapidly increases as the number of MPI
processes as shown in figure 5. To prove that our prediction was true, we executed the same
MPI code on 8 physical processors. Knowing the execution time of the serial code version, the
experimental speed up was calculated. Figure 6 shows that the maximum speed up achieved by
8 physical processors was only 0.66228534 and hence our prediction was true.

Execution Time (seconds) 25

0
1 2 3 4 5 6 7 8 9 10 11
Number of Processes

Figure 5. Execution time using 10 processes on a single CPU for problem 1

9 Experimental
8
Ideal
7
6
Speed up

5
4
3
2
1
0
1 2 3 4 5 6 7 8 9
Number of Processors

Figure 6. Experimental speed up for problem 1

To be unbiased, we also re-executed the same parallel code using different number of processes
on the same 8 physical processors. Figure 7 shows that the execution time was negatively
affected as the number of MPI processes increases except in case of running a small number of
MPI processes using 8 physical processors. The experimental results shows that there is no
significant speed up improvement as shown in figure 8. This also proves that our prediction was
true.
Applying the proposed method to prime numbers generator problem using 20 MPI processes on
a single physical processor, we predicted that the application will exhibit a linear speed up if it
is executed in parallel using multiple physical processors.

24
International Journal of Distributed and Parallel Systems (IJDPS) Vol.2, No.2, March 2011

30 8 CPUs 4 CPUs

Execution Time (seconds)

25 2 CPUs 1 CPU

20
15
10

5
0
1 2 3 4 5 6 7 8 9 10
Number of MPI Processes

Figure 7. Effect of processes number on execution time using 8 CPUs for problem1

12 Ideal 8 CPUs
4 CPUs 2 CPUs
10
1 CPU
8
Speed Up

0
1 3 5 7 9 11
Number of Processes

Figure 8. Experimental vs. ideal speed up for problem 1

Our prediction is based on that the execution time is slowly increases or seems to be constant as
the number of MPI processes as shown in figure 9. Running the same MPI code on 8 physical
processors achieved a linear speed up as shown figure 10 and hence our prediction was also
true.
Ex ex c ution T im e (s ec onds )

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Number of Processes

Figure 9. Execution time using 20 processes on a single CPU for problem 2

25
International Journal of Distributed and Parallel Systems (IJDPS) Vol.2, No.2, March 2011

9
8
Experimental
7
Ideal
6

Speed up
5
4
3
2
1
1 2 3 4 5 6 7 8 9
Number of Processors

Figure 10. Experimental speed up for problem2

7. CONCLUSION
Concerning the issue of speed up gained from parallelization, the decision making to parallelize
or not to parallelize the serial application is not a trivial task.
In this paper we studied the conflicting parameters that affect the parallel programs
performance, specially MPI applications, showing some recommendations to be followed to
achieve a reasonable performance. The problem nature is one of the most important factors that
affect the parallel program speed up. If The problem can be divided into independent subparts
and no communication is required, except to split up the problem and combine the final results,
then there is a great parallelization opportunity, and the resultant parallel program will exhibit a
linear speed up. If the same instruction set are applied to all data and processes communication
is synchronous , speed up will be directly proportional to the computation -communication
ratio. If there are different instruction sets to be applied to all data to solve a specific problem
and the inter-process communication is asynchronous, this will reduce the parallelization
opportunity. Speed up of the resultant parallel application will be negatively affected with extra
communication overhead.
We also proposed an experimental method that aids in speed up prediction. The proposed
method is based on running the MPI applications with several MPI processes using only one
single processor machine. It gives an indication about the speed up behavior of MPI
applications without using extra parallel hardware facilities, so it is recommended to be applied
to MPI applications before running them on real powerful cluster machines or an expensive
parallel systems. The proposed method was applied to predict the speed up of MPI applications
that solve wave equation and prime numbers generator problems. The predicted speed up was
as the same as experimental speed up achieved when using multiple physical processors for
both applications.

REFERENCES

[1] A. Chan D. Ashton, R. Lusk, and W. Gropp, Jumpshot-4 Users Guide, Mathematics
and Computer Science Division, Argonne National Laboratory July 11, 2007.
[2] A. Grama, A. Gupta, and V. Kumar, "Isoefficiency Function: A Scalability Metric for
Parallel Algorithms and Architectures", IEEE Parallel and Distributed Technology,
Special Issue on Parallel and Distributed Systems: From Theory to Practice, Volume 1,
Number 3, pp 12-21, August 1993.
[3] A. H. Karp and H. Flatt, “Measuring Parallel Processor Performance”, Communication
of the ACM Volume 33 Number 5, May 1990.

26
International Journal of Distributed and Parallel Systems (IJDPS) Vol.2, No.2, March 2011

[4] A. Karwande, X. Yuan, and D. K. Lowenthal, “ CC-MPI: A Compiled Communication

Capable MPI Prototype for Ethernet Switched Clusters”, Journal of Parallel and
Distributed Computing, Volume 65, Number 10, pp 1123-1133, 2005.
[5] A. Mohammad , O. Saleh and R. A. Abdeen “Occurrences Algorithm for String
Searching Based on Brute-force Algorithm”, Journal of Computer Science, Volume 2,
Number 1, pp 82-85, 2006.
[6] C. Geoffrey, Fox et al “Solving problems on concurrent processors”, Prentice-Hall,
Inc. Upper Saddle River, NJ, USA, ISBN:0-13-823022-6 , 1988.
[7] E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra, J. M. Squyres, V.
Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R. H. Castain, D. J. Daniel, R. L.
Graham, and T. S. Woodall, “Open MPI: Goals, Concept, and Design of a Next
Generation MPI Implementation”, In Proceedings, 11th European PVM/MPI Users’
Group Meeting, Budapest, Hungary, pp. 97–104, September 2004.
[8] F. Petrini, D. Kerbyson, and S. Pakin. The case of the missing supercomputer
performance: Achieving optimal performance on the 8,192 processors of ASCI Q. In
Proc. Supercomputing, Phoenix, AZ, Nov. 2003.
[9] G. M. Amdahl, “Validity of the Single Processor Approach to achieving Large Scale
Computing Capabilities”, In Proceedings of the AFIPS Spring Joint Computer
Conference, pp 483–485, April 1967.
[10] H. Brunst, M. Winkler, W. E. Nagel and H.-C. Hoppe, Performance Optimization for
Large Scale Computing: The Scalable VAMPIR Approach,, International Conference
on Computational Science (ICCS2001) Workshop on Tools and Environments for
Parallel and Distributed Programming, San Francisco, CA, May 2001.
[11] I. Aziz, N. Haron, L. Tanjung and W. W. dagang, “Parallelization of Prime
Number Generation Using Message Passing Interface”, WSEAS Transactions on
Computers, Volume 7, Number 4, pp 291-303, April 2008.
[12] J. Gustafson “Reevaluating Amdahl's Law”, Communications of the ACM,
Volume 31, Number 5, pp 532-533, 1988.
[13] J. Liu, A. Vishnu, and D. K. Panda “Building Multirail InfiniBand Clusters: MPI-
Level Design and Performance Evaluation”, In Proceedings of the ACM/IEEE SC2004
Conference, pp 33 – 33, Nov. 2004.
[14] J. M. Squyres , “Processes, Processors, and MPI”, Cluster World, MPI Mechanic
Volume 1 Number 2, pp 8-11, January 2004.
[15] M. D. Hill and M. R. Marty, “Amdahl’s Law in the Multicore Era”, IEEE Computer
Society, Volume 41, Number 7, pp 33-38, 2008.
[16] O. L. Atkin and D. J. Bernstein, "Prime sieves using binary quadratic forms",
Mathematics of Computation Volume 73, pp 1023–1030, 2004.
[17] R. Brightwell, K. D. Underwood, “Evaluation of an Eager Protocol Optimization for
MPI”,10th European PVM/MPI Users' Group Meeting, Venice, Italy, pp 327-334,
September 29 - October 2, 2003.
[18] R. M. Badia, J. Labarta, J. G., and F. Escal´e. DIMEMAS: Predicting MPI
applications behavior in grid environments. In Workshop on Grid Applications and
Programming Tools, 8th Global Grid Forum (GGF8), pages 50–60, Seattle, WA, June
2003.
[19] S. Gorlatch, “Send-Receive Considered Harmful: Myths and Realities of Message
Passing”, ACM Transactions on Programming Languages and Systems, Volume 26,
Number 1, pp 47–56, January 2004.
[20] The MPI Forum. The MPI-2: Extensions to the Message Passing Interface, July 1997.
Available at http://www.mpi-forum.org/docs/mpi-20-html/mpi2-report.html.
[21] V. S. Sunderam, “PVM: A framework for parallel distributed computing”,
Concurrency: Practice & Experience, Volume 2, Number 4, pp 315–339, Dec. 1990.
27
International Journal of Distributed and Parallel Systems (IJDPS) Vol.2, No.2, March 2011

[22] W. Gropp, “MPICH2: A New Start for MPI Implementations”, In Recent Advances in
PVM and MPI: 9th European PVM/MPI Users’ Group Meeting, Linz, Austria, Oct.
2002.
[23] Y. Aoyama J. Nakano “Practical MPI Programming”, International Technical Support
Organization, IBM Coorporation SG24-5380-00, August 1999.
[24] Y. Yan, X. Zhang, and Q. Ma, “Software Support for Multiprocessor Latency
Measurement and Evaluation”, IEEE Transactions on Software Engineering, Volume
23, Number1, pp 4-16, January 1997.

Author

Alaa I. Elnashar was born in Minia, Egypt, on November 5, 1967.

He received his B.Sc. and M.Sc. from Faculty of Science,
Department of Mathematics (Math. & Comp. Science), and Ph.D.
from Faculty of Science, Department of Computer Science, Minia
University, Egypt, in 1988, 1994 and 2005. He is a staff member in
Faculty of Science, Computer Science Dept., Minia University,
Egypt.
Dr. Elnashar was a postdoctoral fellow at Kanazawa University,
Japan. His research interests are in the area of Software
Engineering, Software Testing, Parallel programming and Genetic
Algorithms.
Now, Dr Elnashar is an Assistant professor, Department of Computer Science, College of
Computers and Information Technology, Taif University, Saudi Arabia

CC Omake
No ratings yet
CC Omake
202 pages
primerHW SWinterface
No ratings yet
primerHW SWinterface
107 pages
Writing A Parser From Wire To Window
No ratings yet
Writing A Parser From Wire To Window
43 pages
Advances in Network and Distributed Systems Security PDF
No ratings yet
Advances in Network and Distributed Systems Security PDF
218 pages
Coq 8.10.2 Reference Manual PDF
No ratings yet
Coq 8.10.2 Reference Manual PDF
643 pages
Object Oriented Programming Methodology (CS and IT)
No ratings yet
Object Oriented Programming Methodology (CS and IT)
2 pages
The Ultimate PC Building Handbook Volume 2
100% (5)
The Ultimate PC Building Handbook Volume 2
180 pages
Problem 2.2: Answer
100% (2)
Problem 2.2: Answer
3 pages
Parallel Programming
100% (2)
Parallel Programming
410 pages
Parallel Programming
No ratings yet
Parallel Programming
692 pages
Bad Ideas
No ratings yet
Bad Ideas
69 pages
English in The Algerian Primary Schools Between Necessity and Contingency
No ratings yet
English in The Algerian Primary Schools Between Necessity and Contingency
20 pages
The Kaldi Speech Recognition Toolkit PDF
No ratings yet
The Kaldi Speech Recognition Toolkit PDF
4 pages
Anatomy of A Program in Memory
No ratings yet
Anatomy of A Program in Memory
19 pages
Remote Procedure Call in Distributed System
No ratings yet
Remote Procedure Call in Distributed System
26 pages
SQL PT Poster
No ratings yet
SQL PT Poster
1 page
Rob Pike at Stanford On Golang
100% (3)
Rob Pike at Stanford On Golang
56 pages
Course in Erlang
100% (8)
Course in Erlang
48 pages
How To Become A Golang Developer
100% (1)
How To Become A Golang Developer
6 pages
Flume User Guide
No ratings yet
Flume User Guide
48 pages
Unit II - Loops and Function Pointers, Queues
No ratings yet
Unit II - Loops and Function Pointers, Queues
34 pages
Lecture 1 Kaldi
No ratings yet
Lecture 1 Kaldi
56 pages
An Introduction To GCC
100% (1)
An Introduction To GCC
4 pages
TCPIP State Transition Diagram
No ratings yet
TCPIP State Transition Diagram
1 page
WCF PDF
No ratings yet
WCF PDF
35 pages
Linux GDB Commands
No ratings yet
Linux GDB Commands
10 pages
Parallela Cluster by Michael Johan Kruger
No ratings yet
Parallela Cluster by Michael Johan Kruger
56 pages
How To Use An Existing DNN Recognizer For Decoding in Kaldi
No ratings yet
How To Use An Existing DNN Recognizer For Decoding in Kaldi
14 pages
C++ Chat Server Tutorial Part I - Taywils
100% (1)
C++ Chat Server Tutorial Part I - Taywils
18 pages
TCP Congestion Avoidance
No ratings yet
TCP Congestion Avoidance
7 pages
CS3551 Unit 1 Part1
No ratings yet
CS3551 Unit 1 Part1
32 pages
GCC Profile Guided Optimization
No ratings yet
GCC Profile Guided Optimization
47 pages
Solaris Internals
No ratings yet
Solaris Internals
128 pages
Install Guide: Openstack Contributors
No ratings yet
Install Guide: Openstack Contributors
131 pages
Part 1 - Introduction
No ratings yet
Part 1 - Introduction
5 pages
1 Python Multithreading and Multiprocessing Tutorial
No ratings yet
1 Python Multithreading and Multiprocessing Tutorial
14 pages
Distributed System Course File
No ratings yet
Distributed System Course File
26 pages
Ebpf Implementation For Freebsd: Yutaro Hayakawa
No ratings yet
Ebpf Implementation For Freebsd: Yutaro Hayakawa
33 pages
Java Handbook
No ratings yet
Java Handbook
18 pages
SSH Command - Usage, Options, Configuration
No ratings yet
SSH Command - Usage, Options, Configuration
5 pages
CN Lab 2
No ratings yet
CN Lab 2
35 pages
OpenWrt SDK
No ratings yet
OpenWrt SDK
10 pages
Packet and Flow Based Network Intrusion Dataset
No ratings yet
Packet and Flow Based Network Intrusion Dataset
12 pages
CN Final Slides Computer Networks Communication Networks B.tech
No ratings yet
CN Final Slides Computer Networks Communication Networks B.tech
409 pages
Network Programming PDF
No ratings yet
Network Programming PDF
81 pages
Software Engineering: Chapter 6-Data Flow Diagram
No ratings yet
Software Engineering: Chapter 6-Data Flow Diagram
32 pages
The Next 700 Programming Languages
100% (1)
The Next 700 Programming Languages
10 pages
Distributed Systems: Dr.P.Amudha Associate Professor
100% (4)
Distributed Systems: Dr.P.Amudha Associate Professor
38 pages
Parallelization of Dijkstra's Algorithm
100% (2)
Parallelization of Dijkstra's Algorithm
14 pages
CHAPTER 3 C++ Trees
No ratings yet
CHAPTER 3 C++ Trees
13 pages
Professional Summary:: Bhavana Pallepati Software Engineer 404-860-3320
No ratings yet
Professional Summary:: Bhavana Pallepati Software Engineer 404-860-3320
6 pages
Apache Calcite Tutorial
No ratings yet
Apache Calcite Tutorial
83 pages
Adf Interview Questions and Ansewrs
100% (2)
Adf Interview Questions and Ansewrs
85 pages
Pthreads Programming
No ratings yet
Pthreads Programming
54 pages
Chapter02 OSedition7Final
No ratings yet
Chapter02 OSedition7Final
81 pages
Ixbrforum10 Day3 Gobgp Tutorial
No ratings yet
Ixbrforum10 Day3 Gobgp Tutorial
97 pages
Parallel Programming with MPI: Definitive Reference for Developers and Engineers
From Everand
Parallel Programming with MPI: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Writing Message Passing Parallel Programs With MPI: Course Notes
No ratings yet
Writing Message Passing Parallel Programs With MPI: Course Notes
80 pages
Mpi Course
No ratings yet
Mpi Course
202 pages
The Software Programmer: Basis of common protocols and procedures
From Everand
The Software Programmer: Basis of common protocols and procedures
S Mathioudakis
No ratings yet
OpenMPI Programming and Architecture: Definitive Reference for Developers and Engineers
From Everand
OpenMPI Programming and Architecture: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Defect Prediction in Software Development & Maintainence
From Everand
Defect Prediction in Software Development & Maintainence
Rudra Kumar
No ratings yet
International Journal of Distributed and Parallel Systems
No ratings yet
International Journal of Distributed and Parallel Systems
2 pages
International Journal of Distributed and Parallel Systems
No ratings yet
International Journal of Distributed and Parallel Systems
2 pages
International Journal of Distributed and Parallel Systems
No ratings yet
International Journal of Distributed and Parallel Systems
2 pages
Call For Papers - Vinternational Journal of Distributed and Parallel Systems (IJDPS)
No ratings yet
Call For Papers - Vinternational Journal of Distributed and Parallel Systems (IJDPS)
2 pages
International Journal of Distributed and Parallel Systems
No ratings yet
International Journal of Distributed and Parallel Systems
2 pages
International Journal of Distributed and Parallel Systems (IJDPS)
No ratings yet
International Journal of Distributed and Parallel Systems (IJDPS)
3 pages
PART18
No ratings yet
PART18
30 pages
Progress Report
No ratings yet
Progress Report
7 pages
Maximize The Memory Performance
No ratings yet
Maximize The Memory Performance
7 pages
BSBMGT608 Manage Innovation and Continuous Improvement: What Is Continuous Improvement? Who Would Be Involved in This
0% (1)
BSBMGT608 Manage Innovation and Continuous Improvement: What Is Continuous Improvement? Who Would Be Involved in This
11 pages
Microprocessor MCQs
50% (2)
Microprocessor MCQs
67 pages
Identification of Malnutrition and Prediction of BMI From Facial Images Using Machine Learning
No ratings yet
Identification of Malnutrition and Prediction of BMI From Facial Images Using Machine Learning
51 pages
CPU31x Tech Specs
No ratings yet
CPU31x Tech Specs
64 pages
Iar Risc V Ebook 2024
No ratings yet
Iar Risc V Ebook 2024
115 pages
Hollywood Choirs User Manual
No ratings yet
Hollywood Choirs User Manual
86 pages
BM3551 Embedded Systems and IoMT Lecture Notes 1
100% (1)
BM3551 Embedded Systems and IoMT Lecture Notes 1
137 pages
Infineon IDP2303 DS v02 - 00 EN
No ratings yet
Infineon IDP2303 DS v02 - 00 EN
38 pages
Embedded Firmware
No ratings yet
Embedded Firmware
7 pages
Computer maintenance and technical support Project .ppt
100% (1)
Computer maintenance and technical support Project .ppt
40 pages
ME8791 Mechatronics Course File
100% (1)
ME8791 Mechatronics Course File
22 pages
Digital System Design (EE005-3.5-3) : Individual Assignment Report
No ratings yet
Digital System Design (EE005-3.5-3) : Individual Assignment Report
27 pages
CHM Topic 1 Notes Motherboard Its Components
No ratings yet
CHM Topic 1 Notes Motherboard Its Components
20 pages
Asynchronous Data Transfer Modes of Transfer Priority Interrupt Direct Memory Access Input-Output Processor Serial Communication
No ratings yet
Asynchronous Data Transfer Modes of Transfer Priority Interrupt Direct Memory Access Input-Output Processor Serial Communication
49 pages
Computer Awareness (3000+ MCQ)
No ratings yet
Computer Awareness (3000+ MCQ)
598 pages
Transforming Data Into Information
No ratings yet
Transforming Data Into Information
20 pages
ITE101 - History of Computer
No ratings yet
ITE101 - History of Computer
1 page
Final Exam of The Microprocessor
No ratings yet
Final Exam of The Microprocessor
13 pages
8-Bit Microprocessor: VLSI Architecture Project Report On
No ratings yet
8-Bit Microprocessor: VLSI Architecture Project Report On
35 pages
CP R75.40 ReleaseNotes
No ratings yet
CP R75.40 ReleaseNotes
32 pages
Sections 3.2 and 3.3 Dynamic Scheduling - Tomasulo's Algorithm
No ratings yet
Sections 3.2 and 3.3 Dynamic Scheduling - Tomasulo's Algorithm
53 pages
Critikon Dinamap CO2 Modul - Service Manual
No ratings yet
Critikon Dinamap CO2 Modul - Service Manual
32 pages
TutorDAM - About Computers
No ratings yet
TutorDAM - About Computers
5 pages
Vmstat
No ratings yet
Vmstat
3 pages
Grade 7 Syllabus
100% (1)
Grade 7 Syllabus
3 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

To Parallelize or Not To Parallelize, Speed Up Issue

Uploaded by

To Parallelize or Not To Parallelize, Speed Up Issue

Uploaded by

International Journal of Distributed and Parallel Systems (IJDPS) Vol.2, No.

Faculty of Science, Computer Science Department, Minia University, Egypt

MPICH2 implementation. This may be considered as a guide to decide before parallelizing

3. PARALLELIZATION WITH MPI

3.1 Problem decomposition

3.2 Processes Communication

3.2.1 Point-to-Point Communication

3.2.2 Collective Communication

communications to be performed by carefully selecting the input arguments. Finally,

4.1 Execution time

4.2 Speed up and Efficiency

5. PERFORMANCE LIMITATIONS OF MPI PROGRAMS

5.1 Effect of problem decomposition

5.2 Effect of communication pattern

5.3 Effect of message size

5.4 Effect of message passing protocol

5.5 Effect of processors’ number

5.6 Effect of processes’ number

6. EXPERIMENTAL SPEED UP PREDICTION

1. Execute the serial version of MPI application on a single processor machine.

6.1 Experimental setup

6.2 Experimented Problems

6.2.1 Problem 1: Concurrent wave equation

u(i,t+1) = (2.0 * u(i,t)) - u(i,t-1) + (c *(u(i-1,t)-(2.0*u(i,t))+u(i+1,t))) (8)

1. Initialize MPI environment.

6.2.2 Problem 2: Prime numbers generator

1. Determine the upper LIMIT of integers interval.

Figure 3. Serial primes generator pseudo code.

1. Initialize MPI environment.

Figure 4. Parallel MPI primes generator pseudo code.

Execution Time (seconds) 25

Figure 5. Execution time using 10 processes on a single CPU for problem 1

Figure 6. Experimental speed up for problem 1

Execution Time (seconds)

Figure 8. Experimental vs. ideal speed up for problem 1

Figure 9. Execution time using 20 processes on a single CPU for problem 2

Figure 10. Experimental speed up for problem2

[4] A. Karwande, X. Yuan, and D. K. Lowenthal, “ CC-MPI: A Compiled Communication

Alaa I. Elnashar was born in Minia, Egypt, on November 5, 1967.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

u(i,t+1) = (2.0 * u(i,t)) - u(i,t-1) + (c (u(i-1,t)-(2.0u(i,t))+u(i+1,t))) (8)