0% found this document useful (0 votes)
200 views34 pages

Practical File: Parallel Computing

The document provides an introduction to parallel computing including: 1. It defines parallel computing as using multiple processing elements simultaneously to solve problems by breaking problems into instructions that are solved concurrently. 2. It discusses four types of parallelism - bit-level, instruction-level, task, and superword-level. 3. It lists advantages of parallel computing over serial computing as saving time and money, ability to solve larger problems, using non-local resources, and better utilizing hardware.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
200 views34 pages

Practical File: Parallel Computing

The document provides an introduction to parallel computing including: 1. It defines parallel computing as using multiple processing elements simultaneously to solve problems by breaking problems into instructions that are solved concurrently. 2. It discusses four types of parallelism - bit-level, instruction-level, task, and superword-level. 3. It lists advantages of parallel computing over serial computing as saving time and money, ability to solve larger problems, using non-local resources, and better utilizing hardware.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 34

Practical file

Parallel computing

Submitted by:
Harmanjeet Singh
B. Tech CSE (7)
1803448

Submitted to:
Ms. Dhanwant Kaur

INDEX

1
Sr.no Practical Page
Introduction to parallel computing 3–4
1
Various parallel computing environment
2 5–6
Different levels of parallelism
3 7
Shared memory and distributed memory
4 8–9
Basics of MPI (Message Passing Interface)
5 10 – 13
To learn Communication between MPI processes
6 14 – 17
To get familiarized with advance communication between MPI
7 processes 18 – 20

Basics of open MP API (Open Multi-Processor API)


8 21 – 25
To get familiarized with open MP Directives
9 26 – 29

10 Sharing of work among threads using Loop Construct in open MP 30 - 33

PRACTICAL 1
Introduction To Parallel Computing

2
It is the use of multiple processing elements simultaneously for solving any problem.
Problems are broken down into instructions and are solved concurrently as each resource
that has been applied to work is working at the same time. There are generally four types of
parallel computing, available from both proprietary and open source parallel computing
vendors -- bit-level parallelism, instruction-level parallelism, task parallelism, or superword-
level parallelism:

1. Bit-level parallelism: increases processor word size, which reduces the quantity of
instructions the processor must execute in order to perform an operation on variables
greater than the length of the word.
2. Instruction-level parallelism: the hardware approach works upon dynamic parallelism, in
which the processor decides at run-time which instructions to execute in parallel; the
software approach works upon static parallelism, in which the compiler decides which
instructions to execute in parallel
3. Task parallelism: a form of parallelization of computer code across multiple processors
that runs several different tasks at the same time on the same data
4. Superword-level parallelism: a vectorization technique that can exploit parallelism of
inline code

1.1 Advantages;-
Advantages of Parallel Computing over Serial Computing are as follows: 
1. It saves time and money as many resources working together will reduce the time and cut
potential costs. 
2. It can be impractical to solve larger problems on Serial Computing. 
3. It can take advantage of non-local resources when the local resources are finite.  
4. Serial Computing ‘wastes’ the potential computing power, thus Parallel Computing makes
better work of the hardware.

1.2 Why parallel computing?

1. The whole real-world runs in dynamic nature i.e. many things happen at a certain time but
at different places concurrently. This data is extensively huge to manage.
2. Real-world data needs more dynamic simulation and modeling, and for achieving the same,
parallel computing is the key.
3. Parallel computing provides concurrency and saves time and money.
4. Complex, large datasets, and their management can be organized only and only using
parallel computing’s approach.
5. Ensures the effective utilization of the resources. The hardware is guaranteed to be used
effectively whereas in serial computation only some part of the hardware was used and the
rest rendered idle.
6. Also, it is impractical to implement real-time systems using serial computing.

1.3 Applications of Parallel Computing: 

1. Databases and Data mining.


2. Real-time simulation of systems.
3. Science and Engineering.

3
4. Advanced graphics, augmented reality, and virtual reality.

1.4 Limitations of Parallel Computing: 

1. It addresses such as communication and synchronization between multiple sub-tasks and


processes which is difficult to achieve.
2. The algorithms must be managed in such a way that they can be handled in a parallel
mechanism.
3. The algorithms or programs must have low coupling and high cohesion. But it’s difficult to
create such programs.
4. More technically skilled and expert programmers can code a parallelism-based program
well.

1.5 Future of Parallel Computing:-

The computational graph has undergone a great transition from serial computing to parallel
computing. Tech giant such as Intel has already taken a step towards parallel computing by
employing multicore processors. Parallel computation will revolutionize the way computers
work in the future, for the better good. With all the world connecting to each other even
more than before, Parallel Computing does a better role in helping us stay that way. With
faster networks, distributed systems, and multi-processor computers, it becomes even more
necessary.

PRACTICAL 2
Various Parallel Computing Environments

4
A computer system uses many devices, arranged in different ways to solve many problems.
This constitutes a computing environment where many computers are used to process and
exchange information to handle multiple issues.

The different types of Computing Environments are −

2.1 Personal Computing Environment


In the personal computing environment, there is a single computer system. All the system
processes are available on the computer and executed there. The different devices that
constitute a personal computing environment are laptops, mobiles, printers, computer
systems, scanners etc.

2.2 Time Sharing Computing Environment


The time sharing computing environment allows multiple users to share the system
simultaneously. Each user is provided a time slice and the processor switches rapidly among
the users according to it. Because of this, each user believes that they are the only ones using
the system.

2.3 Client Server Computing Environment

5
In client server computing, the client requests a resource and the server provides that
resource. A server may serve multiple clients at the same time while a client is in contact with
only one server. Both the client and server usually communicate via a computer network but
sometimes they may reside in the same system.

2.4 Distributed Computing Environment


A distributed computing environment contains multiple nodes that are physically separate but
linked together using the network. All the nodes in this system communicate with each other
and handle processes in tandem. Each of these nodes contains a small part of the distributed
operating system software.

2.5 Cloud Computing Environment


The computing is moved away from individual computer systems to a cloud of computers in
cloud computing environment. The cloud users only see the service being provided and not
the internal details of how the service is provided. This is done by pooling all the computer
resources and then managing them using a software.

2.6 Cluster Computing Environment


The clustered computing environment is similar to parallel computing environment as they
both have multiple CPUs. However a major difference is that clustered systems are created
by two or more individual computer systems merged together which then work parallel to
each other.

PRACTICAL 3
Different Levels Of Parallelism

6
1.Bit-level parallelism:- It is the form of parallel computing which is based on the
increasing processor’s size. It reduces the number of instructions that the system must
execute in order to perform a task on large-sized data.  Example: Consider a scenario where
an 8-bit processor must compute the sum of two 16-bit integers. It must first sum up the 8
lower-order bits, then add the 8 higher-order bits, thus requiring two instructions to perform
the operation. A 16-bit processor can perform the operation with just one instruction. From
the advent of very-large-scale integration (VLSI) computer-chip fabrication technology in the
1970s until about 1986, speed-up in computer architecture was driven by doubling computer
word size—the amount of information the processor can manipulate per cycle. Increasing the
word size reduces the number of instructions the processor must execute to perform an
operation on variables whose sizes are greater than the length of the word. For example,
where an 8-bit processor must add two 16-bit integers, the processor must first add the
8 lower-order bits from each integer using the standard addition instruction, then add the
8 higher-order bits using an add-with-carry instruction and the carry bit from the lower order
addition; thus, an 8-bit processor requires two instructions to complete a single operation,
where a 16-bit processor would be able to complete the operation with a single instruction.
2.Instruction-level parallelism:- A processor can only address less than one instruction for
each clock cycle phase. These instructions can be re-ordered and grouped which are later on
executed concurrently without affecting the result of the program. This is called instruction-
level parallelism. A computer program is, in essence, a stream of instructions executed by a
processor. Without instruction-level parallelism, a processor can only issue less than
one instruction per clock cycle (IPC < 1). These processors are known as subscalar processors.
These instructions can be re-ordered and combined into groups which are then executed in
parallel without changing the result of the program. This is known as instruction-level
parallelism. Advances in instruction-level parallelism dominated computer architecture from
the mid-1980s until the mid-1990s

3.Task Parallelism:-Task parallelism employs the decomposition of a task into subtasks


and then allocating each of the subtasks for execution. The processors perform the
execution of sub-tasks concurrently. Task parallelisms is the characteristic of a parallel
program that "entirely

different calculations can be performed on either the same or different sets of data".This
contrasts with data parallelism, where the same calculation is performed on the same or
different sets of data. Task parallelism involves the decomposition of a task into sub-tasks
and then allocating each sub-task to a processor for execution. The processors would then
execute these sub-tasks concurrently and often cooperatively. Task parallelism does not
usually scale with the size of a problem

4.Data-level parallelism (DLP):- Instructions from a single stream operate concurrently on


several data – Limited by non-regular data manipulation patterns and by memory
bandwidth

PRACTICAL 4
Shared Memory And Distributed Memory

7
Shared Memory:- The shared memory in the shared memory model is the memory that
can be simultaneously accessed by multiple processes. This is done so that the processes can
communicate with each other. All POSIX systems, as well as Windows operating systems
use shared memory. A diagram that illustrates the shared memory model of process
communication is given as follows

In the above diagram, the shared memory can be accessed by Process 1 and Process 2.

Advantage of Shared Memory Model

Memory communication is faster on the shared memory model as compared to the message
passing model on the same machine.

Disadvantage of Shared Memory Model

Some of the disadvantages of shared memory model are as follows

 All the processes that use the shared memory model need to make sure that they are not
writing to the same memory location.
 Shared memory model may create problems such as synchronization and memory protection
that need to be addressed.

Distributed Memory:- Distributed computing is a field of computer science that studies


distributed systems. A distributed system is a system whose components are located on
different networked computers, which communicate and coordinate their actions by passing
messages to one another from any system. The components interact with one another in order

8
to achieve a common goal. Three significant characteristics of distributed systems are:
concurrency of components, lack of a global clock, and independent failure of components. It
deals with a central challenge that, when components of a system fails, it doesn't imply the
entire system fails. Examples of distributed systems vary from SOA-based
systems to massively multiplayer online games to peer-to-peer applications. A computer
program that runs within a distributed system is called a distributed program (and distributed
programming is the process of writing such programs). There are many different types of
implementations for the message passing mechanism, including pure HTTP, RPC-
like connectors and message queues. Distributed computing also refers to the use of
distributed systems to solve computational problems. In distributed computing, a problem is
divided into many tasks, each of which is solved by one or more computers which
communicate with each other via message passing.

PRACTICAL 5

Basics of MPI (Message Passing Interface)

9
THEORY
MPI - Message Passing Interface

The Message Passing Interface or MPI is a standard for message passing that has been
developed by a consortium consisting of representatives from research laboratories,
universities, and industry. The first version MPI-l was standardized in 1994, and the second
version MPI-2 was developed in 1997. MPI is an explicit message passing paradigm where
tasks communicate with each other by sending messages.

The two main objectives of MPI are portability and high performance. The MPI environment
consists of an MPI library that provides a rich set of functions numbering in the hundreds.
MPI defines the concept of communicators which combine message context and task group
to provide message security. Intra-communicators allow safe message passing within a group
of tasks, and intercommunicates allow safe message passing between two groups of tasks.
MPI provides many different flavors of both blocking and non-blocking point to point
communication primitives, and has support for structured buffers and derived data types. It
also provides many different types of collective communication routines for communication,
between tasks belonging to a group. Other functions include those for application-oriented
task topologies, profiling, and environmental query and control functions. MPI-2 also adds
dynamic spawning of MPI tasks to this impressive list of functions.

Key Points:

 MPI is a library, not a language.


 MPI is a specification , not a particular implementation
 MPI addresses the message passing model.

Implementation of MPI: MPICH

MPICH is one of the complete implementation of the MPI specification, designed to be both
portable and efficient. The ``CH'' in MPICH stands for ``Chameleon,'' symbol of adaptability
to one's environment and thus of portability. Chameleons are fast, and from the beginning a
secondary goal was to give up as little efficiency as possible for the portability.
MPICH is a unified source distribution, supporting most flavors of Unix and recent versions
of Windows. In additional, binary distributions are available for Windows platforms.

Structure of MPI Program:

#include <mpi.h>
int main(int argc, char ** argv)
//Serial Code
{

10
MPI_Init(&argc,&argv);

//Parallel Code
//Serial Code
}

A simple MPI program contains a main program in which parallel code of program is placed
between MPI_Init and MPI_Finalize.

 MPI_Init

It is used initializes the parallel code segment. Always use to declare the start of the parallel
code segment.

int MPI_Init( int* argc ptr /* in/out */ ,char** argv ptr[ ] /* in/out */)
or simply

MPI_Init(&argc,&argv)

 MPI Finalize

It is used to declare the end of the parallel code segment. It is important to note that it takes
no arguments.

int MPI Finalize(void)

or simply

MPI_Finalize()

Key Points:

 Must include mpi.h by introducing its header #include<mpi.h>. This provides us with the
function declarations for all MPI functions.
 A program must have a beginning and an ending. The beginning is in the form of an
MPI_Init() call, which indicates to the operating system that this is an MPI program and
allows the OS to do any necessary initialization. The ending is in the form of an
MPI_Finalize() call, which indicates to the OS that “clean-up” with respect to MPI can
commence.

 If the program is embarrassingly parallel, then the operations done between the MPI
initialization and finalization involve no communication.

Predefined Variable Types in MPI

MPI DATA TYPE C DATA TYPE

11
MPI_CHAR MPI_SHORT Signed Char Singed Short
(Cont.) MPI_INT MPI_LONG Int
MPI_UNSIGNED_CHAR
MPI_UNSIGNED_SHORT Signed Int Singed Long Int
MPI_UNSIGNED Unsigned Char
MPI_UNSIGNED_LONG Unsigned Short Int
MPI_FLOAT MPI_DOUBLE Unsigned Int Unsigned
MPI_LONG_DOUBLE MPI_BYTE Long Int Float
MPI_PACKED Double Long Double
-------------
-------------

Our First MPI Program:

#include <iostream.h> #include <mpi.h>

int main(int argc, char ** argv)


{
MPI_Init(&argc,&argv);
cout << "Hello World!" << endl; MPI_Finalize();

On compile and running of the above program, a collection of “Hello World!” messages will
be printed to your screen equal to the number of processes on which you ran the program
despite there is only one print statement.

Compilation and Execution of a Program:

For Compilation on Linux terminal, mpicc - o {object name} {file name with c extension}

For Execution on Linux terminal, mpirun –np { number of process} program name

Determining the Number of Processors and their IDs

There are two important commands very commonly used in MPI:

 MPI Comm rank: It provides you with your process identification or rank
(Which is an integer ranging from 0 to P − 1, where P is the number of processes on

12
which are running),

int MPI_Comm_rank(MPI Comm comm /* in */,int* result /* out */)

or simply

MPI_Comm_rank(MPI_COMM_WORLD,&myrank)

 MPI Comm size: It provides you with the total number of processes that have been
allocated.

int MPI_Comm_size( MPI Comm comm /* in */,int* size /* out */)

or simply

MPI_Comm_size(MPI_COMM_WORLD,&mysize)

The argument comm is called the communicator, and it essentially is a designation for a
collection of processes which can communicate with each other. MPI has functionality to
allow you to specify varies communicators (differing collections of processes); however,
generally MPI_COMM_WORLD, which is predefined within MPI and consists of all the
processes initiated when a parallel program, is used.
An Example Program:
#include <iostream.h> #include <mpi.h>

int main(int argc, char ** argv)


{

int mynode, totalnodes; MPI_Init(&argc,&argv);

MPI_Comm_size(MPI_COMM_WORLD, &totalnodes);
MPI_Comm_rank(MPI_COMM_WORLD, &mynode);

cout << "Hello world from process " << mynode; cout << " of " << totalnodes << endl;

MPI_Finalize();
}

13
PRACTICAL 6
To learn Communication between MPI processes

THEORY
It is important to observe that when a program running with MPI, all processes use the same
compiled binary, and hence all processes are running the exact same code. What in an MPI
distinguishes a parallel program running on P processors from the serial version of the code
running on P processors? Two things distinguish the parallel program:

 Each process uses its process rank to determine what part of the algorithm instructions are
meant for it.

 Processes communicate with each other in order to accomplish the final task.

Even though each process receives an identical copy of the instructions to be executed, this
does not imply that all processes will execute the same instructions. Because each process is
able to obtain its process rank (using MPI_Comm_rank).It can determine which part of the
code it is supposed to run. This is accomplished through the use of IF statements. Code that is
meant to be run by one particular process should be enclosed within an IF statement, which
verifies the process identification number of the process. If the code is not placed with in IF
statements specific to a particular id, then the code will be executed by all processes.

The second point, communicating between processes; MPI communication can be summed
up in the concept of sending and receiving messages. Sending and receiving is done with the
following two functions: MPI Send and MPI Recv.

 MPI_Send

int MPI_Send( void* message /* in */, int count /* in */, MPI Datatype datatype /* in */,
int dest /* in */, int tag /* in

*/, MPI Comm comm /* in */ )

 MPI_Recv

int MPI_Recv( void* message /* out */, int count /* in */, MPI Datatype datatype /* in
*/, int source /* in */, int tag /* in

*/, MPI Comm comm /* in */, MPI Status* status /* out */)

Understanding the Argument Lists

14
 message - starting address of the send/recv buffer.

 count - number of elements in the send/recv buffer.


 datatype - data type of the elements in the send buffer.
 source - process rank to send the data.
 dest - process rank to receive the data.
 tag - message tag.
 comm - communicator.
 status - status object.

An Example Program:

The following program demonstrate the use of send/receive function in which sender is
initialized as node two (2) where as receiver is assigned as node four (4). The following
program requires that it should be accommodated on five (5) nodes otherwise the sender and
receiver should be initialized to suitable ranks.

#include <iostream.h> #include <mpi.h>

int main(int argc, char ** argv


{

int mynode, totalnodes;

int datasize; // number of data units to be sent/recv int sender=2; // process number of
the sending process

int receiver=4; // process number of the receiving process int tag; // integer message tag

MPI_Status status; // variable to contain status


information

MPI_Init(&argc,&argv);

MPI_Comm_size(MPI_COMM_WORLD, &totalnodes);
MPI_Comm_rank(MPI_COMM_WORLD, &mynode);

// Determine datasize
double * databuffer = new double[datasize];

// Fill in sender, receiver, tag on sender/receiver


processes,

// and fill in databuffer on the sender process.

15
if(mynode==sender)

MPI_Send(databuffer,datasize,MPI_DOUBLE,receiver, tag,MPI_COMM_WORLD);

if(mynode==receiver)
MPI_Recv(databuffer,datasize,MPI_DOUBLE,sender,tag,
MPI_COMM_WORLD,&status);

// Send/Recv complete

MPI_Finalize();
}

Key Points:

 In general, the message array for both the sender and receiver should be of the same type and
both of same size at least datasize.
 In most cases the sendtype and recvtype are identical.
 The tag can be any integer between 0-32767.
 MPI Recv may use for the tag the wildcard MPI ANY TAG. This allows an MPI Recv to
receive from a send using any tag.
 MPI Send cannot use the wildcard MPI ANY TAG. A special tag must be specified.
 MPI Recv may use for the source the wildcard MPI ANY SOURCE. This allows an MPI Recv
to receive from a send from any source.
 MPI Send must specify the process rank of the destination. No wildcard exists.

An Example Program: To calculate the sum of given numbers in parallel:

The following program calculates the sum of numbers from 1 to 1000 in a parallel fashion
while executing on all the cluster nodes and providing the result at the end on only one node.
It should be noted that the print statement for the sum is only executed on the node that is
ranked zero (0) otherwise the statement would be printed as much time as the number of
nodes in the cluster.

#include<iostream.h> #include<mpi.h>

int main(int argc, char ** argv)

{
int mynode, totalnodes;

16
int sum,startval,endval,accum; MPI_Status status;

MPI_Init(argc,argv);

MPI_Comm_size(MPI_COMM_WORLD, &totalnodes);
MPI_Comm_rank(MPI_COMM_WORLD, &mynode);

sum = 0;

startval = 1000*mynode/totalnodes+1; endval = 1000*(mynode+1)/totalnodes; for(int


i=startval;i<=endval;i=i+1) sum = sum + i;

if(mynode!=0) MPI_Send(&sum,1,MPI_INT,0,1,MPI_COMM_WORLD);

else

for(int j=1;j<totalnodes;j=j+1)

MPI_Recv(&accum,1,MPI_INT,j,1,MPI_COMM_WORLD, &status);

sum = sum + accum;


}
if(mynode == 0)

cout << "The sum from 1 to 1000 is: " << sum <<
endl;

MPI_Finalize();
}

17
PRACTICAL 7
To get familiarized with advance communication between MPI processes

THEORY
This Lab session will focus on more information about sending and receiving in MPI like
sending of arrays and simultaneous send and receive

Key Points
 Whenever you send and receive data, MPI assumes that you have provided non overlapping
positions in memory. As discussed in the previous lab session, MPI_COMM_WORLD is
referred to as a communicator. In general, a communicator is a collection of processes that
can send messages to each other. MPI_COMM_WORLD is pre-defined in all
implementations of MPI, and it consists of all MPI processes running after the initial
execution of the program.

 In the send/receive, we are required to use a tag. The tag variable is used to distinguish upon
receipt between two messages sent by the same process.

 The order of sending does not necessarily guarantee the order of receiving. Tags are used to
distinguish between messages. MPI allows the tag MPI_ANY_TAG which can be used by
MPI_Recv to accept any valid tag from a sender but you cannot use MPI_ANY_TAG in the
MPI_Send command.

 Similar to the MPI_ ANY_ TAG wildcard for tags, there is also an MPI_ANY_SOURCE
wildcard that can also be used by MPI_Recv. By using it in an MPI_Recv, a process is ready
to receive from any sending process. Again, you cannot use MPI_ ANY_ SOURCE in the
MPI_ Send command. There is no wildcard for sender destinations.

 When you pass an array to MPI_ Send/MPI_Recv, it need not have exactly the number of
items to be sent – it must have greater than or equal to the number of items to be sent.
Suppose, for example, that you had an array of 100 items, but you only wanted to send the
first ten items, you can do so by passing the array to MPI_Send and only stating that ten items
are to be sent.

An Example Program:

In the following MPI code, array on each process is created, initialize it on process 0. Once
the array has been initialized on process 0, then it is sent out to each process.

#include<iostream.h> #include<mpi.h>
int main(int argc, char * argv[])
{
int i;
int nitems = 10;

18
int mynode, totalnodes; MPI_Status status; double * array;

MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD, &totalnodes);


MPI_Comm_rank(MPI_COMM_WORLD, &mynode); array = new double[nitems];

if(mynode == 0)
{

for(i=0;i<nitems;i++) array[i] = (double) i;

}
if(mynode==0) for(i=1;i<totalnodes;i++)

MPI_Send(array,nitems,MPI_DOUBLE,i,1,MPI_COMM_WORLD); else
MPI_Recv(array,nitems,MPI_DOUBLE,0,1,MPI_COMM_WORLD, &status);

for(i=0;i<nitems;i++)
{
}

Key Points:

cout << "Processor " << mynode;

cout << ": array[" << i << "] = " << array[i] << endl;

}
MPI_Finalize();

 An array is created, on each process, using dynamic memory allocation.


 On process 0 only (i.e., mynode == 0), an array is initialized to contain the ascending index
values.
 On process 0, program proceeds with (totalnodes-1) calls to MPI Send.
 On all other processes other than 0, MPI_Recv is called to receive the sent message.
 On each individual process, the results are printed of the sending/receiving pair.

Simultaneous Send and Receive, MPI_Sendrecv:

The subroutine MPI_Sendrecv exchanges messages with another process. A send-receive


operation is useful for avoiding some kinds of unsafe interaction patterns and for
implementing remote procedure calls. A message sent by a send-receive operation can
bereceived by MPI_Recv and a send-receive operation can receive a message sent by an
MPI_Send.

MPI_Sendrecv(&data_to_send, send_count, send_type, destination_ID, send_tag,


&received_data, receive_count, receive_type, sender_ID, receive_tag,

19
comm, &status)

Understanding the Argument Lists

 data_to_send: variable of a C type that corresponds to the MPI send_type supplied below
 send_count: number of data elements to send (int)
 send_type: datatype of elements to send (one of the MPI datatype handles)
 destination_ID: process ID of the destination (int)
 send_tag: send tag (int)
 received_data: variable of a C type that corresponds to the MPI receive_type supplied below
 receive_count: number of data elements to receive (int)
 receive_type: datatype of elements to receive (one of the MPI datatype handles)
 sender_ID: process ID of the sender (int)
 receive_tag: receive tag (int)
 comm: communicator (handle)
 status: status object (MPI_Status)

It should be noted in above stated arguments that they contain all the arguments that were
declared in send and receive functions separately in the previous lab session

20
PRACTICAL 8
Basics of OpenMP API (Open Multi-Processor API)

THEORY
OpenMP

OpenMP is a portable and standard Application Program Interface (API) that may be used to
explicitly direct multi-threaded, shared memory parallelism

OpenMP attempts to standardize existing practices from several different vendor-specific


shared memory environments. OpenMP provides a portable shared memory API across
different platforms including DEC, HP, IBM, Intel, Silicon Graphics/Cray, and Sun. The
languages supported by OpenMP are FORTRAN, C and C++. Its main emphasis is on
performance and scalability.

Goals of OpenMP

 Standardization: Provide a standard among a variety of shared memory


architectures/platforms
 Lean and Mean: Establish a simple and limited set of directives for programming shared
memory machines. Significant parallelism can be implemented by using just 3 or 4 directives.
 Ease of Use: Provide capability to incrementally parallelize a serial program, unlike
message-passing libraries which typically require an all or nothing approach and also provide
the capability to implement both coarse-grain and fine-grain parallelism
 Portability: Supports Fortran (77, 90, and 95), C, and C++

OpenMP Programming Model


Shared Memory, Thread Based Parallelism:
 OpenMP is based upon the existence of multiple threads in the shared memory programming
paradigm. A shared memory process consists of multiple threads.

Explicit Parallelism:
 OpenMP is an explicit (not automatic) programming model, offering the programmer full
control over parallelization.

Fork - Join Model:

 OpenMP uses the fork-join model of parallel execution:

21
Figure 8.1 Fork and Join Model

 All OpenMP programs begin as a single process: the master thread. The master thread
executes sequentially until the first parallel region construct is encountered.
 FORK: the master thread then creates a team of parallel threads
 The statements in the program that are enclosed by the parallel region construct are then
executed in parallel among the various team threads
 JOIN: When the team threads complete the statements in the parallel region construct, they
synchronize and terminate, leaving only the master thread

Compiler Directive Based:


 Most OpenMP parallelism is specified through the use of compiler directives which are
imbedded in C/C++ or Fortran source code.

Nested Parallelism Support:


 The API provides for the placement of parallel constructs inside of other parallel constructs.
 Implementations may or may not support this feature.

Dynamic Threads:
 The API provides for dynamically altering the number of threads which may used to execute
different parallel regions.
 Implementations may or may not support this feature.

I/O:
 OpenMP specifies nothing about parallel I/O. This is particularly important if multiple threads
attempt to write/read from the same file.
 If every thread conducts I/O to a different file, the issues are not as significant.
 It is entirely up to the programmer to insure that I/O is conducted correctly within the context
of a multi-threaded program.

22
Components of OpenMP API

 Comprised of three primary API components:


o Compiler Directives
o Runtime Library Routines
o Environment Variables

C / C++ - General Code Structure

#include <omp.h> main ()


{

int var1, var2, var3;

Serial code

#pragma omp parallel private(var1, var2)


shared(var3)

Parallel section executed by all threads


.
.
.
All threads join master thread and disband
}

Important terms for an OpenMP environment

Construct: A construct is a statement. It consists of a directive and the subsequent structured


block. Note that some directives are not part of a construct.

directive: A C or C++ #pragma followed by the omp identifier, other text, and a new line.
The directive specifies program behavior.

23
Region: A dynamic extent.

24
dynamic extent: All statements in the lexical extent, plus any statement inside a function that
is executed as a result of the execution of statements within the lexical extent. A dynamic
extent is also referred to as a region.

lexical extent: Statements lexically contained within a structured block.

structured block: A structured block is a statement (single or compound) that has a single
entry and a single exit. No statement is a structured block if there is a jump into or out of that
statement. A compound statement is a structured block if its execution always begins at the
opening { and always ends at the closing }. An expression statement, selection statement,
iteration statement is a structured block if the corresponding compound statement obtained by
enclosing it in { and }would be a structured block. A jump statement, labeled statement, or
declaration statement is not a structured block.

Thread: An execution entity having a serial flow of control, a set of private variables, and
access to shared variables.

master thread: The thread that creates a team when a parallel region is entered.

serial region: Statements executed only by the master thread outside of the dynamic extent of
any parallel region.

parallel region: Statements that bind to an OpenMP parallel construct and may be executed
by multiple threads.

Variable: An identifier, optionally qualified by namespace names, that names an object.

Private: A private variable names a block of storage that is unique to the thread making the
reference. Note that there are several ways to specify that a variable is private: a definition
within a parallel region, a threadprivate directive, a private, firstprivate, lastprivate, or
reduction clause, or use of the variable as a forloop control variable in a for loop immediately
following a for or parallel for directive.

Shared: A shared variable names a single block of storage. All threads in a team that access
this variable will access this single block of storage.

Team: One or more threads cooperating in the execution of a construct.

Serialize: To execute a parallel construct with a team of threads consisting of only a single
thread (which is the master thread for that parallel construct), with serial order of execution
for the statements within the structured block (the same order as if the block were not part of a
parallel construct), and with no effect on the value returned by omp_in_parallel() (apart from
the effects of any nested parallel constructs).

25
Parallel Computing 1803441

Barrier: A synchronization point that must be reached by all threads in a team. Each
thread waits until all threads in the team arrive at this point. There are explicit barriers
identified by directives and implicit barriers created by the implementation.

26
Parallel Computing 1803441

PRACTICAL 9
To get familiarized with OpenMP Directives

THEORY
OpenMP Directives Format

OpenMP directives for C/C++ are specified with the pragma preprocessing directive.
#pragma omp directive-name [clause[[,] clause]. . . ] new-line

Where:

 #pragma omp: Required for all OpenMP C/C++ directives.

 directive-name: A valid OpenMP directive. Must appear after the pragma and before any
clauses.

 [clause, ...]: Optional, Clauses can be in any order, and repeated as necessary unless otherwise
restricted.

 Newline: Required, Precedes the structured block which is enclosed by this directive.

General Rules:

 Case sensitive
 Directives follow conventions of the C/C++ standards for compiler directives
 Only one directive-name may be specified per directive
 Each directive applies to at most one succeeding statement, which must be a structured block.
 Long directive lines can be "continued" on succeeding lines by escaping the newline character
with a backslash ("\") at the end of a directive line.

OpenMP Directives or Constructs

• Parallel Construct

• Work-Sharing Constructs

 Loop Construct
 Sections Construct
 Single Construct

• Data-Sharing, No Wait, and Schedule Clauses

27
Parallel Computing 1803441

• Barrier Construct

• Critical Construct

• Atomic Construct

• Locks

• Master Construct

Directive Scoping Static (Lexical) Extent:

 The code textually enclosed between the beginning and the end of a structured block following a
directive.
 The static extent of a directives does not span multiple routines or code files

Orphaned Directive:

 An OpenMP directive that appears independently from another enclosing directive is said to be
an orphaned directive. It exists outside of another directive's static (lexical) extent.
 Will span routines and possibly code files

Dynamic Extent:

 The dynamic extent of a directive includes both its static (lexical) extent and the extents of its
orphaned directives.

Parallel Construct
This construct is used to specify the computations that should be executed in parallel. Parts of the
program that are not enclosed by a parallel construct will be executed serially. When a thread
encounters this construct, a team of threads is created to execute the associated parallel region,
which is the code dynamically contained within the parallel construct. But although this
construct ensures that computations are performed in parallel, it does not distribute the work of
the region among the threads in a team. In fact, if the programmer does not use the appropriate
syntax to specify this action, the work will be replicated. At the end of a parallel region, there is
an implied barrier that forces all threads to wait until the work inside the region has been
completed. Only the initial thread continues execution after the end of the parallel region.
The thread that encounters the parallel construct becomes the master of the new team. Each
thread in the team is assigned a unique thread number (also referred to as the “thread id”) to
identify it. They range from zero (for the master thread) up to one less than the number of
threads within the team, and they can be accessed by the programmer. Although the parallel
region is executed by all threads in the team, each thread is allowed to follow a different path of
execution.

28
Parallel Computing 1803441

Format

#pragma omp parallel [clause ...] ……. newline


if (scalar_expression) private (list)
shared (list)
default (shared | none) firstprivate (list) reduction (operator: list) copyin (list)
num_threads (integer-expression)
{
structured_block
}

Example of a parallel region

#include <omp.h> main()


{
#pragma omp parallel
{
printf("The parallel region is executed by thread %d\n",
omp_get_thread_num());

} /*-- End of parallel region --*/


}/*-- End of Main Program --*/
Here, the OpenMP library function omp_get_thread_num() is used to obtain the number of
each thread executing the parallel region. Each thread will execute all code in the parallel region,
so that we should expect each to perform the print statement.. Note that one cannot make any
assumptions about the order in which the threads will execute the printf statement. When the
code is run again, the order of execution could be different.

Possible output of the code with four threads.

The parallel region is executed by thread 0 The parallel region is executed by thread 3 The
parallel region is executed by thread 2 The parallel region is executed by thread 1

Clauses supported by the parallel construct

 if(scalar-expression)
 num threads(integer-expression)
 private(list)
 firstprivate(list)
 shared(list)
 default(none|shared)

29
Parallel Computing 1803441

 copyin(list)
 reduction(operator:list)
Details and usage of clauses are disused in lab session B.4

Key Points:

 A program that branches into or out of a parallel region is nonconforming. In other words, if a
program does so, then it is illegal, and the behavior is undefined.
 A program must not depend on any ordering of the evaluations of the clauses of the parallel
directive or on any side effects of the evaluations of the clauses.
 At most one if clause can appear on the directive.
 At most one num_threads clause can appear on the directive. The expression for the clause must
evaluate to a positive integer value.

Determining the Number of Threads for a parallel Region

When execution encounters a parallel directive, the value of the if clause or num_threads clause
(if any) on the directive, the current parallel context, and the values of the nthreads-var, dyn-var,
thread-limit-var, max-active-level-var, and nest-var ICVs are used to determine the number of
threads to use in the region.

Note that using a variable in an if or num_threads clause expression of a parallel construct causes
an implicit reference to the variable in all enclosing constructs. The if clause expression and the
num_threads clause expression are evaluated in the context outside of the parallel construct, and
no ordering of those evaluations is specified. It is also unspecified whether, in what order, or
how many times any side-effects of the evaluation of the num_threads or if clause expressions
occur.

Example: use of num_threads Clause

The following example demonstrates the num_threads clause. The parallel region is executed
with a maximum of 10 threads.

#include <omp.h> main()


{
#pragma omp parallel num_threads(10)
{
... parallel region ...
}
}

Specifying a Fixed Number of Threads

Some programs rely on a fixed, pre-specified number of threads to execute correctly. Because
the default setting for the dynamic adjustment of the number of threads is implementation-

30
Parallel Computing 1803441

defined, such programs can choose to turn off the dynamic threads capability and set the number
of threads explicitly to ensure portability. The following example shows how to do this using
omp_set_dynamic and omp_set_num_threads

Example:

#include <omp.h> main()


{
omp_set_dynamic(0); omp_set_num_threads(16);

#pragma omp parallel num_threads(10)


{
Parallel region …
}

31
Parallel Computing 1803441

PRACTICAL 10
Sharing of work among threads using Loop Construct in OpenMP

THEORY
Introduction

OpenMP’s work-sharing constructs are the most important feature of OpenMP. They are used to
distribute computation among the threads in a team. C/C++ has three work-sharing constructs. A
work-sharing construct, along with its terminating construct where appropriate, specifies a region
of code whose work is to be distributed among the executing threads; it also specifies the manner
in which the work in the region is to be parceled out. A work-sharing region must bind to an
active parallel region in order to have an effect. If a work-sharing directive is encountered in an
inactive parallel region or in the sequential part of the program, it is simply ignored. Since work-
sharing directives may occur in procedures that are invoked both from within a parallel region as
well as outside of any parallel regions, they may be exploited during some calls and ignored
during others.

The work-sharing constructs are listed below.

 #pragma omp for: Distribute iterations over the threads

 #pragma omp sections: Distribute independent work units

 #pragma omp single: Only one thread executes the code block

The two main rules regarding work-sharing constructs are as follows:

 Each work-sharing region must be encountered by all threads in a team or by none at all.
 The sequence of work-sharing regions and barrier regions encountered must be the same for
every thread in a team.

A work-sharing construct does not launch new threads and does not have a barrier on entry. By
default, threads wait at a barrier at the end of a work-sharing region until the last thread has
completed its share of the work. However, the programmer can suppress this by using the nowait
clause.

The Loop Construct

The loop construct causes the iterations of the loop immediately following it to be executed in

32
Parallel Computing 1803441

parallel. At run time, the loop iterations are distributed across the threads. This is probably the
most widely used of the work-sharing features.

Format:
#pragma omp for [clause ...] newline

schedule (type [,chunk]) ordered


private (list) firstprivate (list) lastprivate (list) shared (list)
reduction (operator: list) nowait
for_loop

Example of a work-sharing loop


Each thread executes a subset of the total iteration space i = 0, . . . , n – 1

#include <omp.h> main()

{
#pragma omp parallel shared(n) private(i)
{
#pragma omp for
for (i=0; i<n; i++)
printf("Thread %d executes loop iteration %d\n", omp_get_thread_num(),i);
}
}

Here we use a parallel directive to define a parallel region and then share its work among threads
via the for work-sharing directive: the #pragma omp for directive states that iterations of the
loop following it will be distributed. Within the loop, we use the OpenMP function
omp_get_thread_num(), this time to obtain and print the number of the executing thread in each
iteration. Parallel construct that state which data in the region is shared and which is private.
Although not strictly needed since this is enforced by the compiler, loop variable i is explicitly
declared to be a private variable, which means that each thread will have its own copy of i. its
value is also undefined after the loop has finished. Variable n is made shared.
Output from the example which is executed for n = 9 and uses four threads.

Thread 0 executes loop iteration 0


Thread 0 executes loop iteration 1
Thread 0 executes loop iteration 2
Thread 3 executes loop iteration 7
Thread 3 executes loop iteration 8
Thread 2 executes loop iteration 5
Thread 2 executes loop iteration 6
Thread 1 executes loop iteration 3
Thread 1 executes loop iteration 4

Combined Parallel Work-Sharing Constructs

33
Parallel Computing 1803441

Combined parallel work-sharing constructs are shortcuts that can be used when a parallel region
comprises precisely one work-sharing construct, that is, the work-sharing region includes all the
code in the parallel region. The semantics of the shortcut directives are identical to explicitly
specifying the parallel construct immediately followed by the work- sharing construct.

Full version Combined construct Combined construct

#pragma omp parallel


#pragma omp parallel for
{
{
#pragma omp for
for-loop
for-loop
}
}

34

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy