Lec5 MPI
Lec5 MPI
Parallel Computing
Fall 2022
1
The Message Passing Interface
Introduction
The Message-Passing Interface (MPI)is an attempt to create a standard to
allow tasks executing on multiple processors to communicate through some
standardized communication primitives.
It defines a standard library for message passing that one can use to
develop message-passing program using C or Fortran.
The MPI standard define both the syntax and the semantics of these
functional interface to message passing.
MPI comes intro a variety of flavors, freely available such as LAM-MPI and
MPIch, and also commercial versions such as Critical Software’s WMPI.
It supporst message-passing on a variety of platforms from Linux-based or
Windows-based PC to supercomputer and multiprocessor systems.
After the introduction of MPI whose functionality includes a set of 125
functions, a revision of the standard took place that added C++ support,
external memory accessibility and also Remote Memory Access (similar to
BSP’s put and get capability)to the standard. The resulting standard is
known as MPI-2 and has grown to almost 241 functions.
2
The Message Passing Interface
A minimum set
A minimum set of MPI functions is described below. MPI functions use the prefix
MPI and after the prefix the remaining keyword start with a capital letter.
A brief explanation of the primitives can be found on the textbook (beginning
page 242). A more elaborate presentation is available in the optional book.
4
MPI and MPI-2
Initialization and Termination
#include <mpi.h>
int MPI_Init (int *argc, char **argv);
int MPI_Finalize(void);
Multiple processes from the same source are created by
issuing the function MPI_Init and these processes are safely
terminated by issuing a MPI_Finalize.
The arguments of MPI_Init are the command line arguments
minus the ones that were used/processed by the MPI
implementation(main function’s parameters – argc and
argv). Thus command line processing should only be
performed in the program after the execution of this function
call. Successful return returns a MPI_SUCCESS; otherwise an
error-code that is implementation dependent is returned.
Definitions are available in <mpi.h>.
5
MPI and MPI-2
Abort
int MPI_Abort(MPI_Comm comm, int errcode);
6
MPI and MPI-2
Communicators; Process Control
Under MPI, a communication domain is a set of processors that are
allowed to communicate with each other. Information about such a
domain is stored in a communicator that uniquely identify the processors
that participate in a communication operation.
A default communication domain is all the processors of a parallel
execution; it is called MPI COMM WORLD. By using multiple
communicators between possibly overlapping groups of processors we
make sure that messages are not interfering with each other.
#include <mpi.h>
int MPI_Comm_size ( MPI_Comm comm, int *size);
int MPI_Comm_rank ( MPI_Comm comm, int *rank);
Thus
MPI_Comm_size ( MPI_COMM_WORLD, &nprocs);
MPI_Comm_rank ( MPI_COMM_WORLD, &pid );
return the number of processors nprocs and the processor id pid of the calling
processor.
7
MPI
Message-Passing primitives
#include <mpi.h>
/* Blocking send and receive */
int MPI_Send(void *buf, int count, MPI_Datatype dtype, int dest, int tag, MPI_Comm comm);
int MPI_Recv(void *buf, int count, MPI_Datatype dtype, int src, int tag, MPI_Comm comm, MPI_Status *stat);
9
MPI
Message-Passing primitives
The MPI Send and MPI Recv functions are blocking, that is they
do not return unless it is safe to modify or use the contents of
the send/receive buffer.
MPI also provides for non-blocking send and receive primitives.
These are MPI Isend and MPI Irecv, where the I stands for
Immediate.
These functions allow a process to post that it wants to send to
or receive from another process, and then allow the process to
call a function (eg. MPI Wait to complete the send-receive pair.
Non-blocking send-receives allow for the overlapping of
computation/communication. Thus MPI Wait plays the role of
synchronizer: the send/receive are only advisories and
communication is only effected at the MPI Wait.
10
MPI
Message-Passing primitives: tag
A tag is simply an integer argument that is passed to a communication function and that
can be used to uniquely identify a message. For example, in MPI if process A sends a
message to process B, then in order for B to receive the message, the tag used in A's call
to MPI_Send must be the same as the tag used in B's called to MPI_Recv. Thus, if the
characteristics of two messages sent by A to B are identical (i.e., same count and
datatype), then A and B can distinguish between the two by using different tags.
For example, suppose A is sending two floats, x and y, to B. Then the processes can be
sure that the values are received correctly, regardless of the order in which A sends and B
receives, provided different tags are used:
11
MPI Message-Passing primitives:
Communicators
Now if one message from process A to process B is being sent by the
library, and another, with identical characteristics, is being sent by the
user's code, unless the library developer insists that user programs
refrain from using certain tag values, this approach cannot be
made to work. Clearly, partitioning the set of possible tags is at best an
inconvenience: if one wishes to modify an existing user code so that it
can use a library that partitions tags, each message passing function in
the entire user code must be checked.
The solution that was ultimately decided on was the communicator.
Formally, a communicator is a pair of objects: the first is a group or
ordered collection of processes, and the second is a context, which can
be viewed as a unique, system-defined tag. Every communication
function in MPI takes a communicator argument, and a communication
can succeed only if all the processes participating in the communication
use the same communicator argument. Thus, a library can either
require that its functions be passed a unique library-specific
communicator, or its functions can create their own unique
communicator. In either case, it is straightforward for the library
designer and the user to make certain that their messages are not
confused.
12
For example, suppose now that the user's code is sending a float, x, from process A to process B, while the library is sending a float, y, from
A to B:
/* Assume system provides some buffering */
void User_function(int my_rank, float* x) {
MPI_Status status;
if (my_rank == A) {
/* MPI_COMM_WORLD is pre-defined in MPI */
MPI_Send(x, 1, MPI_FLOAT, B, 0, MPI_COMM_WORLD);
} else if (my_rank == B) {
MPI_Recv(x, 1, MPI_FLOAT, A, 0, MPI_COMM_WORLD, &status);
}
...
}
void Library_function(float* y) {
MPI_Comm library_comm;
MPI_Status status; int my_rank;
/* Create a communicator with the same group */ /* as MPI_COMM_WORLD, but a different context */
MPI_Comm_dup(MPI_COMM_WORLD, &library_comm);
/* Get process rank in new communicator */
MPI_Comm_rank(library_comm, &my_rank);
if (my_rank == A)
MPI_Send(y, 1, MPI_FLOAT, B, 0, library_comm);
else if (my_rank == B)
{ MPI_Recv(y, 1, MPI_FLOAT, A, 0, library_comm, &status); }
...
}
int main(int argc, char* argv[]) {
...
if (my_rank == A) {
User_function(A, &x);
...
Library_function(&y);
} else if (my_rank == B) {
Library_function(&y);
...
User_function(B, &x);
}
...
}
13
MPI Message-Passing primitives:
User-defined datatypes
The second main innovation in MPI, user-defined datatypes, allows programmers to exploit
this power, and as a consequence, to create messages consisting of logically unified sets of
data rather than only physically contiguous blocks of data.
Loosely, an MPI datatype is a sequence of displacements in memory together with a
collection of basic datatypes (e.g., int, float, double, and char). Thus, an MPI-datatype
specifies the layout in memory of data to be collected into a single message or data to be
distributed from a single message.
For example, suppose we specify a sparse matrix entry with the following definition.
typedef struct {
double entry;
int row, col;
} mat_entry_t;
MPI provides functions for creating a variable that stores the layout in memory of a variable
of type mat_entry_t. One does this by first defining an MPI datatype
MPI_Datatype mat_entry_mpi_t;
to be used in communication functions, and then calling various MPI functions to initialize
mat_entry_mpi_t so that it contains the required layout. Then, if we define
mat_entry_t x;
we can send x by simply calling
MPI_Send(&x, 1, mat_entry_mpi_t, dest, tag, comm);
and we can receive x with a similar call to MPI_Recv.
14
MPI: An example with the
blocking operations
#include <stdio.h>
#include <mpi.h>
#define N 10000000 // Choose N to be multiple of nprocs to avoid problems.
// Parallel sum of 1 , 2 , 3, ... , N
int main(int argc,char **argv){
int pid,nprocs,i,j;
int sum, start, end, total;
MPI_Status status;
MPI_Init(argc,argv);
MPI_Comm_size(MPI_COMM_WORLD,&nprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&pid );
sum = 0; total = 0;
start = (N/nprocs)*pid +1 ; // Each processor
end =(N/nprocs)*(pid+1);
for(i=start;i<=end;i++) sum += i;
if (pid != 0 ) {
MPI_Send(&sum,1,MPI_INT,0,1,MPI_COMM_WORLD);
}
else {
for (j=1;j<nprocs;j++) {
MPI_Recv(&total,1,MPI_INT,j,1,MPI_COMM_WORLD,&status);
sum = sum + total;
}
}
if (pid == 0 ) {
printf(" The sum from 1 to %d is %d \n",N,sum);
}
MPI_Finalize();
}
// Note: Program neither compiled nor run!+-
15
Non-Blocking send and Receive
/* Non-Blocking send and receive */
int MPI_Isend(void *buf, int count, MPI_Datatype dtype, int dest, int tag,
MPI_Comm comm, MPI_Request *req);
int MPI_Irecv(void *buf, int count, MPI_Datatype dtype, int src, int tag,
MPI_Comm comm, MPI_Request *req);
int MPI_Wait(MPI_Request *preq, MPI_Status *stat);
#include "mpi.h"
int MPI_Wait ( MPI_Request *request, MPI_Status *status)
Waits for an MPI send or receive to complete
Input Parameter
request (handle)
Output Parameter
status object (Status) . May be MPI_STATUS_IGNORE.
16
MPI: An example with the non-
blocking operations
#include <stdio.h>
#include <mpi.h>
#define N 10000000 // Choose N to be multiple of nprocs to avoid problems.
// Parallel sum of 1 , 2 , 3, ... , N
int main(int argc,char **argv){
int pid,nprocs,i,j;
int sum, start, end, total;
MPI_Status status;
MPI_Request request;
MPI_Init(argc,argv);
MPI_Comm_size(MPI_COMM_WORLD,&nprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&pid );
sum = 0; total = 0;
start = (N/nprocs)*pid +1 ; // Each processor
end = (N/nprocs)*(pid+1);
for(i=start;i<=end;i++) sum += i;
if (pid != 0 ) {
// MPI_Send(&sum,1,MPI_INT,0,1,MPI_COMM_WORLD);
MPI_Isend(&sum,1,MPI_INT,0,1,MPI_COMM_WORLD,&request);
MPI_Wait(&request,&status);
}
else {
for (j=1;j<nprocs;j++) {
MPI_Recv(&total,1,MPI_INT,j,1,MPI_COMM_WORLD,&status);
sum = sum + total;
}
}
if (pid == 0 ) {
printf(" The sum from 1 to %d is %d \n",N,sum);
}
MPI_Finalize();
} // Note: Program neither compiled nor run! 17
MPI Basic Collective Operations
One simple collective operations:
18
MPI_Bcast
Process 1
Data Present
Empty
Process 3
Data Present
Empty
Process 0
Data Present
Data written, unblock
19
Simple Program that Demonstrates MPI_Bcast:
#include <mpi.h>
#include <stdio.h>
int main (int argc, char *argv[]){
int k,id,p,size;
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD, &id);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if(id == 0)
k = 20;
else
k = 10;
for(p=0; p<size; p++){
if(id == p)
printf("Process %d: k= %d before\n",id,k);
}
//note MPI_Bcast must be put where all other processes
//can see it.
MPI_Bcast(&k,1,MPI_INT,0,MPI_COMM_WORLD);
for(p=0; p<size; p++){
if(id == p)
printf("Process %d: k= %d after\n",id,k);
}
MPI_Finalize();
return 0
20
Simple Program that Demonstrates
MPI_Bcast:
The Output would look like:
Process 0: k= 20 before
Process 0: k= 20 after
Process 3: k= 10 before
Process 3: k= 20 after
Process 2: k= 10 before
Process 2: k= 20 after
Process 1: k= 10 before
Process 1: k= 20 after
21
MPI
Timing Control
#include <mpi.h>
double MPI_Wtime(void);
It returns the time in seconds since an arbitrary point
in the past on the calling processor. Gives wall clock
time.
If you want to give the time since the beginning of the
program under MPI, you may need to do something
like the fragment below.
double t1, mpi_basetime;
MPI_Init(&argc,&argv), mpi_basetime = MPI_Wtime();
t1= (MPI_Wtime()-mpi_basetime);
22
MPI
Barrier Synchronization
#include <mpi.h>
int MPI_Barrier(MPI_Comm comm);
23
MPI-2 in LAM MPI
Remote Memory Access: MPI Put
The RMA (Remote Memory Access)facility allowed by LAM MPI according the MPI-2
implementation is effected by MPI Put and MPI Get. There are multiple ways one can
use such operations under LAM MPI. We mention only one.
#include <mpi.h>
int MPI_Put(orgaddr, orgcnt, orgtype, rank, targdisp, targcnt, targtype, win)
24
MPI-2 in LAM MPI
Remote Memory Access: MPI Get
A get has a similar syntax.
#include <mpi.h>
int MPI_Get(orgaddr, orgcnt, orgtype, rank, targdisp, targcnt, targtype, win)
Memory is copied from target memory to the origin. The origin datatype may
not specify overlapping entries in the origin bugger. The target buffer must be
contained within the target window, and the copied data must fit, without
truncation, in the origin buffer.
25
MPI-2 in LAM MPI
Communication Windows
MPI-2 requires that RMA requests be registered through window creation and termination operations.
#include <mpi.h>
int MPI_Win_create(void *base, MPI_Aint size, int disp_unit,
MPI_Info info, MPI_Comm comm, MPI_Win *newwin)
int MPI_Win_free(MPI_Win *newwin)
int MPI_Win_fence(int assertion, MPI_Win newwin)
Operation MPI_Win_create creates a window for remote memory access operation.
base - initial address of window (choice)
size - size of window in bytes (nonnegative integer)
disp_unit - local unit size for displacements, in bytes (positive integer)
info - info argument (handle)
comm - communicator (handle)
newwin - window object returned by the call (handle)
assertion - A suggestion to the communication algorithm; use 0 if you don’w know what values are
acceptable.
Operation MPI_Win_free frees the window object newwin and returns a null handle (equal to MPI WIN
NULL) .
The MPI call MPI_Win_fence(assertion, newwin) synchronizes RMA calls on newwin. The call is collective
on the group of newwin. All RMA operations on newwin originating at a given process and started before
the fence call will complete at that process before the fence call returns. They will be completed at their
target before the fence call returns at the target. RMA operations on newwin started by a process after the
fence call returns will access their target window only after MPI Win fence has been called by the target
process. Calls to MPI Win fence should both precede and follow calls to put, get or accumulate that are
synchronized with fence calls.
MPI Win fence statements must surround an MPI-2 RMA put or get operation.
26
MPI-2 in LAM MPI
Communication Windows:Example
#include <mpi.h>
int x,y;
int pid, nprocs;
MPI_Win win;
..... We assume we have retrieved pid using an appropriate function call .....
MPI_Win_create(&y,sizeof(int),1,MPI_INFO_NULL,MPI_COMM_WORLD,&win);
MPI_Win_fence(0,win);
if (pid == 0) {
MPI_Put(&x,sizeof(int),MPI_CHAR,5,0,sizeof(int),MPI_CHAR,win);
}
MPI_Win_fence(0,win);
MPI_Win_free(0,&win);
27
MPI-2 in LAM MPI:
Put and Get examples
MPI_Put *****************************************
char *des,*src;
int dp,off,size,K;
MPI_Win_create(des,(K),1,MPI_INFO_NULL,MPI_COMM_WORLD,&win);
MPI_Win_fence(0,win);
if (pid == 0) {
MPI_Put(src,size,MPI_CHAR,dp,off,size,MPI_CHAR,win);
}
MPI_Win_fence(0,win);
MPI_Win_free(0,&win);
28