0% found this document useful (0 votes)

23 views128 pages

Ch3 L1 PDC CS4172 Fall 2024

Chapter 3 of 'An Introduction to Parallel Programming' focuses on distributed memory programming using the Message-Passing Interface (MPI). It covers the basics of writing MPI programs, including communication functions, data types, and performance evaluation, while introducing concepts like Single-Program Multiple-Data (SPMD) and collective communication. The chapter also discusses the structure and execution of MPI programs, emphasizing the importance of message matching and the use of MPI_Status for receiving messages.

Uploaded by

MASTER JII

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views128 pages

Ch3 L1 PDC CS4172 Fall 2024

Uploaded by

MASTER JII

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 128

An Introduction to Parallel Programming

Peter Pacheco

Chapter 3
Distributed Memory
Programming with
MPI

Copyright © 2010, Elsevier Inc. All rights Reserved 1

# Chapter Subtitle
Roadmap
■Writing your first MPI program.
■Using the common MPI functions.
■The Trapezoidal Rule in MPI.
■Collective communication.
■MPI derived datatypes.
■Performance evaluation of MPI programs.
■Parallel sorting.
■Safety in MPI programs.

Copyright © 2010, Elsevier Inc. All rights Reserved 2

■ From a programmer’s point of view, a distributed-memory
system consists of a collection of core-memory pairs connected
by a network, and the memory associated with a core is directly
accessible only to that core.
■ On the other hand, from a programmer’s point of view, a shared-
memory system consists of a collection of cores connected to a
globally accessible memory, in which each core can have access
to any memory location.
■ In this chapter, we’re going to start looking at how to program
distributed memory systems using messagepassing.

Copyright © 2010, Elsevier Inc. All rights Reserved 3

A distributed memory system

Copyright © 2010, Elsevier Inc. All rights Reserved 4

A shared memory system

Copyright © 2010, Elsevier Inc. All rights Reserved 5

■ Recall that in message-passing programs, a program running on
one core-memory pair is usually called a process, and two
processes can communicate by calling functions:
■ one process calls a send function and the other calls a receive function.
■ The implementation of message-passing that we’ll be using is
called MPI, which is an abbreviation of Message-Passing
Interface.
■ It defines a library of functions that can be called from C and
Fortran programs.
■ We’ll learn about some of MPI’s different send and receive
functions.

Copyright © 2010, Elsevier Inc. All rights Reserved 6

■ We will also learn about some “global” communication
functions that can involve more than two processes.

■ These functions are called collective communications.

■ In the process of learning about all of these MPI functions,

we’ll also learn about some of the fundamental issues involved
in writing message-passing programs—issues such as
■ Data partitioning and
■ I/O in distributed-memory systems.

■ We’ll also revisit the issue of parallel program performance.

Copyright © 2010, Elsevier Inc. All rights Reserved 7

Hello World!

(a classic)

Copyright © 2010, Elsevier Inc. All rights Reserved 8

■ Let’s write a program similar to “hello, world” that makes
some use of MPI.
■ Instead of having each process simply print a message, we’ll
designate one process to do the output, and the other processes
will send it messages, which it will print.
■ In parallel programming, it’s common (one might say
standard) for the processes to be identified by nonnegative
integer ranks.

Copyright © 2010, Elsevier Inc. All rights Reserved 9

Identifying MPI processes
■Common practice to identify processes by
nonnegative integer ranks.

■p processes are numbered 0, 1, 2, .. p-1

■ For our parallel “hello, world,” let’s make process 0 the
designated process, and the other processes will send it
messages

Copyright © 2010, Elsevier Inc. All rights Reserved 10

Our first MPI program

Copyright © 2010, Elsevier Inc. All rights Reserved 11

execution
■ The details of compiling and running the program depend on
your system, so you may need to check with a local expert.
■ However, recall that when we need to be explicit, we’ll assume
that we’re using a text editor to write the program source, and
the command line to compile and run.
■ Many systems use a command called mpicc for compilation

Copyright © 2010, Elsevier Inc. All rights Reserved 12

Compilation

wrapper script to compile

source file

mpicc -g -Wall -o mpi_hello mpi_hello.c

produce create this executable file name

debugging (as opposed to default a.out)
information

turns on all warnings

Copyright © 2010, Elsevier Inc. All rights Reserved 13

■ Typically mpicc is a script that’s a wrapper for the C
compiler.
■ A wrapper script is a script whose main purpose is to run
some program.
■ In this case, the program is the C compiler.
■ However, the wrapper simplifies the running of the compiler
by telling it where to find the necessary header files, and which
libraries to link with the object file.

Copyright © 2010, Elsevier Inc. All rights Reserved 14

Execution

mpiexec -n <number of processes> <executable>

mpiexec -n 1 ./mpi_hello

run with 1 process

mpiexec -n 4 ./mpi_hello

run with 4 processes

Copyright © 2010, Elsevier Inc. All rights Reserved 15

Execution
mpiexec -n 1 ./mpi_hello

Greetings from process 0 of 1 !

mpiexec -n 4 ./mpi_hello

Greetings from process 0 of 4 !

Greetings from process 1 of 4 !
Greetings from process 2 of 4 !
Greetings from process 3 of 4 !

Copyright © 2010, Elsevier Inc. All rights Reserved 16

3.1.2: MPI Programs
■Written in C.
■Has main.
■Uses stdio.h, string.h, etc.
■Need to add mpi.h header file.
■Identifiers defined by MPI start with
“MPI_”.
■First letter following underscore is
uppercase.
■For function names and MPI-defined types.
■Helps to avoid confusion.
Copyright © 2010, Elsevier Inc. All rights Reserved 17
MPI Components
■MPI_Init
■Tells MPI to do all the necessary setup.

■MPI_Finalize
■Tells MPI we’re done, so clean up anything
allocated for this program.

Copyright © 2010, Elsevier Inc. All rights Reserved 18

Basic Outline

Copyright © 2010, Elsevier Inc. All rights Reserved 19

Communicators
■A collection of processes that can send
messages to each other.
■MPI_Init defines a communicator that
consists of all the processes created when
the program is started.
■Called MPI_COMM_WORLD.

Copyright © 2010, Elsevier Inc. All rights Reserved 20

Communicators

number of processes in the communicator

my rank
(the process making this call)

Copyright © 2010, Elsevier Inc. All rights Reserved 21

■For both functions, the first argument is a
communicator and has the special type defined by
MPI for communicators, MPI_Comm.
■MPI_Comm_size returns in its second argument the
number of processes in the communicator,
■MPI_Comm_rank returns in its second argument the
calling process’s rank in the communicator.
■We’ll often use the variable comm_sz for the number
of processes in MPI_COMM_WORLD, and the
variable my_rank for the process rank.

Copyright © 2010, Elsevier Inc. All rights Reserved 22

SPMD
■Single-Program Multiple-Data
■We compile one program.
■Process 0 does something different.
■Receives messages and prints them while the
other processes do the work.

■The if-else construct makes our program

SPMD.

Copyright © 2010, Elsevier Inc. All rights Reserved 23

Communication

Copyright © 2010, Elsevier Inc. All rights Reserved 24

■The first three arguments,
■msg_buf_p,
■msg_size, and
■msg_type,
■determine the contents of the message.
■The remaining arguments,
■dest,
■tag, and
■communicator,
■determine the destination of the message.

Copyright © 2010, Elsevier Inc. All rights Reserved 25

■ The first argument, msg_buf_p, is a pointer to the block of
memory containing the contents of the message.
■ In our program, this is just the string containing the message,
greeting. (Remember that in C an array, such as a string, is a
pointer.)
■ The second and third arguments, msg_size and msg_type,
determine the amount of data to be sent.
■ In our program, the msg_size argument is the number of characters in
the message, plus one character for the ’\0’ character that terminates C
strings.
■ The msg_type argument is MPI_CHAR.
■ These two arguments together tell the system that the message contains
strlen(greeting)+1 chars.

Copyright © 2010, Elsevier Inc. All rights Reserved 26

Data types
■ Since C types (int, char, etc.) can’t be passed as arguments to
functions, MPI defines a special type, MPI_Datatype, that is
used for the msg_type argument.

■ MPI also defines a number of constant values for this type.

■ The ones we’ll use (and a few others) are listed in Table 3.1.

Copyright © 2010, Elsevier Inc. All rights Reserved 27

Data types

Copyright © 2010, Elsevier Inc. All rights Reserved 28

Communication

Copyright © 2010, Elsevier Inc. All rights Reserved 29

Communication ..
■ Thus the first three arguments specify the memory
available for receiving the message:
■ msg_buf_p points to the block of memory,
■ buf_size determines the number of objects that can be stored in
the block, and
■ buf_type indicates the type of the objects.
■ The next three arguments identify the message.
■ The source argument specifies the process from which the
message should be received.
■ The tag argument should match the tag argument of the
message being sent, and
■ the communicator argument must match the communicator used
by the sending process.

Copyright © 2010, Elsevier Inc. All rights Reserved 30

Message matching

r
MPI_Send
src = q

MPI_Recv
dest = r

Copyright © 2010, Elsevier Inc. All rights Reserved 31

■Then the message sent by q with the above call to
MPI_Send can be received by r with the call to
MPI_Recv if
■recv_comm = send_comm,
■recv_tag = send_tag,
■dest = r, and
■src = q.

Copyright © 2010, Elsevier Inc. All rights Reserved 32

■These conditions aren’t quite enough for the message to
be successfully received, however.
■The parameters specified by the first three pairs of
arguments,
■send_buf_p/recv_buf_p,
■send_buf_sz/recv_buf_sz, and
■send_type/recv_type, must specify compatible buffers.
■For detailed rules, see the MPI-3 specification [40].
■Most of the time, the following rule will suffice:
■If recv_type = send_type and
■recv_buf_sz ≥ send_buf_sz,
■then the message sent by q can be successfully received by r.

Copyright © 2010, Elsevier Inc. All rights Reserved 33

■Of course, it can happen that
■one process is receiving messages from multiple processes,
■and the receiving process doesn’t know the order in which
the other processes will send the messages.
For example, suppose
■process 0 is doling out work to processes 1, 2, . . . ,
comm_sz − 1, and
■processes 1, 2, . . . , comm_sz − 1, send their results back
to process 0 when they finish the work.
■If the work assigned to each process takes an
unpredictable amount of time, then 0 has no way of
knowing the order in which the processes will finish.

Copyright © 2010, Elsevier Inc. All rights Reserved 34

■ If process 0 simply receives the results in process rank order
■ first the results from process 1, then the results from process 2, and so on
■ and if (say) process comm_sz−1 finishes first, it could happen that process
comm_sz−1 could sit and wait for the other processes to finish.
■ To avoid this problem MPI provides a special constant
MPI_ANY_SOURCE that can be passed to MPI_Recv.
■ Then, if process 0 executes the following code, it can receive
the results in the order in which the processes finish:

Copyright © 2010, Elsevier Inc. All rights Reserved 35

Copyright © 2010, Elsevier Inc. All rights Reserved 36
■Similarly, it’s possible that one process can be
receiving multiple messages with different tags
from another process, and
■the receiving process doesn’t know the order in
which the messages will be sent.
■For this circumstance, MPI provides the
special constant MPI_ANY_TAG that can
be passed to the tag argument of
MPI_Recv.
Copyright © 2010, Elsevier Inc. All rights Reserved 37
status_p argument
■If you think about these rules for a minute,
you’ll notice that a receiver can receive a
message without knowing:
■the amount of data in the message,
■the sender of the message,
■or the tag of the message.

Copyright © 2010, Elsevier Inc. All rights Reserved 38

status_p argument…
■So how can the receiver find out these values?
■Recall that the last argument to MPI_Recv has type
MPI_Status∗.
■The MPI type MPI_Status is a struct with at least the
three members
■ MPI_SOURCE,
■ MPI_TAG, and
■ MPI_ERROR.

Copyright © 2010, Elsevier Inc. All rights Reserved 39

status_p argument…
■ Suppose our program contains the definition
MPI_Status status ;
■ Then after a call to MPI_Recv, in which &status is passed as
the last argument, we can determine the sender and tags by
examining the two members:
■ status . MPI_SOURCE
■ status . MPI_TAG

Copyright © 2010, Elsevier Inc. All rights Reserved 40

status_p argument …

MPI_Status*

MPI_Status* status; MPI_SOURCE

MPI_TAG
MPI_ERROR
status.MPI_SOURCE
status.MPI_TAG

Copyright © 2010, Elsevier Inc. All rights Reserved 41

How much data am I receiving?
■ The amount of data that’s been received isn’t stored in a field
that’s directly accessible to the application program.
■ However, it can be retrieved with a call to MPI_Get_count.
■ For example, suppose that in our call to MPI_Recv, the type
of the receive buffer is recv_type and, once again, we passed
in &status. Then the call

■ will return the number of elements received in the count

argument

Copyright © 2010, Elsevier Inc. All rights Reserved 42

How much data am I receiving?
■Note that the count isn’t directly accessible as a
member of the MPI_Status variable, simply because
■it depends on the type of the received data, and,
consequently,
■determining it would probably require a calculation (e.g.,
(number of bytes received)/(bytes per object)).
■And if this information isn’t needed, we shouldn’t waste a
calculation determining it.

Copyright © 2010, Elsevier Inc. All rights Reserved 43

3.1.11 Semantics of MPI_Send and
MPI_Recv
■ What exactly happens when we send a message from one
process to another?
■ Many of the details depend on the particular system, but we
can make a few generalizations.
■ Once the message has been assembled there are essentially two
possibilities:
■ the sending process can buffer the message
■ or it can block.
■ If it buffers the message, the MPI system will place the
message (data and envelope) into its own internal storage, and
the call to MPI_Send will return.

Copyright © 2010, Elsevier Inc. All rights Reserved 44

3.1.11 Semantics of MPI_Send and
MPI_Recv
■ Alternatively, if the system blocks, it will wait until it can
begin transmitting the message, and the call to MPI_Send
may not return immediately.
■ Thus if we use MPI_Send, when the function returns, we
don’t actually know whether the message has been transmitted.
■ We only know that the storage we used for the message, the
send buffer, is available for reuse by our program.
■ If we need to know that the message has been transmitted, or if
we need for our call to MPI_Send to return immediately
regardless of whether the message has been sent
■ Then MPI provides alternative functions for sending

Copyright © 2010, Elsevier Inc. All rights Reserved 45

Issues with send and receive
■Exact behavior is determined by the MPI
implementation.
■MPI_Send may behave differently with
regard to buffer size, cutoffs and blocking.
■MPI_Recv always blocks until a matching
message is received.
■Know your implementation;
don’t make assumptions!

Copyright © 2010, Elsevier Inc. All rights Reserved 46

■Unlike MPI_Send, MPI_Recv always blocks until a
matching message has been received.
■So when a call to MPI_Recv returns, we know that
there is a message stored in the receive buffer (unless
there’s been an error).

Copyright © 2010, Elsevier Inc. All rights Reserved 47

TRAPEZOIDAL RULE IN MPI

Copyright © 2010, Elsevier Inc. All rights Reserved 48

The Trapezoidal Rule

Copyright © 2010, Elsevier Inc. All rights Reserved 49

One trapezoid

Copyright © 2010, Elsevier Inc. All rights Reserved 50

The Trapezoidal Rule

Copyright © 2010, Elsevier Inc. All rights Reserved 51

Pseudo-code for a serial program

Copyright © 2010, Elsevier Inc. All rights Reserved 52

Parallelizing the Trapezoidal Rule
1. Partition problem solution into tasks.
2. Identify communication channels between
tasks.
3. Aggregate tasks into composite tasks.
4. Map composite tasks to cores.

Copyright © 2010, Elsevier Inc. All rights Reserved 53

■In the partitioning phase, we usually try to identify as
many tasks as possible.
■For the trapezoidal rule, we might identify two types
of tasks:
■one type is finding the area of a single trapezoid, and
■ the other is computing the sum of these areas.
■Then the communication channels will join each of
the tasks of the first type to the single task of the
second type.

Copyright © 2010, Elsevier Inc. All rights Reserved 54

Tasks and communications for
Trapezoidal Rule

Copyright © 2010, Elsevier Inc. All rights Reserved 55

■ So how can we aggregate the tasks and map them to the cores?
■ Our intuition tells us that the more trapezoids we use, the more
accurate our estimate will be.
■ That is, we should use many trapezoids, and we will use many
more trapezoids than cores.
■ Thus we need to aggregate the computation of the areas of the
trapezoids into groups.
■ A natural way to do this is to split the interval [a, b] up into
comm_sz subintervals. If comm_sz evenly divides n, the
number of trapezoids, we can simply apply the trapezoidal rule
with n/comm_sz trapezoids to each of the comm_sz
subintervals.
■ To finish, we can have one of the processes, say process 0, add
the estimates.
Copyright © 2010, Elsevier Inc. All rights Reserved 56
Parallel pseudo-code
Let’s make the simplifying assumption that comm_sz evenly divides n.

Copyright © 2010, Elsevier Inc. All rights Reserved 57

■ Notice that in our choice of identifiers, we try to differentiate
between local and global variables.
■ Local variables are variables whose contents are significant only
on the process that’s using them.
■ Some examples from the trapezoidal rule program are local_a,
local_b, and local_n.
■Variables whose contents are significant to all the
processes are sometimes called global variables.
■ Some examples from the trapezoidal rule are a, b, and n.
■Note that this usage is different from the usage you
learned in your introductory programming class, where local
variables are private to a single function and global variables are
accessible to all the functions.

Copyright © 2010, Elsevier Inc. All rights Reserved 58

First version
Let’s defer, for the moment, the issue of input and just “hardwire”
the values for a, b, and n.

Copyright © 2010, Elsevier Inc. All rights Reserved 59

First version (2)

Copyright © 2010, Elsevier Inc. All rights Reserved 60

First version (3)
The Trap function is just an implementation of the serial trapezoidal
rule.

Copyright © 2010, Elsevier Inc. All rights Reserved 61

Dealing with I/O
■ Of course, the current version of the parallel trapezoidal rule
has a serious deficiency:
■ it will only compute the integral over the interval [0, 3] using
1024 trapezoids.
■ We can edit the code and recompile, but this is quite a bit of
work compared to simply typing in three new numbers.
■ We need to address the problem of getting input from the user.

■ While we’re talking about input to parallel programs, it might

be a good idea to also take a look at output.

Copyright © 2010, Elsevier Inc. All rights Reserved 62

3.3.1 Output

■ In both the “greetings” program and the trapezoidal rule

program, we’ve assumed that process 0 can write to stdout,
i.e., its calls to printf behave as we might expect.

■ Although theMPI standard doesn’t specify which processes

have access to which I/O devices, virtually all MPI
implementations allow all the processes in
MPI_COMM_WORLD full access to stdout and stderr.

Copyright © 2010, Elsevier Inc. All rights Reserved 63

3.3.1 Output
■So most MPI implementations allow all processes
to execute printf and fprintf(stderr, ...) .
■However, most MPI implementations don’t
provide any automatic scheduling of access to
these devices.
■That is, if multiple processes are attempting to write
to, say, stdout, the order in which the processes’
output appears will be unpredictable.
■ Indeed, it can even happen that the output of one
process will be interrupted by the output of another
process.
Copyright © 2010, Elsevier Inc. All rights Reserved 64
Running with 6 processes
• For example, suppose we try to run an MPI program in which each process
simply prints a message. (See Program 3.4.)
• On our cluster, if we run the program with five processes, it often produces the
“expected” output:
• However, when we run it with six processes, the order of the output lines is
unpredictable:

unpredictable output

Copyright © 2010, Elsevier Inc. All rights Reserved 65

Program 3.4: Each process just prints a message.

Each process just

prints a message.

Copyright © 2010, Elsevier Inc. All rights Reserved 66

3.3.2 Input
■Unlike output, most MPI implementations
only allow process 0 in
MPI_COMM_WORLD access to stdin.

■This makes sense: If multiple processes have

access to stdin, which process should get
which parts of the input data?
■Should process 0 get the first line?
■process 1 get the second? Or should process 0 get the first
character?

Copyright © 2010, Elsevier Inc. All rights Reserved 67

3.3.2 Input
■So, to write MPI programs that can use scanf, we need
to branch on process rank, with process 0 reading in the
data, and then sending it to the other processes.
■For example, we might write the Get_input function shown
in Program 3.5 for our parallel trapezoidal rule program.
■In this function, process 0 simply reads in the values for a, b,
and n, and sends all three values to each process.
■So this function uses the same basic communication
structure as the “greetings” program, except that now
process 0 is sending to each process, while the other
processes are receiving.

Copyright © 2010, Elsevier Inc. All rights Reserved 68

Input
■ To use this function, we can simply insert a call to it inside our
main function,
■ being careful to put it after we’ve initialized my_rank and
comm_sz: Most MPI implementations only allow process
0 in MPI_COMM_WORLD access to stdin.
■ Process 0 must read the data (scanf) and send to the
other processes.

Copyright © 2010, Elsevier Inc. All rights Reserved 69

Program 3.5: Function for reading user input

Copyright © 2010, Elsevier Inc. All rights Reserved 70

COLLECTIVE
COMMUNICATION

Copyright © 2010, Elsevier Inc. All rights Reserved 71

Tree-structured communication
1. In the first phase:
(a) Process 1 sends to 0, 3 sends to 2, 5 sends to 4, and
7 sends to 6.
(b) Processes 0, 2, 4, and 6 add in the received values.
(c) Processes 2 and 6 send their new values to
processes 0 and 4, respectively.
(d) Processes 0 and 4 add the received values into their
new values.

2. (a) Process 4 sends its newest value to process 0.

(b) Process 0 adds the received value to its newest
value.

Copyright © 2010, Elsevier Inc. All rights Reserved 72

A tree-structured global sum

Copyright © 2010, Elsevier Inc. All rights Reserved 73

An alternative tree-structured
global sum

Copyright © 2010, Elsevier Inc. All rights Reserved 74

MPI_Reduce

Copyright © 2010, Elsevier Inc. All rights Reserved 75

Predefined reduction operators
in MPI

Copyright © 2010, Elsevier Inc. All rights Reserved 76

Collective vs. Point-to-Point
Communications
■All the processes in the communicator
must call the same collective function.

■For example, a program that attempts to

match a call to MPI_Reduce on one
process with a call to MPI_Recv on
another process is erroneous, and, in all
likelihood, the program will hang or crash.

Copyright © 2010, Elsevier Inc. All rights Reserved 77

MPI Reduce

Gather operation combined with specified arithmetic/logical operation.

Example: Values could be gathered and then added together by root:

MPI_Reduce() MPI_Reduce() MPI_Reduce()

As usual same routine called by each process, with same parameters.

78 78
Collective vs. Point-to-Point
Communications
■The arguments passed by each process to
an MPI collective communication must be
“compatible.”

■For example, if one process passes in 0

as the dest_process and another passes
in 1, then the outcome of a call to
MPI_Reduce is erroneous, and, once
again, the program is likely to hang or
crash. Copyright © 2010, Elsevier Inc. All rights Reserved 79
Implementation of reduction using a tree
construction

P0 P1 P2 P3 P4 P5
14 39 53 120 66 29
+ + +
O(log2 P) 53 173 95
with P
processes +
226
+
321
Collective vs. Point-to-Point
Communications
■The output_data_p argument is only used
on dest_process.

■However, all of the processes still need to

pass in an actual argument corresponding
to output_data_p, even if it’s just NULL.

Copyright © 2010, Elsevier Inc. All rights Reserved 81

Collective vs. Point-to-Point
Communications
■Point-to-point communications are
matched on the basis of tags and
communicators.

■Collective communications don’t use tags.

■They’re matched solely on the basis of the
communicator and the order in which
they’re called.

Copyright © 2010, Elsevier Inc. All rights Reserved 82

Example (1)

Multiple calls to MPI_Reduce

Copyright © 2010, Elsevier Inc. All rights Reserved 83

Example (2)
■Suppose that each process calls
MPI_Reduce with operator MPI_SUM, and
destination process 0.

■At first glance, it might seem that after the

two calls to MPI_Reduce, the value of b
will be 3, and the value of d will be 6.

Copyright © 2010, Elsevier Inc. All rights Reserved 84

Example (3)
■However, the names of the memory
locations are irrelevant (???)to the matching
of the calls to MPI_Reduce.

■The order of the calls will determine the

matching so the value stored in b will be
1+2+1 = 4, and the value stored in d will be
2+1+2 = 5.

Copyright © 2010, Elsevier Inc. All rights Reserved 85

MPI_Allreduce
■Useful in a situation in which
all of the processes need the
result of a global sum in
order to complete some
larger computation.

Copyright © 2010, Elsevier Inc. All rights Reserved 86

Reduction to All
■ int MPI_Allreduce(void *sendbuf, void *recvbuf,
int count, MPI_Datatype datatype,
MPI_Op op, MPI_Comm comm)
■ All the processes collect data to all the other processes in the same
communicator, and perform an operation on the data
■ MPI_SUM, MPI_MIN, MPI_MAX, MPI_PROD, logical AND, OR, XOR, and
a few more
■ MPI_Op_create(): User defined operator

P1 A … … … P1 A+B+C+D
P2 B … … …
MPI_Allreduc P2 A+B+C+D
P3 C … … … e P3 A+B+C+D
P4 D … … … P4 A+B+C+D

87
A global sum followed
by distribution of the
result.

Copyright © 2010, Elsevier Inc. All rights Reserved 88

A butterfly-structured global sum.

Broadcast pattern
Sends same data to each of a group of processes

Destinations
A common pattern to get Same data
same data to all processes, sent to all destinations
especially at the beginning
of a computation Source

Note:
•Patterns given do not mean the implementation does them as shown. Only the
final result is the same in any parallel implementation.
•Patterns do not describe the implementation. 90
Broadcast
■Data belonging to a
single process is sent
to all of the
processes in the
communicator.

MPI broadcast operation
Sending same message to all processes in communicator

Notice same routine called by each process, with same parameters. 92

MPI processes usually execute the same program so this is a handy construction.
A tree-structured broadcast.

A version of Get_input that uses
MPI_Bcast

3.4.6: Data distributions

Compute a vector sum.

Serial implementation of vector addition

• The work consists of adding the individual components of the

vectors, so we might specify that the tasks are just the
additions of corresponding components.
• Then there is no communication between the tasks, and the
problem of parallelizing vector addition boils down to
aggregating the tasks and assigning them to the cores

Different partitions of a 12-component
vector among 3 processes

Partitioning options
■Block partitioning
■Assign blocks of consecutive components to
each process.
■Cyclic partitioning
■Assign components in a round robin fashion.
■Block-cyclic partitioning
■Use a cyclic distribution of blocks of
components.

Parallel implementation of
vector addition

3.4.7: Scatter
■ Now suppose we want to test our vector addition function.
■ It would be convenient to be able to read the dimension of the
vectors and then read in the vectors x and y.
■ We already know how to read in the dimension of the vectors:
process 0 can prompt the user, read in the value, and broadcast
the value to the other processes.
■ We might try something similar with the vectors: process 0
could read them in and broadcast them to the other processes
■ However, this could be very wasteful. If there are 10 processes
and the vectors have 10,000 components, then each process
will need to allocate storage for vectors with 10,000
components, when it is only operating on subvectors with 1000
components

3.4.7: Scatter
■MPI_Scatter can be used in a function that
reads in an entire vector on process 0 but
only sends the needed components to
each of the other processes.

Scatter Pattern
Distributes a collection of data items to a group of processes

A common pattern to get

Destinations
data to all processes Different data
sent to each destinations

Usually data sent are Source

parts of an array

102
Scatter Pattern

MPI_Bcast takes a single data element at the root process (the red box)

and copies it to all other processes.

MPI_Scatter takes an array of elements and distributes the elements in the

order of process rank. The first element (in red) goes to process zero, the
second element (in green) goes to process one, and so on.
Basic MPI scatter operation
Sending one of more contiguous elements of an array in root process to a
separate process.

Notice same routine called by each process, with same parameters.

MPI processes usually execute the same program so this is a handy construction.
Reading and distributing a vector

3.4.8: Gather
■ Of course, our test program will be useless unless we can see
the result of our vector addition.
■ So we need to write a function for printing out a distributed
vector.
■ Our function can collect all of the components of the vector
onto process 0, and then process 0 can print all of the
components.
■ The communication in this function can be carried out by
MPI_Gather:

3.4.8: Gather
■ Collect all of the components of the vector onto process
0, and then process 0 can process all of the components.

■ Note that recv_count is the number of data items received

from each process, not the total number of data items received.

Gather Pattern

Essentially the reverse of a scatter. It receives data items from a group of

processes

Sources

Data

Destination
Data
A common pattern especially at
the end of a computation to Data collected
collect results at destination
in an array
Data

108
MPI Gather

Having one process collect individual values from set of processes

(includes itself).

As usual same routine called by each process, with same parameters.

Gather Pattern
• int MPI_Gather(void *sendbuf, int sendcnt,
MPI_Datatype sendtype, void *recvbuf,
int recvcnt, MPI_Datatype recvtype,
int root, MPI_Comm comm)
▪ One process (root) collects data to all the other processes in the same
communicator
▪ Must be called by all the processes with the same arguments

P1 A P1 A B C D

P2 B P2
MPI_Gather
P3 C P3
P4 D P4
Print a distributed vector (1)

Print a distributed vector (2)

3.4.9: Allgather

■As a final example, let’s look at how we

might write an MPI function that multiplies
a matrix by a vector.

Matrix-vector multiplication

i-th component of y
Dot product of the ith
row of A with x.

Matrix-vector multiplication

Multiply a matrix by a vector

Serial pseudo-code

C style arrays However, there are
some peculiarities
in the way that C
programs deal with
two-dimensional
arrays

stored as

A 2D array would be stored as the one-dimensional array

■More generally, if our array has n
columns,
■And we use this scheme,
■we see that the element stored in row i and
column j is located in position i ×n+ j in the
one-dimensional array.

Program 3.11: Serial matrix-vector multiplication.

■ Now let’s see how we might parallelize this function.
■ An individual task can be the multiplication of an element of A by a
component of x and the addition of this product into a component of y.

■ So we see that if y[i] is assigned to process q, then it would be convenient to

also assign row i of A to process q.
■ This suggests that we partition A by rows.
■ We could partition the rows using a block distribution, a cyclic distribution, or
a blockcyclic distribution.
■ In MPI it’s easiest to use a block distribution.
■ So let’s use a block distribution of the rows of A, and, as usual, assume that
comm_sz evenly divides m, the number of rows.

■ We are distributing A by rows so that the computation of y[i] will have
all of the needed elements of A, so we should distribute y by blocks.
■ That is, if the ith row of A is assigned to process q, then the ith
component of y should also be assigned to process q.
■ Now the computation of y[i] involves all the elements in the ith row of
A and all the components of x.
■ So we could minimize the amount of communication by simply
assigning all of x to each process.
■ However, in actual applications, especially when the matrix is square,
it’s often the case that a program using matrix-vector multiplication
will execute the multiplication many times, and the result vector y
from one multiplication will be the input vector x for the next iteration.
■ In practice, then, we usually assume that the distribution for x is the
same as the distribution for y.

■ So if x has a block distribution, how can we arrange that each
process has access to all the components of x before we execute
the following loop?

■ Using the collective communications we’re already familiar with,

we could execute a call to MPI_Gather, followed by a call to
MPI_Bcast.
■ This would, in all likelihood, involve two tree-structured
communications, and we may be able to do better by using a
butterfly.
■ So, once again, MPI provides a single function:

Allgather
■Concatenates the contents of each
process’ send_buf_p and stores this in
each process’ recv_buf_p.
■As usual, recv_count is the amount of data
being received from each process.

Gather to All
• int MPI_Allgather( void *sendbuf, int sendcnt,
MPI_Datatype sendtype,
void *recvbuf, int recvcnt,
MPI_Datatype recvtype,
MPI_Comm comm )
▪ All the processes collects data to all the other processes in the same
communicator
▪ Must be called by all the processes with the same arguments

P1 A P1 A B C D

P2 B P2 A B C D
MPI_Allgather
P3 C P3 A B C D
P4 D P4 A B C D
An MPI matrix-vector multiplication function (1)

An MPI matrix-vector multiplication function (2)

Concluding Remarks (1)
■MPI or the Message-Passing Interface is a
library of functions that can be called from
C, C++, or Fortran programs.
■A communicator is a collection of
processes that can send messages to
each other.
■Many parallel programs use the single-
program multiple data or SPMD approach.

Concluding Remarks (2)
■Most serial programs are deterministic: if
we run the same program with the same
input we’ll get the same output.
■Parallel programs often don’t possess this
property.
■Collective communications involve all the
processes in a communicator.

3 Mpi
No ratings yet
3 Mpi
44 pages
MiniTool Partition Wizard Crack 12 Key Download Free 2025
No ratings yet
MiniTool Partition Wizard Crack 12 Key Download Free 2025
29 pages
Lecture05 MPI
No ratings yet
Lecture05 MPI
26 pages
CP4253 Map Unit Iv
No ratings yet
CP4253 Map Unit Iv
22 pages
MPI Pacheco Ch3
No ratings yet
MPI Pacheco Ch3
124 pages
Introduction To The Message Passing Interface (MPI) : Parallel and High Performance Computing
No ratings yet
Introduction To The Message Passing Interface (MPI) : Parallel and High Performance Computing
41 pages
Lecture 10-Introduction To MPI
No ratings yet
Lecture 10-Introduction To MPI
51 pages
‎⁨تقرير⁩
No ratings yet
‎⁨تقرير⁩
16 pages
Introduction To The Message Passing Interface (MPI
No ratings yet
Introduction To The Message Passing Interface (MPI
16 pages
CS8083 UNIT IV Notes
No ratings yet
CS8083 UNIT IV Notes
21 pages
Message Passing-1
No ratings yet
Message Passing-1
76 pages
Mpi 1
No ratings yet
Mpi 1
38 pages
07 2 Introduction MPI
No ratings yet
07 2 Introduction MPI
27 pages
Basic MPI: Tom Murphy, Dave Joiner, Paul Gray, Henry Neeman, Charlie Peck, Alex Lemann, Kristina Wanous, Kevin Hunter
No ratings yet
Basic MPI: Tom Murphy, Dave Joiner, Paul Gray, Henry Neeman, Charlie Peck, Alex Lemann, Kristina Wanous, Kevin Hunter
22 pages
ECE 1747H: Parallel Programming: Message Passing (MPI)
No ratings yet
ECE 1747H: Parallel Programming: Message Passing (MPI)
67 pages
Cs6801 - Multicore Architectures and Programming 2 Marks Q & A Unit Iv - Distributed Memory Programming With Mpi
No ratings yet
Cs6801 - Multicore Architectures and Programming 2 Marks Q & A Unit Iv - Distributed Memory Programming With Mpi
15 pages
Academic Writing Genres - Essays, Reports & Other Genres (EAP Foundation Book 2)
No ratings yet
Academic Writing Genres - Essays, Reports & Other Genres (EAP Foundation Book 2)
535 pages
NGK Mpi
No ratings yet
NGK Mpi
74 pages
Mpi 1
No ratings yet
Mpi 1
20 pages
Intro MPI
No ratings yet
Intro MPI
60 pages
In3200 Chap09
No ratings yet
In3200 Chap09
56 pages
Parallel Programming Using Basic MPI Presented by Timothy H. Kaiser, Ph.D. San Diego Supercomputer Center
No ratings yet
Parallel Programming Using Basic MPI Presented by Timothy H. Kaiser, Ph.D. San Diego Supercomputer Center
19 pages
Lec 9 DR Marwa Abbas
No ratings yet
Lec 9 DR Marwa Abbas
64 pages
03 MPIProgramStructure
No ratings yet
03 MPIProgramStructure
42 pages
Introduction To C MPI PM
No ratings yet
Introduction To C MPI PM
50 pages
CS-3006 - 5 - MPI Basics
No ratings yet
CS-3006 - 5 - MPI Basics
53 pages
BSP Design Strategy
No ratings yet
BSP Design Strategy
38 pages
Nscet E-Learning Presentation: Listen Learn Lead
No ratings yet
Nscet E-Learning Presentation: Listen Learn Lead
54 pages
Distributed Memory Programming With: Peter Pacheco
No ratings yet
Distributed Memory Programming With: Peter Pacheco
125 pages
Week 10
No ratings yet
Week 10
52 pages
Lec5 MPI
No ratings yet
Lec5 MPI
28 pages
Cs-3006 6 Mpi Basics 2
No ratings yet
Cs-3006 6 Mpi Basics 2
52 pages
Mpi Unit 5 Part 2 1
No ratings yet
Mpi Unit 5 Part 2 1
65 pages
5 MPIprogramming
No ratings yet
5 MPIprogramming
43 pages
Message Passing Interface (MPI)
No ratings yet
Message Passing Interface (MPI)
22 pages
Lecture 11 Distributed Memory Programming
No ratings yet
Lecture 11 Distributed Memory Programming
28 pages
Distributed Memory Programming With MPI: Peter Pacheco
No ratings yet
Distributed Memory Programming With MPI: Peter Pacheco
121 pages
Lesson Plan in English (PREFIX)
86% (7)
Lesson Plan in English (PREFIX)
3 pages
Andrew Spencer, Robert Henley - Maths and English For Electrical - Functional Skills-Cengage Learning EMEA (2013)
100% (1)
Andrew Spencer, Robert Henley - Maths and English For Electrical - Functional Skills-Cengage Learning EMEA (2013)
97 pages
SERC IntroMPI 2019-09-14 v0
No ratings yet
SERC IntroMPI 2019-09-14 v0
43 pages
Message Passing Interface (MPI) : Steve Lantz Center For Advanced Computing Cornell University
No ratings yet
Message Passing Interface (MPI) : Steve Lantz Center For Advanced Computing Cornell University
53 pages
Mpi Basic Operations
No ratings yet
Mpi Basic Operations
6 pages
Chapter 4 - Message-Passing Programming, MPI
No ratings yet
Chapter 4 - Message-Passing Programming, MPI
79 pages
Week09 L2
No ratings yet
Week09 L2
13 pages
Ucsp DLL
100% (1)
Ucsp DLL
18 pages
Intro To MPI: Hpc-Support@duke - Edu
No ratings yet
Intro To MPI: Hpc-Support@duke - Edu
56 pages
Message Passing Interface (MPI) Programming
No ratings yet
Message Passing Interface (MPI) Programming
11 pages
Mpi Lecture
No ratings yet
Mpi Lecture
129 pages
02 Mpi 0
No ratings yet
02 Mpi 0
19 pages
Mpi
No ratings yet
Mpi
30 pages
Message Passing Interface (MPI) : EC3500: Introduction To Parallel Computing
100% (1)
Message Passing Interface (MPI) : EC3500: Introduction To Parallel Computing
40 pages
GR Power 145kv-800amps-Isolator-Without-Earth-Switch-Of-Make-Ms-Gr
No ratings yet
GR Power 145kv-800amps-Isolator-Without-Earth-Switch-Of-Make-Ms-Gr
9 pages
Introduction To MPI Ranger Lonestar
No ratings yet
Introduction To MPI Ranger Lonestar
67 pages
The Message Passing Interface (MPI)
No ratings yet
The Message Passing Interface (MPI)
18 pages
Message Passing Interface (MPI) Programming
No ratings yet
Message Passing Interface (MPI) Programming
11 pages
Message Passing Interface: Parallel Processing Course University of Tehran
No ratings yet
Message Passing Interface: Parallel Processing Course University of Tehran
49 pages
Clase 4 - Tutorial de MPI
No ratings yet
Clase 4 - Tutorial de MPI
35 pages
6.3 Mpi: The Message Passing Interface: (Team Lib)
No ratings yet
6.3 Mpi: The Message Passing Interface: (Team Lib)
5 pages
Message Passing Interface (MPI)
No ratings yet
Message Passing Interface (MPI)
14 pages
02 Message Passing Interface Tutorial
No ratings yet
02 Message Passing Interface Tutorial
34 pages
Unit Iv Distributed Memory Programming With Mpi
No ratings yet
Unit Iv Distributed Memory Programming With Mpi
19 pages
4.1 Revised Penal Code Book 1
No ratings yet
4.1 Revised Penal Code Book 1
75 pages
An Introduction To MPI: Parallel Programming With The Message Passing Interface
No ratings yet
An Introduction To MPI: Parallel Programming With The Message Passing Interface
48 pages
Johnson Porselano Royale 120x240 80x240 120x180 & 80x160cm Catalogue RJKT Aug 23
No ratings yet
Johnson Porselano Royale 120x240 80x240 120x180 & 80x160cm Catalogue RJKT Aug 23
248 pages
Salas vs. Adil - Digest
No ratings yet
Salas vs. Adil - Digest
2 pages
[Routledge Studies in Security and Conflict Management] Fen Osler Hampson (Editor), Amrita Narlikar (Editor) - International Negotiation and Political Narratives_ a Comparative Study (2022, Routledge) - Libgen.li
No ratings yet
[Routledge Studies in Security and Conflict Management] Fen Osler Hampson (Editor), Amrita Narlikar (Editor) - International Negotiation and Political Narratives_ a Comparative Study (2022, Routledge) - Libgen.li
319 pages
24 Documento - Planilha Nps Bale Da Cidade Marco 2018 2 1548267841
No ratings yet
24 Documento - Planilha Nps Bale Da Cidade Marco 2018 2 1548267841
2,967 pages
Boolean Search Quick Guide
100% (2)
Boolean Search Quick Guide
2 pages
Fundamentals of Power Electronics Ch1
No ratings yet
Fundamentals of Power Electronics Ch1
35 pages
8DG24624AGAATQZZA - V1 - 1850 Transport Service Switch 5C (TSS-5C) Release 6.1 User Provisioning Guide PDF
No ratings yet
8DG24624AGAATQZZA - V1 - 1850 Transport Service Switch 5C (TSS-5C) Release 6.1 User Provisioning Guide PDF
464 pages
EDU4 Instructors Lesson Plan Final
No ratings yet
EDU4 Instructors Lesson Plan Final
26 pages
Table of Content
No ratings yet
Table of Content
13 pages
Treasurers Certificate
No ratings yet
Treasurers Certificate
2 pages
Check List For Photocopies of Documents Required To Be Submitted To Nts For Hajj Medical Mission For Hajj-2025
No ratings yet
Check List For Photocopies of Documents Required To Be Submitted To Nts For Hajj Medical Mission For Hajj-2025
4 pages
Special Ed Thesis Topics
100% (3)
Special Ed Thesis Topics
5 pages
Check List For Photocopies of Documents Required To Be Submitted To Nts For Hajj Medical Mission For Hajj-2025
No ratings yet
Check List For Photocopies of Documents Required To Be Submitted To Nts For Hajj Medical Mission For Hajj-2025
4 pages
Student Council Vice President: Zanna Mccleary
No ratings yet
Student Council Vice President: Zanna Mccleary
8 pages
The Key Process of Decision (POM) BB
No ratings yet
The Key Process of Decision (POM) BB
8 pages
ENG-189 SAS12 Speaking 2324
No ratings yet
ENG-189 SAS12 Speaking 2324
7 pages
OSI Security Architecture
No ratings yet
OSI Security Architecture
5 pages
Catalogue of Microbial Cultures
100% (1)
Catalogue of Microbial Cultures
78 pages
Job Hunting Presentation
No ratings yet
Job Hunting Presentation
6 pages
An Approach To Physical Performance Analysis For Judo
No ratings yet
An Approach To Physical Performance Analysis For Judo
8 pages
16apr2020 National Disaster Payment
No ratings yet
16apr2020 National Disaster Payment
2 pages
Chap 4 Job Costing
No ratings yet
Chap 4 Job Costing
9 pages
Shahmir Raza - Resume
No ratings yet
Shahmir Raza - Resume
1 page
Entrep Quiz 1
No ratings yet
Entrep Quiz 1
6 pages
Nodi Amazzonici - Genere, Genere e Donne Guerriere Di Ariosto
No ratings yet
Nodi Amazzonici - Genere, Genere e Donne Guerriere Di Ariosto
24 pages
Resume
No ratings yet
Resume
4 pages
612le 13 GQF Capitalizenamestitles 4
No ratings yet
612le 13 GQF Capitalizenamestitles 4
7 pages
Bakteriemia, Sepsis Dan Syok Septik
No ratings yet
Bakteriemia, Sepsis Dan Syok Septik
36 pages
Audio Project
No ratings yet
Audio Project
5 pages
Year 3 and 4 Statutory Spelling Words Activity Mat Pack 3
No ratings yet
Year 3 and 4 Statutory Spelling Words Activity Mat Pack 3
5 pages
5 Ways To Improve User Experience
No ratings yet
5 Ways To Improve User Experience
10 pages
The 1 Page Python Book
From Everand
The 1 Page Python Book
Barani Kumar
2/5 (1)
Dive Into Sea of C
From Everand
Dive Into Sea of C
M Ashok
No ratings yet
Python for Beginners: An Introduction to Learn Python Programming with Tutorials and Hands-On Examples
From Everand
Python for Beginners: An Introduction to Learn Python Programming with Tutorials and Hands-On Examples
Nathan Metzler
4/5 (2)

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.