0% found this document useful (0 votes)
23 views128 pages

Ch3 L1 PDC CS4172 Fall 2024

Chapter 3 of 'An Introduction to Parallel Programming' focuses on distributed memory programming using the Message-Passing Interface (MPI). It covers the basics of writing MPI programs, including communication functions, data types, and performance evaluation, while introducing concepts like Single-Program Multiple-Data (SPMD) and collective communication. The chapter also discusses the structure and execution of MPI programs, emphasizing the importance of message matching and the use of MPI_Status for receiving messages.

Uploaded by

MASTER JII
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views128 pages

Ch3 L1 PDC CS4172 Fall 2024

Chapter 3 of 'An Introduction to Parallel Programming' focuses on distributed memory programming using the Message-Passing Interface (MPI). It covers the basics of writing MPI programs, including communication functions, data types, and performance evaluation, while introducing concepts like Single-Program Multiple-Data (SPMD) and collective communication. The chapter also discusses the structure and execution of MPI programs, emphasizing the importance of message matching and the use of MPI_Status for receiving messages.

Uploaded by

MASTER JII
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 128

An Introduction to Parallel Programming

Peter Pacheco

Chapter 3
Distributed Memory
Programming with
MPI

Copyright © 2010, Elsevier Inc. All rights Reserved 1


# Chapter Subtitle
Roadmap
■Writing your first MPI program.
■Using the common MPI functions.
■The Trapezoidal Rule in MPI.
■Collective communication.
■MPI derived datatypes.
■Performance evaluation of MPI programs.
■Parallel sorting.
■Safety in MPI programs.

Copyright © 2010, Elsevier Inc. All rights Reserved 2


■ From a programmer’s point of view, a distributed-memory
system consists of a collection of core-memory pairs connected
by a network, and the memory associated with a core is directly
accessible only to that core.
■ On the other hand, from a programmer’s point of view, a shared-
memory system consists of a collection of cores connected to a
globally accessible memory, in which each core can have access
to any memory location.
■ In this chapter, we’re going to start looking at how to program
distributed memory systems using messagepassing.

Copyright © 2010, Elsevier Inc. All rights Reserved 3


A distributed memory system

Copyright © 2010, Elsevier Inc. All rights Reserved 4


A shared memory system

Copyright © 2010, Elsevier Inc. All rights Reserved 5


■ Recall that in message-passing programs, a program running on
one core-memory pair is usually called a process, and two
processes can communicate by calling functions:
■ one process calls a send function and the other calls a receive function.
■ The implementation of message-passing that we’ll be using is
called MPI, which is an abbreviation of Message-Passing
Interface.
■ It defines a library of functions that can be called from C and
Fortran programs.
■ We’ll learn about some of MPI’s different send and receive
functions.

Copyright © 2010, Elsevier Inc. All rights Reserved 6


■ We will also learn about some “global” communication
functions that can involve more than two processes.

■ These functions are called collective communications.

■ In the process of learning about all of these MPI functions,


we’ll also learn about some of the fundamental issues involved
in writing message-passing programs—issues such as
■ Data partitioning and
■ I/O in distributed-memory systems.

■ We’ll also revisit the issue of parallel program performance.

Copyright © 2010, Elsevier Inc. All rights Reserved 7


Hello World!

(a classic)

Copyright © 2010, Elsevier Inc. All rights Reserved 8


■ Let’s write a program similar to “hello, world” that makes
some use of MPI.
■ Instead of having each process simply print a message, we’ll
designate one process to do the output, and the other processes
will send it messages, which it will print.
■ In parallel programming, it’s common (one might say
standard) for the processes to be identified by nonnegative
integer ranks.

Copyright © 2010, Elsevier Inc. All rights Reserved 9


Identifying MPI processes
■Common practice to identify processes by
nonnegative integer ranks.

■p processes are numbered 0, 1, 2, .. p-1


■ For our parallel “hello, world,” let’s make process 0 the
designated process, and the other processes will send it
messages

Copyright © 2010, Elsevier Inc. All rights Reserved 10


Our first MPI program

Copyright © 2010, Elsevier Inc. All rights Reserved 11


execution
■ The details of compiling and running the program depend on
your system, so you may need to check with a local expert.
■ However, recall that when we need to be explicit, we’ll assume
that we’re using a text editor to write the program source, and
the command line to compile and run.
■ Many systems use a command called mpicc for compilation

Copyright © 2010, Elsevier Inc. All rights Reserved 12


Compilation

wrapper script to compile


source file

mpicc -g -Wall -o mpi_hello mpi_hello.c

produce create this executable file name


debugging (as opposed to default a.out)
information

turns on all warnings

Copyright © 2010, Elsevier Inc. All rights Reserved 13


■ Typically mpicc is a script that’s a wrapper for the C
compiler.
■ A wrapper script is a script whose main purpose is to run
some program.
■ In this case, the program is the C compiler.
■ However, the wrapper simplifies the running of the compiler
by telling it where to find the necessary header files, and which
libraries to link with the object file.

Copyright © 2010, Elsevier Inc. All rights Reserved 14


Execution

mpiexec -n <number of processes> <executable>

mpiexec -n 1 ./mpi_hello

run with 1 process

mpiexec -n 4 ./mpi_hello

run with 4 processes

Copyright © 2010, Elsevier Inc. All rights Reserved 15


Execution
mpiexec -n 1 ./mpi_hello

Greetings from process 0 of 1 !

mpiexec -n 4 ./mpi_hello

Greetings from process 0 of 4 !


Greetings from process 1 of 4 !
Greetings from process 2 of 4 !
Greetings from process 3 of 4 !

Copyright © 2010, Elsevier Inc. All rights Reserved 16


3.1.2: MPI Programs
■Written in C.
■Has main.
■Uses stdio.h, string.h, etc.
■Need to add mpi.h header file.
■Identifiers defined by MPI start with
“MPI_”.
■First letter following underscore is
uppercase.
■For function names and MPI-defined types.
■Helps to avoid confusion.
Copyright © 2010, Elsevier Inc. All rights Reserved 17
MPI Components
■MPI_Init
■Tells MPI to do all the necessary setup.

■MPI_Finalize
■Tells MPI we’re done, so clean up anything
allocated for this program.

Copyright © 2010, Elsevier Inc. All rights Reserved 18


Basic Outline

Copyright © 2010, Elsevier Inc. All rights Reserved 19


Communicators
■A collection of processes that can send
messages to each other.
■MPI_Init defines a communicator that
consists of all the processes created when
the program is started.
■Called MPI_COMM_WORLD.

Copyright © 2010, Elsevier Inc. All rights Reserved 20


Communicators

number of processes in the communicator

my rank
(the process making this call)

Copyright © 2010, Elsevier Inc. All rights Reserved 21


■For both functions, the first argument is a
communicator and has the special type defined by
MPI for communicators, MPI_Comm.
■MPI_Comm_size returns in its second argument the
number of processes in the communicator,
■MPI_Comm_rank returns in its second argument the
calling process’s rank in the communicator.
■We’ll often use the variable comm_sz for the number
of processes in MPI_COMM_WORLD, and the
variable my_rank for the process rank.

Copyright © 2010, Elsevier Inc. All rights Reserved 22


SPMD
■Single-Program Multiple-Data
■We compile one program.
■Process 0 does something different.
■Receives messages and prints them while the
other processes do the work.

■The if-else construct makes our program


SPMD.

Copyright © 2010, Elsevier Inc. All rights Reserved 23


Communication

Copyright © 2010, Elsevier Inc. All rights Reserved 24


■The first three arguments,
■msg_buf_p,
■msg_size, and
■msg_type,
■determine the contents of the message.
■The remaining arguments,
■dest,
■tag, and
■communicator,
■determine the destination of the message.

Copyright © 2010, Elsevier Inc. All rights Reserved 25


■ The first argument, msg_buf_p, is a pointer to the block of
memory containing the contents of the message.
■ In our program, this is just the string containing the message,
greeting. (Remember that in C an array, such as a string, is a
pointer.)
■ The second and third arguments, msg_size and msg_type,
determine the amount of data to be sent.
■ In our program, the msg_size argument is the number of characters in
the message, plus one character for the ’\0’ character that terminates C
strings.
■ The msg_type argument is MPI_CHAR.
■ These two arguments together tell the system that the message contains
strlen(greeting)+1 chars.

Copyright © 2010, Elsevier Inc. All rights Reserved 26


Data types
■ Since C types (int, char, etc.) can’t be passed as arguments to
functions, MPI defines a special type, MPI_Datatype, that is
used for the msg_type argument.

■ MPI also defines a number of constant values for this type.

■ The ones we’ll use (and a few others) are listed in Table 3.1.

Copyright © 2010, Elsevier Inc. All rights Reserved 27


Data types

Copyright © 2010, Elsevier Inc. All rights Reserved 28


Communication

Copyright © 2010, Elsevier Inc. All rights Reserved 29


Communication ..
■ Thus the first three arguments specify the memory
available for receiving the message:
■ msg_buf_p points to the block of memory,
■ buf_size determines the number of objects that can be stored in
the block, and
■ buf_type indicates the type of the objects.
■ The next three arguments identify the message.
■ The source argument specifies the process from which the
message should be received.
■ The tag argument should match the tag argument of the
message being sent, and
■ the communicator argument must match the communicator used
by the sending process.

Copyright © 2010, Elsevier Inc. All rights Reserved 30


Message matching

r
MPI_Send
src = q

MPI_Recv
dest = r

Copyright © 2010, Elsevier Inc. All rights Reserved 31


■Then the message sent by q with the above call to
MPI_Send can be received by r with the call to
MPI_Recv if
■recv_comm = send_comm,
■recv_tag = send_tag,
■dest = r, and
■src = q.

Copyright © 2010, Elsevier Inc. All rights Reserved 32


■These conditions aren’t quite enough for the message to
be successfully received, however.
■The parameters specified by the first three pairs of
arguments,
■send_buf_p/recv_buf_p,
■send_buf_sz/recv_buf_sz, and
■send_type/recv_type, must specify compatible buffers.
■For detailed rules, see the MPI-3 specification [40].
■Most of the time, the following rule will suffice:
■If recv_type = send_type and
■recv_buf_sz ≥ send_buf_sz,
■then the message sent by q can be successfully received by r.

Copyright © 2010, Elsevier Inc. All rights Reserved 33


■Of course, it can happen that
■one process is receiving messages from multiple processes,
■and the receiving process doesn’t know the order in which
the other processes will send the messages.
For example, suppose
■process 0 is doling out work to processes 1, 2, . . . ,
comm_sz − 1, and
■processes 1, 2, . . . , comm_sz − 1, send their results back
to process 0 when they finish the work.
■If the work assigned to each process takes an
unpredictable amount of time, then 0 has no way of
knowing the order in which the processes will finish.

Copyright © 2010, Elsevier Inc. All rights Reserved 34


■ If process 0 simply receives the results in process rank order
■ first the results from process 1, then the results from process 2, and so on
■ and if (say) process comm_sz−1 finishes first, it could happen that process
comm_sz−1 could sit and wait for the other processes to finish.
■ To avoid this problem MPI provides a special constant
MPI_ANY_SOURCE that can be passed to MPI_Recv.
■ Then, if process 0 executes the following code, it can receive
the results in the order in which the processes finish:

Copyright © 2010, Elsevier Inc. All rights Reserved 35


Copyright © 2010, Elsevier Inc. All rights Reserved 36
■Similarly, it’s possible that one process can be
receiving multiple messages with different tags
from another process, and
■the receiving process doesn’t know the order in
which the messages will be sent.
■For this circumstance, MPI provides the
special constant MPI_ANY_TAG that can
be passed to the tag argument of
MPI_Recv.
Copyright © 2010, Elsevier Inc. All rights Reserved 37
status_p argument
■If you think about these rules for a minute,
you’ll notice that a receiver can receive a
message without knowing:
■the amount of data in the message,
■the sender of the message,
■or the tag of the message.

Copyright © 2010, Elsevier Inc. All rights Reserved 38


status_p argument…
■So how can the receiver find out these values?
■Recall that the last argument to MPI_Recv has type
MPI_Status∗.
■The MPI type MPI_Status is a struct with at least the
three members
■ MPI_SOURCE,
■ MPI_TAG, and
■ MPI_ERROR.

Copyright © 2010, Elsevier Inc. All rights Reserved 39


status_p argument…
■ Suppose our program contains the definition
MPI_Status status ;
■ Then after a call to MPI_Recv, in which &status is passed as
the last argument, we can determine the sender and tags by
examining the two members:
■ status . MPI_SOURCE
■ status . MPI_TAG

Copyright © 2010, Elsevier Inc. All rights Reserved 40


status_p argument …

MPI_Status*

MPI_Status* status; MPI_SOURCE


MPI_TAG
MPI_ERROR
status.MPI_SOURCE
status.MPI_TAG

Copyright © 2010, Elsevier Inc. All rights Reserved 41


How much data am I receiving?
■ The amount of data that’s been received isn’t stored in a field
that’s directly accessible to the application program.
■ However, it can be retrieved with a call to MPI_Get_count.
■ For example, suppose that in our call to MPI_Recv, the type
of the receive buffer is recv_type and, once again, we passed
in &status. Then the call

■ will return the number of elements received in the count


argument

Copyright © 2010, Elsevier Inc. All rights Reserved 42


How much data am I receiving?
■Note that the count isn’t directly accessible as a
member of the MPI_Status variable, simply because
■it depends on the type of the received data, and,
consequently,
■determining it would probably require a calculation (e.g.,
(number of bytes received)/(bytes per object)).
■And if this information isn’t needed, we shouldn’t waste a
calculation determining it.

Copyright © 2010, Elsevier Inc. All rights Reserved 43


3.1.11 Semantics of MPI_Send and
MPI_Recv
■ What exactly happens when we send a message from one
process to another?
■ Many of the details depend on the particular system, but we
can make a few generalizations.
■ Once the message has been assembled there are essentially two
possibilities:
■ the sending process can buffer the message
■ or it can block.
■ If it buffers the message, the MPI system will place the
message (data and envelope) into its own internal storage, and
the call to MPI_Send will return.

Copyright © 2010, Elsevier Inc. All rights Reserved 44


3.1.11 Semantics of MPI_Send and
MPI_Recv
■ Alternatively, if the system blocks, it will wait until it can
begin transmitting the message, and the call to MPI_Send
may not return immediately.
■ Thus if we use MPI_Send, when the function returns, we
don’t actually know whether the message has been transmitted.
■ We only know that the storage we used for the message, the
send buffer, is available for reuse by our program.
■ If we need to know that the message has been transmitted, or if
we need for our call to MPI_Send to return immediately
regardless of whether the message has been sent
■ Then MPI provides alternative functions for sending

Copyright © 2010, Elsevier Inc. All rights Reserved 45


Issues with send and receive
■Exact behavior is determined by the MPI
implementation.
■MPI_Send may behave differently with
regard to buffer size, cutoffs and blocking.
■MPI_Recv always blocks until a matching
message is received.
■Know your implementation;
don’t make assumptions!

Copyright © 2010, Elsevier Inc. All rights Reserved 46


■Unlike MPI_Send, MPI_Recv always blocks until a
matching message has been received.
■So when a call to MPI_Recv returns, we know that
there is a message stored in the receive buffer (unless
there’s been an error).

Copyright © 2010, Elsevier Inc. All rights Reserved 47


TRAPEZOIDAL RULE IN MPI

Copyright © 2010, Elsevier Inc. All rights Reserved 48


The Trapezoidal Rule

Copyright © 2010, Elsevier Inc. All rights Reserved 49


One trapezoid

Copyright © 2010, Elsevier Inc. All rights Reserved 50


The Trapezoidal Rule

Copyright © 2010, Elsevier Inc. All rights Reserved 51


Pseudo-code for a serial program

Copyright © 2010, Elsevier Inc. All rights Reserved 52


Parallelizing the Trapezoidal Rule
1. Partition problem solution into tasks.
2. Identify communication channels between
tasks.
3. Aggregate tasks into composite tasks.
4. Map composite tasks to cores.

Copyright © 2010, Elsevier Inc. All rights Reserved 53


■In the partitioning phase, we usually try to identify as
many tasks as possible.
■For the trapezoidal rule, we might identify two types
of tasks:
■one type is finding the area of a single trapezoid, and
■ the other is computing the sum of these areas.
■Then the communication channels will join each of
the tasks of the first type to the single task of the
second type.

Copyright © 2010, Elsevier Inc. All rights Reserved 54


Tasks and communications for
Trapezoidal Rule

Copyright © 2010, Elsevier Inc. All rights Reserved 55


■ So how can we aggregate the tasks and map them to the cores?
■ Our intuition tells us that the more trapezoids we use, the more
accurate our estimate will be.
■ That is, we should use many trapezoids, and we will use many
more trapezoids than cores.
■ Thus we need to aggregate the computation of the areas of the
trapezoids into groups.
■ A natural way to do this is to split the interval [a, b] up into
comm_sz subintervals. If comm_sz evenly divides n, the
number of trapezoids, we can simply apply the trapezoidal rule
with n/comm_sz trapezoids to each of the comm_sz
subintervals.
■ To finish, we can have one of the processes, say process 0, add
the estimates.
Copyright © 2010, Elsevier Inc. All rights Reserved 56
Parallel pseudo-code
Let’s make the simplifying assumption that comm_sz evenly divides n.

Copyright © 2010, Elsevier Inc. All rights Reserved 57


■ Notice that in our choice of identifiers, we try to differentiate
between local and global variables.
■ Local variables are variables whose contents are significant only
on the process that’s using them.
■ Some examples from the trapezoidal rule program are local_a,
local_b, and local_n.
■Variables whose contents are significant to all the
processes are sometimes called global variables.
■ Some examples from the trapezoidal rule are a, b, and n.
■Note that this usage is different from the usage you
learned in your introductory programming class, where local
variables are private to a single function and global variables are
accessible to all the functions.

Copyright © 2010, Elsevier Inc. All rights Reserved 58


First version
Let’s defer, for the moment, the issue of input and just “hardwire”
the values for a, b, and n.

Copyright © 2010, Elsevier Inc. All rights Reserved 59


First version (2)

Copyright © 2010, Elsevier Inc. All rights Reserved 60


First version (3)
The Trap function is just an implementation of the serial trapezoidal
rule.

Copyright © 2010, Elsevier Inc. All rights Reserved 61


Dealing with I/O
■ Of course, the current version of the parallel trapezoidal rule
has a serious deficiency:
■ it will only compute the integral over the interval [0, 3] using
1024 trapezoids.
■ We can edit the code and recompile, but this is quite a bit of
work compared to simply typing in three new numbers.
■ We need to address the problem of getting input from the user.

■ While we’re talking about input to parallel programs, it might


be a good idea to also take a look at output.

Copyright © 2010, Elsevier Inc. All rights Reserved 62


3.3.1 Output

■ In both the “greetings” program and the trapezoidal rule


program, we’ve assumed that process 0 can write to stdout,
i.e., its calls to printf behave as we might expect.

■ Although theMPI standard doesn’t specify which processes


have access to which I/O devices, virtually all MPI
implementations allow all the processes in
MPI_COMM_WORLD full access to stdout and stderr.

Copyright © 2010, Elsevier Inc. All rights Reserved 63


3.3.1 Output
■So most MPI implementations allow all processes
to execute printf and fprintf(stderr, ...) .
■However, most MPI implementations don’t
provide any automatic scheduling of access to
these devices.
■That is, if multiple processes are attempting to write
to, say, stdout, the order in which the processes’
output appears will be unpredictable.
■ Indeed, it can even happen that the output of one
process will be interrupted by the output of another
process.
Copyright © 2010, Elsevier Inc. All rights Reserved 64
Running with 6 processes
• For example, suppose we try to run an MPI program in which each process
simply prints a message. (See Program 3.4.)
• On our cluster, if we run the program with five processes, it often produces the
“expected” output:
• However, when we run it with six processes, the order of the output lines is
unpredictable:

unpredictable output

Copyright © 2010, Elsevier Inc. All rights Reserved 65


Program 3.4: Each process just prints a message.

Each process just


prints a message.

Copyright © 2010, Elsevier Inc. All rights Reserved 66


3.3.2 Input
■Unlike output, most MPI implementations
only allow process 0 in
MPI_COMM_WORLD access to stdin.

■This makes sense: If multiple processes have


access to stdin, which process should get
which parts of the input data?
■Should process 0 get the first line?
■process 1 get the second? Or should process 0 get the first
character?

Copyright © 2010, Elsevier Inc. All rights Reserved 67


3.3.2 Input
■So, to write MPI programs that can use scanf, we need
to branch on process rank, with process 0 reading in the
data, and then sending it to the other processes.
■For example, we might write the Get_input function shown
in Program 3.5 for our parallel trapezoidal rule program.
■In this function, process 0 simply reads in the values for a, b,
and n, and sends all three values to each process.
■So this function uses the same basic communication
structure as the “greetings” program, except that now
process 0 is sending to each process, while the other
processes are receiving.

Copyright © 2010, Elsevier Inc. All rights Reserved 68


Input
■ To use this function, we can simply insert a call to it inside our
main function,
■ being careful to put it after we’ve initialized my_rank and
comm_sz: Most MPI implementations only allow process
0 in MPI_COMM_WORLD access to stdin.
■ Process 0 must read the data (scanf) and send to the
other processes.

Copyright © 2010, Elsevier Inc. All rights Reserved 69


Program 3.5: Function for reading user input

Copyright © 2010, Elsevier Inc. All rights Reserved 70


COLLECTIVE
COMMUNICATION

Copyright © 2010, Elsevier Inc. All rights Reserved 71


Tree-structured communication
1. In the first phase:
(a) Process 1 sends to 0, 3 sends to 2, 5 sends to 4, and
7 sends to 6.
(b) Processes 0, 2, 4, and 6 add in the received values.
(c) Processes 2 and 6 send their new values to
processes 0 and 4, respectively.
(d) Processes 0 and 4 add the received values into their
new values.

2. (a) Process 4 sends its newest value to process 0.


(b) Process 0 adds the received value to its newest
value.

Copyright © 2010, Elsevier Inc. All rights Reserved 72


A tree-structured global sum

Copyright © 2010, Elsevier Inc. All rights Reserved 73


An alternative tree-structured
global sum

Copyright © 2010, Elsevier Inc. All rights Reserved 74


MPI_Reduce

Copyright © 2010, Elsevier Inc. All rights Reserved 75


Predefined reduction operators
in MPI

Copyright © 2010, Elsevier Inc. All rights Reserved 76


Collective vs. Point-to-Point
Communications
■All the processes in the communicator
must call the same collective function.

■For example, a program that attempts to


match a call to MPI_Reduce on one
process with a call to MPI_Recv on
another process is erroneous, and, in all
likelihood, the program will hang or crash.

Copyright © 2010, Elsevier Inc. All rights Reserved 77


MPI Reduce

Gather operation combined with specified arithmetic/logical operation.


Example: Values could be gathered and then added together by root:

MPI_Reduce() MPI_Reduce() MPI_Reduce()

As usual same routine called by each process, with same parameters.

78 78
Collective vs. Point-to-Point
Communications
■The arguments passed by each process to
an MPI collective communication must be
“compatible.”

■For example, if one process passes in 0


as the dest_process and another passes
in 1, then the outcome of a call to
MPI_Reduce is erroneous, and, once
again, the program is likely to hang or
crash. Copyright © 2010, Elsevier Inc. All rights Reserved 79
Implementation of reduction using a tree
construction

P0 P1 P2 P3 P4 P5
14 39 53 120 66 29
+ + +
O(log2 P) 53 173 95
with P
processes +
226
+
321
Collective vs. Point-to-Point
Communications
■The output_data_p argument is only used
on dest_process.

■However, all of the processes still need to


pass in an actual argument corresponding
to output_data_p, even if it’s just NULL.

Copyright © 2010, Elsevier Inc. All rights Reserved 81


Collective vs. Point-to-Point
Communications
■Point-to-point communications are
matched on the basis of tags and
communicators.

■Collective communications don’t use tags.


■They’re matched solely on the basis of the
communicator and the order in which
they’re called.

Copyright © 2010, Elsevier Inc. All rights Reserved 82


Example (1)

Multiple calls to MPI_Reduce

Copyright © 2010, Elsevier Inc. All rights Reserved 83


Example (2)
■Suppose that each process calls
MPI_Reduce with operator MPI_SUM, and
destination process 0.

■At first glance, it might seem that after the


two calls to MPI_Reduce, the value of b
will be 3, and the value of d will be 6.

Copyright © 2010, Elsevier Inc. All rights Reserved 84


Example (3)
■However, the names of the memory
locations are irrelevant (???)to the matching
of the calls to MPI_Reduce.

■The order of the calls will determine the


matching so the value stored in b will be
1+2+1 = 4, and the value stored in d will be
2+1+2 = 5.

Copyright © 2010, Elsevier Inc. All rights Reserved 85


MPI_Allreduce
■Useful in a situation in which
all of the processes need the
result of a global sum in
order to complete some
larger computation.

Copyright © 2010, Elsevier Inc. All rights Reserved 86


Reduction to All
■ int MPI_Allreduce(void *sendbuf, void *recvbuf,
int count, MPI_Datatype datatype,
MPI_Op op, MPI_Comm comm)
■ All the processes collect data to all the other processes in the same
communicator, and perform an operation on the data
■ MPI_SUM, MPI_MIN, MPI_MAX, MPI_PROD, logical AND, OR, XOR, and
a few more
■ MPI_Op_create(): User defined operator

P1 A … … … P1 A+B+C+D
P2 B … … …
MPI_Allreduc P2 A+B+C+D
P3 C … … … e P3 A+B+C+D
P4 D … … … P4 A+B+C+D

87
A global sum followed
by distribution of the
result.

Copyright © 2010, Elsevier Inc. All rights Reserved 88


A butterfly-structured global sum.

Copyright © 2010, Elsevier Inc. All rights Reserved 89


Broadcast pattern
Sends same data to each of a group of processes

Destinations
A common pattern to get Same data
same data to all processes, sent to all destinations
especially at the beginning
of a computation Source

Note:
•Patterns given do not mean the implementation does them as shown. Only the
final result is the same in any parallel implementation.
•Patterns do not describe the implementation. 90
Broadcast
■Data belonging to a
single process is sent
to all of the
processes in the
communicator.

Copyright © 2010, Elsevier Inc. All rights Reserved 91


MPI broadcast operation
Sending same message to all processes in communicator

Notice same routine called by each process, with same parameters. 92


MPI processes usually execute the same program so this is a handy construction.
A tree-structured broadcast.

Copyright © 2010, Elsevier Inc. All rights Reserved 93


A version of Get_input that uses
MPI_Bcast

Copyright © 2010, Elsevier Inc. All rights Reserved 94


3.4.6: Data distributions

Compute a vector sum.

Copyright © 2010, Elsevier Inc. All rights Reserved 95


Serial implementation of vector addition

• The work consists of adding the individual components of the


vectors, so we might specify that the tasks are just the
additions of corresponding components.
• Then there is no communication between the tasks, and the
problem of parallelizing vector addition boils down to
aggregating the tasks and assigning them to the cores

Copyright © 2010, Elsevier Inc. All rights Reserved 96


Different partitions of a 12-component
vector among 3 processes

Copyright © 2010, Elsevier Inc. All rights Reserved 97


Partitioning options
■Block partitioning
■Assign blocks of consecutive components to
each process.
■Cyclic partitioning
■Assign components in a round robin fashion.
■Block-cyclic partitioning
■Use a cyclic distribution of blocks of
components.

Copyright © 2010, Elsevier Inc. All rights Reserved 98


Parallel implementation of
vector addition

Copyright © 2010, Elsevier Inc. All rights Reserved 99


3.4.7: Scatter
■ Now suppose we want to test our vector addition function.
■ It would be convenient to be able to read the dimension of the
vectors and then read in the vectors x and y.
■ We already know how to read in the dimension of the vectors:
process 0 can prompt the user, read in the value, and broadcast
the value to the other processes.
■ We might try something similar with the vectors: process 0
could read them in and broadcast them to the other processes
■ However, this could be very wasteful. If there are 10 processes
and the vectors have 10,000 components, then each process
will need to allocate storage for vectors with 10,000
components, when it is only operating on subvectors with 1000
components

Copyright © 2010, Elsevier Inc. All rights Reserved 100


3.4.7: Scatter
■MPI_Scatter can be used in a function that
reads in an entire vector on process 0 but
only sends the needed components to
each of the other processes.

Copyright © 2010, Elsevier Inc. All rights Reserved 101


Scatter Pattern
Distributes a collection of data items to a group of processes

A common pattern to get


Destinations
data to all processes Different data
sent to each destinations

Usually data sent are Source


parts of an array

102
Scatter Pattern

MPI_Bcast takes a single data element at the root process (the red box)

and copies it to all other processes.

MPI_Scatter takes an array of elements and distributes the elements in the

order of process rank. The first element (in red) goes to process zero, the
second element (in green) goes to process one, and so on.
Basic MPI scatter operation
Sending one of more contiguous elements of an array in root process to a
separate process.

Notice same routine called by each process, with same parameters.


MPI processes usually execute the same program so this is a handy construction.
Reading and distributing a vector

Copyright © 2010, Elsevier Inc. All rights Reserved 105


3.4.8: Gather
■ Of course, our test program will be useless unless we can see
the result of our vector addition.
■ So we need to write a function for printing out a distributed
vector.
■ Our function can collect all of the components of the vector
onto process 0, and then process 0 can print all of the
components.
■ The communication in this function can be carried out by
MPI_Gather:

Copyright © 2010, Elsevier Inc. All rights Reserved 106


3.4.8: Gather
■ Collect all of the components of the vector onto process
0, and then process 0 can process all of the components.

■ Note that recv_count is the number of data items received


from each process, not the total number of data items received.

Copyright © 2010, Elsevier Inc. All rights Reserved 107


Gather Pattern

Essentially the reverse of a scatter. It receives data items from a group of


processes

Sources

Data

Destination
Data
A common pattern especially at
the end of a computation to Data collected
collect results at destination
in an array
Data

108
MPI Gather

Having one process collect individual values from set of processes


(includes itself).

As usual same routine called by each process, with same parameters.


Gather Pattern
• int MPI_Gather(void *sendbuf, int sendcnt,
MPI_Datatype sendtype, void *recvbuf,
int recvcnt, MPI_Datatype recvtype,
int root, MPI_Comm comm)
▪ One process (root) collects data to all the other processes in the same
communicator
▪ Must be called by all the processes with the same arguments

P1 A P1 A B C D

P2 B P2
MPI_Gather
P3 C P3
P4 D P4
Print a distributed vector (1)

Copyright © 2010, Elsevier Inc. All rights Reserved 111


Print a distributed vector (2)

Copyright © 2010, Elsevier Inc. All rights Reserved 112


3.4.9: Allgather

■As a final example, let’s look at how we


might write an MPI function that multiplies
a matrix by a vector.

Copyright © 2010, Elsevier Inc. All rights Reserved 113


Matrix-vector multiplication

i-th component of y
Dot product of the ith
row of A with x.

Copyright © 2010, Elsevier Inc. All rights Reserved 114


Matrix-vector multiplication

Copyright © 2010, Elsevier Inc. All rights Reserved 115


Multiply a matrix by a vector

Serial pseudo-code

Copyright © 2010, Elsevier Inc. All rights Reserved 116


C style arrays However, there are
some peculiarities
in the way that C
programs deal with
two-dimensional
arrays

stored as

A 2D array would be stored as the one-dimensional array

Copyright © 2010, Elsevier Inc. All rights Reserved 117


■More generally, if our array has n
columns,
■And we use this scheme,
■we see that the element stored in row i and
column j is located in position i ×n+ j in the
one-dimensional array.

Copyright © 2010, Elsevier Inc. All rights Reserved 118


Program 3.11: Serial matrix-vector multiplication.

Copyright © 2010, Elsevier Inc. All rights Reserved 119


■ Now let’s see how we might parallelize this function.
■ An individual task can be the multiplication of an element of A by a
component of x and the addition of this product into a component of y.

■ So we see that if y[i] is assigned to process q, then it would be convenient to


also assign row i of A to process q.
■ This suggests that we partition A by rows.
■ We could partition the rows using a block distribution, a cyclic distribution, or
a blockcyclic distribution.
■ In MPI it’s easiest to use a block distribution.
■ So let’s use a block distribution of the rows of A, and, as usual, assume that
comm_sz evenly divides m, the number of rows.

Copyright © 2010, Elsevier Inc. All rights Reserved 120


■ We are distributing A by rows so that the computation of y[i] will have
all of the needed elements of A, so we should distribute y by blocks.
■ That is, if the ith row of A is assigned to process q, then the ith
component of y should also be assigned to process q.
■ Now the computation of y[i] involves all the elements in the ith row of
A and all the components of x.
■ So we could minimize the amount of communication by simply
assigning all of x to each process.
■ However, in actual applications, especially when the matrix is square,
it’s often the case that a program using matrix-vector multiplication
will execute the multiplication many times, and the result vector y
from one multiplication will be the input vector x for the next iteration.
■ In practice, then, we usually assume that the distribution for x is the
same as the distribution for y.

Copyright © 2010, Elsevier Inc. All rights Reserved 121


■ So if x has a block distribution, how can we arrange that each
process has access to all the components of x before we execute
the following loop?

■ Using the collective communications we’re already familiar with,


we could execute a call to MPI_Gather, followed by a call to
MPI_Bcast.
■ This would, in all likelihood, involve two tree-structured
communications, and we may be able to do better by using a
butterfly.
■ So, once again, MPI provides a single function:

Copyright © 2010, Elsevier Inc. All rights Reserved 122


Allgather
■Concatenates the contents of each
process’ send_buf_p and stores this in
each process’ recv_buf_p.
■As usual, recv_count is the amount of data
being received from each process.

Copyright © 2010, Elsevier Inc. All rights Reserved 123


Gather to All
• int MPI_Allgather( void *sendbuf, int sendcnt,
MPI_Datatype sendtype,
void *recvbuf, int recvcnt,
MPI_Datatype recvtype,
MPI_Comm comm )
▪ All the processes collects data to all the other processes in the same
communicator
▪ Must be called by all the processes with the same arguments

P1 A P1 A B C D

P2 B P2 A B C D
MPI_Allgather
P3 C P3 A B C D
P4 D P4 A B C D
An MPI matrix-vector multiplication function (1)

Copyright © 2010, Elsevier Inc. All rights Reserved 125


An MPI matrix-vector multiplication function (2)

Copyright © 2010, Elsevier Inc. All rights Reserved 126


Concluding Remarks (1)
■MPI or the Message-Passing Interface is a
library of functions that can be called from
C, C++, or Fortran programs.
■A communicator is a collection of
processes that can send messages to
each other.
■Many parallel programs use the single-
program multiple data or SPMD approach.

Copyright © 2010, Elsevier Inc. All rights Reserved 127


Concluding Remarks (2)
■Most serial programs are deterministic: if
we run the same program with the same
input we’ll get the same output.
■Parallel programs often don’t possess this
property.
■Collective communications involve all the
processes in a communicator.

Copyright © 2010, Elsevier Inc. All rights Reserved 128

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy