
Today many developers are harnessing the power of multiple computers
to solve CPU-intensive problems in scientific research, animation, and
rendering. In particular, Apple's Xgrid makes it easier than
ever to collect ad hoc resources into a personal supercomputer. Xgrid
is especially well suited for "embarrassingly parallel" problems, which
run the exact same code many times using different inputs (e.g.,
different random numbers for a Monte Carlo simulation, or different
input files for finite element analysis).
But, what if your problem
can't easily be split up into numerous independent chunks? What if, in
fact, the various processors need to be able to coordinate with each
other and exchange data during a calculation?
What is MPI?
The solution is MPI, the Message Passing Interface. Developed over the
last several decades as an alternative to shared memory architectures
(such as OpenMP), MPI enables developers to efficiently program "tightly
coupled" algorithms which require nodes to communicate during the course
of a computation.
MPI consists of a standard set of API calls, usually implemented as a
platform-independent communications library, that manage all aspects of
inter-node communication and data transfer. MPI provides an abstraction
that works with any networking infrastructure (e.g., Ethernet, Myrinet,
InfiniBand), while being flexible and consistent enough for developers
to port and run their code on any parallel platform with only a
recompile. Different implementations can be run either standalone, or
embedded within scheduling solutions like Xgrid. And while Xgrid
supports MPI applications, implementing such applications takes some careful planning, so this article helps you get familiar with your options and what
each can provide.
The Message Passing Interface on Mac OS X
Why MPI on Mac OS X? Mac OS X is rapidly becoming the platform of
choice for technical computing, because it combines industry-standard
open source software with cutting-edge productivity applications. Not
only is Mac OS X the best UNIX-based solution for laptops and desktops,
but Mac OS X Server makes it easy for anyone to manage a complex array
of open source and Java-based services. Even better, every Mac that Apple ships
today—from the tiny Mac mini to the high-density dual-processor
Xserve G5—contains optimized math libraries for FFTs, BLAS, LAPACK
and other codes scientists use every day. This means that, out of the
box, your Mac is already pre-loaded with most of the tools you need to
get your work done.
This article is designed to help Mac OS X developers understand how
MPI can help them port and write tightly coupled, distributed algorithms
to run on multiple computers, as well as help MPI developers from other
platforms understand what's unique about Mac OS X. We will start with a
review the various implementations available on Mac OS X, and the hardware interconnects available, then cover the
MPI API in some detail, and also discuss issues you might face when
working with MPI on Mac OS X. We'll end with next steps to get you started.
If you are new to distributed computing, you might want to first
familiarize yourself with the basic concepts by learning about Xgrid. To
start with, read the article, Xgrid:
High-performance Computing for the Rest of Us, which explains how
Apple's Advanced Computational Group developed Xgrid to make it easy to
run a set of calculations on many machines using machine-dependent
parameters. With Xgrid, developers can stay focused on the science and
mathematics and not be distracted by having to set up a network of
computers. Xgrid is included with Mac OS X 10.4 Tiger; you can download the Xgrid Technology
Preview 2 for the Panther version of Mac OS X.
Implementations of MPI on Mac OS X
Here are some of the most popular implementations of MPI on Mac OS X.
MPICH (commonly pronounced "emm-pitch") is perhaps the oldest
implementation of MPI. In fact, portions of its code were created in
the earliest days of UNIX, 30 years ago. MPICH, hosted at Argonne
National Laboratory, was used as a reference standard for MPI, resulting
in its being ported to almost every parallel computing platform
imaginable.
Its default setting on Mac OS X uses TCP/IP over UNIX
sockets for communications, as is commonly done on Ethernet networks.
Being open source, it has numerous offshoots, often ported by hardware
vendors—such as Myrinet (MPICH-GM) or makers of InfiniBand interfaces
(MVAPICH)—to utilize their interconnect technology (which may not
require or even support TCP/IP).
Perhaps due to its long history of adaptation, it is not always the
best performer or particularly reliable. Its makers recognize these
issues, resulting in MPICH2 (currently in version 1.0), a rewritten and
revamped MPI implementation. MPICH2 is just beginning its portability
career, with MVAPICH2 for InfiniBand being its most prominent offshoot
so far.
LAM/MPI ("lamb-em-pee-eye" or simply "lamb") is the next most
prominent open-source implementation of MPI ported to Mac OS X. Hosted
by the Pervasive Technology Labs at Indiana University, it has a unique
implementation that provides its own run-time environment, called LAM
daemons, for services such as process control, I/O redirection, and
optionally asynchronous message passing.
The parallel application relies on these LAM daemons for execution and
communication. While LAM does not enjoy the broad base of contributors
that MPICH does, it is arguably the best performing MPI implementation
in open source. Its default distribution for Mac OS X is even provided
as a native installer package, rather than a UNIX tar archive like most
open source.
MacMPI ("Mac-em-pee-eye"), the first MPI implementation on the
Macintosh platform, was created in 1998 at UCLA's Plasma Physics Group.
Today's Mac OS X incarnations use TCP/IP and are launched via the Pooch application
from Dauger Research, a commercial spinoff of the UCLA physics group.
Supporting the most commonly used subset of MPI, MacMPI is unique in
that it displays visual diagnostics about the ongoing MPI
job—useful both for debugging and optimizing MPI code.
An offshoot of MacMPI, LnxMPI_S, which uses BSD sockets and forgoes
the visual diagnostics, can be run as a Darwin background process; LnxMPI_S is
the first implementation of MPI supported by Apple's Xgrid. While most other MPI implementations
require installation into the OS, MacMPI is meant to be simpler in that
one integrates this source code library into the user's parallel code's
source code base. Such executables can then be easily launched onto a
cluster using, e.g., Pooch.
Since Mac OS X was introduced in 2001, other MPI implementations have
appeared on the platform. MPI/Pro from MPI Software Technology is a
commercial implementation whose performance rivals the best of open
source. OpenMPI is an emerging open source collaboration (including
LAM/MPI and several hardware vendors) intended to enable software
developers, hardware vendors, system administrators, and scientific
researchers to contribute best-of-breed functionality in their
respective areas into a common environment. Other MPI implementations
continuing to emerge, such as ProMPI and W-MPI.
Interconnects
MPI is designed to provide an abstract layer that hardware vendors and
software implementors support and that parallel computing users
use. It therefore is possible for the same parallel
application to remain unchanged, at least at a source-code level, and
execute correctly on a wide range of parallel computing hardware and
software. These choices carry forth to Mac OS X.
The use of TCP/IP in clusters is so widespread that it's easy to think
of IP-based networking and MPI as synonymous, but they are not. In
some cases, there are good reasons not to use TCP/IP. A prime limitation
that constrains the size of a cluster is network latency; that is, the time
it takes for a one-byte message from one MPI task in a node to reach
another. The software overhead of the TCP/IP stack is often a large
fraction of that latency, so vendors of hardware that dramatically
improve on that latency usually write custom interfaces called directly
by the MPI implementation.
Two prominent examples of such hardware are
Myrinet and InfiniBand. Myricom has shipped Myrinet hardware since 1994
for supercomputer and cluster vendors. Using fiber or copper,
Myrinet hardware can achieve microsecond latency and multiple Gigabit
bandwidth. The first hardware using InfiniBand, a non-proprietary
interconnect standard, emerged in 2001. Capable of using either
copper or fiber, this interconnect is
derived from motherboard bus technology. However, the switches in
particular are quite expensive, so they are practical primarily for very
large clusters with at least several hundred nodes.
Calling the MPI API
The fundamental purpose of the MPI library is to provide a means to pass
messages between executables running on different processors, typically
on different machines. These executables need not be identical or exit
simultaneously, though they typically are. Each of these executables
use MPI calls to share information with each other. The three main
types of MPI calls are:
- "boilerplate" calls
- message passing calls, and
- collective calls.
Here are the calls and how to use each, by category.
"Boilerplate" Calls
Virtually every MPI code calls these routines because they set up
the MPI environment and retrieve sufficient information for each process
of a parallel executable to identify what part of the problem
to work on.
- MPI_Init & MPI_Finalize. All instances of the code must
call MPI_Init before communicating, allowing the MPI implementation to
prepare the communications environment, and MPI_Finalize, for the
corresponding teardown, before exiting.
- MPI_Comm_size & MPI_Comm_rank. After MPI_Init, the code
may learn about its environment by calling MPI_Comm_size and
MPI_Comm_rank. The former reports the number (N) of processes that are
communicating via MPI and the latter identifies the calling process with
an integer (0 through N-1).
Message Passing Calls
These are the fundamental calls of MPI that perform the most
primitive message passing. Mastering these is often sufficient for many
complex scientific codes.
- MPI_Send & MPI_Recv. Passing a message between processes
consists of calls to MPI_Send and MPI_Recv. The former call sends a
message to another process, while the latter receives that message. The
origin and destination of the message are memory-based. These MPI calls
require a reference to a memory buffer and its size and are blocking,
returning only when the memory buffer is available for the parallel
executable's use again.
- MPI_Isend & MPI_Irecv. These are the non-blocking, or
asynchronous, calls for passing messages, making it possible to perform
work while message passing is occurring or communicate with multiple
tasks simultaneously.
- MPI_Wait & MPI_Test. These balance the above asynchronous
calls. MPI_Wait blocks until completion, while MPI_Test returns true
the first time after its corresponding operation completes.
Collective Calls
MPI provides useful collective calls to coordinate data between many processes.
- MPI_Bcast. Performs a simple broadcast of data from any one process
to all the others.
- MPI_Reduce. Computes the sum, maximum, or minimum on values spread
across the parallel system, hence "reducing" the data.
- MPI_Gather & MPI_Scatter. The former collects values from an array
spread across processes into one large array on one process, while the
latter performs the opposite operation.
- MPI_Allreduce & MPI_Allgather. Cousins of MPI_Gather and MPI_Scatter
that spread its results to all processes.
- MPI_Alltoall. This call essentially performs a transpose on a matrix
spread across the processes.
Many of these have vector variants (indicated with a v suffix), which
allows a variable amount of data sent to or received by each process.
Example: Pascal's Triangle
To see how these work together, let us consider a simple,
tightly coupled problem known as Pascal's Triangle. Numbers generated
by this algorithm play an important role in
probability and network theory.
For an excellent tutorial on converting
single processor code to single-processor propagation-style code into a
parallel code, see
Parallelization: Parallel Pascal's Triangle by Dauger Research, Inc.
In the Pascal's Triangle example, calculating the next number requires knowing two previous numbers.
On a single processor, this is simple to do, but can end up taking a
very long time; see Figure 1.
Figure 1: Generating Numbers on a Single Processor.
To speed things up, we can partition this problem to run
on three different processors; since the computation is "volume-like"
(O(n)) while the communication is "area-like" (O(1)) this should be a
net win. But how do we implement this?
The answer is that we replace array lookups with MPI calls.
Whenever a new number is generated that would be needed by a
different process, the relevant CPU issues an MPI_Send message, which is
caught by the MPI_Recv message issued by the CPU that needs that value, as shown in Figure 2.
Figure 2: Generating Numbers on Multiple Processors.
Once the necessary values are available locally, the triangular
calculation proceeds as before. Similarly, if you want to get the total
value of a given row of the Triangle, you can use the collective calls
discussed above (which may be more efficient, depending on your
interconnect) rather than MPI_Send/Recv.
Of course, for a long-running calculation you'd want a more efficient
algorithm that grew all three regions proportionately, rather than
always having three elements in the middle tier, but this illustrates
the basic ideas.
Issues Using MPI on Mac OS X
In general, using MPI on Mac OS X is identical to using MPI on other
platforms, and similar to running other UNIX programs on Mac OS X.
However, there are a few areas requiring special consideration:
Launching MPI on OS X
For both the CLI- and GUI-based mechanisms, an MPI job is usually
launched by copying the executable to all nodes that will be involved in
the system and executing them there (via Xgrid, Apple Remote Desktop, a
specialized tool [e.g., Pooch], or even just ssh) Once they begin, the
launch mechanisms provide the executables data about the environment.
For TCP/IP-based MPI implementations, this data consists of IP network
addresses. For specialized hardware such as Myrinet and InfiniBand,
they provide analogous data that identifies the interconnects. With
this data, the MPI implementation can set up the N*(N-1)/2 connections
between the N processes on the cluster, usually triggered when the code
calls MPI_Init. From here, the parallel job takes control.
Networking
One of the most critical steps in launching an MPI job is setting up
those interprocess connections. Any web browser user recognizes the
frustration of a failed network connection. In an MPI job, that issue
is multiplied by dozens or hundreds of network connections. A problem
could occur because of an incorrectly specified address, to which many
command-line based MPI implementations are susceptible because they
often rely on static files containing address lists. Sometimes a simple
refused network query, even due to a temporary issue, could prevent a
connection from being made. Some implementations are more tolerant of
such faults than others. Naturally, a firewall or other built-in
network restriction will also prevent any unanticipated communication.
Deadlocks
With the MPI job's successful start, the parallel code can take
over, but as with any programming project, debugging becomes an
important skill.
The first bug that the typical student of MPI
encounters is a deadlock. A deadlock can occur when processes expect
messages from each other in a way that is impossible to complete. For
example, a beginning MPI code may have process 0 call MPI_Send send a
message and have process 1 call MPI_Recv to receive that message, and
all is well.
But the novice might try to have these two processes
exchange messages simultaneously by having both call MPI_Send first,
then MPI_Recv. They will deadlock because they will both wait forever
on their sends because neither has called a receive to accept the
messages.
The fix is for each to call MPI_Irecv, then MPI_Send, then
MPI_Wait on the MPI_Request value returned by the MPI_Irecv. This way,
both processes post a receive, ready for the other process' MPI_Send.
The MPI_Wait completes the receive. In general, deadlocks can be
substantially more complicated, but the principle is the same.
Optimization
Because of the nature of computation and communications technology,
optimization of parallel code using MPI often involves designing your
code to make most efficient use of the communications hardware. From
the outset of the code, it is best to organize the problem to be solved
so that it reduces or consolidates the interprocess communication as
much as possible.
For example, a code may have many small amounts of data for transmission
to another node. Sending this data using many small messages is a less
efficient use of the network because there is a constant overhead,
consisting of handshaking, error correction, and the like, for each data
packet sent over the network.
Buffering these small data structures together into a large message,
when possible, allows the network to communicate the data most
efficiently and approach the peak bandwidth of the hardware. On typical
Ethernet hardware, messages greater than 16 kB in size approach this
optimal scenario. Knowing this, a parallel code writer can make more
efficient use of the hardware, and the techniques one learns on small
clusters carry to large supercomputers as well.
Next Steps
Now that you understand the basics, it is time to dig in and get started. Two easy projects to get your feet wet are:
Updated: 2005-03-09
|