Apple Developer Connection
Advanced Search
Member Login Log In | Not a Member? Contact ADC

Introduction to MPI Distributed Programming With Mac OS X

Today many developers are harnessing the power of multiple computers to solve CPU-intensive problems in scientific research, animation, and rendering. In particular, Apple's Xgrid makes it easier than ever to collect ad hoc resources into a personal supercomputer. Xgrid is especially well suited for "embarrassingly parallel" problems, which run the exact same code many times using different inputs (e.g., different random numbers for a Monte Carlo simulation, or different input files for finite element analysis).

But, what if your problem can't easily be split up into numerous independent chunks? What if, in fact, the various processors need to be able to coordinate with each other and exchange data during a calculation?

What is MPI?

The solution is MPI, the Message Passing Interface. Developed over the last several decades as an alternative to shared memory architectures (such as OpenMP), MPI enables developers to efficiently program "tightly coupled" algorithms which require nodes to communicate during the course of a computation.

MPI consists of a standard set of API calls, usually implemented as a platform-independent communications library, that manage all aspects of inter-node communication and data transfer. MPI provides an abstraction that works with any networking infrastructure (e.g., Ethernet, Myrinet, InfiniBand), while being flexible and consistent enough for developers to port and run their code on any parallel platform with only a recompile. Different implementations can be run either standalone, or embedded within scheduling solutions like Xgrid. And while Xgrid supports MPI applications, implementing such applications takes some careful planning, so this article helps you get familiar with your options and what each can provide.

The Message Passing Interface on Mac OS X

Why MPI on Mac OS X? Mac OS X is rapidly becoming the platform of choice for technical computing, because it combines industry-standard open source software with cutting-edge productivity applications. Not only is Mac OS X the best UNIX-based solution for laptops and desktops, but Mac OS X Server makes it easy for anyone to manage a complex array of open source and Java-based services. Even better, every Mac that Apple ships today—from the tiny Mac mini to the high-density dual-processor Xserve G5—contains optimized math libraries for FFTs, BLAS, LAPACK and other codes scientists use every day. This means that, out of the box, your Mac is already pre-loaded with most of the tools you need to get your work done.

This article is designed to help Mac OS X developers understand how MPI can help them port and write tightly coupled, distributed algorithms to run on multiple computers, as well as help MPI developers from other platforms understand what's unique about Mac OS X. We will start with a review the various implementations available on Mac OS X, and the hardware interconnects available, then cover the MPI API in some detail, and also discuss issues you might face when working with MPI on Mac OS X. We'll end with next steps to get you started.

If you are new to distributed computing, you might want to first familiarize yourself with the basic concepts by learning about Xgrid. To start with, read the article, Xgrid: High-performance Computing for the Rest of Us, which explains how Apple's Advanced Computational Group developed Xgrid to make it easy to run a set of calculations on many machines using machine-dependent parameters. With Xgrid, developers can stay focused on the science and mathematics and not be distracted by having to set up a network of computers. Xgrid is included with Mac OS X 10.4 Tiger; you can download the Xgrid Technology Preview 2 for the Panther version of Mac OS X.

Implementations of MPI on Mac OS X

Here are some of the most popular implementations of MPI on Mac OS X.

MPICH (commonly pronounced "emm-pitch") is perhaps the oldest implementation of MPI. In fact, portions of its code were created in the earliest days of UNIX, 30 years ago. MPICH, hosted at Argonne National Laboratory, was used as a reference standard for MPI, resulting in its being ported to almost every parallel computing platform imaginable.

Its default setting on Mac OS X uses TCP/IP over UNIX sockets for communications, as is commonly done on Ethernet networks. Being open source, it has numerous offshoots, often ported by hardware vendors—such as Myrinet (MPICH-GM) or makers of InfiniBand interfaces (MVAPICH)—to utilize their interconnect technology (which may not require or even support TCP/IP).

Perhaps due to its long history of adaptation, it is not always the best performer or particularly reliable. Its makers recognize these issues, resulting in MPICH2 (currently in version 1.0), a rewritten and revamped MPI implementation. MPICH2 is just beginning its portability career, with MVAPICH2 for InfiniBand being its most prominent offshoot so far.

LAM/MPI ("lamb-em-pee-eye" or simply "lamb") is the next most prominent open-source implementation of MPI ported to Mac OS X. Hosted by the Pervasive Technology Labs at Indiana University, it has a  unique implementation that provides its own run-time environment, called LAM daemons, for services such as process control, I/O redirection, and optionally asynchronous message passing.

The parallel application relies on these LAM daemons for execution and communication. While LAM does not enjoy the broad base of contributors that MPICH does, it is arguably the best performing MPI implementation in open source. Its default distribution for Mac OS X is even provided as a native installer package, rather than a UNIX tar archive like most open source.

MacMPI ("Mac-em-pee-eye"), the first MPI implementation on the Macintosh platform, was created in 1998 at UCLA's Plasma Physics Group. Today's Mac OS X incarnations use TCP/IP and are launched via the Pooch application from Dauger Research, a commercial spinoff of the UCLA physics group. Supporting the most commonly used subset of MPI, MacMPI is unique in that it displays visual diagnostics about the ongoing MPI job—useful both for debugging and optimizing MPI code.

An offshoot of MacMPI, LnxMPI_S, which uses BSD sockets and forgoes the visual diagnostics, can be run as a Darwin background process; LnxMPI_S is the first implementation of MPI supported by Apple's Xgrid. While most other MPI implementations require installation into the OS, MacMPI is meant to be simpler in that one integrates this source code library into the user's parallel code's source code base. Such executables can then be easily launched onto a cluster using, e.g., Pooch.

Since Mac OS X was introduced in 2001, other MPI implementations have appeared on the platform. MPI/Pro from MPI Software Technology is a commercial implementation whose performance rivals the best of open source. OpenMPI is an emerging open source collaboration (including LAM/MPI and several hardware vendors) intended to enable software developers, hardware vendors, system administrators, and scientific researchers to contribute best-of-breed functionality in their respective areas into a common environment. Other MPI implementations continuing to emerge, such as ProMPI and W-MPI.

Interconnects

MPI is designed to provide an abstract layer that hardware vendors and software implementors support and that parallel computing users use. It therefore is possible for the same parallel application to remain unchanged, at least at a source-code level, and execute correctly on a wide range of parallel computing hardware and software. These choices carry forth to Mac OS X.

The use of TCP/IP in clusters is so widespread that it's easy to think of IP-based networking and MPI as synonymous, but they are not. In some cases, there are good reasons not to use TCP/IP. A prime limitation that constrains the size of a cluster is network latency; that is, the time it takes for a one-byte message from one MPI task in a node to reach another. The software overhead of the TCP/IP stack is often a large fraction of that latency, so vendors of hardware that dramatically improve on that latency usually write custom interfaces called directly by the MPI implementation.

Two prominent examples of such hardware are Myrinet and InfiniBand. Myricom has shipped Myrinet hardware since 1994 for supercomputer and cluster vendors. Using fiber or copper, Myrinet hardware can achieve microsecond latency and multiple Gigabit bandwidth. The first hardware using InfiniBand, a non-proprietary interconnect standard, emerged in 2001. Capable of using either copper or fiber, this interconnect is derived from motherboard bus technology. However, the switches in particular are quite expensive, so they are practical primarily for very large clusters with at least several hundred nodes.

Calling the MPI API

The fundamental purpose of the MPI library is to provide a means to pass messages between executables running on different processors, typically on different machines. These executables need not be identical or exit simultaneously, though they typically are. Each of these executables use MPI calls to share information with each other. The three main types of MPI calls are:

  • "boilerplate" calls
  • message passing calls, and
  • collective calls.

Here are the calls and how to use each, by category.

"Boilerplate" Calls

Virtually every MPI code calls these routines because they set up the MPI environment and retrieve sufficient information for each process of a parallel executable to identify what part of the problem to work on.

  • MPI_Init & MPI_Finalize. All instances of the code must call MPI_Init before communicating, allowing the MPI implementation to prepare the communications environment, and MPI_Finalize, for the corresponding teardown, before exiting.
  • MPI_Comm_size & MPI_Comm_rank. After MPI_Init, the code may learn about its environment by calling MPI_Comm_size and MPI_Comm_rank. The former reports the number (N) of processes that are communicating via MPI and the latter identifies the calling process with an integer (0 through N-1).

Message Passing Calls

These are the fundamental calls of MPI that perform the most primitive message passing. Mastering these is often sufficient for many complex scientific codes.

  • MPI_Send & MPI_Recv. Passing a message between processes consists of calls to MPI_Send and MPI_Recv. The former call sends a message to another process, while the latter receives that message. The origin and destination of the message are memory-based. These MPI calls require a reference to a memory buffer and its size and are blocking, returning only when the memory buffer is available for the parallel executable's use again.
  • MPI_Isend & MPI_Irecv. These are the non-blocking, or asynchronous, calls for passing messages, making it possible to perform work while message passing is occurring or communicate with multiple tasks simultaneously.
  • MPI_Wait & MPI_Test. These balance the above asynchronous calls. MPI_Wait blocks until completion, while MPI_Test returns true the first time after its corresponding operation completes.

Collective Calls

MPI provides useful collective calls to coordinate data between many processes.

  • MPI_Bcast. Performs a simple broadcast of data from any one process to all the others.
  • MPI_Reduce. Computes the sum, maximum, or minimum on values spread across the parallel system, hence "reducing" the data.
  • MPI_Gather & MPI_Scatter. The former collects values from an array spread across processes into one large array on one process, while the latter performs the opposite operation.
  • MPI_Allreduce & MPI_Allgather. Cousins of MPI_Gather and MPI_Scatter that spread its results to all processes.
  • MPI_Alltoall. This call essentially performs a transpose on a matrix spread across the processes.

Many of these have vector variants (indicated with a v suffix), which allows a variable amount of data sent to or received by each process.

Example: Pascal's Triangle

To see how these work together, let us consider a simple, tightly coupled problem known as Pascal's Triangle. Numbers generated by this algorithm play an important role in probability and network theory.

For an excellent tutorial on converting single processor code to single-processor propagation-style code into a parallel code, see Parallelization: Parallel Pascal's Triangle by Dauger Research, Inc.

In the Pascal's Triangle example, calculating the next number requires knowing two previous numbers. On a single processor, this is simple to do, but can end up taking a very long time; see Figure 1.

Single Processor

Figure 1: Generating Numbers on a Single Processor.

To speed things up, we can partition this problem to run on three different processors; since the computation is "volume-like" (O(n)) while the communication is "area-like" (O(1)) this should be a net win. But how do we implement this?

The answer is that we replace array lookups with MPI calls. Whenever a new number is generated that would be needed by a different process, the relevant CPU issues an MPI_Send message, which is caught by the MPI_Recv message issued by the CPU that needs that value, as shown in Figure 2.

Multiple Processors

Figure 2: Generating Numbers on Multiple Processors.

Once the necessary values are available locally, the triangular calculation proceeds as before. Similarly, if you want to get the total value of a given row of the Triangle, you can use the collective calls discussed above (which may be more efficient, depending on your interconnect) rather than MPI_Send/Recv.

Of course, for a long-running calculation you'd want a more efficient algorithm that grew all three regions proportionately, rather than always having three elements in the middle tier, but this illustrates the basic ideas.

Issues Using MPI on Mac OS X

In general, using MPI on Mac OS X is identical to using MPI on other platforms, and similar to running other UNIX programs on Mac OS X. However, there are a few areas requiring special consideration:

Launching MPI on OS X

For both the CLI- and GUI-based mechanisms, an MPI job is usually launched by copying the executable to all nodes that will be involved in the system and executing them there (via Xgrid, Apple Remote Desktop, a specialized tool [e.g., Pooch], or even just ssh) Once they begin, the launch mechanisms provide the executables data about the environment.

For TCP/IP-based MPI implementations, this data consists of IP network addresses. For specialized hardware such as Myrinet and InfiniBand, they provide analogous data that identifies the interconnects. With this data, the MPI implementation can set up the N*(N-1)/2 connections between the N processes on the cluster, usually triggered when the code calls MPI_Init. From here, the parallel job takes control.

Networking

One of the most critical steps in launching an MPI job is setting up those interprocess connections. Any web browser user recognizes the frustration of a failed network connection. In an MPI job, that issue is multiplied by dozens or hundreds of network connections. A problem could occur because of an incorrectly specified address, to which many command-line based MPI implementations are susceptible because they often rely on static files containing address lists. Sometimes a simple refused network query, even due to a temporary issue, could prevent a connection from being made. Some implementations are more tolerant of such faults than others. Naturally, a firewall or other built-in network restriction will also prevent any unanticipated communication.

Deadlocks

With the MPI job's successful start, the parallel code can take over, but as with any programming project, debugging becomes an important skill.

The first bug that the typical student of MPI encounters is a deadlock. A deadlock can occur when processes expect messages from each other in a way that is impossible to complete. For example, a beginning MPI code may have process 0 call MPI_Send send a message and have process 1 call MPI_Recv to receive that message, and all is well.

But the novice might try to have these two processes exchange messages simultaneously by having both call MPI_Send first, then MPI_Recv. They will deadlock because they will both wait forever on their sends because neither has called a receive to accept the messages.

The fix is for each to call MPI_Irecv, then MPI_Send, then MPI_Wait on the MPI_Request value returned by the MPI_Irecv. This way, both processes post a receive, ready for the other process' MPI_Send. The MPI_Wait completes the receive. In general, deadlocks can be substantially more complicated, but the principle is the same.

Optimization

Because of the nature of computation and communications technology, optimization of parallel code using MPI often involves designing your code to make most efficient use of the communications hardware. From the outset of the code, it is best to organize the problem to be solved so that it reduces or consolidates the interprocess communication as much as possible.

For example, a code may have many small amounts of data for transmission to another node. Sending this data using many small messages is a less efficient use of the network because there is a constant overhead, consisting of handshaking, error correction, and the like, for each data packet sent over the network.

Buffering these small data structures together into a large message, when possible, allows the network to communicate the data most efficiently and approach the peak bandwidth of the hardware. On typical Ethernet hardware, messages greater than 16 kB in size approach this optimal scenario. Knowing this, a parallel code writer can make more efficient use of the hardware, and the techniques one learns on small clusters carry to large supercomputers as well.

Next Steps

Now that you understand the basics, it is time to dig in and get started. Two easy projects to get your feet wet are:

Updated: 2005-03-09