Add low-level and high-performance kernels to your Metal app. Optimize graphics and compute performance with kernels that are fine-tuned for the unique characteristics of each Metal GPU family.


The Metal Performance Shaders framework contains a collection of highly optimized compute and graphics shaders that are designed to integrate easily and efficiently into your Metal app. These data-parallel primitives are specially tuned to take advantage of the unique hardware characteristics of each GPU family to ensure optimal performance. Apps adopting the Metal Performance Shaders framework can be sure of achieving optimal performance without needing to update their own hand-written shaders for each new GPU family. Metal Performance Shaders can be used along with your app’s existing Metal resources (such as the MTLCommandBuffer, MTLTexture, and MTLBuffer objects) and shaders.

In iOS 9 and tvOS 9, the Metal Performance Shaders framework introduced a series of commonly-used image processing kernels for performing image effects on Metal textures.

In iOS 10 and tvOS 10, the Metal Performance Shaders framework introduced additional support for the following kernels:

  • Convolutional Neural Networks (CNN) to implement and run deep learning using previously obtained training data. CNN is a machine learning technique that attempts to model the visual cortex as a sequence of convolution, rectification, pooling, and normalization steps.

  • Image processing to perform color-conversion.

  • Matrix multiplication.

Data Containers

Most data operated on by the Metal Performance Shaders framework must be in a portable data container appropriate for use on the GPU, such as a MTLBuffer, MTLTexture, or MPSImage object.

The MPSImage Class

MTLBuffer and MTLTexture objects are commonly used in Metal apps and are used directly by the Metal Performance Shaders framework when possible. In apps that use CNN, kernels may need more than the four data channels that a MTLTexture object can provide. In these cases, an MPSImage object is used instead as an abstraction layer on top of a MTLTexture object. When more than 4 channels are needed, additional textures in the 2D texture array are added to hold additional channels in sets of four. An MPSImage object tracks this information as the number of feature channels in an image.

The MPSTemporaryImage Class

The MPSTemporaryImage class extends the MPSImage class to provide advanced caching of unused memory, in order to increase performance and reduce memory footprint. MPSTemporaryImage objects are intended as fast GPU-only storage for intermediate image data needed only transiently within a single MTLCommandBuffer object. They accelerate the common case of image data which is created only to be consumed and destroyed immediately by the next operation(s) encoded in a command buffer. MPSTemporaryImage objects provide a convenient and simple way to save memory by automatically aliasing other MPSTemporaryImage objects in the same command buffer. Because they alias (i.e., share texel storage with) other textures in the same command buffer, the valid lifetime of the data in an MPSTemporaryImage object is extremely short, limited to a portion of a the command buffer itself.

You can not read or write data to an MPSTemporaryImage using the CPU, or use the data in other MTLCommandBuffer objects. Use regular MPSImage objects for more persistent storage.

CNN Images

MPSCNNKernel objects operate on MPSImage objects. MPSImage objects are at their core MTLTexture objects; however, whereas MTLTexture objects commonly represent image or texel data, an MPSImage object is a more abstract representation of image features. The channels within an MPSImage do not necessarily correspond to colors in a color space (although they can, if necessary). As a result, there can be many more than four of them. Having 32 or 64 channels per pixel is not uncommon in CNN. This is achieved on the MTLTexture object abstraction by inserting extra RGBA pixels to handle the additional feature channels (if any) beyond 4. These extra pixels are stored as multiple slices of a 2D image array. Thus, each CNN pixel in a 32-channel image is represented as 8 array slices, with 4-channels stored per-pixel in each slice. The width and height of the MTLTexture object is the same as the width and height of the MPSImage object. The number of slices in the MTLTexture object is given by the number of feature channels rounded up to a multiple of 4.

MPSImage objects can be created from existing MTLTexture objects. They may also be created anew from an MPSImageDescriptor and backed with either standard texture memory, or as MPSTemporaryImage objects using memory drawn from the framework’s internal cached texture backing store. MPSTemporaryImage objects can provide great memory usage and CPU time savings, but come with significant restrictions that should be understood before using them. For example, their contents are only valid during the GPU-side execution of a single MTLCommandBuffer object and can not be read from or written to by the CPU. They are provided as an efficient way to hold CNN computations that are used immediately within the scope of the same MTLCommandBuffer object and then discarded. Concatenation is also supported by allowing you to define from which destination feature channel to start writing the output of the current layer. In this way, your app can make a large MPSImage or MPSTemporaryImage object and fill in parts of it with multiple layers (as long as the destination feature channel offset is a multiple of 4).

The MPSKernel Class

The MPSKernel is the base class for all Metal Performance Shaders kernels. It defines the baseline behavior for all kernels, declaring the device to run the kernel on, some debugging options, and a user-friendly label, should one be required. Derived from this class are the MPSUnaryImageKernel and MPSBinaryImageKernel subclasses, which define shared behavior for most image processing kernels (filters) such as edging modes, clipping, and tiling support for image operations that consume one or two source textures. Neither these nor the MPSKernel class are meant to be used directly. They just provide API abstraction and in some cases may allow some level of polymorphic manipulation of image kernel objects.

Subclasses of the MPSUnaryImageKernel and MPSBinaryImageKernel classes provide specialized initialization and encoding methods to encode various image processing primitives into a command buffer, and may also provide additional configurable properties on their own. Many such image filters are available, such as:

  • Convolution filters (Sobel, Gaussian)

  • Morphological operators (dilate, erode)

  • Histogram operators (equalization, specification)

All of these run on the GPU directly on texture and buffer objects.

As the MPSKernel, MPSUnaryImageKernel, and MPSBinaryImageKernel classes serve to unify a diversity of image operations into a simple consistent interface and calling sequence to apply image filters, subclasses implement details that diverge from the norm. For example, some filters may take a small set of parameters (e.g. a convolution kernel) to govern how they function. However, the overall sequence for using kernel subclasses remains the same:

  1. Determine whether the Metal Performance Shaders framework supports your device by querying the MPSSupportsMTLDevice(_:) function.

  2. Allocate the usual Metal objects to drive a Metal compute pipeline: MTLDevice, MTLCommandQueue, and MTLCommandBuffer. If your app has already written to any command buffers, Metal Performance Shaders can encode onto them inline with your own workload.

  3. Create an appropriate kernel—for example, a MPSImageGaussianBlur object if you want to do a Gaussian blur. Kernels are generally lightweight, but can be reused to save some setup time. They cannot be used by multiple threads concurrently, so if your app uses Metal from many threads concurrently, make extra kernels. MPSKernel objects conform to the NSCopying protocol.

  4. Call the kernel’s encoding method. Parameters for the encoding call vary by kernel type, but operate similarly. They create a command encoder, write commands to run the kernel into the command buffer, and then end the command encoder. This means you must call the endEncoding() method on your current command encoder before calling a kernel’s encode method. At this point, you can either release the kernel or keep it for later use to save some setup cost.

  5. If you wish to encode further commands of your own on the command buffer, you must create a new command encoder to do so.

  6. When you are done with the command buffer, submit it to the device using the commit() method. The kernel will then begin running on the GPU. You can either use the waitUntilCompleted() or addCompletedHandler(_:) methods to be notified when the work is done.

Each kernel is allocated against a particular device; a single kernel may not be used with multiple devices. This is necessary because the init(device:) methods sometimes allocate buffers and textures to hold data passed in as parameters to the initialization method, and a device is required to allocate them. Kernels provide a copy(with:device:) method that allows them to be copied for a new device.

Common Kernel Properties

The MPSUnaryImageKernel and MPSBinaryImageKernel base classes define several properties common to all image kernels:

clipRect and clipRect

A clip rectangle is available to all image kernels that write to a destination texture. It describes the sub-rectangle of the destination texture overwritten by the filter. If the clip rectangle is larger than the destination texture, then the intersection between the clip rectangle and the destination texture bounds is used instead. A clip rectangle may be used to avoid doing work to obscured regions of the destination image, or to manage tiling and limit operations to parts of an image—for example, if a user draws a rectangle on the screen and asks your app to just apply the filter there.

offset, primaryOffset, and secondaryOffset

An offset is available to all image kernels that use a source texture from which pixel data is read. It describes the positioning of the source image relative to the result texture. An offset of {0, 0, 0} indicates that the top left pixel of the source texture is the center pixel used to create the top left corner of the destination texture clip rectangle (as a further example, an offset of {1, 2, 0} positions the top left corner of the clip rectangle at position x=1, y=2, and z=0 of the source image). The offset is the position of the top left corner of the clip rectangle in the source coordinate frame. It can be used for tiling and for translating an image up, down, left, or right by pixel increments. If there is no clip rectangle, then the offset is the top left corner of the region read by the filter. If there are multiple source textures, then the primary offset describes the top left corner of the region read in the primary source texture and the secondary offset describes the top left corner of the region read in the secondary source texture.

edgeMode, primaryEdgeMode, and secondaryEdgeMode

An edge mode describes the behavior of texture reads that stray off the edge of the source image. This can happen if the offset is negative, meaning a read off the top or left edge of the image. This can also happen if the sum of the clip rectangle size and the offset is larger than the source image, meaning a read off the bottom or right edge of the image. Furthermore, it is also possible for image filters to have a kernel window that stretches to examine neighboring pixels beyond the image bounds (such as convolution, morphology, and resampling filters). If there are multiple source textures, then the primary edge mode describes the mode to use with the primary source texture and the secondary edge mode describes the mode to use with the secondary source texture.

In-Place Operation

Some kernels can operate in place. This means that the same texture is used to hold both the input image and the result image. Operating in place is a great way to save memory, time, and energy. You can perform an in-place operation by using the encode(commandBuffer:inPlaceTexture:fallbackCopyAllocator:) method.

Unfortunately, it is not always possible for kernels to run in place. Whether a particular kernel can operate in place can vary according to the hardware it is running on, the OS version, and the parameters and properties passed to it. You may not assume that because a kernel works in place today on a particular device that it will do so in the future.

To simplify error handling with failed in-place operation, the encode(commandBuffer:inPlaceTexture:fallbackCopyAllocator:) method takes an optional MPSCopyAllocator object. It is used to create a new texture when in-place operation is not possible so as to allow the operation to proceed out of place in a reliable fashion instead. When this happens, the input texture is released and replaced with a new texture. To make use of the feature, you will need to write a copy allocator block.

Listing 1 shows a minimal copy allocator implementation. For more information, see the MPSCopyAllocator reference.

Listing 1

Minimal MPSCopyAllocator Implementation

let myAllocator: MPSCopyAllocator =
    (kernel: MPSKernel, buffer: MTLCommandBuffer, texture: MTLTexture) -> MTLTexture in
    let descriptor = MTLTextureDescriptor.texture2DDescriptor(pixelFormat: texture.pixelFormat,
                                                              width: texture.width,
                                                              height: texture.height,
                                                              mipmapped: false)
    return buffer.device.makeTexture(descriptor: descriptor)

Sample Code

Listing 2

Metal Performance Shaders Sample Code

// Blur the input texture (in place if possible) on MTLCommandQueue q, and return the new texture.
// This is a trivial example. It is not necessary or necessarily advised to enqueue a MPSKernel on
// its own MTLCommandBuffer or using its own MTLComputeCommandEncoder. Group work together.
// Here we assume that you have already gotten a MTLDevice using MTLCreateSystemDefaultDevice() or
// MTLCopyAllDevices(), used it to create a MTLCommandQueue with MTLDevice.newCommandQueue, and
// similarly made textures with the device as needed.
func myBlurTextureInPlace(inTexture: MTLTexture, blurRadius: Float, queue: MTLCommandQueue)
    // Create the usual Metal objects.
    // MPS does not need a dedicated MTLCommandBuffer or MTLComputeCommandEncoder.
    // This is a trivial example. You should reuse the MTL objects you already have, if you have them.
    let device = queue.device;
    let buffer = queue.makeCommandBuffer();
    // Create a MPS filter.
    let blur = MPSImageGaussianBlur(device: device, sigma: blurRadius)
    // Defaults are okay here for other MPSKernel properties (clipRect, origin, edgeMode).
    // Attempt to do the work in place.  Since we provided a copyAllocator as an out-of-place
    // fallback, we don’t need to check to see if it succeeded or not.
    // See the "Minimal MPSCopyAllocator Implementation" code listing for a sample myAllocator.
    let inPlaceTexture = UnsafeMutablePointer<MTLTexture>.allocate(capacity: 1)
    inPlaceTexture.initialize(to: inTexture)
    blur.encode(commandBuffer: buffer, 
                inPlaceTexture: inPlaceTexture, 
                fallbackCopyAllocator: myAllocator)
    // The usual Metal enqueue process.

Tuning Hints

The Metal Performance Shaders framework has been tuned for excellent performance across a diversity of devices and kernel parameters. The tuning process focuses on minimizing both CPU and GPU latency for back to back calls on the same command buffer. It is possible, however, to inadvertently undo this optimization effort by introducing costly operations into the pipeline around the kernel, leading to disappointing overall results.

Here are some elements of good practice to avoid common pitfalls:

  1. Don’t wait for results to complete before enqueuing more work. There can be a significant delay (up to 2.5 ms) just to get an empty command buffer through the pipeline to where the waitUntilCompleted() method returns. Instead, start encoding the next command buffer(s) while you wait for the first one to complete. Enqueue them too, so they can start immediately after the previous one exits the GPU. Don’t wait for the CPU kernel to notice the first command buffer is done, start taking it apart, and eventually make a callback to the app before beginning work on encoding the next one. By allowing the CPU and GPU to work concurrently in this way, throughput can be enhanced by up to a factor of ten.

  2. There is a large cost to allocating buffers and textures. The cost can swamp the CPU, preventing you from keeping the GPU busy. Try to preallocate and reuse the MTLResource objects as much as possible.

  3. There is a cost to switching between render and compute encoders. Each time a new render encoder is used, there can be a substantial GPU mode switch cost that may undermine your throughput. To avoid the cost, try to batch compute work together. Since making a new command buffer forces you to make a new command encoder too, try to do more work with fewer command buffers.

  4. For some image operations, particularly those involving multiple passes (e.g. chaining multiple image filters together), performance can be improved by up to a factor of two by breaking the work into tiles of ~512 KB in size. Use the -sourceRegion(destinationSize:) method to find the region needed for each tile.

CNN Tips

  • Think carefully about the edge mode requested for pooling layers. The default value is zero, but there are times when a clamp value may be better.

  • To avoid reading off the edge of an image for filters that have a filter area (convolution, pooling), set MPSCNNKernel.offset = (MPSOffset){ .x = kernelWidth/2, .y = kernelHeight/2, .z = 0} and reduce the size of the output image by {kernelWidth-1, kernelHeight-1, 0}. The filter area stretches up and to the left of the kernel offset by {kernelWidth/2, kernelHeight/2}.

  • Always remember the following distinction:

    • The MPSCNNConvolution class takes weights in the order weight[outputChannels][kernelHeight][kernelWidth][inputChannels/groups].

    • The MPSCNNFullyConnected class takes weights in the order weight[outputChannels][sourceWidth][sourceHeight][inputChannels].

  • Initialize MPSCNNKernel objects once and reuse them.

  • You can use MPSCNNNeuron objects and similar to perform pre-processing of images, such as scaling and resizing.

  • Specify a neuron filter with an MPSCNNConvolutionDescriptor object to combine the convolution and neuron operations.

  • Use MPSTemporaryImage objects for intermediate images that live for a short period of time (one MTLCommandBuffer object).

    MPSTemporaryImage objects can reduce the amount of memory used by the CNN by several folds, and similarly reduce the amount of CPU time spent allocating storage and latency between the time a command buffer is committed and when it is actually executed on the GPU.

    You cannot read or write to a MPSTemporaryImage object using the CPU. Generally, MPSTemporaryImage objects should be created as needed and thrown away promptly. Persistent objects should not retain them.

    Please be sure to understand the purpose of the readCount property.

  • Because the Metal Performance Shaders framework encodes its work in place in your command buffer, you always have the option to insert your own code in between MPSCNNKernel encodings as a Metal function for tasks not covered by the framework. You do not need to use the Metal Performance Shaders framework for everything.

Supported Pixel Formats for Image Kernels

All Metal Performance Shaders image kernels support source and destination textures with the following ordinary and packed pixel formats:

r8Unorm, r8Unorm_srgb

Ordinary formats with one 8-bit normalized unsigned integer component.

rg8Unorm, rg8Unorm_srgb

Ordinary formats with two 8-bit normalized unsigned integer components.

rgba8Unorm, rgba8Unorm_srgb, bgra8Unorm, bgra8Unorm_srgb

Ordinary formats with four 8-bit normalized unsigned integer components.

r16Float, rg16Float, rgba16Float

Ordinary format with 16-bit floating-point components.

r32Float, rg32Float, rgba32Float

Ordinary format with 32-bit floating-point components.

r16Unorm, rg16Unorm, rgba16Unorm

Ordinary format with 16-bit normalized unsigned integer components.

b5g6r5Unorm, a1bgr5Unorm, abgr4Unorm, bgr5A1Unorm

Packed 16-bit format with normalized unsigned integer color components.


Packed 32-bit format with normalized unsigned integer color components.

rg11b10Float, rgb9e5Float

Packed 32-bit format with floating-point color components.

Some compressed pixel formats can be used as source textures. They cannot be used as destination textures because they cannot be written to. Metal Performance Shaders image kernels support the following compression families:



  • ASTC

The following Metal Performance Shaders image kernels also support source and destination textures with ordinary signed and unsigned integer pixel formats:

The ordinary signed and unsigned integer pixel formats supported by these image kernels:

r8Sint, rg8Sint, rgba8Sint

Ordinary format with 8-bit signed integer components.

r8Uint, rg8Uint, rgba8Uint

Ordinary format with 8-bit unsigned integer components.

r16Sint, rg16Sint, rgba16Sint

Ordinary format with 16-bit signed integer components.

r16Uint, rg16Uint, rgba16Uint

Ordinary format with 16-bit unsigned integer components.

r32Sint, rg32Sint, rgba32Sint

Ordinary format with 32-bit signed integer components.

r32Uint, rg32Uint, rgba32Uint

Ordinary format four 32-bit unsigned integer components.

For more information on pixel formats, see MTLPixelFormat and Pixel Format Capabilities.


Working with Convolutional Neural Networks


Specifies a ReLU (Rectified Linear Unit) neuron filter.


Specifies a sigmoid neuron filter.


Specifies a hyperbolic tangent neuron filter.


Specifies a pooling kernel.


Specifies an average pooling filter.


Specifies a max pooling filter.


The softmax filter is a neural transfer function that is useful for classification tasks.


Specifies a spatial normalization kernel.


A description of the attributes of a convolution kernel.


Specifies a normalization kernel across feature channels.


A fully connected convolution layer, also known as an inner product layer.


Describes a convolution neural network kernel.


Specifies a local contrast normalization kernel.


The logarithmic softmax filter is a neural transfer function that is useful for constructing a loss function to be minimized when training neural networks.


This filter applies a neuron activation function.


Specifies an absolute neuron filter.


Specifies a linear neuron filter.


A convolution kernel that convolves the input image with a set of filters, with each producing one feature map in the output image.

Image Filter Base Classes


A standard interface for Metal Performance Shaders kernels.


A kernel that consumes one texture and produces one texture.


A kernel that consumes two textures and produces one texture.

Morphological Image Filters


A filter that finds the maximum pixel value in a rectangular region centered around each pixel in the source image.


A filter that finds the minimum pixel value in a rectangular region centered around each pixel in the source image.


A filter that finds the maximum pixel value in a rectangular region centered around each pixel in the source image.


A filter that finds the minimum pixel value in a rectangular region centered around each pixel in the source image.

Convolution Image Filters


A filter that convolves an image with a given kernel of odd width and height.


A filter that applies a median filter in a square region centered around each pixel in the source image.


A filter that convolves an image with a given kernel of odd width and height.


A filter that convolves an image with a tent filter.


A filter that convolves an image with a Gaussian blur of a given sigma in both the x and y directions.


A filter that convolves an image with a Gaussian pyramid.


A filter that convolves an image with the Sobel operator.


An optimized Laplacian filter, provided for ease of use.


A base class for creating different kinds of pyramid images.

Histogram Image Filters


A filter that computes the histogram of an image.


A filter that equalizes the histogram of an image.


A filter that performs a histogram specification operation on an image.

Image Threshold Filters


A filter that applies a fixed-level threshold to each pixel in the image.


A filter that applies a fixed-level threshold to each pixel in the image.


A filter that applies a fixed-level threshold to each pixel in the image.


A filter that applies a fixed-level threshold to each pixel in the image.


A filter that applies a fixed-level threshold to each pixel in the image.

Image Integral Filters


A filter that calculates the sum of pixels over a specified region in the image.


A filter that calculates the sum of squared pixels over a specified region in the image.

Converting, Transforming and Transposing Images


A filter that can perform a conversion of color space, alpha, or pixel format.


A filter that can be used to resample an existing image by using a different sampling frequency in each dimension.


A filter that transposes an image.

Working with Matrices


Describes a 2D array of data and provides storage for its values.


A matrix descriptor describes the attributes of an MPSMatrix object and is used to create one.


A matrix multiplication kernel.

Working with Images


An image describes a texture that may have more than 4 channels.


A description of the attributes used to create an MPSImage.


Temporary images are provided as a fast way to store transient data that will be used and discarded promptly.


Metal Performance Shaders Constants

The Metal Performance Shaders framework provides a single constant that is used as a default clipping rectangle for Metal Performance Shaders kernels.

Metal Performance Shaders Data TypesMetal Performance Shaders FunctionsMetalPerformanceShaders Enumerations

See Also

Related Symbols


Render advanced 3D graphics and perform data-parallel computations. Get fine-grained access to the GPU while minimizing CPU overhead.

Related Documentation