Add low-level and high-performance kernels to your Metal app. Optimize graphics and compute performance with kernels that are fine-tuned for the unique characteristics of each Metal GPU family.
- iOS 9.0+
- macOS 10.13+Beta
- tvOS 9.0+
The Metal Performance Shaders framework contains a collection of highly optimized compute and graphics shaders that are designed to integrate easily and efficiently into your Metal app. These data-parallel primitives are specially tuned to take advantage of the unique hardware characteristics of each GPU family to ensure optimal performance. Apps adopting the Metal Performance Shaders framework can be sure of achieving optimal performance without needing to update their own hand-written shaders for each new GPU family. Metal Performance Shaders can be used along with your app’s existing Metal resources (such as the
MTLBuffer objects) and shaders.
In iOS 9 and tvOS 9, the Metal Performance Shaders framework introduced a series of commonly-used image processing kernels for performing image effects on Metal textures.
In iOS 10 and tvOS 10, the Metal Performance Shaders framework introduced additional support for the following kernels:
Convolutional Neural Networks (CNN) to implement and run deep learning using previously obtained training data. CNN is a machine learning technique that attempts to model the visual cortex as a sequence of convolution, rectification, pooling, and normalization steps.
Image processing to perform color-conversion.
The MPSKernel Class
MPSKernel is the base class for all Metal Performance Shaders kernels. It defines the baseline behavior for all kernels, declaring the device to run the kernel on, some debugging options, and a user-friendly label, should one be required. Derived from this class are the
MPSBinary subclasses, which define shared behavior for most image processing kernels (filters) such as edging modes, clipping, and tiling support for image operations that consume one or two source textures. Neither these nor the
MPSKernel class are meant to be used directly. They just provide API abstraction and in some cases may allow some level of polymorphic manipulation of image kernel objects.
Subclasses of the
MPSBinary classes provide specialized initialization and encoding methods to encode various image processing primitives into a command buffer, and may also provide additional configurable properties on their own. Many such image filters are available, such as:
Convolution filters (Sobel, Gaussian)
Morphological operators (dilate, erode)
Histogram operators (equalization, specification)
All of these run on the GPU directly on texture and buffer objects.
MPSBinary classes serve to unify a diversity of image operations into a simple consistent interface and calling sequence to apply image filters, subclasses implement details that diverge from the norm. For example, some filters may take a small set of parameters (for example, a convolution kernel) to govern how they function. However, the overall sequence for using kernel subclasses remains the same:
Determine whether the Metal Performance Shaders framework supports your device by querying the
Allocate the usual Metal objects to drive a Metal compute pipeline:
MTLCommand. If your app has already written to any command buffers, Metal Performance Shaders can encode onto them inline with your own workload.
Create an appropriate kernel—for example, a
MPSImageobject if you want to do a Gaussian blur. Kernels are generally lightweight, but can be reused to save some setup time. They cannot be used by multiple threads concurrently, so if your app uses Metal from many threads concurrently, make extra kernels.
MPSKernelobjects conform to the
Call the kernel’s encoding method. Parameters for the encoding call vary by kernel type, but operate similarly. They create a command encoder, write commands to run the kernel into the command buffer, and then end the command encoder. This means you must call the
endmethod on your current command encoder before calling a kernel’s encode method. At this point, you can either release the kernel or keep it for later use to save some setup cost.
If you wish to encode further commands of your own on the command buffer, you must create a new command encoder to do so.
When you are done with the command buffer, submit it to the device using the
commit()method. The kernel will then begin running on the GPU. You can either use the
addmethods to be notified when the work is done.
Each kernel is allocated against a particular device; a single kernel may not be used with multiple devices. This is necessary because the
init(device:) methods sometimes allocate buffers and textures to hold data passed in as parameters to the initialization method, and a device is required to allocate them. Kernels provide a
copy(with: method that allows them to be copied for a new device.
The Metal Performance Shaders framework has been tuned for excellent performance across a diversity of devices and kernel parameters. The tuning process focuses on minimizing both CPU and GPU latency for back to back calls on the same command buffer. It is possible, however, to inadvertently undo this optimization effort by introducing costly operations into the pipeline around the kernel, leading to disappointing overall results.
Here are some elements of good practice to avoid common pitfalls:
Don’t wait for results to complete before enqueuing more work. There can be a significant delay (up to 2.5 ms) just to get an empty command buffer through the pipeline to where the
waitmethod returns. Instead, start encoding the next command buffer(s) while you wait for the first one to complete. Enqueue them too, so they can start immediately after the previous one exits the GPU. Don’t wait for the CPU kernel to notice the first command buffer is done, start taking it apart, and eventually make a callback to the app before beginning work on encoding the next one. By allowing the CPU and GPU to work concurrently in this way, throughput can be enhanced by up to a factor of ten.
There is a large cost to allocating buffers and textures. The cost can swamp the CPU, preventing you from keeping the GPU busy. Try to preallocate and reuse the
MTLResourceobjects as much as possible.
There is a cost to switching between render and compute encoders. Each time a new render encoder is used, there can be a substantial GPU mode switch cost that may undermine your throughput. To avoid the cost, try to batch compute work together. Since making a new command buffer forces you to make a new command encoder too, try to do more work with fewer command buffers.
For some image operations, particularly those involving multiple passes (e.g. chaining multiple image filters together), performance can be improved by up to a factor of two by breaking the work into tiles of ~512 KB in size. Use the
sourcemethod to find the region needed for each tile.