-
Accelerate machine learning with Metal
Learn how to accelerate your machine learning transformer models with new features in Metal Performance Shaders Graph. We'll also cover how to improve your model's compute bandwidth and quality, and visualize it in the all new MPSGraph viewer.
Chapters
- 0:00 - Introduction
- 2:07 - Transformer support
- 13:41 - Fast Fourier transforms
- 16:42 - MPS Graph viewer
Resources
- Filtering Images with MPSGraph FFT Operations
- Forum: Machine Learning and AI
- Metal Performance Shaders Graph
- MPSGraph
Related Videos
WWDC24
WWDC23
WWDC22
WWDC21
-
Download
Hello! My name is Kamal Ramamoorthy and I'm a software engineer from the GPU, Graphics and Display team. In this video, my colleague Sam and I will present how you can accelerate your machine learning models using Metal.
Training is the first step of deploying models on Apple's platforms. The second step is to prepare the model for deployment on device.
Finally, the model is ready to be integrated into your application, which is what I will focus on in this video.
If you are using Core ML to deploy your models, MPSGraph provides GPU acceleration using Metal.
To learn more about Core ML, watch the “Explore Machine Learning on Apple Platforms” video. You may also want to watch the “Deploy Machine Learning Models On-Device With Core ML” video.
You can also train your models using frameworks like PyTorch, TensorFlow and JAX. Check out the “Train Your ML Models With Apple Silicon GPUs” video to learn more.
All these frameworks build on top of Metal Performance Shaders Graph, which is a framework for constructing and running general purpose compute graphs using Metal. MPSGraph provides low level control over GPU synchronization and memory, so in some cases you may want to use MPSGraph directly.
For example, if your application uses Metal, you can use MPSGraph to sequence ML tasks with other GPU work. You can also share low level Metal resources such as buffers. If you’re new to accelerating machine learning using Metal, check out our videos from previous years’ WWDCs.
Sam and I will talk about three things in this video. First, improvements to MPS and MPS Graph. Many of these are focused on improving the efficiency of transformer models, so I will use those models as an example. Second, new features which accelerate FFT-based ML models. Finally, Sam will introduce MPSGraph Viewer, which allows you to visualize your ML models. Let’s start with the new features focused on improving the performance of transformer models. Transformers are commonly used in language models to translate, predict and generate text. The input is a sequence of tokens. For example, a simple sentence like “The quick brown” is a 3 token input.
The language model responds by predicting the next token, “fox”.
This new sentence is repeatedly fed back into the model to generate new tokens.
MPS and MPSGraph have new features which enable you to improve your transformer models. These improvements fall into three categories. First, improved compute performance. Next, memory bandwidth savings. And finally, quality improvements for transformer models. I’ll start with compute performance.
Transformer-based models are made of layers of transformer blocks. Focusing further on the internals of a transformer block, it consists of multihead attention, normalization and feed forward blocks.
The multihead attention block is one of the most compute intensive blocks. This block computes large multidimensional matrix multiplications, which are compute heavy operations. The input matrix is projected through a matrix multiplication layer to produce a query matrix called Q, a key matrix K and a value matrix V, which are then fed to the Scaled Dot-Product attention block. This is the heart of the transformer model.
There are two ways you can optimize the performance of this attention block. If you look inside the attention block, it is made of several operations.
MPSGraph now has an operation which combines this sequence of operations into a single kernel which executes more efficiently.
To use this operation, call the scaledDotProductAttention method on an MPSGraph object. This method takes query, key, and value tensors as arguments.
Using the fused SDPA operation should enable you to improve the performance of your transformer models.
Let’s revisit the multi head attention block so I can show you another opportunity to improve compute performance using these tensors.
Here you can see how the Query, Key and Value projections work for the first token. A matrix multiplication operation projects the embedding vector for query, key and value.
To generate the next output token, we have to feed all the previously generated tokens into the matrix multiplication. This results in recomputing the K and V projections that were already calculated in previous iterations.
This cost adds up the longer the sequence length gets. To mitigate this problem, you can cache these projections as they are generated so they can be reused in future iterations. The first thing you need to do is: create K and V tensors which will store the cached K and V values. These tensors are called the KV cache. In the first iteration, the K and V values are computed for the first token and inserted into the KV cache.
You can now reuse the cached values, so that in the second iteration, you only need to compute the K and V values for the second token. This simplifies the matrix-matrix multiplication into a matrix-vector multiplication.
You could append the K and V projections to the end of the KV cache by creating a new tensor for every iteration, but this would use a lot of memory. Instead, you can update the existing tensor in-place using the slice update operation.
You can then use the slice operation to extract just the portion of the KV cache which has been computed.
Let’s look at how to do this in code.
First, create a placeholder representing the cache. The dimensions of this tensor depend on the details of your model. For this example, I will focus on just the key portion of the cache, but the value portion works the same way.
To be able to to update the KV cache in-place, you will need to create a variable which represents the current state of the cache. Unlike the results of a normal graph operation, you can update this variable to refer to a different value later.
For every token, you’ll need to insert the key projection into the cache. You can do this using the sliceUpdateDataTensor method on the MPSGraph object. The start and end arrays indicate where to put the new value. This example appends it to the end of the valid portion of the cache. In this case, the stride is uniform.
You can now assign the updated cache back to the original variable. MPSGraph will optimize this to update the cache allocation in-place.
Finally, you can use the slice operation to extract just the portion of the KV cache which has been computed. This is the portion from the beginning of the cache up to the most recently inserted key projection.
You can then pass the updated key cache to the SDPA operation.
Once you’ve made these compute improvements, the memory bandwidth becomes the new bottleneck.
The memory required to store the weights for large language models can be in the order of tens of gigabytes. These weights are usually represented using 16-bit floating point. However, MPS supports quantizing these weights to 8-bit integers to reduce the memory foot print by half.
New this year, MPS also supports a 4-bit integer format. This allows you to reduce the size of the weights even further. MPS supports several techniques to map your weights to these quantized formats.
Here is an example tensor where the elements are distributed on a number line. For 8-bit quantization, there are 256 possible quantization points linearly distributed along the number line. For 4-bit quantization there are 16 points.
During quantization, the points are adjusted to the closest quantized value. The quantization scale factor can be determined by using the formula on the right. This accrues a slight error but we end up saving 2x or 4x the memory space and bandwidth. Use the dequantize method on the MPSGraph object to dequantize the values.
Another quantization technique uses a lookup table. This technique is useful when your weights are clustered around different areas on the number line. With affine quantization, the quantized values are uniformly distributed, but the input values are not. This causes most of the quantized bits to go unused as most input values cluster around only a few quantized points. You can use the quantized bits better by using a lookup table.
In this technique, you choose your own quantized points based on the distribution of your data. You store these quantized values in a lookup table. Then, you assign each weight a 4 or 8-bit index into this table. This way, you get a lot more flexibility while sacrificing only a small amount of performance to look up the values in the table.
Use the dequantize method to convert these quantized values back into 32-bit floating point values. Simply pass in your quantized weights in the 32-bit lookup table. You can then reuse the dequantized tensor as usual. For example, as an input to a matrix multiplication. In fact, in cases like this, MPSGraph goes one step further.
If your graph contains a dequantize operation on the weights preceding a matrix multiplication, MPSGraph will replace the two operations with a single quantized matrix multiplication operation. This operation will dequantize weights on the fly as needed rather than storing a temporary copy of the dequantized weights.
Quantization can save memory and memory bandwidth, but it can also introduce numerical inaccuracies. Now, let me show you 2 ways to improve the quality of your transformer models.
When you quantize your weights, each weight is mapped to a lower precision value. You also choose a scale and, optionally, and, optionally, an offset to apply to the quantized values when dequantizing. However, applying a single scale and offset value to all of the weights will limit how accurate the reconstructed values can be.
Instead, you can quantize blocks of elements individually, each with their own scale and offset values. This allows you to match the scale and offset values more precisely for each block.
The code to do this is similar to the earlier example, except, instead of passing in a single scale and zero point value, you pass in a tensor containing the scale and zero point for each block.
So, that’s it for quantization. Next, I’ll show you a different way you can improve the quality of your transformer models using adapters.
Adapters are small layers that you can insert into your model consisting of just a few operations and parameters.
When you fine-tune the model, only the parameters inside the adapter are updated. This can be used to adapt a pre-trained base model to new tasks, but it can also be used to compensate for error introduced by quantization. You can add adapters to your model using MPSGraph callables.
The way this works is each adapter is a separate MPSGraph that can be called from the main graph.
First, insert calls to your adapters from your base graph by specifying a unique name for each adapter.
To do this in code, you will need to define the shape and type of the output the call to your adapter will produce. Then, use the call method on your main MPSGraph object to add the call to your adapter. This is where you provide the name, inputs, and output types for your callable.
Next, create the MPSGraph for each adapter. In this example, I’ll create a placeholder representing the "input" as an unranked tensor. Next, I’ll create the "output" tensor by multiplying the "input" by 2.
Finally, compile the graphs for each adapter into graph executables. These are compiled like any other MPS graph. First, define the input types to the graph by providing the exact shape. Then, call the compile method on the graph object, providing the Metal device, input types, and outputTensor.
Now that you’ve added the calls to your adapters from the main graph and compiled the graph executables for each adapter, the last thing you need to do is map the adapter names to the actual graph executables in the main graph. This is done when compiling your main MPSGraph for your network using a GraphCompilationDescriptor. First, create a dictionary mapping the name of each adapter to its graph executable and set it on the descriptor. Then simply provide the compilation descriptor when compiling the main graph.
And that’s all you need to do to set up adapters! To summarize, adapters and callables let you customize your models to perform new tasks and improve the quality because you can use them to fine-tune your model output.
Next I’ll tell you about what’s new for Fourier Transforms in MPS and MPSGraph this year. Fast Fourier transforms, or FFTs for short, convert data like signals or images from the temporal or spatial domains to the frequency domain. They are a common pre-processing step in machine learning models that process audio, such as speech-to-text models, and models that separate different audio sources from a single track. They can also be used to accelerate certain convolution layers, and they’re used in many image processing and scientific computing applications. For example, to extract text from an audio signal, the input waveform is first run through a Short-Time Fourier Transform, or STFT. The frequency spectrum is then analyzed by a Transformer model to extract the text.
I’ve already described how you can use MPSGraph to execute ML models efficiently on the GPU. But you can also use MPSGraph’s support for Fast Fourier Transforms to move this entire pipeline to the GPU. The first step is to implement the Short-Time Fourier Transform.
This works by dividing the input waveform into multiple shorter views, or windows, which may overlap each other. Each window is effectively an independent signal. In order to reduce spectral leakage the individual windows are multiplied by a window function. Finally, you can use a normal batched one-dimensional FFT operation to compute the STFT for each window.
In order to divide the waveform into shorter windows, you can create a strided view.
First, set up the shape of the windowed view. In this case the width of the window will be 512 elements. Next, set up the stride for each dimension. This example uses 256, meaning we skip 256 elements in the underlying 1D array for each step in the second dimension. The final batch dimension is set to 1, but you can use larger batch sizes.
Finally you can create the strided view by calling the arrayView method on the input tensor. The best part is that the view operation works without performing any copies by aliasing the memory of the input array, saving memory and GPU time.
Now you can compute the FFTs over all the windows. First, create a placeholder for the strided view data. You will need to load the data out of the strided view NDArray and provide it when running the graph later. Next, multiply by the window function. This is typically the Hann window or a Gaussian window. You can use an MPSGraphConstant tensor for this, for example. Finally, you can create the FFTTensor operation.
So that’s it for Fast Fourier Transforms. Next, I’ll hand the presentation over to Sam, who has some great news for you if you want to get a better understanding of your MPSGraph structures.
Thanks Kamal! Hi everyone, I’m Sam Colbran, and I'm also a software engineer from the GPU, Graphics and Display team. Now, if you’re not familiar, Metal includes advanced tools in Xcode and Instruments, to help you take full advantage of Apple GPUs. With so much power at your fingertips: Your Metal pipelines, and AI models that run on-device, can be bigger, and more complex. However, while you can visualize your Metal pipelines with the dependency viewer in Xcode, it hasn’t been possible to visualize MPSGraph until now! Today, I’m excited to introduce the newest addition to the Metal tools, coming in Xcode 16, the MPSGraph Viewer! It’s a brand new tool, specifically designed for machine learning and AI. Now, you can directly open MPSGraph packages in Xcode, and visualize how your operations are connected. Before I jump into a demo, let’s first recap how to actually create an MPSGraph package, whether you’re using MPSGraph directly, or have been developing your ML models in another framework.
If you’ve created your model directly with MPSGraph, first, compile your graph into an MPSGraph executable. Then, use the serialize API on the executable, to create the package.
New this year, you can now also create an MPSGraph executable directly from a CoreML package. As before, you can then serialize the executable to an MPSGraph package.
Alternatively, if you’re coming from another framework, like one that exports to ONNX, you can use mpsgraphtool to convert your model. Let’s go through an example together.
I’m using Mistral’s model with 7 billion parameters, that was converted to CoreML during this years “Bring your machine learning and AI models to Apple silicon” video.
Now, mpsgraphtool, can be accessed through the command line. So open up your favorite Terminal, and run mpsgraphtool with the convert argument.
And that’s it! Your freshly created mpsgraph package is ready to go.
And now, viewing it is easy, with the new MPSGraph Viewer.
I’ve already opened the converted Mistral package in Xcode 16, so let me describe what’s on my screen. Starting at the top left, are the compilation options. By default, the viewer is showing the graph as-is. That is, it’s not optimized for any particular device. So the operations should appear the same, regardless of the device you’re using.
Beneath that, is the operations navigator, which shows you a list of all of the operations used in your graph.
In the middle, is the graph itself. And finally, on the right, there’s the operation inspector, which I’ll come back to later.
At this level, it’s a bit hard to see anything, so I’ll zoom in.
And now, I can see high level structures and the further in I go, even more detail! Here, I can scroll around and see all of the inputs and outputs for each operation and how they’re connected. This makes it easy to visualize and understand, the structure of your graph.
Now, Mistral is a transformer model, and, as Kamal explained earlier, these are made of layers of transformer blocks. Let’s try to find them. I’ll start, by looking for the new, Scaled Dot Product Attention operator, which should be part of the multihead attention in each transformer block. I could search for it, but I can already see in the operation navigator on the left that there are 32 of them.
I’ll expand this group, and click on the first one, to jump to it in the graph.
It looks like this operation has 5 inputs, and, hopefully, you recognize the Query, Key and Value, which Kamal went through earlier.
I’ll zoom out a little bit to get a better view of the whole transformer block.
And, following the connections, I can see the blocks that make up the query, the key, and the value.
And even at this level, I can see variables in both the key and the value. I’ll zoom-in on the ones in the key.
And, since this model was exported from CoreML with states, it’s using a KV-cache and taking advantage of the new assign-to-variable and read-from-variable operations in MPS, which, as Kamal showed, will improve compute performance. Now, to simplify the graph, the viewer might show some operations and variables like this one, in multiple places. And, once I’ve selected the variable in the inspector on the right, I can see where it’s first created, and all of the places that it’s used.
Okay, so that was one transformer block. How about the rest? Well, just like the inspector, I can actually select multiple operations at the same time in the operation navigator. And just like that, the high level structure is clear, and I can see all of the layers repeated over, and over again.
Now, let’s talk about constants. You might’ve already noticed the green previews shown directly inside the graph, but I can also find them, sorted by size, in the constants navigator tab on the left. I’ll select the biggest one. Then, double click to open the constant viewer. Here, you can inspect the trained weights and gain insights into what your model has learned. This can help you to uncover opportunities to optimize your model for better integration on-device.
However, remember: the viewer is showing the graph as-is, it’s not optimized for any particular device.
In reality, the graph that’s executed might be different. For example, MPSGraph might automatically optimize operations by stitching them together into a single Metal kernel. You can use the viewer to visualize this. Let me show you how.
I’ve opened up an MPSGraph package containing ResNet50, and, like before, I can zoom in to see all of the operations and constants. But now, let’s see what the graph looks like for my device.
In the compilation options at the top left, I’ll select my device.
Now, zooming in, I can see that the operations have been grouped inside these Metal Stitched Shader regions, which I can expand to see inside.
Because these operations are fused into a single optimized Metal shader, internally, they have no memory overhead, which dramatically improves performance. In general, knowing how your graph ultimately gets executed on hardware can be useful to truly understanding it’s runtime performance characteristics. And that’s it for the new MPSGraph Viewer! Now, let’s recap what was shown today. As Kamal explained earlier, you can accelerate your machine learning with Metal, using Metal Performance Shaders Graph. It’s already used under-the-hood in popular frameworks, such as CoreML, to give you the best performance on Apple Silicon.
This year, new features for transformers can help you improve compute, with the new Scaled Dot Product Attention operation, combined with kv-cache. Memory bandwidth, with quantization, and quality, with adaptors, through callables. Fourier Transforms can now be computered even faster in MPSGraph, with the new strided NDArray API and lastly, the new MPSGraph Viewer makes it easy to understand and gain insights into how your model is executed on Apple Silicon. Make sure to checkout the documentation and sample code for MPSGraph and, of course, model integration is the last piece of the puzzle. So, if you haven’t already, make sure to check out these other great videos to learn more about training and deployment. Thanks for watching, and have a great WWDC!
-
-
Looking for something specific? Enter a topic above and jump straight to the good stuff.