大多数浏览器和
Developer App 均支持流媒体播放。
-
利用 Metal 加快机器学习
了解如何利用 Metal Performance Shaders Graph 中的新功能来加快你的机器学习 Transformer 模型。我们还将介绍如何提升自己模型的计算带宽和质量,并利用全新的 MPSGraph 视图直观呈现模型。
章节
- 0:00 - Introduction
- 2:07 - Transformer support
- 13:41 - Fast Fourier transforms
- 16:42 - MPS Graph viewer
资源
- Filtering Images with MPSGraph FFT Operations
- Forum: Machine Learning and AI
- Metal Performance Shaders Graph
- MPSGraph
相关视频
WWDC24
WWDC23
WWDC22
WWDC21
-
下载
Hello! My name is Kamal Ramamoorthy and I'm a software engineer from the GPU, Graphics and Display team. In this video, my colleague Sam and I will present how you can accelerate your machine learning models using Metal.
Training is the first step of deploying models on Apple's platforms. The second step is to prepare the model for deployment on device.
Finally, the model is ready to be integrated into your application, which is what I will focus on in this video.
If you are using Core ML to deploy your models, MPSGraph provides GPU acceleration using Metal.
To learn more about Core ML, watch the “Explore Machine Learning on Apple Platforms” video. You may also want to watch the “Deploy Machine Learning Models On-Device With Core ML” video.
You can also train your models using frameworks like PyTorch, TensorFlow and JAX. Check out the “Train Your ML Models With Apple Silicon GPUs” video to learn more.
All these frameworks build on top of Metal Performance Shaders Graph, which is a framework for constructing and running general purpose compute graphs using Metal. MPSGraph provides low level control over GPU synchronization and memory, so in some cases you may want to use MPSGraph directly.
For example, if your application uses Metal, you can use MPSGraph to sequence ML tasks with other GPU work. You can also share low level Metal resources such as buffers. If you’re new to accelerating machine learning using Metal, check out our videos from previous years’ WWDCs.
Sam and I will talk about three things in this video. First, improvements to MPS and MPS Graph. Many of these are focused on improving the efficiency of transformer models, so I will use those models as an example. Second, new features which accelerate FFT-based ML models. Finally, Sam will introduce MPSGraph Viewer, which allows you to visualize your ML models. Let’s start with the new features focused on improving the performance of transformer models. Transformers are commonly used in language models to translate, predict and generate text. The input is a sequence of tokens. For example, a simple sentence like “The quick brown” is a 3 token input.
The language model responds by predicting the next token, “fox”.
This new sentence is repeatedly fed back into the model to generate new tokens.
MPS and MPSGraph have new features which enable you to improve your transformer models. These improvements fall into three categories. First, improved compute performance. Next, memory bandwidth savings. And finally, quality improvements for transformer models. I’ll start with compute performance.
Transformer-based models are made of layers of transformer blocks. Focusing further on the internals of a transformer block, it consists of multihead attention, normalization and feed forward blocks.
The multihead attention block is one of the most compute intensive blocks. This block computes large multidimensional matrix multiplications, which are compute heavy operations. The input matrix is projected through a matrix multiplication layer to produce a query matrix called Q, a key matrix K and a value matrix V, which are then fed to the Scaled Dot-Product attention block. This is the heart of the transformer model.
There are two ways you can optimize the performance of this attention block. If you look inside the attention block, it is made of several operations.
MPSGraph now has an operation which combines this sequence of operations into a single kernel which executes more efficiently.
To use this operation, call the scaledDotProductAttention method on an MPSGraph object. This method takes query, key, and value tensors as arguments.
Using the fused SDPA operation should enable you to improve the performance of your transformer models.
Let’s revisit the multi head attention block so I can show you another opportunity to improve compute performance using these tensors.
Here you can see how the Query, Key and Value projections work for the first token. A matrix multiplication operation projects the embedding vector for query, key and value.
To generate the next output token, we have to feed all the previously generated tokens into the matrix multiplication. This results in recomputing the K and V projections that were already calculated in previous iterations.
This cost adds up the longer the sequence length gets. To mitigate this problem, you can cache these projections as they are generated so they can be reused in future iterations. The first thing you need to do is: create K and V tensors which will store the cached K and V values. These tensors are called the KV cache. In the first iteration, the K and V values are computed for the first token and inserted into the KV cache.
You can now reuse the cached values, so that in the second iteration, you only need to compute the K and V values for the second token. This simplifies the matrix-matrix multiplication into a matrix-vector multiplication.
You could append the K and V projections to the end of the KV cache by creating a new tensor for every iteration, but this would use a lot of memory. Instead, you can update the existing tensor in-place using the slice update operation.
You can then use the slice operation to extract just the portion of the KV cache which has been computed.
Let’s look at how to do this in code.
First, create a placeholder representing the cache. The dimensions of this tensor depend on the details of your model. For this example, I will focus on just the key portion of the cache, but the value portion works the same way.
To be able to to update the KV cache in-place, you will need to create a variable which represents the current state of the cache. Unlike the results of a normal graph operation, you can update this variable to refer to a different value later.
For every token, you’ll need to insert the key projection into the cache. You can do this using the sliceUpdateDataTensor method on the MPSGraph object. The start and end arrays indicate where to put the new value. This example appends it to the end of the valid portion of the cache. In this case, the stride is uniform.
You can now assign the updated cache back to the original variable. MPSGraph will optimize this to update the cache allocation in-place.
Finally, you can use the slice operation to extract just the portion of the KV cache which has been computed. This is the portion from the beginning of the cache up to the most recently inserted key projection.
You can then pass the updated key cache to the SDPA operation.
Once you’ve made these compute improvements, the memory bandwidth becomes the new bottleneck.
The memory required to store the weights for large language models can be in the order of tens of gigabytes. These weights are usually represented using 16-bit floating point. However, MPS supports quantizing these weights to 8-bit integers to reduce the memory foot print by half.
New this year, MPS also supports a 4-bit integer format. This allows you to reduce the size of the weights even further. MPS supports several techniques to map your weights to these quantized formats.
Here is an example tensor where the elements are distributed on a number line. For 8-bit quantization, there are 256 possible quantization points linearly distributed along the number line. For 4-bit quantization there are 16 points.
During quantization, the points are adjusted to the closest quantized value. The quantization scale factor can be determined by using the formula on the right. This accrues a slight error but we end up saving 2x or 4x the memory space and bandwidth. Use the dequantize method on the MPSGraph object to dequantize the values.
Another quantization technique uses a lookup table. This technique is useful when your weights are clustered around different areas on the number line. With affine quantization, the quantized values are uniformly distributed, but the input values are not. This causes most of the quantized bits to go unused as most input values cluster around only a few quantized points. You can use the quantized bits better by using a lookup table.
In this technique, you choose your own quantized points based on the distribution of your data. You store these quantized values in a lookup table. Then, you assign each weight a 4 or 8-bit index into this table. This way, you get a lot more flexibility while sacrificing only a small amount of performance to look up the values in the table.
Use the dequantize method to convert these quantized values back into 32-bit floating point values. Simply pass in your quantized weights in the 32-bit lookup table. You can then reuse the dequantized tensor as usual. For example, as an input to a matrix multiplication. In fact, in cases like this, MPSGraph goes one step further.
If your graph contains a dequantize operation on the weights preceding a matrix multiplication, MPSGraph will replace the two operations with a single quantized matrix multiplication operation. This operation will dequantize weights on the fly as needed rather than storing a temporary copy of the dequantized weights.
Quantization can save memory and memory bandwidth, but it can also introduce numerical inaccuracies. Now, let me show you 2 ways to improve the quality of your transformer models.
When you quantize your weights, each weight is mapped to a lower precision value. You also choose a scale and, optionally, and, optionally, an offset to apply to the quantized values when dequantizing. However, applying a single scale and offset value to all of the weights will limit how accurate the reconstructed values can be.
Instead, you can quantize blocks of elements individually, each with their own scale and offset values. This allows you to match the scale and offset values more precisely for each block.
The code to do this is similar to the earlier example, except, instead of passing in a single scale and zero point value, you pass in a tensor containing the scale and zero point for each block.
So, that’s it for quantization. Next, I’ll show you a different way you can improve the quality of your transformer models using adapters.
Adapters are small layers that you can insert into your model