Optimize custom machine learning operations with Metal tensors

Optimize custom machine learning operations with Metal tensors

Unlock powerful machine learning performance with the Metal Tensor API and Metal Performance Primitives (MPP) Tensor Ops library. Discover how to create portable operations that take advantage of Neural Accelerators in Apple M5 and A19 GPUs. Learn to build custom machine learning kernels for your Core AI applications, and find out how to work effectively with quantized data formats and GPU memory optimization.

Chapters
- 0:00 - Introduction
- 0:21 - Apple's ML software stack
- 2:25 - Managing quantized data
- 4:23 - Multi-plane tensors
- 5:17 - Quantized matrix multiplication
- 9:31 - Building advanced ops
- 13:35 - Integrating custom ops into Core AI
- 15:25 - Next steps
Resources
Related Videos

Tech Talks
- Accelerate your machine learning workloads with the M5 and A19 GPUs
WWDC25
- Combine Metal 4 machine learning and graphics
Hello, my name is Shiyao. I'm a GPU Software Engineer. Today, I am pleased to guide you through an exploration of Metal tensors, and show you how to write optimized custom ML kernels with TensorOps. Apple platforms provide first-class support for running ML models at every layer of the software stack. High-level frameworks like Core AI and MLX make it easy to deploy your models with minimal code, while lower-level APIs like Metal Performance Shaders provide access to high-performance Metal kernels. These layers all build on the low-level acceleration provided by Metal Performance Primitives and the TensorOps library. There are a few reasons why you might want to work at the Metal level. ML research moves quickly, so you might want to implement custom operations which can plug into a higher level frameworks such as Core AI. You may also need to write Metal kernels if you're contributing to an ML framework such as MLX or llama.cpp or if you're working on a Metal-based application. The easiest way to get started is using the TensorOps library. TensorOps is a Metal Shading Language API which accelerates tensor operations on the GPU, including matrix multiplication and convolution. It automatically uses any available hardware acceleration across all Apple Silicon GPU generations, so you don't need to worry about the differences between hardware generations. In particular, it takes full advantage of the neural accelerator in the M5 chip family.
The neural accelerator is a new hardware block in M5, located directly in each shader core. It sits alongside the other GPU pipelines and is designed to accelerate dense compute-bound work such as the prefill stage of an LLM.
You can check out the related sessions to learn the basics of getting started with TensorOps. In this session, I'll build on those basics, starting with best practices for working with quantized data. Then, I'll show you how to build advanced custom operations such as FlashAttention.
Let's dive into the first topic — working with quantized data.
As we know, state-of-the-art machine learning models are getting larger. The inference stage is typically memory bandwidth bound, so compressing the weights becomes necessary both to better fit models into memory and to save memory bandwidth.
The standard approach for compressing weights is quantization. The idea is simple — take higher-precision weights and reduce them into lower-precision data types. For example, 16-bit half-precision weights could be compressed down to just 4-bits. These quantized weights are paired with scale factors, which let us scale the quantized values back into the original range when it's time to compute.
In addition to 16- and 32-bit floating point types, TensorOps now natively supports quantized data types. We added a support for 4- and 8-bit integer types in an update to macOS and iOS 26, and we're extending support to even more data types in macOS and iOS 27. This includes 4- and 8-bit floating point types and 2-bit integer types.
You can simply create and pass your app's quantized tensors to TensorOps and it will automatically take advantage of any available hardware acceleration.
Creating a tensor with a quantized data type is very similar to creating a regular tensor. You fill in your descriptor's properties like any other tensor, but simply specify a quantized dataType. Then create the tensor by calling newTensorWithDescriptor on your Metal device.
So that's how you can store your quantized element data. Next, let's talk about the scale factors. In macOS and iOS 27, a single MTLTensor object can now represent your scales alongside your tensor's quantized data as an additional scale plane. This plane supports the popular FP8 E8M0 block-wise scale factor format. Each element of the scale plane applies to a block of elements in the data plane. Declaring the scale plane is similar to declaring a tensor.
First, create a descriptor object for the scale plane. Then fill in the dataType and blockFactors. Finally, create an auxiliary plane map to specify that this plane is for scales.
Then simply attach the auxiliary planes map to your original tensorDescriptor. The quantized data, scales, and metadata will all be packed into a single tensor object.
Now let's put this into practice by extending a basic matrix multiplication kernel to support quantization.
Matrix multiplication is the core operation in machine learning workloads. For instance, LLMs perform millions of matrix multiplications during inference.
We covered the basics of how to write a high performance matrix multiplication kernel with TensorOps in the M5 machine learning talk. The basic approach is to slice the input matrices into smaller tiles, and then perform tile-wise matrix multiplications using TensorOps. This maximizes parallelism and keeps data in the cache.
We can use quantization to further reduce memory traffic and fit larger models in memory. In the kernel, it helps to define type aliases up front before binding the tensors. Here we declare a scales factor plane with fp8_e8m0_ data type, and a block size of 32 by 1. That means every 32 elements in the data plane share a single element in the scales_plane. Then we declare a full tensor type, specifying an FP8 data type along with the scales_plane. You can simply bind these tensors to buffer binding points. The kernel will then have access to the tensors you've allocated on the host side. Alternatively, if you don't want to create a full MTLTensor on the host, you can create a temporary tensor right on the shader's stack. The syntax is almost identical, just swap the tag tensor_handle with tensor_inline. Then pass your buffer pointers and other metadata to the tensor constructor to create a tensor on the stack.
As I mentioned earlier, we'll divide the problem over many threadgroups for better parallelism. First, we'll slice out the tile for each threadgroup and then perform the multiplication with TensorOps.
To do this, simply call slice on your input and output tensors using the threadgroup ID. The data and scales plane will both be sliced simultaneously according to the block size. Setting up the matrix multiplication with quantized tensors is identical to normal tensors. First, set up the matmul2d_descriptor, specifying the tile sizes and other parameters. Then create a matmul2d op, specifying the number of simdgroups in the threadgroup. Then simply pass in your quantized tensors and TensorOps will handle dequantization for you.
In most cases, you should feed your quantized data straight into TensorOps so that it can automatically utilize any available hardware acceleration. However, if you need to dequantize a custom format, TensorOps still have you covered. The simplest approach is to have each thread load a chunk of quantized data from device memory and dequantize it to f16 values in threadgroup memory. You can then pass it as an inline threadgroup tensor to TensorOps. However, this approach requires extra loads and stores through threadgroup memory. Ideally, we would keep all this data in thread registers instead. You can do this by dequantizing the data into a cooperative tensor, which can now be passed as an input to the matmul2d op. Cooperative tensors distribute their storage across the thread private memory of the threads participating in the matmul operation. So if you can't use quantized tensors directly, you can still skip the round trip through threadgroup memory.
To recap — Metal tensors natively support a wide range of quantized data types, including the new MX scaling formats and E8M0 scale factors coming in iOS and macOS 27. Note that these new data types have additional alignment requirements compared to the larger data types, so be sure to check the Metal documentation for details.
Now let's take it up a notch — building a full, more complex custom operation with TensorOps.
Attention is at the core of every transformer network, including LLMs. To compute attention, you first multiply two matrices together called Q and K. Next, you compute SoftMax using reductions on the rows of the intermediate matrix.
Finally, you multiply by a third matrix called V. The popular FlashAttention algorithm fuses all of these operations together into a single kernel.
To implement this with TensorOps, you'll first need to set up a custom simd group mapping so that each simd group owns complete rows of the intermediate matrix. This allows you to compute the SoftMax without exchanging data between simd groups. You can do this using the execution_simdgroup operation scope. This means that each simd group will perform an independent matrix multiplication in parallel.
You can use the simd group ID to slice your input tiles. We'll use a cooperative tensor to store the intermediate matrix so that we can use it as an input to the next step without writing it to the memory. We'll compute SoftMax on the result.
To do this, we'll need to compute a couple of reductions on the cooperative tensor. TensorOps includes a reduce_rows function to help with this.
Threads will exchange data amongst themselves to calculate the max for each row. The result is returned in another cooperative tensor.
Let's set it up. First, create a cooperative tensor to store the reduction output. Then pass the source and destination to the reduce_rows function. Here we'll use the max reduction_operation with an initial value of negative INFINITY.
These two cooperative tensors have different shapes, so to help map between them, TensorOps also includes a map_iterator function. Given an iterator pointing to an element in the 2D tensor, it returns an iterator pointing to the corresponding element in the reduction destination.
First, set up a loop over the 2D cooperative tensor using iterators. Then call map_iterator to map each element to its corresponding row max. Finally, dereference these iterators to compute SoftMax and store the result back into the cooperative tensor.
Now we're ready to multiply this cooperative tensor by V. In macOS 26, you would have had to first store it to threadgroup memory. But it's now possible to use cooperative tensors directly as inputs to matmul operations.
To do this, call get_left_input_cooperative_tensor method, passing the source cooperative tensor as an argument. You can then pass the result as an input to the second matmul operation. One thing to watch out for: not every cooperative tensor can be reused as an input. The layouts may different depending on the data types and other factors. So before you do this, call the is_compatible_as_left or right _input method to check for compatibility.
If it returns true, you're good to go. If not, you'll need to store and reload the data through threadgroup memory to convert it to the correct layout. Either way, the call to op.run is the same. Those are the key TensorOps features you'll need to build an advanced operation like FlashAttention using TensorOps. Now that we've walked through how to build this operation, let's see how it runs in a real model using Core AI. Core AI provides tools for Python developers to convert Pytorch models to Core AI models, including support for custom Metal kernels. Check out the "Deep Dive into Core AI Model authoring and Optimization" session for the details of how to integrate a Metal kernel into a Core AI model.
I've followed the steps outlined in that session to integrate our custom FlashAttention kernel into a Sam3 image segmentation model. We define the body of our custom attention kernel as a string in Python and register the TorchMetalKernel object, shown here.
Then, we replace the default huggingface attention implementation with one that calls our kernel, shown here.
Finally, we load the model from huggingface and export it from PyTorch as an optimized Core AI asset. The export will take a moment to finish.
Now we're ready to do inference.
Sam3 performs promptable concept segmentation, so we provide the model with an image and text, and then it responds with a segmentation mask indicating where objects are located in the image. Here, I'm prompting the model to label all pixels containing a car in this image.
Ok, now, I'll run the segmentation.
Looking at the final result, we can see the model correctly segmented the image. The car is highlighted in blue, so our attention kernel is fully integrated into the model as expected.
Today, I've covered all the tools you can use to build optimized custom ML kernels on Apple Silicon. From quantized data types, to advanced TensorOps features like cooperative tensors and reductions, to integrating with Core AI. To go further, explore the Metal Performance Primitives documentation for the full API reference, and the programming guide for more performance optimization guidelines. You can also download the TensorOps sample code to see the details that I couldn't cover here. And be sure to check out the related sessions to learn more about Core AI and Metal. Thank you!

// Creating a tensor with a quantized data type from device

#define RANK 2

MTLTensorDescriptor *tensorDesc = [MTLTensorDescriptor new];

tensorDesc.dataType = MTLTensorDataTypeMetalFloat8E4M3;
tensorDesc.usage = MTLTensorUsageCompute;

NSInteger dimensions[RANK] = {NumCols, NumRows};
tensorDesc.dimensions = [[MTLTensorExtents alloc] initWithRank:RANK values:dimensions];

NSError *err = nil;
id <MTLTensor> tensor = [device newTensorWithDescriptor:tensorDesc error:&err];

4:48 - Declare a multi-plane tensor with scale factors

// Creating a tensor with a scales auxiliary plane from device

#define RANK 2

MTLTensorAuxiliaryPlaneDescriptor *planeDesc = [MTLTensorAuxiliaryPlaneDescriptor new];
planeDesc.dataType = MTLTensorDataTypeMetalFloat8UE8M0;

NSInteger blockFactors[RANK] = {32, 1};
planeDesc.blockFactors = [[MTLTensorExtents alloc] initWithRank:RANK values:blockFactors];

MTLTensorAuxiliaryPlaneDescriptorMap *auxiliaryPlanes =
    [MTLTensorAuxiliaryPlaneDescriptorMap new];
[auxiliaryPlanes setDescriptor:planeDesc forPlane:MTLTensorPlaneTypeScales];

MTLTensorDescriptor *tensorDesc = [MTLTensorDescriptor new];
tensorDesc.dataType = MTLTensorDataTypeMetalFloat8E4M3;
tensorDesc.usage = MTLTensorUsageCompute;

NSInteger dimensions[RANK] = {NumCols, NumRows};
tensorDesc.dimensions = [[MTLTensorExtents alloc] initWithRank:RANK values:dimensions];
tensorDesc.auxiliaryPlanes = auxiliaryPlanes;

NSError *err = nil;
id <MTLTensor> tensor = [device newTensorWithDescriptor:tensorDesc error:&err];

6:07 - MSL type aliases for an MXFP8 tensor handle

// Type aliases for a MXFP8 multi-plane tensor handle

#include <metal_tensor>

using namespace metal;

using scales_plane = tensor_blockwise<tensor_plane_scales,
                                      device metal_fp8_ue8m0_format,
                                      32, 1>;

using mxfp8_tensor = tensor<device metal_fp8_e4m3_format,
                            dextents<int, 2>,
                            tensor_handle,
                            scales_plane>;

kernel void matmul(mxfp8_tensor matrixA [[buffer(0)]],
                   mxfp8_tensor matrixB [[buffer(1)]],
                   tensor<device half, dextents<int, 2>> matrixC [[buffer(2)]])
{
    // ...
}

6:51 - Declare an inline MXFP8 tensor on the stack

// Type aliases for a MXFP8 multi-plane tensor inline

#include <metal_tensor>

using namespace metal;

using scales_plane = tensor_blockwise<tensor_plane_scales,
                                      device metal_fp8_ue8m0_format,
                                      32, 1>;

using mxfp8_tensor_inline = tensor<device metal_fp8_e4m3_format,
                                   dextents<int, 2>,
                                   tensor_inline,
                                   scales_plane>;

// Construct tensor on the stack from buffer pointers
mxfp8_tensor_inline matrixA(dataBufferA,
                             dextents<int, 2>(K, M),
                             array<int, 2>({ 1, K }),
                             scales_plane(scalesBufferA));

7:19 - Slice tensors and run a quantized matmul

// Slice the tensors to extract the relevant tile
auto tA = matrixA.slice(0, tgid.y * TILEM);
auto tB = matrixB.slice(tgid.x * TILEN, 0);
auto tC = matrixC.slice(tgid.x * TILEN, tgid.y * TILEM);

// Set up the matmul descriptor
constexpr auto descriptor = matmul2d_descriptor(TILEM,                  // M
                                                TILEN,                  // N
                                                dynamic_length_v<int>,  // K
                                                false,   // Left matrix transposed
                                                false);  // Right matrix transposed

matmul2d<descriptor, execution_simdgroups<4>> op;

// Run the op — TensorOps handles dequantization automatically
op.run(tA, tB, tC);

10:27 - Set up simdgroup-scoped QxK multiplication

// Setup QxK matrix multiplication op
constexpr auto mul_qk_op_desc = matmul2d_descriptor(/* ... */);
matmul2d<mul_qk_op_desc, execution_simdgroups> mul_qk_op;

// Slice Q, K, V
auto tQSlice = tQ.slice<D, ROWS_PER_SIMD>(0, sgid * ROWS_PER_SIMD);
auto tKSlice = tK.slice<D, BK>(0, k);
auto tVSlice = tV.slice<D, BK>(0, k);

// Create cooperative tensor to store tile of QxK
auto ctQK = mul_qk_op.get_destination_cooperative_tensor<decltype(tQSlice),
                                                         decltype(tKSlice),
                                                         float>();

// Multiply QxK
mul_qk_op.run(tQSlice, tKSlice, ctQK);

11:18 - Compute row-wise reduction for SoftMax

// Create a cooperative tensor to store row reduction output
auto ctTileRowMax = mul_qk_op.get_row_reduction_destination_cooperative_tensor<
                        decltype(tQSlice),
                        decltype(tKSlice),
                        float>();

// Compute max over each row of QxK tile
reduce_rows(ctQK, ctTileRowMax, reduction_operation::max, -INFINITY);

11:56 - Compute element-wise SoftMax with map_iterator

// Iterate over elements of QxK tile
#pragma clang loop unroll(full)
for (auto it = ctQK.begin(); it != ctQK.end(); it++) {
    // Fetch row max corresponding to this element
    auto row_it = ctRowMax.map_iterator(it);

    // Subtract row max from each element and compute exponent
    *it = exp(*it - *row_it);
}

12:33 - Reuse cooperative tensor as matmul input

constexpr auto mul_sv_op_desc = matmul2d_descriptor(/* ... */);
matmul2d<mul_sv_op_desc, metal::execution_simdgroup> mul_sv_op;

if (mul_sv_op.is_compatible_as_left_input<float, half, float>(ctQK)) {
    // Directly reuse cooperative tensor as input
    auto ctQKIn = mul_sv_op.get_left_input_cooperative_tensor<float, half, float>(ctQK);
    mul_sv_op.run(ctQKIn, tVSlice, ctO);
} else {
    // Store and reload through threadgroup memory if layout is not compatible
    ctQK.store(tgTensor);
    simdgroup_barrier(mem_flags::mem_threadgroup);

    auto ctQKIn = mul_sv_op.get_left_input_cooperative_tensor<float, half, float>();
    ctQKIn.load(tgTensor);
    mul_sv_op.run(ctQKIn, tVSlice, ctO);
}

- 0:00 - Introduction
- Overview of how Metal tensors and TensorOps enable you to write optimized custom ML kernels on Apple Silicon.
- 0:21 - Apple's ML software stack
- A tour of Apple's ML software stack, from high-level frameworks like Core AI and MLX down to Metal Performance Shaders, Metal Performance Primitives, and the TensorOps library — and why you might want to work at the Metal level.
- 2:25 - Managing quantized data
- How quantization reduces memory bandwidth requirements for large models, and the new quantized data types natively supported by TensorOps, including MX scaling formats.
- 4:23 - Multi-plane tensors
- How a single MTLTensor object can now represent both quantized element data and scale factors as separate planes, and how to configure multi-plane tensor descriptors in your Metal shaders.
- 5:17 - Quantized matrix multiplication
- How to extend a tiled matrix multiplication kernel to support quantized inputs, including binding scales planes, using inline tensors, slicing with threadgroup IDs, and handling custom dequantization formats.
- 9:31 - Building advanced ops
- How to implement Flash Attention with TensorOps, covering custom SIMD group mappings, cooperative tensors, row reductions, SoftMax, and the new API for passing cooperative tensors directly as matrix multiplication inputs — eliminating the threadgroup memory round-trip.
- 13:35 - Integrating custom ops into Core AI
- How to integrate a custom Metal TensorOps kernel into a Core AI application, using Core AI's Python tools to convert PyTorch models and plug in custom Metal operations.
- 15:25 - Next steps
- A summary of the TensorOps features covered — quantized types, multi-plane tensors, Flash Attention, and Core AI integration — with pointers to sample code and related sessions on Core AI and Metal.

Explore Get Started

Stay Updated

Explore Platforms

Featured

Explore Technologies

Featured

Explore Community

Featured

Explore Documentation

Release Notes

Explore Downloads

Featured

Explore Support

Featured

Quick Links

Chapters

Resources

Related Videos

Tech Talks

WWDC25