Build real-time neural rendering pipelines with Metal

Build real-time neural rendering pipelines with Metal

Discover how to integrate machine learning into your real-time rendering pipeline using Metal 4. We'll explore practical adoption patterns and best practices for achieving production-quality results with MetalFX neural denoising, featuring real-world insights from Maxon's Redshift Live. Learn how to train and deploy a neural tone mapper using the ML command encoder inline with your graphics work. Finally, dive into the new tensor API to build and evaluate small, specialized neural networks directly within your shaders.

Chapters
- 0:00 - Introduction
- 2:16 - MetalFX Denoising
- 9:57 - Deploy custom ML networks with Metal 4
- 13:40 - Inline neural networks with tensorOps
- 20:55 - Next steps
Resources
Related Videos

WWDC25
- Combine Metal 4 machine learning and graphics
- Go further with Metal 4 games
Hi, I'm Yulia, a GPU Software Engineer here at Apple.
Today, I'll share how to bring machine learning to your real-time rendering pipeline with Metal 4. You'll learn practical ways to integrate machine learning into your renderer, best practices for building high-performance pipelines, and two techniques to start adopting today.
Machine learning is moving from research into production in real-time rendering. Across the rendering pipeline, many established techniques that have traditionally relied on analytical methods can also be implemented with machine learning.
Neural denoising, neural textures, learned tone mapping and many others are among the techniques that can leverage machine learning. At every stage of the pipeline, these approaches can improve quality, performance, or memory footprint. I'll share just how this works in Metal.
On Apple platforms, you have a complete machine learning toolset for your rendering needs. At the highest level, MetalFX provides a ready-to-use neural denoising and upscaling API as a fully integrated, black box solution.
The Metal 4 ML command encoder lets you run pre-trained models directly in your command buffer, giving you more control over integration and scheduling.
And at the most flexible level, the TensorOps API provides the building blocks to design and run custom models directly in your shaders, enabling you to fully leverage the neural accelerator introduced in our M5 and A19 Pro Apple silicon GPUs. Today, I'll talk about all of these in turn.
Here is the plan. I'll cover how to adopt and achieve production quality results in your rendering pipeline with MetalFX, using Maxon's Redshift Live as an example of a modern real-time path tracing viewport that adopting MetalFX Denoising using Apple's best practices.
Then, I will describe how you can train a neural tone mapper and deploy it with Metal 4.
Finally, I will explain how to build a small network directly in a shader using the TensorOps API. It starts with MetalFX.
In your path tracer your frame budget might only allow you one or few samples per pixel to stay interactive. However one sample is naturally noisy.
To keep the quality bar, use MetalFX Denoising. It is designed specifically for the low latency demands of a live viewport.
MetalFX Denoising is a combined neural upscaler and denoiser, the platform-integrated solution, optimized for Apple silicon.
You can integrate it easily in your pipeline. You will need to generate a few extra auxiliary inputs like diffuse albedo, depth, and a few others. Depending on your renderer, you might already have produced those. You feed all these inputs to MetalFX, which produces a beautiful denoised image.
From there, you complete your pipeline with post processing and displaying the output.
This is Redshift Live, Maxon's modern real-time path tracer, rendering one of their high-quality 3D assets in Cinema 4D on Apple silicon. You get all the benefits of path tracing directly in the viewport, but during camera movement you can see some noise from the one sample-per-pixel presentation. Enable the MetalFX denoiser, and the image becomes dramatically more stable and noise-free.
Redshift Live can now deliver clean, near-final image quality at interactive frame rates, with real-time ray-traced lighting, shadows, and global illumination. Now artists can see lighting effects take place in real-time in their viewport, like this tree being moved. This becomes possible when you combine hardware-accelerated ray tracing with MetalFX neural denoising.
Here is an example of a one sample-per-pixel frame rendered by Redshift Live. By leveraging both spatial and temporal techniques, MetalFX is able to transform the noisy one sample-per-pixel into an image with near final quality, in real time. To get all the details on the inputs and how to leverage MetalFX in your application, check out "Go further with Metal 4 games" session.
I'll outline three key best practices that Maxon used to get the best quality from MetalFX, starting with denoiser inputs and noise.
The output quality of the denoiser is directly dependent on the quality of your inputs. Normally your auxiliary inputs are noise free, do your best to keep them that way. Among all the inputs, the diffuse albedo is the strongest signal for denoising. When in doubt, make it as close as possible to a noise free version of the final result you want to see on the screen.
Consider building debug views for each input directly in your engine.
Use a GPU capture to inspect textures frame-by-frame. This will allow you to validate your inputs and make sure they look the way the model expects.
You might have some noise-free layers in your scene, or some parts you don't want to denoise as strongly. You have two tools at your disposal, the transparency overlay, and the denoiser strength mask, using them will help you to maximise the quality in these scenarios.
Particles, fog, volumetrics, and sky are effects that don't have a meaningful surface and might be already noise free based on your rendering pipeline.
MetalFX will denoise and upscale your noisy input.
For those noise free effects, you can leverage the MetalFX transparency overlay input instead. The overlay input will only be upscaled and composited in the final result for you. For areas that are already noise free, like the sky, you can configure MetalFX to skip denoising for those pixels, using the denoiser strength mask. I'll share an example.
Here, the sky has been marked as not to be denoised. The value can be tuned from zero, meaning no denoising, all the way to one, meaning denoise at max strength depending on your use case.
This gives you control over the denoising effect in the scene. By now you should already have a great output by MetalFX, but there are a few tricky cases with reflection and transmission. This is what this second best practice will help you with.
A mirror has no color of its own. The viewer sees the reflected surface. As discussed previously, your inputs and especially the diffuse albedo should represent the final desired output as close as possible. Store your reflected geometry properties like albedo, normal, and roughness in the mirror-like objects.
Glass builds on the same foundational concepts and pushes it a bit further. The viewer sees a combination of what is reflected and what is transmitted, which could be noisy. One solution is to blend geometry properties like diffuse albedo, by the Fresnel term reducing substantially the noise of your inputs. The Fresnel is the term telling you at a given intersection point how much light would be reflected versus refracted.
On the left, you can see the primary surface albedo while on the right, it is replaced by the combined reflected and refracted albedo.
This is a well known technique called primary surface replacement. Getting this right will keep the reflection beautiful and sharp.
Now that your materials look rich, and your reflection and refraction are sharp, let's dive into the third best practice: get your motion vectors right. Correct motion vectors are essential for temporal stability.
Motion vectors are per-pixel screen-space displacements from the current frame to the previous frame.
For every pixel, the motion vector should answer the question, where was this pixel in the previous frame? Motion vectors have been a staple of modern rendering technique.
Getting motion vectors right makes the difference between a blurry result and a sharp output under motion. The model uses motion vectors to understand the scene under motion and over time.
MetalFX expects dejittered motion vectors, meaning without the sub-pixel shifts. Without this, MetalFX might receive motion vectors that might be up to one pixel wrong, creating edge shimmering. Here is how you can compute them correctly.
Here is the code to compute camera-only motion vectors for static objects. You start by computing the projected position of the current vertex. Then project the same position through the previous frame's matrix.
Your motion vector is the difference between the two.
However since the camera matrices were jittered, subtract the jitter deltas from the current and previous frame.
Finally, get a clean unjittered motion vector for a cleaner motion. For objects that move and deforming geometry, the camera-only path won't see the displacement. Store each vertex's previous-frame world position, or skin twice, and compute the actual motion vector. For objects where motion is genuinely unreliable fast motion, like alpha-blended particles use the reactive mask. For more on the reactive mask, check out "Go further with Metal 4 games".
This is what it looks like in practice. Redshift Live from Maxon ships every best practice I just covered, getting the most of MetalFX Denoising, running on Apple silicon and delivering near-final image quality.
Now, I'll take you beyond platform solutions, and share how you can build your own ML-powered solutions. Neural rendering goes well beyond denoising. More and more techniques across the pipeline are becoming machine learning based, and with Metal 4, you have the tools to build and deploy your own.
Metal 4 gives you two ways to bring your own machine leaning technique into the pipeline. The machine learning command encoder lets you deploy a trained model right in your command buffer in the same pipeline without context switch. The TensorOps API lets you build a small hardware-accelerated network directly in your shader.
For more details on both APIs, check out "Combine Metal 4 machine learning and graphics". Today I'll focus on tone mapping.
Most renderers have extended post-processing pipelines to correctly map the HDR image to something that can be displayed and matches the artistic vision, like tone mapping, color grade or film emulation. The pipeline is composed of multiple stages, each with its own parameters and concatenated outputs.
The pipeline can grow arbitrarily complex. The best results come from understanding the content of the image, and that's exactly what a neural network can learn.
The idea is simple. Take your existing whole color pipeline or part of it and replace it with a single neural network. The network will learn the color transformation.
An example of such a workflow is called HDRNet. A 2017 architecture from Gharbi and colleagues.
Here's the bird's-eye view on how it works. The network works on a small downsampled version of the image. It performs two types of analysis, a global and local one to capture both scene level and small details. This process allows the network to create color transformations for 16x16 tiles of the image. These localized transformations are applied with smart, edge-aware techniques to produce the beautiful tone mapped final result.
To create this solution you would first develop and train the network in your framework of choice, for example PyTorch. The training data could be deployed from manually tone mapped previous projects, or a lot of tone mapped images generated by your renderer. Once the model is trained, export it to an MTLPackage.
In order to execute your network in Metal 4, there are a few steps that need to be done on both setup and on the actual execution. First you need to setup the pipeline by loading an MTLPackage, specifying the network function with a function descriptor and creating a machine learning pipeline descriptor. This process is very similar to loading regular pipelines.
The next step is to dispatch your network execution, to do that, you will create an encoder, create an argument table with the inputs and outputs and finally dispatch the command buffer. That will kick off the execution, where you will have a mix of compute, machine learning, and rendering work happening at the same time.
Here's the updated pipeline. First, your path tracer produces samples, followed by MetalFX denoising and the new neural tone mapper, all encoded in the same command buffer, executing in the same frame.
The ML encoder replaced your entire multi-stage post-processing chain with a single neural evaluation.
I've shared how you can train and deploy your networks. Now, go one level deeper and build small networks directly in your shaders with the TensorOps API.
So far you have explored large general-purpose networks trained offline on a very large dataset. Now I will show you the opposite approach: tiny networks for one specific task. A few thousand parameters or less, trained on your scene data, sometimes even trained online every few frames. The network only sees one scenario, it does not need to generalise.
So far you have learned how to execute ML in the same command buffer as a stand alone step.
Here it is executing alongside compute and render.
However a small network can fit inline in your shader, among the rest of your code, ALU and texture sampling instructions.
The key enabling technology is TensorOps, available in any stage of the rendering pipeline.
All this combined unlocks new possibilities and workflows that involve online training.
Here's an example, a skybox used for image based lighting. The skybox is casting light on the geometry in the scene, creating a natural soft illumination. The soft illumination is the result of the average light coming from all visible directions at a specific point. Normally, this result is precomputed offline and sampled at runtime.
However, a scene is rarely static. You might have a dynamic day-night cycle.
Your offline learned signal may be out of sync.
This is a learnable function for a neural network, and this is where online training comes into play. Here is how you could recreate this technique.
Based on what you learned about the machine learning encoder so far, a simplified rendering loop might look like this, first, you update your world so that all the information is up to date for rendering.
Next, you dispatch the machine learning encoder to run the inference on the model, and produce the necessary lighting information that you will use later for shading.
Online training disrupts this paradigm. By creating your own training and inference routines, you can run one or more training iterations per frame to improve the model accuracy.
This is how the online training loop would look like for the sky illumination model. You start by generating a direction you wish to sample and run inference on your model to get the result.
Then you are able to compute the analytical solution to the sky illumination problem that you can use to compute the error, and finally, run a back propagation pass to progressively improve the model. This is the same exact flow you could use to train offline, but this time, repeating training iteration over frames.
So, you are now running your own inference and training routines. This enables you to run the inference pass, inline in your shading pass, And TensorOps will allow you to implement this very efficiently. You now have a model that every frame adapts to the new world condition and can use this information for shading right away. This would not be possible with the standard offline training workflow. This concept generalizes to any technique that can learn a signal. Here is how to start building your own solutions. At a high level, a neural network is composed of three main building blocks: the input layer, which processes the network inputs, also known as input features. The output layer, which generates the network's final predictions, and finally the hidden layers, where the magic of learning happens. The sky probe is a small network, the hidden layers group is composed of two hidden layers of four neurons each.
The network takes as an input value three floats to encode a direction, and produces three floats as an output that represents the average illumination coming from that direction, as a color.
This is called a fully connected multilayer perceptron, or in short an MLP, a 3 - 4 - 4 -3 network. You can experiment with the input sizes, amount and size of layers to get the best result for your application. To be able to evaluate your network you need to prepare your input tensor. It's best to batch multiple inputs at the same time making it a 2D matrix. For the sky probe example, this will be a 2D matrix of a batch of input directions you wish to evaluate. But the input can contain whatever data might be useful to the network, like positional or material data. Same principle applies to the output tensor. For sky probe, make it a 2D matrix of a batch of colors.
Now that you know the structure of an MLP, here is how you can implement it in your shader and evaluate it in a forward pass.
Now you are ready to begin the evaluation. You have your input tensor and the first hidden layer weights tensor. You can multiply the two together using a matmul 2D tensor operation.
You will obtain a pre-activation result on which you want to apply your activation function. Before doing that, you will need to store your matrix multiplication result. I'll explain how to do that efficiently. You may be familiar with the thread execution scope, where a single thread will be in charge of executing the whole tensor operation. This works great for executing divergent work or in pipeline stages where you don't have full control of a thread group.
However, when you do have full control, new possibilities arise.
In a compute stage, you can use SIMD group execution scope, where all participating threads will work on the same matrix multiplicaiton.
This execution mode, will also give you access to cooperative tensors. Cooperative tensors storage is distributed among multiple threads in the thread group, avoiding an expensive round trip to main memory.
By using a cooperative tensor as an output of your first multiplication, the result will stay in fast thread storage memory. Then you can apply your activation function in place.
You can now repeat the same operation of matrix multiplication and activation for the next layer.
And all the subsequent layers, all the way to the output layer, where you can store the resulting tensor and leverage the result in your compute shader immediately, or at a later stage.
On the left, there is the ground truth render computed using raytracing. On the right the neural rendering version. The small neural network was capable of learning the signal efficiently.
This was a high level overview of how you can construct an MLP and evaluate it in your shader using TensorOps. The same exact building blocks can be used to create an efficient back propagation pass needed for the online training step. For all the code details, please check the "Metal Performance Primitives (MPP) Programming Guide".
To recap, today, I have covered three levels of ML in your rendering pipeline. First, MetalFX gives you platform-integrated neural denoising, with three best practices: keep your inputs clean, store what the viewer sees, get motion vectors right. Next, the MTLPackage lets you export your offline trained models and deploy at runtime, You learned how to replace an entire post-processing pipeline with one neural evaluation. Finally, I covered the TensorOps API, it lets you build tiny networks directly in your shaders, running on the neural accelerator. Each level gives you more control. Pick the one that's right for your app.
Download Xcode and explore the Metal 4 sample code. If your app has realtime requirements, like viewports in pro-apps or games, adopt MetalFX Denoising and Upscaling.
Try training a neural tone mapper with your own post-processing pipeline.
And experiment with small specialized networks using the tensor API.
Check out our sessions from previous years for more details.
I can't wait to see what you build.

8:46 - Compute camera-only motion vectors

#include <metal_stdlib>
using namespace metal;

// Compute camera-only motion vectors
float4 clipCurrent = viewProjCurrent * float4(worldPos, 1.0);
float2 ndcCurrent = clipCurrent.xy / clipCurrent.w;

float4 clipPrevious = viewProjPrevious * float4(worldPos, 1.0);
float2 ndcPrevious = clipPrevious.xy / clipPrevious.w;

float2 motion = ndcPrevious - ndcCurrent;

// Get subpixel offset for current and previous frames
float2 jitterCurrent = getJitter(frameIndex);
float2 jitterPrevious = getJitter(frameIndexPrevious);
motion -= jitterPrevious - jitterCurrent;

- 0:00 - Introduction
- An overview of how machine learning is transforming real-time rendering pipelines on Apple platforms, and a preview of three levels of ML integration: MetalFX Denoising, deploying custom networks with Metal 4, and building tiny networks inline in shaders with tensorOps.
- 2:16 - MetalFX Denoising
- How to integrate MetalFX Denoising into a path tracer running at one sample per pixel. Covers auxiliary inputs (albedo, depth, motion vectors), best practices for clean inputs, transparency overlays, the denoiser strength mask, and primary surface replacement for mirrors and glass — illustrated with Redshift Live from Maxon.
- 9:57 - Deploy custom ML networks with Metal 4
- How to train a neural tone mapper offline (e.g., HDRNet), export it to Metal Performance Shaders Graph, and execute it inside a Metal 4 command buffer alongside your existing rendering passes to replace complex post-processing pipelines with a single network.
- 13:40 - Inline neural networks with tensorOps
- How to build and run small multilayer perceptrons directly inside Metal shaders using the TensorOps API and cooperative tensors. Demonstrates an online-trained sky visibility probe that adapts to dynamic scenes each frame — enabling ML inference and training that runs alongside your existing compute and render work.
- 20:55 - Next steps
- A recap of the three levels of ML integration in rendering pipelines, and guidance on where to start: download Xcode, explore Metal 4 sample code, and adopt MetalFX denoising for real-time applications first.

Explore Get Started

Stay Updated

Explore Platforms

Featured

Explore Technologies

Featured

Explore Community

Featured

Explore Documentation

Release Notes

Explore Downloads

Featured

Explore Support

Featured

Quick Links

Chapters

Resources

Related Videos

WWDC25