hi
When analyzing our game using Instruments, I've always been confused about the two items "Drawable Present" and "Drawable Presented" in the GPU column. The timing of Drawable Present seems to be when the CPU layer calls commandbuffer:present, rather than when the actual encoding is completed on the GPU. Also, what does drawable presented specifically mean? In our case, when a CPU stall occurs, it appears that the vsync interval changes in the next frame, and a surface that has already been calculated is not displayed. Why is this happening?
Metal
RSS for tagRender advanced 3D graphics and perform data-parallel computations using graphics processors using Metal.
Selecting any option will automatically load the page
Post
Replies
Boosts
Views
Activity
Context
I’m deploying large language models on iPhone using llama.cpp. A new iPhone Air (12 GB RAM) reports a Metal MTLDevice.recommendedMaxWorkingSetSize of 8,192 MB, and my attempt to load Llama-2-13B Q4_K (~7.32 GB weights) fails during model initialization.
Environment
Device: iPhone Air (12 GB RAM)
iOS: 26
Xcode: 26.0.1
Build: Metal backend enabled llama.cpp
App runs on device (not Simulator)
What I’m seeing
MTLCreateSystemDefaultDevice().recommendedMaxWorkingSetSize == 8192 MiB
Loading Llama-2-13B Q4_K (7.32 GB) fails to complete. Logs indicate memory pressure / allocation issues consistent with the 8 GB working-set guidance.
Smaller models (e.g., 7B/8B with similar quantization) load and run (8B Q4_K provide around 9 tokens/second decoding speed).
Questions
Is 8,192 MB an expected recommendedMaxWorkingSetSize on a 12 GB iPhone?
What values should I expect on other 2025 devices including iPhone 17 (8 GB RAM) and iPhone 17 Pro (12 GB RAM)
Is it strictly enforced by Metal allocations (heaps/buffers), or advisory for best performance/eviction behavior?
Can a process practically exceed this for long-lived buffers without immediate Jetsam risk?
Any guidance for LLM scenarios near the limit?
The following minimal snippet SEGFAULTS with SDK 26.0 and 26.1. Won't crash if I remove async from the enclosing function signature - but it's impractical in a real project.
import Metal
import MetalPerformanceShaders
let SEED = UInt64(0x0)
typealias T = Float16
/* Why ran in async context? Because global GPU object,
and async makeMTLFunction,
and async makeMTLComputePipelineState.
Nevertheless, can trigger the bug without using global
@MainActor
let myGPU = MyGPU()
*/
@main
struct CMDLine {
static func main() async {
let ptr = UnsafeMutablePointer<T>.allocate(capacity: 0)
async let future: Void = randomFillOnGPU(ptr, count: 0)
print("Main thread is playing around")
await future
print("Successfully reached the end.")
}
static func randomFillOnGPU(_ buf: UnsafeMutablePointer<T>, count destbufcount: Int) async {
// let (device, queue) = await (myGPU.device, myGPU.commandqueue)
let myGPU = MyGPU()
let (device, queue) = (myGPU.device, myGPU.commandqueue)
// Init MTLBuffer, async let makeFunction, makeComputePipelineState, etc.
let tempDataType = MPSDataType.uInt32
let randfiller = MPSMatrixRandomMTGP32(device: device, destinationDataType: tempDataType, seed: Int(bitPattern:UInt(SEED)))
print("randomFillOnGPU: successfully created MPSMatrixRandom.")
// try await computePipelineState
// ^ Crashes before this could return
// Or in this minimal case, after randomFillOnGPU() returns
// make encoder, set pso, dispatch, commit...
}
}
actor MyGPU {
let device : MTLDevice
let commandqueue : MTLCommandQueue
init() {
guard let dev: MTLDevice = MPSGetPreferredDevice(.skipRemovable),
let cq = dev.makeCommandQueue(),
dev.supportsFamily(.apple6) || dev.supportsFamily(.mac2)
else { print("Unable to get Metal Device! Exiting"); exit(EX_UNAVAILABLE) }
print("Selected device: \(String(format: "%llX", dev.registryID))")
self.device = dev
self.commandqueue = cq
print("myGPU: initialization complete.")
}
}
See FB20916929. Apparently objc autorelease pool is releasing the wrong address during context switch (across suspension points). I wonder why such obvious case has not been caught before.
After watching WWDC 2025 session "Combine Metal 4 machine learning and graphics", I have decided to give it a shot to integrate the latest MTL4MachineLearningCommandEncoder to my existing render pipeline. After a lot of trial and errors, I managed to set up the pipeline and have the app compiled.
However, I am now stuck on creating a MTLLibrary with .mtlpackage.
Here is the code I have to create a MTLLibrary according the WWDC session https://developer.apple.com/videos/play/wwdc2025/262/?time=550:
let coreMLFilePath = bundle.path(forResource: "my_model", ofType: "mtlpackage")!
let coreMLURL = URL(string: coreMLFilePath)!
do {
metalDevice.makeLibrary(URL: coreMLURL)
} catch {
print("error: \(error)")
}
With the above code, I am getting error:
Error Domain=MTLLibraryErrorDomain Code=1 "Invalid metal package" UserInfo={NSLocalizedDescription=Invalid metal package}
What is the correct way to create a MTLLibrary with .mtlpackage? Do I see this error because the .mtlpackage I am using is incorrect? How should I go with debugging this?
I'd really appreciate if I could get some help on this as I have been stuck with it for some time now. Thanks in advance!
Hi,
I’m testing Unity’s Spaceship HDRP demo on iPhone 17 Pro Max and iPad Pro M4 (iOS 26.1).
Everything renders correctly, and my custom MetalFX Spatial plugin initializes successfully — it briefly reports active scaling (e.g. 1434×660 → 2868×1320 at 50% scaling), then reverts to native rendering a few frames later.
Setup:
Xcode 16.1 (targeting iOS 18)
Unity 2022.3.62f3 (HDRP)
Metal backend
Dynamic Resolution enabled in HDRP assets and cameras
Relevant Xcode console excerpt:
[MetalFXPlugin] MetalFX_Enable(True) called.
[SpaceshipOptions] MetalFX enabled with HDRP dynamic resolution integration.
[SpaceshipOptions] Disabled TAA for MetalFX Spatial.
[SpaceshipOptions] Created runtime RenderTexture: 1434x660
[MetalFX] Spatial scaler created (1434x660 → 2868x1320).
[MetalFX] Processed frame with scaler.
[MetalFXPlugin] Sent RenderTexture (1434x660) to MetalFX. Output target 2868x1320.
[SpaceshipOptions] MetalFX target set: 1434x660
[SpaceshipOptions] Camera targetTexture cleared after MetalFX handoff.
It looks like HDRP clears the camera’s target texture right after MetalFX submits the frame, which causes it to revert to native rendering.
Is there a recommended way to persist or rebind the MetalFX output texture when using HDRP on iOS?
Unity doesn’t appear to support MetalFX in the Editor either:
Thanks!
Hi all,
I'm encountering an issue with Metal raytracing on my M5 MacBook Pro regarding Instance Acceleration Structure (IAS).
Intersection tests suddenly stop working after a certain point in the sampling loop.
Situation
I implemented an offline GPU path tracer that runs the same kernel multiple times per pixel (sampleCount) using metal::raytracing.
Intersection tests are performed using an IAS.
Since this is an offline path tracer, geometries inside the IAS never changes across samples (no transforms or updates).
As sampleCount increases, there comes a point where the number of intersections drops to zero, and remains zero for all subsequent samples.
Here's a code sketch:
let sampleCount: UInt16 = 1024
for sampleIndex: UInt16 in 0..<sampleCount {
// ...
do {
let commandBuffer = commandQueue.makeCommandBuffer()
// Dispatch the intersection kernel.
await commandBuffer.completed()
}
do {
let commandBuffer = commandQueue.makeCommandBuffer()
// Use the intersection test results from the previous command buffer.
await commandBuffer.completed()
}
// ...
}
kernel void intersectAlongRay(
const metal::uint32_t threadIndex [[thread_position_in_grid]],
// ...
const metal::raytracing::instance_acceleration_structure accelerationStructure [[buffer(2)]],
// ...
)
{
// ...
const auto result = intersector.intersect(ray, accelerationStructure);
switch (result.type) {
case metal::raytracing::intersection_type::triangle: {
// Write intersection result to device buffers.
break;
}
default:
break;
}
Observations
Encoding both the intersection kernel and the subsequent result usage in the same command buffer does not resolve the problem.
Switching from IAS to Primitive Acceleration Structure (PAS) fixes the problem.
Rebuilding the IAS for each sample also resolves the issue.
Intersections produce inconsistent results even though the IAS and rays are identical — Image 1 shows a hit, while Image 2 shows a miss.
Questions
Am I misusing IAS in some way ?
Could this be a Metal bug ?
Any guidance or confirmation would be greatly appreciated.
My app has a number of heterogeneous GPU workloads that all run concurrently. Some of these should be executed with the highest priority because the app’s responsiveness depends on them, while others are triggered by file imports and the like which should have a low priority. If this was running on the CPU I’d assign the former User Interactive QoS and the latter Utility QoS. Is there an equivalent to this for GPU work?
Deterministic RNG behaviour across Mac M1 CPU and Metal GPU – BigCrush pass & structural diagnostics
Hello,
I am currently working on a research project under ENINCA Consulting, focused on advanced diagnostic tools for pseudorandom number generators (structural metrics, multi-seed stability, cross-architecture reproducibility, and complementary indicators to TestU01).
To validate this diagnostic framework, I prototyped a small non-linear 64-bit PRNG (not as a goal in itself, but simply as a vehicle to test the methodology).
During these evaluations, I observed something interesting on Apple Silicon (Mac M1):
• bit-exact reproducibility between M1 ARM CPU and M1 Metal GPU,
• full BigCrush pass on both CPU and Metal backends,
• excellent p-values,
• stable behaviour across multiple seeds and runs.
This was not the intended objective, the goal was mainly to validate the diagnostic concepts, but these results raised some questions about deterministic compute behaviour in Metal.
My question: Is there any official guidance on achieving (or expecting) deterministic RNG or compute behaviour across CPU ↔ Metal GPU on Apple Silicon? More specifically:
• Are deterministic compute kernels expected or guaranteed on Metal for scientific workloads?
• Are there recommended patterns or best practices to ensure reproducibility across GPU generations (M1 → M2 → M3 → M4)?
• Are there known Metal features that can introduce non-determinism?
I am not sharing the internal recurrence (this work is proprietary), but I can discuss the high-level diagnostic observations if helpful.
Thank you for any insight, very interested in how the Metal engineering team views deterministic compute patterns on Apple Silicon.
Pascal ENINCA Consulting
Topic:
Graphics & Games
SubTopic:
Metal
Tags:
ML Compute
Metal
Metal Performance Shaders
Apple Silicon
Deterministic RNG behaviour across Mac M1 CPU and Metal GPU – BigCrush pass & structural diagnostics
Hello,
I am currently working on a research project under ENINCA Consulting, focused on advanced diagnostic tools for pseudorandom number generators (structural metrics, multi-seed stability, cross-architecture reproducibility, and complementary indicators to TestU01).
To validate this diagnostic framework, I prototyped a small non-linear 64-bit PRNG (not as a goal in itself, but simply as a vehicle to test the methodology).
During these evaluations, I observed something interesting on Apple Silicon (Mac M1): • bit-exact reproducibility between M1 ARM CPU and M1 Metal GPU, • full BigCrush pass on both CPU and Metal backends, • excellent p-values, • stable behaviour across multiple seeds and runs.
This was not the intended objective, the goal was mainly to validate the diagnostic concepts, but these results raised some questions about deterministic compute behaviour in Metal.
My question: Is there any official guidance on achieving (or expecting) deterministic RNG or compute behaviour across CPU ↔ Metal GPU on Apple Silicon? More specifically:
• Are deterministic compute kernels expected or guaranteed on Metal for scientific workloads?
• Are there recommended patterns or best practices to ensure reproducibility across GPU generations (M1 → M2 → M3 → M4)? • Are there known Metal features that can introduce non-determinism?
I am not sharing the internal recurrence (this work is proprietary), but I can discuss the high-level diagnostic observations if helpful.
Thank you for any insight, very interested in how the Metal engineering team views deterministic compute patterns on Apple Silicon.
Pascal ENINCA Consulting
Topic:
Graphics & Games
SubTopic:
Metal
I am puzzled by the setAddress(_:attributeStride:index:) of MTL4ArgumentTable. Can anyone please explain what the attributeStride parameter is for? The doc says that it is "The stride between attributes in the buffer." but why?
Who uses this for what? On the C++ side in the shaders the stride is determined by the C++ type, as far as I know. What am I missing here?
Thanks!