I have a Core Image filter in my app that uses Metal. I cannot compile it because it complains that the executable tool metal is not available, but I have installed it in Xcode.
If I go to the "Components" section of Xcode Settings, it shows it as downloaded. And if I run the suggested command, it also shows it as installed. Any advice?
Xcode Version
Version 26.0 beta (17A5241e)
Build Output
Showing All Errors Only
Build target Lessons of project StudyJapanese with configuration Light
RuleScriptExecution /Users/chris/Library/Developer/Xcode/DerivedData/StudyJapanese-glbneyedpsgxhscqueifpekwaofk/Build/Intermediates.noindex/StudyJapanese.build/Light-iphonesimulator/Lessons.build/DerivedSources/OtsuThresholdKernel.ci.air /Users/chris/Code/SerpentiSei/Shared/iOS/CoreImage/OtsuThresholdKernel.ci.metal normal undefined_arch (in target 'Lessons' from project 'StudyJapanese')
cd /Users/chris/Code/SerpentiSei/StudyJapanese
/bin/sh -c xcrun\ metal\ -w\ -c\ -fcikernel\ \"\$\{INPUT_FILE_PATH\}\"\ -o\ \"\$\{SCRIPT_OUTPUT_FILE_0\}\"'
'
error: error: cannot execute tool 'metal' due to missing Metal Toolchain; use: xcodebuild -downloadComponent MetalToolchain
/Users/chris/Code/SerpentiSei/StudyJapanese/error:1:1: cannot execute tool 'metal' due to missing Metal Toolchain; use: xcodebuild -downloadComponent MetalToolchain
Build failed 6/9/25, 8:31 PM 27.1 seconds
Result of xcodebuild -downloadComponent MetalToolchain (after switching Xcode-beta.app with xcode-select)
xcodebuild -downloadComponent MetalToolchain
Beginning asset download...
Downloaded asset to: /System/Library/AssetsV2/com_apple_MobileAsset_MetalToolchain/4d77809b60771042e514cfcf39662c6d1c195f7d.asset/AssetData/Restore/022-19457-035.dmg
Done downloading: Metal Toolchain (17A5241c).
Screenshots from Xcode
Result of "Copy Information"
Metal Toolchain 26.0 [com.apple.MobileAsset.MetalToolchain: 17.0 (17A5241c)] (Installed)
Metal
RSS for tagRender advanced 3D graphics and perform data-parallel computations using graphics processors using Metal.
Selecting any option will automatically load the page
Post
Replies
Boosts
Views
Created
Hey, I've been struggling with this for some days now.
I am trying to write to a sparse texture in a compute shader. I'm performing the following steps:
Set up a sparse heap and create a texture from it
Map the whole area of the sparse texture using updateTextureMapping(..)
Overwrite every value with the value "4" in a compute shader
Blit the texture to a shared buffer
Assert that the values in the buffer are "4".
I have a minimal example (which is still pretty long unfortunately).
It works perfectly when removing the line heapDesc.type = .sparse.
What am I missing? I could not find any information that writes to sparse textures are unsupported. Any help would be greatly appreciated.
import Metal
func sparseTexture64x64Demo() throws {
// ── Metal objects
guard let device = MTLCreateSystemDefaultDevice()
else { throw NSError(domain: "SparseNotSupported", code: -1) }
let queue = device.makeCommandQueue()!
let lib = device.makeDefaultLibrary()!
let pipeline = try device.makeComputePipelineState(function: lib.makeFunction(name: "addOne")!)
// ── Texture descriptor
let width = 64, height = 64
let format: MTLPixelFormat = .r32Uint // 4 B per texel
let desc = MTLTextureDescriptor()
desc.textureType = .type2D
desc.pixelFormat = format
desc.width = width
desc.height = height
desc.storageMode = .private
desc.usage = [.shaderWrite, .shaderRead]
// ── Sparse heap
let bytesPerTile = device.sparseTileSizeInBytes
let meta = device.heapTextureSizeAndAlign(descriptor: desc)
let heapBytes = ((bytesPerTile + meta.size + bytesPerTile - 1) / bytesPerTile) * bytesPerTile
let heapDesc = MTLHeapDescriptor()
heapDesc.type = .sparse
heapDesc.storageMode = .private
heapDesc.size = heapBytes
let heap = device.makeHeap(descriptor: heapDesc)!
let tex = heap.makeTexture(descriptor: desc)!
// ── CPU buffers
let bytesPerPixel = MemoryLayout<UInt32>.stride
let rowStride = width * bytesPerPixel
let totalBytes = rowStride * height
let dstBuf = device.makeBuffer(length: totalBytes, options: .storageModeShared)!
let cb = queue.makeCommandBuffer()!
let fence = device.makeFence()!
// 2. Map the sparse tile, then signal the fence
let rse = cb.makeResourceStateCommandEncoder()!
rse.updateTextureMapping(
tex,
mode: .map,
region: MTLRegionMake2D(0, 0, width, height),
mipLevel: 0,
slice: 0)
rse.update(fence) // ← capture all work so far
rse.endEncoding()
let ce = cb.makeComputeCommandEncoder()!
ce.waitForFence(fence)
ce.setComputePipelineState(pipeline)
ce.setTexture(tex, index: 0)
let threadsPerTG = MTLSize(width: 8, height: 8, depth: 1)
let tgCount = MTLSize(width: (width + 7) / 8,
height: (height + 7) / 8,
depth: 1)
ce.dispatchThreadgroups(tgCount, threadsPerThreadgroup: threadsPerTG)
ce.updateFence(fence)
ce.endEncoding()
// Blit texture into shared buffer
let blit = cb.makeBlitCommandEncoder()!
blit.waitForFence(fence)
blit.copy(
from: tex,
sourceSlice: 0,
sourceLevel: 0,
sourceOrigin: MTLOrigin(x: 0, y: 0, z: 0),
sourceSize: MTLSize(width: width, height: height, depth: 1),
to: dstBuf,
destinationOffset: 0,
destinationBytesPerRow: rowStride,
destinationBytesPerImage: totalBytes)
blit.endEncoding()
cb.commit()
cb.waitUntilCompleted()
assert(cb.error == nil, "GPU error: \(String(describing: cb.error))")
// ── Verify a few texels
let out = dstBuf.contents().bindMemory(to: UInt32.self, capacity: width * height)
print("first three texels:", out[0], out[1], out[width]) // 0 1 64
assert(out[0] == 4 && out[1] == 4 && out[width] == 4)
}
Metal shader:
#include <metal_stdlib>
using namespace metal;
kernel void addOne(texture2d<uint, access::write> tex [[texture(0)]],
uint2 gid [[thread_position_in_grid]])
{
tex.write(4, gid);
}
hi
When analyzing our game using Instruments, I've always been confused about the two items "Drawable Present" and "Drawable Presented" in the GPU column. The timing of Drawable Present seems to be when the CPU layer calls commandbuffer:present, rather than when the actual encoding is completed on the GPU. Also, what does drawable presented specifically mean? In our case, when a CPU stall occurs, it appears that the vsync interval changes in the next frame, and a surface that has already been calculated is not displayed. Why is this happening?
Hello!
I'm a developer working on a plugin for the Elgato Stream Deck, called GPU Metrics. The plugin currently only works on Windows but I'd like to bring it to macOS. However, based on forum posts I've read (and StackOverflow) there isn't a very clear path to query GPU metrics like usage, temperature, used GPU memory, and power consumption. There are some tools out there that do similar things, but I wanted to see what would be the recommendation from Apple's engineering team to get this data via a public API.
Requirements:
Access GPU utilization, temperature, memory usage, power usage
C/C++ based API for querying the metrics so I can expose the data to JavaScript via Node Addon
No need to compatibile with Intel-based Macs, as Apple silicon will be fine for now
Plugin GitHub
Thank you!
Noah
In the Creating A 3D Application With Hydra Rendering tutorial on the Apple Developer website, on the last step where I execute this command:
cmake -S ~/Users/macuser/CreatingA3DApplicationWithHydraRendering/ -B ~/Users/macuser/CreatingA3DApplicationWithHydraRendering/
I keep getting an error:
CMake Error at CMakeLists.txt:5 (include):
include could not find requested file:
/Users/macuser/USDInstall/bin/pxrConfig.cmake
I've tried to follow the instructions as mentioned in the README.md file included in the project files at least 5 times as well as moving the pxrConfig.cmake file around and copying it in different folders, then executed the command and was still unsuccessful into generating the proper file expected to compile and render the HydraPlayer renderer. How do I get cmake to generate the Xcode file to create the HydraPlayer renderer?
I have run into an issue where I am trying to use atomic_float in a swift package but I cannot get things to compile because it appears that the Swift Package Manager doesn't support Metal 3 (atomic_float is Metal 3 functionality). Is there any way around this? I am using
// swift-tools-version: 6.1
and my Metal code includes:
#include <metal_stdlib>
#include <metal_geometric>
#include <metal_math>
#include <metal_atomic>
using namespace metal;
kernel void test(device atomic_float* imageBuffer [[buffer(1)]],
uint id [[ thread_position_in_grid ]]) {
}
But I get an error on the definition of atomic_float .
Any help, one more importantly, where I could have found this information about this limitation, would be helpful.
-RadBobby
Topic:
Graphics & Games
SubTopic:
Metal
Hi,
seems MSL is missing support for a clock() shader instruction available in other graphics APIs like Vulkan or OpenGL for example..
useful for counting cost in number of clock cycles of some code insider shader with much finer granularity than launching a micro kernel with same instructions and measuring cycles cost from CPU..
also useful for MoltenVK to support that extensions..
thanks..
The game physics work as expected using GTPK 2.0 using Crossover 24 or Whisky. However, using GPTK 2.1 with Crossover 25, the player and camera physics misbehave. See https://www.reddit.com/r/WWEGames/comments/1jx9mph/the_siamese_elbow/ and https://www.reddit.com/r/WWEGames/comments/1jx9ow4/camera_glitch/
Full video also linked in the Reddit post.
I have also submitted this bug via the feedback assistant.
The code is pretty simple
kernel void naive(
constant RunParams *param [[ buffer(0) ]],
const device float *A [[ buffer(1) ]], // [N, K]
device float *output [[ buffer(2) ]],
uint2 gid [[ thread_position_in_grid ]]) {
uint a_ptr = gid.x * param->K;
for (uint i = 0; i < param->K; i++, a_ptr++) {
val += A[b_ptr];
}
output[ptr] = val;
}
when uint a_ptr = gid.x * param->K, the code got 150 GFLops
when uint a_ptr = gid.y * param->K, the code got 860 GFLops
param->K = 256;
thread per group: [16, 16]
I'd like to understand why the performance is so different, and how can I profile/diagnose this to help with further optimization.
Topic:
Graphics & Games
SubTopic:
Metal
View Layout
Add the following views in a view controller:
Label
View A, with a subview of the same size: MTKView A
View B, with a subview of the same size: MTKView B
Refresh Rates of Each View
The label view refreshes at 60fps (driven by CADisplayLink).
MTKView A and B refresh at 15fps.
MTKView Implementation Details
The corresponding CAMetalLayer's maximumDrawableCount is set to 2, changed to double buffering.
The scheduling mechanism is modified; drawing is not driven by the internal loop but is done manually. The draw call is triggered immediately upon receiving a frame.
self.metalView.enableSetNeedsDisplay = NO;
self.metalView.paused = YES;
A new high-priority queue is created for drawing, instead of handling it on the main queue.
MTKView Latency Tracking
The GPU completion time T1 is observed through the addCompletedHandler callback of the CommandBuffer.
The presentation time T2 of the frame is observed through the addPresentedHandler callback of the currentDrawable in MTKView.
Testing shows that T2 - T1 > 16.6ms (the Vsync period at 60Hz). This means that after the GPU rendering in MTLView is finished, the frame is not actually displayed at the next Vsync instruction but only at the Vsync instruction after that.
I believe there is an extra 16.6ms of latency here, which I want to eliminate by adjusting the rendering mechanism.
Observation from Instruments
From Instruments, the Surface presentation aligns with the above test results. After the Metal encoder finishes, the Surface in Display switches only after the next-next Vsync instruction. See the image in the link for details.
Questions
According to a beginner's understanding, after MTKView's GPU rendering is finished, the next Vsync instruction should officially display (make it visible). However, this is not what is observed. Does the subview MTKView need to wait for another Vsync cycle to be drawn to the actual display buffer?
The label updates its text at 60fps, so the entire interface should be displayed at 60fps. Is the content of MTKView not synchronized when the display happens?
Explanation of the Reasoning Behind Some MTKView Code Details
Changing from the default triple buffering to double buffering helps reduce the latency introduced by rendering.
Not using MTKView's own scheduling mechanism but using manual triggering of the draw method is because MTKView's own scheduling mechanism is driven by CADisplayLink. Therefore, if a frame falls within a Vsync window, it needs to wait for the next Vsync window to trigger the draw operation, which introduces waiting latency.
I'm implementing optimized matmul on metal: https://github.com/crynux-ai/metal-matmul/blob/main/metal/1_shared_mem.metal
I notice that performance is significantly different with different threadgroup memory set in
[computeEncoder setThreadgroupMemoryLength]
All other lines are exactly same, the only difference is this parameter.
Matmul performance is roughly 250 GFLops if I set 32768 (max bytes allowed on this M1 Max),
but 400 GFLops if I set 8192.
Why does this happen? How can I optimize it?
Topic:
Graphics & Games
SubTopic:
Metal
Hello
I am trying to get thread group memory access in fragment shader. In essence, I would like to have all the fragments in a tile to bitwiseOR some value. My idea was to use simd_or across the SIMD group, then make each SIMD group thread 0 to atomic or the value into thread group memory. Finally very first thread of the tile would be tasked with writing the value down to texture with write access.
Now, I can allocate the thread group memory argument to the fragment function all right. MTLRenderEncoder has setThreadgroupMemoryLength call, which I am using the following way
[renderEncoder setThreagroupMemoryLength: 16 offset: 0 atIndex:0]
Unfortunately, all I am getting is the following error (runtime assertion)
-[MTLDebugRenderCommandEncoder setThreadgroupMemoryLength:offset:atIndex:]:3487: failed assertion Set Threadgroup Memory Length Validation
offset + length(16) must be <= threadgroupMemoryLength(0).`
What I am doing wrong? How I can get thread group memory in the fragment shader? I know I could use tile shading and compute function but the problem is that here I really like to use fragment stuff. Will be grateful for help.
Hi,
I am working with a large project. We are compiling each material to its own .metallib. They all include many common files full of inline functions. Finally we link it all together at the end with a single big pathtrace kernel. Everything works as expected, however the compile times have gotten completely out of hand and it takes multiple minutes to compile at runtime (to native code). I have gathered that I can do this offline by using metal-tt however if I am wondering if there is a way to reduce the compile times in such a scenario, and how to investigate what the root cause of the problem is. I suspect it could have to do with the fact that every materials metallib contains duplications of all the inline functions. Any ideas on how to profile and debug this?
Thanks,
Rasmus
Anyone else unable to download the "Rendering a Scene with Deferred Lighting in C++" (https://developer.apple.com/documentation/metal/rendering-a-scene-with-deferred-lighting-in-c++?language=objc)?
I just an error page:
Is there another place to download this sample?
Topic:
Graphics & Games
SubTopic:
Metal
Hello!
I have a question about how thread groups work with tile shading. When running "traditional" compute, I get to choose both thread group size and the grid size. However, when using tile shading kernel I only have dispatchThreadsPerTile method - this controls how many threads will be ran in each tile. So far so good, but what about thread groups?
The examples in video "Tile Shading on A11" seem to suggest that there will be only one thread group per tile. In the video, [[thread_index_in_threadgroup]] is called "local_id" and it is used to access the image block.
I assume this is the default configuration. So when one does the following:
Creates MTLRenderPassDescriptor with tileWidth set to W and tileHeight set to H
Fires up the tile shading kernel using dispatchThreadsPerTile with MTLSize size = { W, H, 1 }
I understand that the result is 1-to-1 mapping between the tile "pixels" and kernel threads. Now, what I would like to do is to have more than one thread group there. I want this for performance reasons: I have a certain compute kernel which I know executes very well with small thread group size. In fact, { 32, 1, 1 } seems to be the fastest. My understanding is that even if I set tile size to 16x16, and so I am executing 256 threads there, there will only be one SIMD group active in a thread group. Meaning that this SIMD group has to execute 8 times over the tile.
Is it possible somehow? Or perhaps the limitations of the API are pointing at the limitations of hardware itself, and if I want to execute with SIMD group sized thread groups I have to use "traditional" compute encoder?
Will be grateful for help.
Michał
I have this drawing app that I have been working on for the past few years when I have free time. I recently rebuilt the app in Metal to build out other brushes and improve performance, need to render 10000s of lines in realtime.
I’m running into this issue trying to create a uniform opacity per path. I have a solution but do not love it - as this is a realtime app and the solution could have some bottlenecks. If I just generate a triangle strip from touch points and do my best to smooth, resample, and handle miters I will always get some overlaps. See:
To create a uniform opacity I render to an offscreen texture with blending disabled. I then pre-multiply the color and draw that texture to a composite texture with blending on (I do this per path). This works but gets tricky when you introduce a textured brush, the edges of the texture in the frag shader cut out the line.
Pasted Graphic 1.png
Solution: I discard below a threshold
fragment float4 fragment_line(VertexOut in [[stage_in]],
texture2d<float> texture [[ texture(0) ]]) {
constexpr sampler s(coord::normalized, address::mirrored_repeat, filter::linear);
float2 texCoord = in.texCoord;
float4 texColor = texture.sample(s, texCoord);
if (texColor.a < 0.01) discard_fragment(); // may be slow (from what I read)
return in.color * texColor;
}
Better but still not perfect.
Question: I'm looking for better ways to create a uniform opacity per path. I tried .max blending but that will cause no blending of other paths. Any tips, ideas, much appreciated. If this is too detailed of a question just achieve.
I'm trying to use MTLBinaryArchive. I collected a BinaryArchive from one device and used metal-tt to translate it for all supported iPhone devices, ranging from iPhone 7 Plus to iPhone 16.
However, this BinaryArchive is quite large, around 1.5GB uncompressed, and about 500MB compressed in the IPA. I'm wondering how to address the size issue.
I watched the WWDC 2022 video, which mentioned that the operating system or app installation process would handle compatibility. Does this compatibility support different GPU chips? I tried installing an IPA with a BinaryArchive collected only from an iPhone 12 on an iPhone 13, but the BinaryArchive didn't take effect.
I also saw that Apple supports App Thinning. However, it seems that resources in the Asset Catalog cannot be accessed via URL, and creating an MTLBinaryArchive requires a URL. Is it possible for MTLBinaryArchive to be distributed through App Thinning?
The WWDC 2022 video also mentioned using the -Os optimization flag to reduce size. Can this give an estimate of how much compression it would achieve? Are there any methods to solve the BinaryArchive size issue without impacting performance?
Topic:
Graphics & Games
SubTopic:
Metal
In this video, tile fragment shading is recommended for image processing. In this example, the unpack function takes two arguments, one of which is RasterizerData. As I understand it, this is the data passed to us from the previous stage (Vertex) of the graphics pipeline.
However, the properties of MTLTileRenderPipelineDescriptor do not include an option for specifying a Vertex function. Therefore, in this render pass, a mix of commands is used: first, a draw command is executed to obtain UV coordinates, and then threads are dispatched.
My question is: without using a draw command, only dispatch, how can I get pixel coordinates in the fragment tile function? For the kernel tile function, everything is clear.
typedef struct
{
float4 OPTexture [[ color(0) ]];
float4 IntermediateTex [[ color(1) ]];
} FragmentIO;
fragment FragmentIO Unpack(RasterizerData in [[ stage_in ]],
texture2d<float, access::sample> srcImageTexture [[texture(0)]])
{
FragmentIO out;
//...
// Run necessary per-pixel operations
out.OPTexture = // assign computed value;
out.IntermediateTex = // assign computed value;
return out;
}
The flushContextInternal function in glr_sync.mm:262 called abort internally. What caused this? Was it due to high device temperature or some other reason?
Date/Time: 2024-08-29 09:20:09.3102 +0800
Launch Time: 2024-08-29 08:53:11.3878 +0800
OS Version: iPhone OS 16.7.10 (20H350)
Release Type: User
Baseband Version: 8.50.04
Report Version: 104
Exception Type: EXC_CRASH (SIGABRT)
Exception Codes: 0x0000000000000000, 0x0000000000000000
Triggered by Thread: 0
Thread 0 name:
Thread 0 Crashed:
0 libsystem_kernel.dylib 0x00000001ed053198 __pthread_kill + 8 (:-1)
1 libsystem_pthread.dylib 0x00000001fc5e25f8 pthread_kill + 208 (pthread.c:1670)
2 libsystem_c.dylib 0x00000001b869c4b8 abort + 124 (abort.c:118)
3 AppleMetalGLRenderer 0x00000002349f574c GLDContextRec::flushContextInternal() + 700 (glr_sync.mm:262)
4 DiSpecialDriver 0x000000010824b07c Di::RHI::onRenderFrameEnd() + 184 (RHIDevice.cpp:118)
5 DiSpecialDriver 0x00000001081b85f8 Di::Client::drawFrame() + 120 (Client.cpp:155)
2024-08-27_14-44-10.8104_+0800-07d9de9207ce4c73289507e608e5de4320d02ccf.crash
Topic:
Graphics & Games
SubTopic:
Metal
Recently, I adopted MetalFX for Upscale feature.
However, I have encountered a persistent build failure for the iOS Simulator with the error message, 'MetalFX is not available when building for iOS Simulator.'
To address this, I modified the MetalFX.framework status to 'Optional' within Build Phases > Link Binary With Libraries, adding the linker option (-weak_framework). Despite this adjustment, the build process continues to fail.
Furthermore, I observed that the MetalFX sample application provided by Apple, specifically the one found at https://developer.apple.com/documentation/metalfx/applying-temporal-antialiasing-and-upscaling-using-metalfx, also fails to build for the iOS Simulator target.
Has anyone encountered this issue?