Render advanced 3D graphics and perform data-parallel computations using graphics processors using Metal.

Metal Documentation

Posts under Metal subtopic

Post

Replies

Boosts

Views

Activity

Can't profile Metal on Apple TV
Hi, I can capture a frame on the Apple TV, but when I try to profile the capture for GPU timing information, I got "Abort Trap 6" error and with following error in the report: Exception Type: EXC_CRASH (SIGABRT) Exception Codes: 0x0000000000000000, 0x0000000000000000 Triggered by Thread: 7 Application Specific Information: abort() called Last Exception Backtrace: 0 CoreFoundation 0x18c0a99d0 __exceptionPreprocess + 160 1 libobjc.A.dylib 0x18b596d24 objc_exception_throw + 71 2 CoreFoundation 0x18bfa7308 -[__NSArrayM insertObject:atIndex:] + 1239 3 MTLReplayController 0x101f5d148 DYMTLReplayFrameProfiler_loadAnalysis + 1140 4 MTLReplayController 0x101e97f90 GTMTLReplayClient_collectGPUShaderTimelineData + 224 5 MTLReplayController 0x101e81794 __30-[GTMTLReplayService profile:]_block_invoke_4 + 288 6 Foundation 0x18eb6072c __NSOPERATION_IS_INVOKING_MAIN__ + 11 7 Foundation 0x18eb5cc1c -[NSOperation start] + 623 8 Foundation 0x18eb60edc __NSOPERATIONQUEUE_IS_STARTING_AN_OPERATION__ + 11 9 Foundation 0x18eb60bc4 __NSOQSchedule_f + 167 10 libdispatch.dylib 0x18b8d6a84 _dispatch_block_async_invoke2 + 103 11 libdispatch.dylib 0x18b8c9420 _dispatch_client_callout + 15 12 libdispatch.dylib 0x18b8cc5d0 _dispatch_continuation_pop + 531 13 libdispatch.dylib 0x18b8cbcd4 _dispatch_async_redirect_invoke + 635 14 libdispatch.dylib 0x18b8d9224 _dispatch_root_queue_drain + 335 15 libdispatch.dylib 0x18b8d9a08 _dispatch_worker_thread2 + 163 16 libsystem_pthread.dylib 0x18b6e652c _pthread_wqthread + 223 17 libsystem_pthread.dylib 0x18b6ed8d0 start_wqthread + 7 It's Xcode 16.0 + Apply TV 4K (4th Gen) tvOS 18, does anyone know what's the cause of this error and is there any solution for it? Thank you very much, Kai
0
0
335
Oct ’24
Getting stuck in first frame of renderLoop.
First I get this ar_world_tracking_provider_query_device_anchor_at_timestamp <0x302b9c0a0>: The device_anchor can only be queried when the world tracking provider is running. This seemed to all break with the auto-update to 2.0.1. Simulator runs the code fine. I seem to see an infinite stall here frameLayer.endUpdate() // Pace frames by waiting for the optimal prediction time. try await LayerRenderer.Clock().sleep(until: timing.optimalInputTime, tolerance: nil) // Start submitting the updated frame. frameLayer.startSubmission() <-
0
0
504
Oct ’24
Strange Metal related shader issue
Hi everyone, I encountered a very strange shader bug that seems related to Metal only (not OpenGL). You can find the full description of the issue on the Babylon.js forums here: https://forum.babylonjs.com/t/strange-shader-related-issue-on-macos-with-safari-and-chrome-not-firefox/54289 (sorry, I couldn't post a clickable link here as this seems to be blocked here). I have a workaround to fix the issue (as described in the link above), but this really looks like an issue in Metal itself. Let me know if you need more details or explanations.
0
0
397
Oct ’24
Metal Compute Overhead
Hello, We are experimenting with Metal to accelerate some peculiar numerical computation. Our workloads are relatively small, so the ability to avoid moving data to and from the GPU's memory is very appealing. However, we are observing higher overhead compared to CUDA, which negates the benefits of avoiding data transfer. In our tests using an empty kernel, CUDA completes in 0.001 ms (Intel i7 10700K, RTX 3080), while Metal's waitUntilCompleted takes 0.12 ms (M2 Max). As we do not have prior experience with Metal, we are wondering if we are using the APIs just fine and this timing is expected, or if there is a way to reduce it. Thank you in advance for any comment! test-metal.cpp
0
0
629
Nov ’24
What does CAMetalLayerWantsCompositingDependencies in Info.plist do?
I've noticed a major third-party app has the following flag set to 1/true in its Info.plist: CAMetalLayerWantsCompositingDependencies Does anyone know if it’s recognized by Core Animation / Metal, and what it’s supposed to do? It might obviously have zero relationship to the OS, defined by that app and for that app... but since it looks very much like an unofficial/undocumented environment setting, it might be great to know what problem it solves. I happen to have issues related to compositing other CALayers over a CAMetalLayer in my app... so this definitely stood out as interesting. Thank you!
0
0
425
Nov ’24
jpegData(compressionQuality:) background crash on iOS18
Ever since the release of iOS18, we've been seeing a new crash related to calling jpegData(compressionQuality:). From reports, this isn't crashing during foreground usage of the app, but instead will prompt the user about a background app crash upon foregrounding. The stacks from crash reports show this crash is happening from a variety of callers, but all go through jpegData(compressionQuality:), down through [HDRImageConverter_Metal init] and end up in apthread_mutex_local call when it crashes. Attached is a sample crash report from 18.2(22C5125e), but we've been seeing this since the first iOS18 release. Did something change with around these calls in iOS18 that prohibits their use in the background? crash.txt
0
2
676
Nov ’24
Rendering YCbCr input using Metal
I would like to take YCbCr CVPixelBuffers from AVCaptureVideoDataOutput, apply some processing in RGB space, render to an MTKView, and pass to AVAssetWriter for recording. Right now, I'm doing this all manually – deswing the incoming data if necessary, choose the right matrix to convert to RGB, apply processing, etc. I also have to convert back to YCbCr before feeding the frames to AVAssetWriter because encoding performs much better if I do. Is there any efficient, built-in way to achieve the same? I can't use AVCaptureVideoPreviewLayer, since I need to do some further processing before display. I can't use AVCaptureVideoDataOutput's videoSettings to get automatic BGRA conversion because that would lose bit depth for 10 bit video formats (and isn't available on all formats anyway). I see these Accelerate functions, but they seemingly don't use the GPU, nor do they support all the formats and bit depths I'd need. I found reference to some undocumented MTLPixelFormats that seem to do exactly what I want, but I don't want to rely on something like this unless it's explicitly endorsed. This would also incur an RGB/YCbCr conversion on every texture read and write, right? Is there anything I'm missing here?
0
0
542
Nov ’24
D3DMetal unsupported CheckFeatureSupport query 53 while running simple vulkaninfo using Mesa 24.3 Dozen (Vulkanon12) driver..
Hi, wanted to test if possible to use Mesa3D Dozen driver(Vulkan on D3D12 )+D3DMetal 2b3 to get maybe better Vulkan driver on Wine than default MoltenVK.. this will support Vulkan windows apps via using D3D12Metal.. using vulkan_dzn.dll,dzn_icd.x86_64.json,dxil.dll from x64 folder from: https://github.com/pal1000/mesa-dist-win/releases/download/24.3.0-rc1/mesa3d-24.3.0-rc1-release-msvc.7z using simple vulkaninfo app and running like: wine64 vulkaninfo I get error: [D3DMetal:LOG:2A825] Unsupported API: CheckFeatureSupport, unhandled support query 53 also seems D3DMetal Wine integration on Whisky doesn't expose d3d12core.dll and d3d12.dll like new Agility D3D12 dlls or VKD3D, so getting: MESA: error: Failed to retrieve D3D12GetInterface MESA: error: Failed to load DXCore but anyways seems to try to load the driver as: WARNING: dzn is not a conformant Vulkan implementation, testing use only. full log: MESA: error: Failed to retrieve D3D12GetInterface MESA: error: Failed to load DXCore WARNING: dzn is not a conformant Vulkan implementation, testing use only. [D3DMetal:LOG:2A825] Unsupported API: CheckFeatureSupport, unhandled support query 53 00bc:fixme:dcomp:DCompositionCreateDevice 0000000000000000, {c37ea93a-e7aa-450d-b16f-9746cb0407f3}, 000000000011E328. MESA: error: Failed to load DXCore WARNING: dzn is not a conformant Vulkan implementation, testing use only. [D3DMetal:LOG:2A825] Unsupported API: CheckFeatureSupport, unhandled support query 53 00bc:fixme:dcomp:DCompositionCreateDevice 0000000000000000, {c37ea93a-e7aa-450d-b16f-9746cb0407f3}, 000000000011E578. ERROR: [Loader Message] Code 0 : setup_loader_term_phys_devs: Call to 'vkEnumeratePhysicalDevices' in ICD c:\windows\system32\.\vulkan_dzn.dll failed with error code -3 ERROR: [Loader Message] Code 0 : setup_loader_term_phys_devs: Failed to detect any valid GPUs in the current config ERROR at C:\j\msdk0\build\Khronos-Tools\repo\vulkaninfo\vulkaninfo.h:241:vkEnumeratePhysicalDevices failed with ERROR_INITIALIZATION_FAILED
0
0
776
Nov ’24
What are the CAMetalLayer.nextDrawable threading rules?
What evidence exists that it's safe to call nextDrawable() on CAMetalLayer off the main thread? I have seen developers claiming that it's OK, but the official docs are silent on the topic. Attempting to do so with Strict Concurrency Checking set to Complete complains that CAMetalLayer is not @Sendable. I want to call it off the main thread since there doesn't seem to be any way to prevent it from blocking the UI for up to a second. I have read hints and allegations that this won't happen if you avoid asking for too many drawables, but that doesn't seem to be true 100% of the time in my experience. Supposing it is allowed, I wonder how races are handled such as when the layer's size is changed on the main thread, or if the layer is removed from the layer hierarchy.
0
0
497
Dec ’24
Texture Definitions for MPSSVGF Denoise
I am trying to use the SVGF denoiser to denoise my ray traced shadows (and also other textures later). I do get a smoothed image, but with wonky denoising. I need the depth-normal textures and motion textures for the SVGF and assume that these are badly filled in my case. However, neither in the above linked documentation nor in the WWDC19 video I find how they should be defined. I am looking to answers to: Is depth in red or alpha channel for the depth-normal texture? Are the normals in screen space? Is depth linear? Is it distance or z coordinate in view space? Or even logarithmically scaled or something else? Are the motion vectors supposed to be in pixels per frame? What is the orientation of the axis? Is y up or down? Are there are other restrictions on the formats? Also the linked code did not help me (I have not found any SVGF so far; also all the code is in Objective-C++, not Swift, but that's a different topic). So how should I fill these textures. Can someone point me to the documentation where these kinds of questions are answered?
0
0
528
Dec ’24
Concurrent conflicting texture writes
Hello! I need to "draw" a set of particles into the texture. It would be trivial in render encoder of course. However, I would like to implement the task in compute kernel. Every particle draw operation is expected to set 5 texels - "center" one and left/right/upper/lower. Particles can and will overlap, so concurrent draws are to be expected. I tried using texture atomics - atomic_store() to be more precise. This worked, albeit pretty slowly - too slow for my purpose. Just to test what would happen, I tried using normal texture write(). I was expecting to see some kind of visual artefacts, but to my surprise, it worked very well (and much faster). My question: is it safe? I understand that calling write() doesn't guarantee any ordering of the operations, so if multiple threads write to the same texel, the final value may come from any of those threads. But suppose all the threads were to write the very same color? Can I assume that the texel in question will have said color after the compute kernel finishes? I am using M2 Pro MacBook, but ideally I would love to get the answer for the all Apple Silicon devices. My texture format is R32Int (so as to be able to use atomics), but I could do with any single-channel format, the purpose of the texture is to be binary mask of sorts. Thanks!
0
0
387
Feb ’25
How to properly pass a Metal layer from SwiftUI MTKView to C++ for use with metal-cpp?
Hello! I'm currently porting a videogame console emulator to iOS and I'm trying to make the renderer (tested on MacOS) work on iOS as well. The emulator core is written in C++ and uses metal-cpp for rendering, whereas the iOS frontend is written in Swift with SwiftUI. I have an Objective-C++ bridging header for bridging the Swift and C++ sides. On the Swift side, I create an MTKView. Inside the MTKView delegate, I run the emulator for 1 video frame and pass it the view's backing layer for it to render the final output image with. The emulator runs and returns, but when it returns I get a crash in Swift land (callstack attached below), inside objc_release, which indicates I'm doing something wrong with memory management. My bridging interface (ios_driver.h): #pragma once #include <Foundation/Foundation.h> #include <QuartzCore/QuartzCore.h> void iosCreateEmulator(); void iosRunFrame(CAMetalLayer* layer); Bridge implementation (ios_driver.mm): #import <Foundation/Foundation.h> extern "C" { #include "ios_driver.h" } <...> #define IOS_EXPORT extern "C" __attribute__((visibility("default"))) std::unique_ptr<Emulator> emulator = nullptr; IOS_EXPORT void iosCreateEmulator() { ... } // Runs 1 video frame of the emulator and IOS_EXPORT void iosRunFrame(CAMetalLayer* layer) { void* layerBridged = (__bridge void*)layer; // Pass the CAMetalLayer to the emulator emulator->getRenderer()->setMTKLayer(layerBridged); // Runs the emulator for 1 frame and renders the output image using our layer emulator->runFrame(); } My MTKView delegate: class Renderer: NSObject, MTKViewDelegate { var parent: ContentView var device: MTLDevice! init(_ parent: ContentView) { self.parent = parent if let device = MTLCreateSystemDefaultDevice() { self.device = device } super.init() } func mtkView(_ view: MTKView, drawableSizeWillChange size: CGSize) {} func draw(in view: MTKView) { var metalLayer = view.layer as! CAMetalLayer // Run the emulator for 1 frame & display the output image iosRunFrame(metalLayer) } } Finally, the emulator's render function that interacts with the layer: void RendererMTL::setMTKLayer(void* layer) { metalLayer = (CA::MetalLayer*)layer; } void RendererMTL::display() { CA::MetalDrawable* drawable = metalLayer->nextDrawable(); if (!drawable) { return; } MTL::Texture* texture = drawable->texture(); <rest of rendering follows here using the drawable & its texture> } This is the Swift callstack at the time of the crash: To my understanding, I shouldn't be violating ARC rules as my bridging header uses CAMetalLayer* instead of void* and Swift will automatically account for ARC when passing CoreFoundation objects to Objective-C. However I don't have any other idea as to what might be causing this. I've been trying to debug this code for a couple of days without much success. If you need more info, the emulator code is also on Github Metal renderer: https://github.com/wheremyfoodat/Panda3DS/blob/ios/src/core/renderer_mtl/renderer_mtl.cpp#L58-L68 Bridge implementation: https://github.com/wheremyfoodat/Panda3DS/blob/ios/src/ios_driver.mm Bridging header: https://github.com/wheremyfoodat/Panda3DS/blob/ios/include/ios_driver.h Any help is more than appreciated. Thank you for your time in advance.
0
0
425
Mar ’25
MTLBinaryArchive Size
I'm trying to use MTLBinaryArchive. I collected a BinaryArchive from one device and used metal-tt to translate it for all supported iPhone devices, ranging from iPhone 7 Plus to iPhone 16. However, this BinaryArchive is quite large, around 1.5GB uncompressed, and about 500MB compressed in the IPA. I'm wondering how to address the size issue. I watched the WWDC 2022 video, which mentioned that the operating system or app installation process would handle compatibility. Does this compatibility support different GPU chips? I tried installing an IPA with a BinaryArchive collected only from an iPhone 12 on an iPhone 13, but the BinaryArchive didn't take effect. I also saw that Apple supports App Thinning. However, it seems that resources in the Asset Catalog cannot be accessed via URL, and creating an MTLBinaryArchive requires a URL. Is it possible for MTLBinaryArchive to be distributed through App Thinning? The WWDC 2022 video also mentioned using the -Os optimization flag to reduce size. Can this give an estimate of how much compression it would achieve? Are there any methods to solve the BinaryArchive size issue without impacting performance?
0
1
60
Mar ’25
Threadgroup configuration for tile shading
Hello! I have a question about how thread groups work with tile shading. When running "traditional" compute, I get to choose both thread group size and the grid size. However, when using tile shading kernel I only have dispatchThreadsPerTile method - this controls how many threads will be ran in each tile. So far so good, but what about thread groups? The examples in video "Tile Shading on A11" seem to suggest that there will be only one thread group per tile. In the video, [[thread_index_in_threadgroup]] is called "local_id" and it is used to access the image block. I assume this is the default configuration. So when one does the following: Creates MTLRenderPassDescriptor with tileWidth set to W and tileHeight set to H Fires up the tile shading kernel using dispatchThreadsPerTile with MTLSize size = { W, H, 1 } I understand that the result is 1-to-1 mapping between the tile "pixels" and kernel threads. Now, what I would like to do is to have more than one thread group there. I want this for performance reasons: I have a certain compute kernel which I know executes very well with small thread group size. In fact, { 32, 1, 1 } seems to be the fastest. My understanding is that even if I set tile size to 16x16, and so I am executing 256 threads there, there will only be one SIMD group active in a thread group. Meaning that this SIMD group has to execute 8 times over the tile. Is it possible somehow? Or perhaps the limitations of the API are pointing at the limitations of hardware itself, and if I want to execute with SIMD group sized thread groups I have to use "traditional" compute encoder? Will be grateful for help. Michał
0
0
48
Mar ’25
Slow compilation
Hi, I am working with a large project. We are compiling each material to its own .metallib. They all include many common files full of inline functions. Finally we link it all together at the end with a single big pathtrace kernel. Everything works as expected, however the compile times have gotten completely out of hand and it takes multiple minutes to compile at runtime (to native code). I have gathered that I can do this offline by using metal-tt however if I am wondering if there is a way to reduce the compile times in such a scenario, and how to investigate what the root cause of the problem is. I suspect it could have to do with the fact that every materials metallib contains duplications of all the inline functions. Any ideas on how to profile and debug this? Thanks, Rasmus
0
1
65
Mar ’25
Diagnose data access latency
The code is pretty simple kernel void naive( constant RunParams *param [[ buffer(0) ]], const device float *A [[ buffer(1) ]], // [N, K] device float *output [[ buffer(2) ]], uint2 gid [[ thread_position_in_grid ]]) { uint a_ptr = gid.x * param->K; for (uint i = 0; i < param->K; i++, a_ptr++) { val += A[b_ptr]; } output[ptr] = val; } when uint a_ptr = gid.x * param->K, the code got 150 GFLops when uint a_ptr = gid.y * param->K, the code got 860 GFLops param->K = 256; thread per group: [16, 16] I'd like to understand why the performance is so different, and how can I profile/diagnose this to help with further optimization.
0
0
61
Apr ’25