Metal Performance Shaders

RSS for tag

Optimize graphics and compute performance with kernels that are fine-tuned for the unique characteristics of each Metal GPU family using Metal Performance Shaders.

Posts under Metal Performance Shaders tag

23 Posts

Post

Replies

Boosts

Views

Activity

“Accelerate Transformer Training on Apple Devices from Months to Hours!”
I am excited to share that I have developed a Metal kernel for Flash Attention that eliminates race conditions and fully leverages Apple Silicon’s shared memory and registers. This kernel can dramatically accelerate training of transformer-based models. Early benchmarks suggest that models which previously required months to train could see reductions to just a few hours on Apple hardware, while maintaining numerical stability and accuracy. I plan to make the code publicly available to enable the broader community to benefit. I would be happy to keep you updated on the latest developments and improvements as I continue testing and optimizing the kernel. I believe this work could provide valuable insights for Apple’s machine learning research and products.
0
0
102
1d
MPSMatrixRandom SEGFAULTs when ran in an async context
The following minimal snippet SEGFAULTS with SDK 26.0 and 26.1. Won't crash if I remove async from the enclosing function signature - but it's impractical in a real project. import Metal import MetalPerformanceShaders let SEED = UInt64(0x0) typealias T = Float16 /* Why ran in async context? Because global GPU object, and async makeMTLFunction, and async makeMTLComputePipelineState. Nevertheless, can trigger the bug without using global @MainActor let myGPU = MyGPU() */ @main struct CMDLine { static func main() async { let ptr = UnsafeMutablePointer<T>.allocate(capacity: 0) async let future: Void = randomFillOnGPU(ptr, count: 0) print("Main thread is playing around") await future print("Successfully reached the end.") } static func randomFillOnGPU(_ buf: UnsafeMutablePointer<T>, count destbufcount: Int) async { // let (device, queue) = await (myGPU.device, myGPU.commandqueue) let myGPU = MyGPU() let (device, queue) = (myGPU.device, myGPU.commandqueue) // Init MTLBuffer, async let makeFunction, makeComputePipelineState, etc. let tempDataType = MPSDataType.uInt32 let randfiller = MPSMatrixRandomMTGP32(device: device, destinationDataType: tempDataType, seed: Int(bitPattern:UInt(SEED))) print("randomFillOnGPU: successfully created MPSMatrixRandom.") // try await computePipelineState // ^ Crashes before this could return // Or in this minimal case, after randomFillOnGPU() returns // make encoder, set pso, dispatch, commit... } } actor MyGPU { let device : MTLDevice let commandqueue : MTLCommandQueue init() { guard let dev: MTLDevice = MPSGetPreferredDevice(.skipRemovable), let cq = dev.makeCommandQueue(), dev.supportsFamily(.apple6) || dev.supportsFamily(.mac2) else { print("Unable to get Metal Device! Exiting"); exit(EX_UNAVAILABLE) } print("Selected device: \(String(format: "%llX", dev.registryID))") self.device = dev self.commandqueue = cq print("myGPU: initialization complete.") } } See FB20916929. Apparently objc autorelease pool is releasing the wrong address during context switch (across suspension points). I wonder why such obvious case has not been caught before.
0
0
40
1d
“Unleashing the MacBook Air M2: 673 TFLOPS Achieved with Highly Optimized Metal Shading Language”
Using highly optimized Metal Shading Language (MSL) code, I pushed the MacBook Air M2 to its performance limits with the deformable_attention_universal kernel. The results demonstrate both the efficiency of the code and the exceptional power of Apple Silicon. The total computational workload exceeded 8.455 quadrillion FLOPs, equivalent to processing 8,455 trillion operations. On average, the code sustained a throughput of 85.37 TFLOPS, showcasing the chip’s remarkable ability to handle massive workloads. Peak instantaneous performance reached approximately 673.73 TFLOPS, reflecting near-optimal utilization of the GPU cores. Despite this intensity, the cumulative GPU runtime remained under 100 seconds, highlighting the code’s efficiency and time optimization. The fastest iteration achieved a record processing time of only 0.051 ms, demonstrating minimal bottlenecks and excellent responsiveness. Memory management was equally impressive: peak GPU memory usage never exceeded 2 MB, reflecting efficient use of the M2’s Unified Memory. This minimizes data transfer overhead and ensures smooth performance across repeated workloads. Overall, these results confirm that a well-optimized Metal implementation can unlock the full potential of Apple Silicon, delivering exceptional computational density, processing speed, and memory efficiency. The MacBook Air M2, often considered an energy-efficient consumer laptop, is capable of handling highly intensive workloads at performance levels typically expected from much larger GPUs. This test validates both the robustness of the Metal code and the extraordinary capabilities of the M2 chip for high-performance computing tasks.
0
0
314
3d
Can't i use metal in the DeviceActivityReportExtension?
i am try to build an app that show beautiful result represent the user activity. but i found that if i write metal code in the View of some DeviceActivityReportScene, the metal code wasn't working. (the same metal code works in other taget) i can switch to canvas, but the perform is bad compare with metal. can use metal? or it is just not working?
0
0
202
Sep ’25
Metal IR reference
Hello! I'm developing a GPU (shader) language, where I aim to target multiple backends with a common frontend. I wanted to avoid having to round trip through Metal, and go straight to IR just like I have with SPIRV, in order to have a fast and efficient compilation process. I've been looking for a reference page where I can read about Metals IR, and as far as I'm aware, it exists, but I can't seem to find it anywhere. Furthermore, if such a reference is available, is there also a toolkit where I can run validation on the output IR, and perhaps even run optimizations, much like spv-tools for SPIRV? Any help would be appreciated! Thanks, Gustav
2
0
255
Jul ’25
VisionOS 26 - threadsPerThreadgroup limit causing crash on device (but not in simulator)
Hi all, I'm running into an issue with an app that previously worked fine on device using visionOS 2.0. After updating to visionOS 26, the same code runs fine in the simulator but crashes on the device with the following error: -[MTLDebugComputeCommandEncoder _validateThreadsPerThreadgroup:]:1330: failed assertion `(threadsPerThreadgroup.width(32) * threadsPerThreadgroup.height(32) * threadsPerThreadgroup.depth(1))(1024) must be <= 832. (kernel threadgroup size limit)` Is there any documented way to check or increase the allowed threadsPerThreadgroup size on Apple Vision Pro? Or any recommended workaround for this regression? Thanks in advance!
3
0
141
Jun ’25
CoreML memory allocation logic
hello, I got a question about coreml. I loaded the coreml model in the project and set the computing unit to CPU+GPU. When I used instruments to analyze the performance, I found that there was an overhead of prepare gpu request before each inference. I also checked the freezing point graph and found that memory was frequently allocated. Is this as expected? Is there any way to avoid frequent prepares? I have tried some methods, such as memory sharing of predict interface input parameters, but it seems to be ineffective.
0
0
79
May ’25
CoreML Model Conversion Help
I’m trying to follow Apple’s “WWDC24: Bring your machine learning and AI models to Apple Silicon” session to convert the Mistral-7B-Instruct-v0.2 model into a Core ML package, but I’ve run into a roadblock that I can’t seem to overcome. I’ve uploaded my full conversion script here for reference: https://pastebin.com/T7Zchzfc When I run the script, it progresses through tracing and MIL conversion but then fails at the backend_mlprogram stage with this error: https://pastebin.com/fUdEzzKM The core of the error is: ValueError: Op "keyCache_tmp" (op_type: identity) Input x="keyCache" expects list, tensor, or scalar but got state[tensor[1,32,8,2048,128,fp16]] I’ve registered my KV-cache buffers in a StatefulMistralWrapper subclass of nn.Module, matching the keyCache and valueCache state names in my ct.StateType definitions, but Core ML’s backend pass reports the state tensor as an invalid input. I’m using Core ML Tools 8.3.0 on Python 3.9.6, targeting iOS18, and forcing CPU conversion (MPS wasn’t available). Any pointers on how to satisfy the handle_unused_inputs pass or properly declare/cache state for GQA models in Core ML would be greatly appreciated! Thanks in advance for your help, Usman Khan
0
0
162
May ’25
Slow compilation
Hi, I am working with a large project. We are compiling each material to its own .metallib. They all include many common files full of inline functions. Finally we link it all together at the end with a single big pathtrace kernel. Everything works as expected, however the compile times have gotten completely out of hand and it takes multiple minutes to compile at runtime (to native code). I have gathered that I can do this offline by using metal-tt however if I am wondering if there is a way to reduce the compile times in such a scenario, and how to investigate what the root cause of the problem is. I suspect it could have to do with the fact that every materials metallib contains duplications of all the inline functions. Any ideas on how to profile and debug this? Thanks, Rasmus
0
1
70
Mar ’25
Tile Shaders performance when writing to tile texture vs. resolve texture
I am working on a custom resolve tile shader for a client. I see a big difference in performance depending on where we write to: 1- the resolve texture of the color attachment 2- a rw tile shader texture set via [renderEncoder setTileTexture: myResolvedTexture] Option 2 is more than twice as slow than option 1. Our compute shader writes to 4 UAVs so just using the resolve texture entry is not possible. Why such a difference as there is no more data being written? Can option 2 be as fast as option 1? I can demonstrate the issue in a modified version of the Multisample code sample.
5
0
547
Feb ’25
Instruments showing incorrect values
Hello, I’m encountering an issue with the Instruments app while running a benchmark on an M2 Ultra Mac Studio. Despite being certain that GPU activities involving memory read and write operations are occurring, all related performance counters consistently return 0. Interestingly, this problem does not occur when using the same code on an M1 MacBook Air, where the counters behave as expected. What could be causing this discrepancy? Any insights or suggestions would be greatly appreciated. Thank you!
0
0
443
Jan ’25
SwiftUI glitch with coloreffect shader & orientation change
Hi, I have the following swiftUI code: Image(uiImage: image) .resizable() .aspectRatio(contentMode: .fit) .colorEffect(ShaderLibrary.AlphaConvert()) and the following shader: [[ stitchable ]] half4 AlphaConvert(float2 position, half4 currentColor) { return half4(currentColor.r>0.5,currentColor.r<=0.5,0,(currentColor.r>0.5)); } I am loading a full-res image from my photo library (24MP)... The image initially displays fine, with portions of the image red, and the rest black (due to alpha blending)... However, after rotating the device, I get an image that is a combination of red&green... Note, that the green pixels from the shader have alpha 0, hence, should never be seen. Is there something special that needs to be done on orientation changes so that the shader works fine?
0
0
375
Dec ’24
Texture Definitions for MPSSVGF Denoise
I am trying to use the SVGF denoiser to denoise my ray traced shadows (and also other textures later). I do get a smoothed image, but with wonky denoising. I need the depth-normal textures and motion textures for the SVGF and assume that these are badly filled in my case. However, neither in the above linked documentation nor in the WWDC19 video I find how they should be defined. I am looking to answers to: Is depth in red or alpha channel for the depth-normal texture? Are the normals in screen space? Is depth linear? Is it distance or z coordinate in view space? Or even logarithmically scaled or something else? Are the motion vectors supposed to be in pixels per frame? What is the orientation of the axis? Is y up or down? Are there are other restrictions on the formats? Also the linked code did not help me (I have not found any SVGF so far; also all the code is in Objective-C++, not Swift, but that's a different topic). So how should I fill these textures. Can someone point me to the documentation where these kinds of questions are answered?
0
0
533
Dec ’24
How to use imageblock_slice
Is there a working example of imageblock_slice with implicit layout somewhere? I get a compilation error when i write this: imageblock_slilce color_slice = img_blk.slice(frag->color); Error: No matching member function for call to 'slice' candidate template ignored: couldn't infer template argument 'E' candidate function template not viable: requires 2 arguments, but 1 was provided Too few template arguments for class template 'imageblock_slice' It seems the syntax has changed since the Imageblocks presentation https://developer.apple.com/videos/play/tech-talks/603/ I tried supplying the struct type of the image block between <> but it still does not work.
1
0
640
Dec ’24
Normally distributed MPSMatrixRandom number generation generates NaN
When generating large arrays of random numbers, NaNs show up. They also show up at the same indices when using the same seed, leading me to believe that this is a bug with MPSMatrixRandom's normally distributed Float32 random number distribution. Happens with both Philox and MTGP32. Is this intentional and how do I work around this? See the original post for a MWE in Swift and Julia: https://github.com/JuliaGPU/Metal.jl/issues/474
1
1
639
Dec ’24