Metal Performance Shaders

RSS for tag

Optimize graphics and compute performance with kernels that are fine-tuned for the unique characteristics of each Metal GPU family using Metal Performance Shaders.

Posts under Metal Performance Shaders tag

23 Posts

Post

Replies

Boosts

Views

Activity

Powermetrics GPU power vs system DC power discrepancy on M4 Max
While analyzing system power on an M4 Max under GPU-heavy compute workloads, I noticed that the the GPU power reported by powermetrics does not come anywhere close to total system DC power reported by the SMC counter PDTR (as used by utilities like mactop). For example, in a heavy GPU workload, powermetrics would report a 65W idle-load delta on the GPU, but at the same time system DC power would rise by 179W, leaving 114W or nearly 2/3 of total system DC power on a Mac Studio M4 Max unexplained. From measurements, the difference appears to correlate with the amount of on-chip data movement (for example, varying bytes-per-FLOP in the workload changes the observed gap). Using SMC and IOReport, I was able to reverse engineer an energy model for the GPU that explains almost all of the energy flow with less than 2% error on the workload I studied. The result is a simple two-term energy roofline model: P_GPU (GPU_combined term in the plot) ≈ a * bytes + b * FLOPs with: ~5 pJ/byte for SRAM movement ~2.7 pJ/FLOP for compute. Has anyone observed similar behavior, or is there guidance on how GPU power reported by IOReport/powermetrics should be interpreted relative to total system power? In particular, I’m interested in whether certain classes of GPU activity may not be attributed to the GPU component in IOReport. Full details with the methodology and results are available here: https://youtu.be/HKxIGgyeISM
0
0
64
2w
MPS SDPA Attention Kernel Regression on A14-class (M1) in macOS 26.3.1 — Works on A15+ (M2+)
Summary Since macOS 26, our Core ML / MPS inference pipeline produces incorrect results on Mac mini M1 (Macmini9,1, A14-class SoC). The same model and code runs correctly on M2 and newer (A15-class and up). The regression appears to be in the Scaled Dot-Product Attention (SDPA) kernel path in the MPS backend. Environment Affected Mac mini M1 — Macmini9,1 (A14-class) Not affected M2 and newer (A15-class and up) Last known good macOS Sequoia First broken macOS 26 (Tahoe) ? Confirmed broken on macOS 26.3.1 Framework Core ML + MPS backend Language C++ (via CoreML C++ API) Description We ship an audio processing application (VoiceAssist by NoiseWorks) that runs a deep learning model (based on Demucs architecture) via Core ML with the MPS compute unit. On macOS Sequoia this works correctly on all Apple Silicon Macs including M1. After updating to macOS 26 (Tahoe), inference on M1 Macs fails — either producing garbage output or crashing. The same binary, same .mlpackage, same inputs work correctly on M2+. Our Apple contact has suggested the root cause is a regression in the A14-specific MPS SDPA attention kernel, which may have broken when the Metal/MPS stack was updated in macOS 26. The model makes heavy use of attention layers, and the failure correlates precisely with the SDPA path being exercised on A14 hardware. Steps to Reproduce Load a Core ML model that uses Scaled Dot-Product Attention (e.g. a transformer or attention-based audio model) Run inference with MLComputeUnits::cpuAndGPU (MPS active) Run on Mac mini M1 (Macmini9,1) with macOS 26.3.1 Compare output to the same model running on M2 / macOS Sequoia Expected: Correct inference output, consistent with M2+ and macOS Sequoia behavior Actual: Incorrect / corrupted output (or crash), only on A14-class hardware running macOS 26+ Workaround Forcing MLComputeUnits::cpuOnly bypasses MPS entirely and produces correct output on M1, confirming the issue is in the MPS compute path. This is not acceptable as a shipping workaround due to performance impact. Additional Notes The failure is hardware-specific (A14 only) and OS-specific (macOS 26+), pointing to a kernel-level regression rather than a model or app bug We first became aware of this through a customer report Happy to provide a symbolicated crash log if helpful this text was summarized by AI and human verified
1
0
192
1w
Missing DirectX Calls for Tearing and Depth Bound Test in D3DMetal and GPTK 3
I want to address the missing or incomplete DirectX calls from D3DMetal and Game Porting Toolkit 3. These missing calls have in part caused issue with our porting process and we are reconsidering. Missing or Incomplete Calls DXGI_FEATURE_PRESENT_ALLOW_TEARING — IDXGIFactory5::CheckFeatureSupport — this calls has to do with how VSync is handled and some modern games require it to initialize. Currently D3DMetal return 0 maybe by design but most likely because it’s not integrated. Adding a stub that returns 1 can fix this. I’m my use case I simply Noped the check and forced it to continue. D3D12_FEATURE_D3D12_OPTIONS2.DepthBoundsTestSupported — this call is also not present. Which causes games to not initialize rendering. Thankfully this was fixed by once again skipping the check. But this is essential for water rendering. This could be one reason currently water is not rendering in our game. IDXGIOutput6::GetDesc1().ColorSpace — returns DXGI_COLOR_SPACE_RGB_FULL_G22_NONE_P709 (SDR) on external HDR compatible displays. We were able to fix this by forcing HDR to be enabled. It should return HDR support. These calls may exist but they need to be updated to return the correct values. Specifically for depth bound test you can reference MoltenVK which sets it up on top of Metal since it’s not a native feature. The water issue could be also an issue with how the shaders are compiled. But I’m unable to check because of the closed source nature of GPTK and its debuggers. What is a better way we can debug our game to see why the water isn’t rendering. Does D3DMetal have some debug options or something similar? Feedback Number FB22330617 - Missing DirectX Calls for Tearing and Depth Bound Test in D3DMetal and GPTK 3 We hope these issues are resolved quickly because we were thinking of a simultaneous release with our Windows version, but we can't ship with such large bugs.
5
3
233
1w
Powermetrics GPU power vs system DC power discrepancy on M4 Max
While analyzing system power on an M4 Max under GPU-heavy compute workloads, I noticed that the the GPU power reported by powermetrics does not come anywhere close to total system DC power reported by the SMC counter PDTR (as used by utilities like mactop). For example, in a heavy GPU workload, powermetrics would report a 65W idle-load delta on the GPU, but at the same time system DC power would rise by 179W, leaving 114W or nearly 2/3 of total system DC power on a Mac Studio M4 Max unexplained. From measurements, the difference appears to correlate with the amount of on-chip data movement (for example, varying bytes-per-FLOP in the workload changes the observed gap). Using SMC and IOReport, I was able to reverse engineer an energy model for the GPU that explains almost all of the energy flow with less than 2% error on the workload I studied. The result is a simple two-term energy roofline model: P_GPU (GPU_combined term in the plot) ≈ a * bytes + b * FLOPs with: ~5 pJ/byte for SRAM movement ~2.7 pJ/FLOP for compute. Has anyone observed similar behavior, or is there guidance on how GPU power reported by IOReport/powermetrics should be interpreted relative to total system power? In particular, I’m interested in whether certain classes of GPU activity may not be attributed to the GPU component in IOReport. Full details with the methodology and results are available here: https://youtu.be/HKxIGgyeISM
Replies
0
Boosts
0
Views
64
Activity
2w
MPS SDPA Attention Kernel Regression on A14-class (M1) in macOS 26.3.1 — Works on A15+ (M2+)
Summary Since macOS 26, our Core ML / MPS inference pipeline produces incorrect results on Mac mini M1 (Macmini9,1, A14-class SoC). The same model and code runs correctly on M2 and newer (A15-class and up). The regression appears to be in the Scaled Dot-Product Attention (SDPA) kernel path in the MPS backend. Environment Affected Mac mini M1 — Macmini9,1 (A14-class) Not affected M2 and newer (A15-class and up) Last known good macOS Sequoia First broken macOS 26 (Tahoe) ? Confirmed broken on macOS 26.3.1 Framework Core ML + MPS backend Language C++ (via CoreML C++ API) Description We ship an audio processing application (VoiceAssist by NoiseWorks) that runs a deep learning model (based on Demucs architecture) via Core ML with the MPS compute unit. On macOS Sequoia this works correctly on all Apple Silicon Macs including M1. After updating to macOS 26 (Tahoe), inference on M1 Macs fails — either producing garbage output or crashing. The same binary, same .mlpackage, same inputs work correctly on M2+. Our Apple contact has suggested the root cause is a regression in the A14-specific MPS SDPA attention kernel, which may have broken when the Metal/MPS stack was updated in macOS 26. The model makes heavy use of attention layers, and the failure correlates precisely with the SDPA path being exercised on A14 hardware. Steps to Reproduce Load a Core ML model that uses Scaled Dot-Product Attention (e.g. a transformer or attention-based audio model) Run inference with MLComputeUnits::cpuAndGPU (MPS active) Run on Mac mini M1 (Macmini9,1) with macOS 26.3.1 Compare output to the same model running on M2 / macOS Sequoia Expected: Correct inference output, consistent with M2+ and macOS Sequoia behavior Actual: Incorrect / corrupted output (or crash), only on A14-class hardware running macOS 26+ Workaround Forcing MLComputeUnits::cpuOnly bypasses MPS entirely and produces correct output on M1, confirming the issue is in the MPS compute path. This is not acceptable as a shipping workaround due to performance impact. Additional Notes The failure is hardware-specific (A14 only) and OS-specific (macOS 26+), pointing to a kernel-level regression rather than a model or app bug We first became aware of this through a customer report Happy to provide a symbolicated crash log if helpful this text was summarized by AI and human verified
Replies
1
Boosts
0
Views
192
Activity
1w
Missing DirectX Calls for Tearing and Depth Bound Test in D3DMetal and GPTK 3
I want to address the missing or incomplete DirectX calls from D3DMetal and Game Porting Toolkit 3. These missing calls have in part caused issue with our porting process and we are reconsidering. Missing or Incomplete Calls DXGI_FEATURE_PRESENT_ALLOW_TEARING — IDXGIFactory5::CheckFeatureSupport — this calls has to do with how VSync is handled and some modern games require it to initialize. Currently D3DMetal return 0 maybe by design but most likely because it’s not integrated. Adding a stub that returns 1 can fix this. I’m my use case I simply Noped the check and forced it to continue. D3D12_FEATURE_D3D12_OPTIONS2.DepthBoundsTestSupported — this call is also not present. Which causes games to not initialize rendering. Thankfully this was fixed by once again skipping the check. But this is essential for water rendering. This could be one reason currently water is not rendering in our game. IDXGIOutput6::GetDesc1().ColorSpace — returns DXGI_COLOR_SPACE_RGB_FULL_G22_NONE_P709 (SDR) on external HDR compatible displays. We were able to fix this by forcing HDR to be enabled. It should return HDR support. These calls may exist but they need to be updated to return the correct values. Specifically for depth bound test you can reference MoltenVK which sets it up on top of Metal since it’s not a native feature. The water issue could be also an issue with how the shaders are compiled. But I’m unable to check because of the closed source nature of GPTK and its debuggers. What is a better way we can debug our game to see why the water isn’t rendering. Does D3DMetal have some debug options or something similar? Feedback Number FB22330617 - Missing DirectX Calls for Tearing and Depth Bound Test in D3DMetal and GPTK 3 We hope these issues are resolved quickly because we were thinking of a simultaneous release with our Windows version, but we can't ship with such large bugs.
Replies
5
Boosts
3
Views
233
Activity
1w