Core ML

Integrate machine learning models into your app using Core ML.

Core ML Documentation

Post

Replies

Boosts

Views

Activity

关于我使用Swift和Metal制作的神经网络引擎

我今年18岁。没有机器学习背景，没有上过大学，高中都没去上，没有导师。几天前我盯着一张纸发呆。突然想：为什么计算机神经网络一定要是2D的？可以模拟生物吗？为什么一定要在平面上算？如果多个平面，岂不是翻倍？如果把六张纸想象成一个魔方，六个面各自承载神经元，八条体对角线变成新的通信通道会怎么样？我真的很喜欢折腾这些，然后我立刻制定了详细计划，使用AI工具辅助写下了第一个 kernel。跑崩了。我又重新想了一下，和qq群友分享了我的目标，又写。又崩。连续几十次。没有 PyTorch，没有 TensorFlow，没有 CUDA。只有Swift和Metal。因为我的电脑显卡是AMD Vega 64，没装任何框架辅助，因为我想明白最底层的运行方式是什么原理。这就是CubeNN。 ##以下为AI的详细解答，内容与架构改动太多，我在这里一次讲不清楚它是什么一个用魔方几何作为计算架构的神经网络引擎。标准 Transformer: 把数据排成一行，O(n²) 地互相看 CubeNN: 把数据分布在 14 个面上，只在该看的地方看 6 个标准面 → 块稀疏注意力（粗看全局 + 细看局部） 8 个 X 面对角线 → 跨面信息桥（不做 Attention，只负责传递）每轮：6 面算 → 投影到 8 X 面 → 上采样精炼 → 融合回 6 面最关键的是 Cube Cascade——一个树+链级联推理：树阶段: 1 个魔方 spawn 8 个 → 8 个 spawn 64 个 → 73 个并行探索 GPU 上同时跑，选最优路径链阶段: 最优叶子无限深度精炼 3-5 步收敛，方差提升 ~7% 怎么实现的纯 Swift + Metal。零依赖。零框架。 // 大致代码就是这些 import Metal import Foundation let device = MTLCreateSystemDefaultDevice()! let library = try! device.makeLibrary(filepath: "cube_nn.metallib") // ...12 个 GPU kernel，12,000 次 dispatch 关键技术决策：单 Command Buffer：整个树阶段 73 个魔方的全部 kernel dispatch 打包进一个 CB，0 次 CPU-GPU 同步 Pipeline State 缓存：编码从 1022ms 降到 42ms Buffer 偏移：所有 73 个魔方的 14 个面存进一个连续 buffer，kernel 通过 buffer(15) 传偏移量 FP16：N≥64 时半精度提速 21% 性能 ##经过测试，但是因设备差异可能不准确，仅参考 AMD Radeon RX Vega 64 (2017 年显卡, 14nm, 295W): 规模神经元魔方数耗时 N=32 6,144 73 (树) 435ms N=64 24,576 21 (树) 817ms N=128 98,304 1 116ms N=32 全连接 Attention 每层 201M FLOP → CubeNN 块稀疏 370K FLOP (544× 减少) N=128 全连接需要 32GB 显存（物理上不存在）→ CubeNN 用 192KB N=256 全连接需要 2.2T FLOP → CubeNN 52M FLOP (42,300× 减少) 代码体积：161KB。对比 PyTorch 的 800MB。我经历了什么这个项目最困难的不是写 kernel，是在没有任何人告诉我"能不能做"的情况下，靠反复试错找到路。第一次试图跑 73 个魔方，GPU 直接 hang 了。花了 3 天定位到是 Command Buffer 堆叠过多。改了 single encoder 方案，又碰上 SIGILL——Metal 不允许 makeBuffer(length: 0)，B=0 时创建了零长度 buffer。想用 threadgroup memory 做 kernel fusion，结果跨 threadgroup 读不到数据，才明白 LDS 是 per-group 的。 N=64 的 FP16 要手动写 float↔half 转换函数，因为 macOS 11 上 Float16 类型被标为 unavailable。每一次崩溃都教会我一个 Metal 的底层细节。没有人教我，但 Metal 的报错信息就是最好的老师。为什么发在 Apple 开发者论坛因为这是为苹果生态而生的项目。CubeNN 从头到尾只用了两个东西：Swift 和 Metal。它不需要移植就能跑在任何 Apple Silicon Mac 上（API兼容）。如果未来能把部分 kernel 映射到 Neural Engine，效率会再翻几倍。我想问 Apple 的 Metal 工程师和 Core ML 团队： ** 有没有更好的 GPU 任务调度方式？**目前表现仍然欠佳（对于我这个完美主义者来说），可能改得有点乱了有没有兴趣评估这个架构在 M4 上的表现？我手里只有 Vega 64。M4 GPU + ANE方法跑 CubeNN 会是什么效果？源代码 ├── run.swift # 统一 CLI，参数化 N/B/depth ├── src/ │ ├── cube_nn.metal # FP16 kernel │ └── cube_nn_fp32.metal # FP32 kernel └── benchmarks/ # 实测数据如果你读到了这里——谢谢你。一个门外汉靠痴狂的，纯粹到几乎是妄想的主意和Metal走到了这里。我懂的不是很多，如果这个架构有任何价值，我想让它变得更好。任何建议、批评、或者指教，都非常欢迎。

Machine Learning & AI Core ML Swift Metal

Core ML RIP?

No mention of Core ML at WWDC26... Shall we assume it was replaced by Core AI? What about Adapters?

Machine Learning & AI Core ML

127

Silent FP16 Overflow in coremltools: 5 Numerical Failures Affecting ANE Inference (With Fixes)

Silent FP16 Overflow in coremltools: 5 Numerical Failures Affecting ANE Inference (With Fixes) Hi everyone, With the announcements at WWDC26 regarding Core AI and "automatic stable decompositions," it is clear that managing mathematical stability in constrained FP16 environments is a major priority for the ecosystem. To support developers maintaining existing models that cannot migrate to the newest architectures overnight, I have published a research paper and an open-source static analysis tool documenting 5 silent numerical failures in the standard coremltools pipeline. Because the Apple Neural Engine (ANE) executes inference in FP16, the maximum representable value is 65,504 ($\exp(11.09)$). Inputs exceeding these tight bounds cause silent overflows to infinity or collapses to zero without warnings. Deployed Operations Currently Affected softplus (YOLOv5/v8): Outputs silently collapse to 0.0 at $x > 10.4$ on ANE. logsumexp (Attention mechanisms): Overflows at $x > 7.63$ for 32 channels. For vocabulary-sized reductions, the threshold drops below $5$. log_softmax (Classifiers like BERT, GPT, ViT): Softmax probabilities underflow to 0, causing $\log(0) = -\infty$. logcumsumexp (CTC decoders): Overflows at $x > 11.09$. mish (YOLO variants): Inherits the softplus overflow limits. The Immediate Safety Net: Algebraically Equivalent Reformulations We can bypass these hardware limits entirely by rewriting the operations into mathematically stable forms. For example, rewriting softplus as: $$\max(x, 0) + \log(1 + \exp(-|x|))$$ Because $-|x| \le 0$, $\exp(-|x|)$ is bound strictly between $(0, 1]$. Overflow becomes mathematically impossible in any precision, yielding bit-identical outputs for all valid inputs. While PyTorch AMP traditionally classifies these operations as FP32-only, the ANE has no such fallback—making stable decomposition mandatory. Tools & Patches Deployed Today The Paper: "Silent Numerical Failures in On-Device ML Converters: A Systematic Audit of FP16 Overflow in Apple Neural Engine Deployment." (Complete vulnerability census, discrepancy pattern analysis, formal proofs, and quantitative evaluation). The Tool (ane-fp16-lint): A CLI that scans .mlpackage files and flags FP16-unsafe operations before you push to production. It detects nine patterns and provides stable alternatives for each. The Fixes: We have submitted three Pull Requests to the official apple/coremltools repository implementing these stable decompositions, which are currently under review by Apple's Core ML team. While Core AI introduces great automated stability for new architectures like the 20B AFM 3 Core Advanced, millions of deployed production models still need an immediate safety net. Full technical paper, proofs, and the linting tool are available on GitHub: github.com/apple-f16-overflow-audit (Note: Replace with your direct, clean GitHub repository link—avoiding social media redirects so the forum filters do not auto-flag the post) Looking forward to hearing if anyone else has run into these unexpected discrepancy patterns in production!

Machine Learning & AI Core ML

LLM inference on Apple Silicon: why do some MoE architectures outperform dense models despite similar parameter counts?

We're doing some local LLM inference experiments on Apple Silicon and have observed something that seems counterintuitive. In several cases, MoE models significantly outperform dense models despite having similar total parameter counts. Examples (simplified): Dense model: ~30B parameters MoE model: ~30B total parameters, ~3B active parameters On Apple Silicon, the MoE model consistently achieves higher decode throughput even after accounting for routing overhead. A few hypotheses we're considering: Active parameter count appears to matter more than total parameter count for decode throughput. Memory traffic may dominate M=1 autoregressive decode, making sparse activation more important than expected. Expert matrix geometry might matter as much as parameter count. Some MoE designs appear to produce GPU-friendly GEMV shapes while others do not. Quantization layout and memory alignment seem to have surprisingly large effects on practical throughput. What I'm curious about is whether others have observed similar behavior on Apple Silicon specifically. Has anyone profiled decode throughput across: dense models large-expert MoE many-small-expert MoE and identified which hardware characteristics are actually driving the difference? I'm particularly interested in observations from Metal profiling rather than benchmark leaderboards.

Machine Learning & AI Core ML

Apple GPU forward progress guarantees for persistent-thread synchronization?

We're doing some research on Apple Silicon inference runtimes and trying to understand the practical synchronization boundary of Apple GPUs. We are not asking about threadgroup barriers (those are documented), but about device-scope synchronization patterns built from atomics. What we've observed: Device-scope atomics are available. It is possible to build global counters and persistent-thread style coordination structures. However, we cannot find any documented guarantee regarding: threadgroup co-residency, global forward progress, occupancy-bounded synchronization safety. In our experiments, synchronization schemes that rely on all threadgroups making progress eventually can become unreliable, while strictly local producer/consumer handoff patterns appear much more robust. Questions: Does Metal provide any documented forward-progress guarantees across threadgroups beyond what is explicitly stated in the Metal specification? Is there any recommended pattern for implementing long-lived producer/consumer GPU pipelines without relying on global synchronization assumptions? For Apple GPUs specifically, should developers assume that occupancy-bounded global synchronization is unsupported unless explicitly provided by the API? We are not looking for undocumented implementation details, only for guidance on what assumptions are safe for production systems. Thanks.

Machine Learning & AI Core ML

Resolving co channel interference VOIP

Subject: Inquiry Regarding Architectural Overhead and Buffer Access in the Push to Talk Framework for Real-Time Core ML Blind Source Separation Dear Apple Engineering Team, We are currently developing an Apple-native communication platform that utilizes the Push to Talk framework alongside Core ML to handle real-time, on-device audio processing. We are working to resolve the issue of single-channel, co-channel interference (overlapping voice streams) directly on the edge. Our current challenge lies in the pipeline latency and background lifecycle constraints when intercepting incoming audio buffers. To cleanly separate overlapping voices before they hit the audio output mixer, we need to process the raw PCM data immediately upon arrival. Could you please provide guidance on the following architectural questions: Low-Latency Buffer Interception: What is the recommended design pattern within the PTChannelManagerDelegate flow to pass raw incoming audio buffers directly to a Core ML model running on the Apple Neural Engine (ANE) before the system routes them to AVAudioEngine for playback? Background Thread Management: Given the strict background execution boundaries enforced by the Push to Talk framework, how can we best optimize thread scheduling to ensure our speech separation model completes its execution without triggering an OS background processing timeout or process termination? Dynamic UI Manifestation: Once a combined audio stream is separated into two clean, distinct voice vectors on-device, what is the best approach for registering multiple PTParticipant states simultaneously so that the native system UI (like the Dynamic Island) accurately reflects both speakers? Thank you for your time, insights, and continued support of developer innovation within the iOS and iPadOS ecosystems. Best regards, Ken Zakreski Founder, Marine Link Pro

Machine Learning & AI Core ML

_FusedMatMul with [BiasAdd, Relu] produces incorrect results in graph mode on Metal GPU

When running a tf.function-traced graph on the Metal GPU, any operation that combines MatMul → BiasAdd → Relu (the fused pattern emitted by tf.keras.layers.Dense(activation='relu')) produces numerically incorrect output — errors on the order of tens of units, not floating-point noise. Eager mode on the same Metal GPU is correct. Graph mode forced to CPU (tf.config.set_visible_devices([], 'GPU')) is also correct. The bug is deterministic and data-independent (reproduces with random weights). the three-op combination of MatMul + BiasAdd + Relu trigger the error. Specifically: relu(tf.nn.bias_add(tf.matmul(x, W), b)) in graph mode on Metal is wrong, while relu(tf.matmul(x, W) + b) (using AddV2 instead of BiasAdd) is correct. Removing the Relu also makes the result correct — tf.nn.bias_add(tf.matmul(x, W), b) without a following Relu produces correct output at every shape tested. This points to the Metal plugin's fused _FusedMatMul kernel with fused_ops=[BiasAdd, Relu] as the culprit. Disabling the TF core grappler remapping pass (tf.config.optimizer.set_experimental_options({'remapping': False})) does not fix the issue, confirming that the fusion decision is made inside the Metal plugin's own kernel selection, below the TF core graph optimizer. The bug reproduces across all shapes tested (batch 4–200, inner dimension K 512–8192, output 128–2048) and is not specific to any particular weight values. A minimal reproducer: import tensorflow as tf import numpy as np # Any shape works; larger K makes the error more obvious M, K, N = 64, 2048, 1024 W = tf.Variable(tf.random.normal([K, N])) b = tf.Variable(tf.random.normal([N])) x = tf.random.normal([M, K]) @tf.function def graph_fused(x): return tf.nn.relu(tf.nn.bias_add(tf.matmul(x, W), b)) @tf.function def graph_safe(x): return tf.nn.relu(tf.matmul(x, W) + b) # AddV2 instead of BiasAdd eager_ref = tf.nn.relu(tf.nn.bias_add(tf.matmul(x, W), b)) # eager = correct fused_out = graph_fused(x) # Metal graph mode = WRONG safe_out = graph_safe(x) # Metal graph mode = correct print(f"eager vs graph_fused (BiasAdd): {tf.reduce_max(tf.abs(eager_ref - fused_out)).numpy():.1f}") # ^ typically 30–80+ (WRONG) print(f"eager vs graph_safe (AddV2): {tf.reduce_max(tf.abs(eager_ref - safe_out)).numpy():.2e}") # ^ typically ~1e-5 (correct) Environment: TensorFlow 2.18.1, Keras 3.11.2, tensorflow-metal (latest as of 2026-05-26), Apple Silicon Mac. Impact: This breaks any Keras model that uses Dense(activation='relu') when called inside a tf.function or via SavedModel serving on the Metal GPU. Eager-mode inference is unaffected.

Machine Learning & AI Core ML Metal tensorflow-metal

1.1k

When will mps support fp8 dtypes?

https://github.com/pytorch/pytorch/issues/132624 this fp8 dtypes unsupport issue has been existed for 2 years, does mlx have any plan to it?

Machine Learning & AI Core ML ML Compute

614

Do loading multiple functions that share model weights multiply memory use?

Hi, I have a multifunction model where the functions share the same model weights, and for latency I have multiple functions loaded at the same time. According to what Codex found this multiplies RAM usage, so if the single model weights 2GB, loading two functions that share the underlying weights still doubles RAM usage to 4GB (seems that it is something like neural wired memory). Does anyone have any knowledge relating to this?

Machine Learning & AI Core ML

1.2k

CoreML model cache causes fake hard drive memory usage

Hi, I experiment by creating and compiling a lot of CoreML models and I have the issue that this causes a lot of disk usage, but when I try to delete everything (I search in the disk for possible CoreML cache directories) the disk space is not actually freed up. This is a picture of my disk usage according to what is shown inside of Settings>General>Storage and the Disk Utility app. I am running on macOS 15.7.5

Machine Learning & AI Core ML

1.5k

May ’26

Does using Vision API offline to label a custom dataset for Core ML training violate DPLA?

Hello everyone, I am currently developing a smart camera app for iOS that recommends optimal zoom and exposure values on-device using a custom Core ML model. I am still waiting for an official response from Apple Support, but I wanted to ask the community if anyone has experience with a similar workflow regarding App Review and the DPLA. Here is my training methodology: I gathered my own proprietary dataset of original landscape photos. I generated multiple variants of these photos with different zoom and exposure settings offline on my Mac. I used the CalculateImageAestheticsScoresRequest (Vision framework) via a local macOS command-line tool to evaluate and score each variant. Based on those scores, I labeled the "best" zoom and exposure parameters for each original photo. I used this labeled dataset to train my own independent neural network using PyTorch, and then converted it to a Core ML model to ship inside my app. Since the app uses my own custom model on-device and does not send any user data to a server, the privacy aspect is clear. However, I am curious if using the output of Apple's Vision API strictly offline to label my own dataset could be interpreted as "reverse engineering" or a violation of the Developer Program License Agreement (DPLA). Has anyone successfully shipped an app using a similar knowledge distillation or automated dataset labeling approach with Apple's APIs? Did you face any pushback during App Review? Any insights or shared experiences would be greatly appreciated!

Machine Learning & AI Core ML App Review Vision Machine Learning Core ML

531

Apr ’26

Sharing a Swift port of Gemma 4 for mlx-swift-lm — feedback welcome

Hi all, I've been working on a pure-Swift port of Google's Gemma 4 text decoder that plugs into mlx-swift-lm as a sidecar model registration. Sharing it here in case anyone else hit the same wall I did, and to get feedback from the MLX team and the community before I propose anything upstream. Repo: https://github.com/yejingyang8963-byte/Swift-gemma4-core Why As of mlx-swift-lm 2.31.x, Gemma 4 isn't supported out of the box. The obvious workaround — reusing the Gemma 3 text implementation with a patched config — fails at weight load because Gemma 4 differs from Gemma 3 in several structural places. The chat-template path through swift-jinja 1.x also silently corrupts the prompt, so the model loads but generates incoherent text. What's in the package A from-scratch Swift implementation of the Gemma 4 decoder (Configuration, Layers, Attention, MLP, RoPE, DecoderLayer) Per-Layer Embedding (PLE) support — the shared embedding table that feeds every decoder layer through a gated MLP as a third residual KV sharing across the back half of the decoder, threaded through the forward pass via a donor table with a single global rope offset A custom Gemma4ProportionalRoPE class for the partial-rotation rope type that initializeRope doesn't currently recognize A chat-template bypass that builds the prompt as a literal string with the correct turn markers and encodes via tokenizer.encode(text:), matching Python mlx-lm's apply_chat_template byte-for-byte Measured on iPhone (A-series, 7.4 GB RAM) Model: mlx-community/gemma-4-e2b-it-4bit Warm load: ~6 s Memory after load: 341–392 MB Time to first token (end-to-end, 333-token system prompt): 2.82 s Generation throughput: 12–14 tok/s What I'd love feedback on Is the sidecar registration pattern the right way to extend mlx-swift-lm with new model families, or is there a more idiomatic path I missed? The chat-template bypass works but feels like a workaround. Is the right long-term fix in swift-jinja, in the tokenizer, or somewhere else entirely? Anyone running into the same PLE / KV-sharing issues on other Gemma-family checkpoints? I'd like to make sure the implementation generalizes beyond E2B before tagging a 0.2.0. Happy to open a PR against mlx-swift-lm if the maintainers think any of this belongs upstream. Thanks for reading.

Machine Learning & AI Core ML

416

Apr ’26

CoreML MLE5ProgramLibrary AOT recompilation hangs/crashes on iOS 26.4 — C++ exception in espresso IR compiler bypasses Swift error handling

Area: CoreML / Machine Learning Describe the issue: On iOS 26.4, calling MLModel(contentsOf:configuration:) to load an .mlpackage model hangs indefinitely and eventually kills the app via watchdog. The same model loads and runs inference successfully in under 1 second on iOS 26.3.1. The hang occurs inside eort_eo_compiler_compile_from_ir_program (espresso) during on-device AOT recompilation triggered by MLE5ProgramLibraryOnDeviceAOTCompilationImpl createProgramLibraryHandleWithRespecialization:error:. A C++ exception (__cxa_throw) is thrown inside libBNNS.dylib during the exception unwind, which then hangs inside __cxxabiv1::dyn_cast_slow and __class_type_info::search_below_dst. Swift's try/catch does not catch this — the exception originates in C++ and the process hangs rather than terminating cleanly. Setting config.computeUnits = .cpuOnly does not resolve the issue. MLE5ProgramLibrary initialises as shared infrastructure regardless of compute units. Steps to reproduce: Create an app with an .mlpackage CoreML model using the MLE5/espresso backend Call MLModel(contentsOf: modelURL, configuration: config) at runtime Run on a device on iOS 26.3.1 — loads successfully in <1 second Update device to iOS 26.4 — hangs indefinitely, app killed by watchdog after 60–745 seconds Expected behaviour: Model loads successfully, or throws a catchable Swift error on failure. Actual behaviour: Process hangs in MLE5ProgramLibrary.lazyInitQueue. App killed by watchdog. No Swift error thrown. Full stack trace at point of hang: Thread 1 Queue: com.apple.coreml.MLE5ProgramLibrary.lazyInitQueue (serial) frame 0: __cxxabiv1::__class_type_info::search_below_dst libc++abi.dylib frame 1: __cxxabiv1::(anonymous namespace)::dyn_cast_slow libc++abi.dylib frame 2: ___lldb_unnamed_symbol_23ab44dd4 libBNNS.dylib frame 23: eort_eo_compiler_compile_from_ir_program espresso frame 24: -[MLE5ProgramLibraryOnDeviceAOTCompilationImpl createProgramLibraryHandleWithRespecialization:error:] CoreML frame 25: -[MLE5ProgramLibrary _programLibraryHandleWithForceRespecialization:error:] CoreML frame 26: __44-[MLE5ProgramLibrary prepareAndReturnError:]_block_invoke CoreML frame 27: _dispatch_client_callout libdispatch.dylib frame 28: _dispatch_lane_barrier_sync_invoke_and_complete libdispatch.dylib frame 29: -[MLE5ProgramLibrary prepareAndReturnError:] CoreML frame 30: -[MLE5Engine initWithContainer:configuration:error:] CoreML frame 31: +[MLE5Engine loadModelFromCompiledArchive:modelVersionInfo:compilerVersionInfo:configuration:error:] CoreML frame 32: +[MLLoader _loadModelWithClass:fromArchive:modelVersionInfo:compilerVersionInfo:configuration:error:] CoreML frame 45: +[MLModel modelWithContentsOfURL:configuration:error:] CoreML frame 46: @nonobjc MLModel.__allocating_init(contentsOf:configuration:) GKPersonalV2 frame 47: MDNA_GaitEncoder_v1_3.__allocating_init(contentsOf:configuration:) frame 48: MDNA_GaitEncoder_v1_3.__allocating_init(configuration:) frame 50: GaitModelInference.loadModel() frame 51: GaitModelInference.init() iOS version: Reproduced on iOS 26.4. Works correctly on iOS 26.3.1. Xcode version: 26.2 Device: iPhone (model used in testing) Model format: .mlpackage

Machine Learning & AI Core ML ML Compute

940

Apr ’26

How does ARKit achieve low-latency and stable head tracking using only RGB camera ?

Hi, I’m working on a real-time head/face tracking pipeline using a standard 2D RGB camera, and I’m trying to better understand how ARKit achieves such stable and responsive results in comparable conditions. To clarify upfront: I’m specifically interested in RGB-only tracking and the underlying vision/ML pipeline. I’m not using TrueDepth or any depth/IR-based sensors, and I’d like to understand how similar stability and responsiveness can be achieved under those constraints. In my current setup, I estimate head pose from RGB frames (facial landmarks + PnP) and apply temporal filtering (e.g., exponential smoothing and Kalman filtering). This significantly reduces jitter, but introduces noticeable latency, especially during faster head movements. What stands out in ARKit is that it appears to maintain both: Very low jitter Very low perceived latency even when operating with camera input alone. I’m trying to understand what techniques might contribute to this behavior. In particular: Does ARKit use predictive tracking (e.g., velocity or acceleration-based pose extrapolation) to compensate for camera and processing delays in RGB-only scenarios? Are there recommended strategies for balancing temporal smoothing and responsiveness without introducing visible lag in camera-based pose estimation pipelines? Is the tracking pipeline internally decoupled from rendering (e.g., asynchronous processing with prediction applied at render time)? Are there general best practices for minimizing end-to-end latency in vision-based head tracking systems beyond standard filtering approaches? I understand that implementation details may not be public, but any high-level insights or pointers would be greatly appreciated. Thanks!

Machine Learning & AI Core ML ARKit

324

Mar ’26

MPS SDPA Attention Kernel Regression on A14-class (M1) in macOS 26.3.1 — Works on A15+ (M2+)

Summary Since macOS 26, our Core ML / MPS inference pipeline produces incorrect results on Mac mini M1 (Macmini9,1, A14-class SoC). The same model and code runs correctly on M2 and newer (A15-class and up). The regression appears to be in the Scaled Dot-Product Attention (SDPA) kernel path in the MPS backend. Environment Affected Mac mini M1 — Macmini9,1 (A14-class) Not affected M2 and newer (A15-class and up) Last known good macOS Sequoia First broken macOS 26 (Tahoe) ? Confirmed broken on macOS 26.3.1 Framework Core ML + MPS backend Language C++ (via CoreML C++ API) Description We ship an audio processing application (VoiceAssist by NoiseWorks) that runs a deep learning model (based on Demucs architecture) via Core ML with the MPS compute unit. On macOS Sequoia this works correctly on all Apple Silicon Macs including M1. After updating to macOS 26 (Tahoe), inference on M1 Macs fails — either producing garbage output or crashing. The same binary, same .mlpackage, same inputs work correctly on M2+. Our Apple contact has suggested the root cause is a regression in the A14-specific MPS SDPA attention kernel, which may have broken when the Metal/MPS stack was updated in macOS 26. The model makes heavy use of attention layers, and the failure correlates precisely with the SDPA path being exercised on A14 hardware. Steps to Reproduce Load a Core ML model that uses Scaled Dot-Product Attention (e.g. a transformer or attention-based audio model) Run inference with MLComputeUnits::cpuAndGPU (MPS active) Run on Mac mini M1 (Macmini9,1) with macOS 26.3.1 Compare output to the same model running on M2 / macOS Sequoia Expected: Correct inference output, consistent with M2+ and macOS Sequoia behavior Actual: Incorrect / corrupted output (or crash), only on A14-class hardware running macOS 26+ Workaround Forcing MLComputeUnits::cpuOnly bypasses MPS entirely and produces correct output on M1, confirming the issue is in the MPS compute path. This is not acceptable as a shipping workaround due to performance impact. Additional Notes The failure is hardware-specific (A14 only) and OS-specific (macOS 26+), pointing to a kernel-level regression rather than a model or app bug We first became aware of this through a customer report Happy to provide a symbolicated crash log if helpful this text was summarized by AI and human verified

Machine Learning & AI Core ML Metal Performance Shaders

447

Apr ’26

Building a 4-agent autonomous coding pipeline on Apple Silicon — MLX backend questions

Hi, I'm building ANF (Autonomous Native Forge) — a cloud-free, 4-agent autonomous software production pipeline running on local hardware with local LLM inference. No middleware, pure Node.js native. Currently running on NVIDIA Blackwell GB10 with vLLM + DeepSeek-R1-32B. Now porting to Apple Silicon. Three technical questions: How production-ready is mlx-lm's OpenAI-compatible API server for long context generation (32K tokens)? What's the recommended approach for KV Cache management with Unified Memory architecture — any specific flags or configurations for M4 Ultra? MLX vs GGUF (llama.cpp) for a multi-agent pipeline where 4 agents call the inference endpoint concurrently — which handles parallel requests better on Apple Silicon? GitHub: github.com/trgysvc/AutonomousNativeForge Any guidance appreciated.

Machine Learning & AI Core ML Interface Builder Core ML Apple Silicon

594

Mar ’26

tensorflow-metal ReLU activation fails to clip negative values on M4 Apple Silicon

Environment: Hardware: Mac M4 OS: macOS Sequoia 15.7.4 TensorFlow-macOS Version: 2.16.2 TensorFlow-metal Version: 1.2.0 Description: When using the tensorflow-metal plug-in for GPU acceleration on M4, the ReLU activation function (both as a layer and as an activation argument) fails to correctly clip negative values to zero. The same code works correctly when forced to run on the CPU. Reproduction Script: import os import numpy as np import tensorflow as tf # weights and biases = -1 weights = [np.ones((10, 5)) * -1, np.ones(5) * -1] # input = 1 data = np.ones((1, 10)) # comment this line => GPU => get negative values # uncomment this line => CPU => no negative values # tf.config.set_visible_devices([], 'GPU') # create model model = tf.keras.Sequential([ tf.keras.layers.Input(shape=(10,)), tf.keras.layers.Dense(5, activation='relu') ]) # set weights model.layers[0].set_weights(weights) # get output output = model.predict(data) # check if negative is present print(f"min value: {output.min()}") print(f"is negative present? {np.any(output < 0)}")

Machine Learning & AI Core ML Foundation ML Compute tensorflow-metal

677

Mar ’26

How can I change the output dimensions of a CoreML model in Xcode when the outputs come from a NonMaximumSuppression layer?

After exerting a custom model with nms=True. In Xcode, the outputs show as: confidence: MultiArray (0 × 5) coordinates: MultiArray (0 × 4) I want to set fixed shapes (e.g., 100 × 5, 100 × 4), but Xcode does not allow editing—the shape fields are locked. The model graph shows both outputs come directly from a NonMaximumSuppression layer. Is it possible to set fixed output dimensions for NMS outputs in CoreML?

Machine Learning & AI Core ML ML Compute Swift Xcode Core ML

666

Mar ’26

Massive CoreML latency spike on live AVFoundation camera feed vs. offline inference (CPU+ANE)

Hello, I’m experiencing a severe performance degradation when running CoreML models on a live AVFoundation video feed compared to offline or synthetic inference. This happens across multiple models I've converted (including SCI, RTMPose, and RTMW) and affects multiple devices. The Environment OS: macOS 26.3, iOS 26.3, iPadOS 26.3 Hardware: Mac14,6 (M2 Max), iPad Pro 11 M1, iPhone 13 mini Compute Units: cpuAndNeuralEngine The Numbers When testing my SCI_output_image_int8.mlpackage model, the inference timings are drastically different: Synthetic/Offline Inference: ~1.34 ms Live Camera Inference: ~15.96 ms Preprocessing is completely ruled out as the bottleneck. My profiling shows total preprocessing (nearest-neighbor resize + feature provider creation) takes only ~0.4 ms in camera mode. Furthermore, no frames are being dropped. What I've Tried I am building a latency-critical app and have implemented almost every recommended optimization to try and fix this, but the camera-feed penalty remains: Matched the AVFoundation camera output format exactly to the model input (640x480 at 30/60fps). Used IOSurface-backed pixel buffers for everything (camera output, synthetic buffer, and resize buffer). Enabled outputBackings. Loaded the model once and reused it for all predictions. Configured MLModelConfiguration with reshapeFrequency = .frequent and specializationStrategy = .fastPrediction. Wrapped inference in ProcessInfo.processInfo.beginActivity(options: .latencyCritical, reason: "CoreML_Inference"). Set DispatchQueue to qos: .userInteractive. Disabled the idle timer and enabled iOS Game Mode. Exported models using coremltools 9.0 (deployment target iOS 26) with ImageType inputs/outputs and INT8 quantization. Reproduction To completely rule out UI or rendering overhead, I wrote a standalone Swift CLI script that isolates the AVFoundation and CoreML pipeline. The script clearly demonstrates the ~15ms latency on live camera frames versus the ~1ms latency on synthetic buffers. (I have attached camera_coreml_benchmark.swift and coreml model (very light low light enghancement model) to this repo on github https://github.com/pzoltowski/apple-coreml-camera-latency-repro). My Question: Is this massive overhead expected behavior for AVFoundation + Core ML on live feeds, or is this a framework/runtime bug? If expected, what is the Apple-recommended pattern to bypass this camera-only inference slowdown? One think found interesting when running in debug model was faster (not as fast as in performance benchmark but faster than 16ms. Also somehow if I did some dummy calculation on on different DispatchQueue also seems like model got slightly faster. So maybe its related to ANE Power State issues (Jitter/SoC Wake) and going to fast to sleep and taking a long time to wakeup? Doing dummy calculation in background thought is probably not a solution. Thanks in advance for any insights!

Machine Learning & AI Core ML Performance AVFoundation

1.2k

Mar ’26

Core Model Editor and Params

Optimal Precision • Current Precision: Mixed (Float32, int32) • Optimal Precision: Not specified in the image, but typically involves using the most efficient data type for the model's operations to balance speed and memory usage without significant loss of accuracy. Comparison: • Mixed Precision: Utilizes both Float32 and int32 to optimize performance. Float32 provides high precision, while int32 reduces memory usage and increases computational speed. • Optimal Precision: Aimed at achieving the best trade-off between performance and accuracy, potentially using other data types like Float16 (bfloat16) for even greater efficiency in certain hardware environments. Operation Distribution • Current Distribution: • iOS18.mul: 168 • iOS18.transpose: 126 • iOS18.linear: 98 • iOS18.add: 97 • iOS18.sliceByIndex: 96 • iOS18.expandDims: 74 • iOS18.concat: 72 • iOS18.squeeze: 72 • iOS18.reshape: 67 • iOS18.layerNorm: 49 • iOS18.matmul: 48 • iOS18.gelu: 26 • iOS18.softmax: 24 • Split: 24 • conv: 1 • iOS18.conv: 1 Comparison: • Operation Count: Indicates how frequently each operation is executed. High counts for operations like mul, transpose, and linear suggest these are computationally intensive parts of the model. • Optimization Opportunities: Reducing the count of high-frequency operations or optimizing their execution can improve performance. This might involve pruning unnecessary operations, optimizing algorithms, or leveraging hardware acceleration. General Recommendations • Precision Tuning: Experiment with different precision levels to find the best balance for your specific hardware and accuracy requirements. • Operation Optimization: Focus on optimizing the most frequent operations. Techniques include using more efficient algorithms, parallelizing computations, or utilizing specialized hardware like GPUs or TPUs. • Benchmarking: Regularly benchmark the model to assess the impact of changes and ensure that optimizations lead to meaningful performance improvements. By focusing on these areas, you can potentially enhance the efficiency and performance of your ML model.

Machine Learning & AI Core ML

265

Feb ’26

关于我使用Swift和Metal制作的神经网络引擎