Integrate machine learning models into your app using Core ML.

Core ML Documentation

Posts under Core ML subtopic

Post

Replies

Boosts

Views

Activity

关于我使用Swift和Metal制作的神经网络引擎
我今年18岁。没有机器学习背景,没有上过大学,高中都没去上,没有导师。 几天前我盯着一张纸发呆。突然想:为什么计算机神经网络一定要是2D的?可以模拟生物吗?为什么一定要在平面上算?如果多个平面,岂不是翻倍?如果把六张纸想象成一个魔方,六个面各自承载神经元,八条体对角线变成新的通信通道会怎么样? 我真的很喜欢折腾这些,然后我立刻制定了详细计划,使用AI工具辅助写下了第一个 kernel。跑崩了。我又重新想了一下,和qq群友分享了我的目标,又写。又崩。连续几十次。没有 PyTorch,没有 TensorFlow,没有 CUDA。只有Swift和Metal。因为我的电脑显卡是AMD Vega 64,没装任何框架辅助,因为我想明白最底层的运行方式是什么原理。 这就是CubeNN。 ##以下为AI的详细解答,内容与架构改动太多,我在这里一次讲不清楚 它是什么 一个用魔方几何作为计算架构的神经网络引擎。 标准 Transformer: 把数据排成一行,O(n²) 地互相看 CubeNN: 把数据分布在 14 个面上,只在该看的地方看 6 个标准面 → 块稀疏注意力(粗看全局 + 细看局部) 8 个 X 面对角线 → 跨面信息桥(不做 Attention,只负责传递) 每轮:6 面算 → 投影到 8 X 面 → 上采样精炼 → 融合回 6 面 最关键的是 Cube Cascade——一个树+链级联推理: 树阶段: 1 个魔方 spawn 8 个 → 8 个 spawn 64 个 → 73 个并行探索 GPU 上同时跑,选最优路径 链阶段: 最优叶子无限深度精炼 3-5 步收敛,方差提升 ~7% 怎么实现的 纯 Swift + Metal。零依赖。零框架。 // 大致代码就是这些 import Metal import Foundation let device = MTLCreateSystemDefaultDevice()! let library = try! device.makeLibrary(filepath: "cube_nn.metallib") // ...12 个 GPU kernel,12,000 次 dispatch 关键技术决策: 单 Command Buffer:整个树阶段 73 个魔方的全部 kernel dispatch 打包进一个 CB,0 次 CPU-GPU 同步 Pipeline State 缓存:编码从 1022ms 降到 42ms Buffer 偏移:所有 73 个魔方的 14 个面存进一个连续 buffer,kernel 通过 buffer(15) 传偏移量 FP16:N≥64 时半精度提速 21% 性能 ##经过测试,但是因设备差异可能不准确,仅参考 AMD Radeon RX Vega 64 (2017 年显卡, 14nm, 295W): 规模 神经元 魔方数 耗时 N=32 6,144 73 (树) 435ms N=64 24,576 21 (树) 817ms N=128 98,304 1 116ms N=32 全连接 Attention 每层 201M FLOP → CubeNN 块稀疏 370K FLOP (544× 减少) N=128 全连接需要 32GB 显存(物理上不存在)→ CubeNN 用 192KB N=256 全连接需要 2.2T FLOP → CubeNN 52M FLOP (42,300× 减少) 代码体积:161KB。 对比 PyTorch 的 800MB。 我经历了什么 这个项目最困难的不是写 kernel,是在没有任何人告诉我"能不能做"的情况下,靠反复试错找到路。 第一次试图跑 73 个魔方,GPU 直接 hang 了。花了 3 天定位到是 Command Buffer 堆叠过多。 改了 single encoder 方案,又碰上 SIGILL——Metal 不允许 makeBuffer(length: 0),B=0 时创建了零长度 buffer。 想用 threadgroup memory 做 kernel fusion,结果跨 threadgroup 读不到数据,才明白 LDS 是 per-group 的。 N=64 的 FP16 要手动写 float↔half 转换函数,因为 macOS 11 上 Float16 类型被标为 unavailable。 每一次崩溃都教会我一个 Metal 的底层细节。没有人教我,但 Metal 的报错信息就是最好的老师。 为什么发在 Apple 开发者论坛 因为这是为苹果生态而生的项目。CubeNN 从头到尾只用了两个东西:Swift 和 Metal。它不需要移植就能跑在任何 Apple Silicon Mac 上(API兼容)。如果未来能把部分 kernel 映射到 Neural Engine,效率会再翻几倍。 我想问 Apple 的 Metal 工程师和 Core ML 团队: ** 有没有更好的 GPU 任务调度方式?**目前表现仍然欠佳(对于我这个完美主义者来说),可能改得有点乱了 有没有兴趣评估这个架构在 M4 上的表现? 我手里只有 Vega 64。M4 GPU + ANE方法 跑 CubeNN 会是什么效果? 源代码 ├── run.swift # 统一 CLI,参数化 N/B/depth ├── src/ │ ├── cube_nn.metal # FP16 kernel │ └── cube_nn_fp32.metal # FP32 kernel └── benchmarks/ # 实测数据 如果你读到了这里——谢谢你。一个门外汉靠痴狂的,纯粹到几乎是妄想的主意和Metal走到了这里。我懂的不是很多,如果这个架构有任何价值,我想让它变得更好。任何建议、批评、或者指教,都非常欢迎。
0
0
35
20h
Silent FP16 Overflow in coremltools: 5 Numerical Failures Affecting ANE Inference (With Fixes)
Silent FP16 Overflow in coremltools: 5 Numerical Failures Affecting ANE Inference (With Fixes) Hi everyone, With the announcements at WWDC26 regarding Core AI and "automatic stable decompositions," it is clear that managing mathematical stability in constrained FP16 environments is a major priority for the ecosystem. To support developers maintaining existing models that cannot migrate to the newest architectures overnight, I have published a research paper and an open-source static analysis tool documenting 5 silent numerical failures in the standard coremltools pipeline. Because the Apple Neural Engine (ANE) executes inference in FP16, the maximum representable value is 65,504 ($\exp(11.09)$). Inputs exceeding these tight bounds cause silent overflows to infinity or collapses to zero without warnings. Deployed Operations Currently Affected softplus (YOLOv5/v8): Outputs silently collapse to 0.0 at $x > 10.4$ on ANE. logsumexp (Attention mechanisms): Overflows at $x > 7.63$ for 32 channels. For vocabulary-sized reductions, the threshold drops below $5$. log_softmax (Classifiers like BERT, GPT, ViT): Softmax probabilities underflow to 0, causing $\log(0) = -\infty$. logcumsumexp (CTC decoders): Overflows at $x > 11.09$. mish (YOLO variants): Inherits the softplus overflow limits. The Immediate Safety Net: Algebraically Equivalent Reformulations We can bypass these hardware limits entirely by rewriting the operations into mathematically stable forms. For example, rewriting softplus as: $$\max(x, 0) + \log(1 + \exp(-|x|))$$ Because $-|x| \le 0$, $\exp(-|x|)$ is bound strictly between $(0, 1]$. Overflow becomes mathematically impossible in any precision, yielding bit-identical outputs for all valid inputs. While PyTorch AMP traditionally classifies these operations as FP32-only, the ANE has no such fallback—making stable decomposition mandatory. Tools & Patches Deployed Today The Paper: "Silent Numerical Failures in On-Device ML Converters: A Systematic Audit of FP16 Overflow in Apple Neural Engine Deployment." (Complete vulnerability census, discrepancy pattern analysis, formal proofs, and quantitative evaluation). The Tool (ane-fp16-lint): A CLI that scans .mlpackage files and flags FP16-unsafe operations before you push to production. It detects nine patterns and provides stable alternatives for each. The Fixes: We have submitted three Pull Requests to the official apple/coremltools repository implementing these stable decompositions, which are currently under review by Apple's Core ML team. While Core AI introduces great automated stability for new architectures like the 20B AFM 3 Core Advanced, millions of deployed production models still need an immediate safety net. Full technical paper, proofs, and the linting tool are available on GitHub: github.com/apple-f16-overflow-audit (Note: Replace with your direct, clean GitHub repository link—avoiding social media redirects so the forum filters do not auto-flag the post) Looking forward to hearing if anyone else has run into these unexpected discrepancy patterns in production!
0
0
22
2d
LLM inference on Apple Silicon: why do some MoE architectures outperform dense models despite similar parameter counts?
We're doing some local LLM inference experiments on Apple Silicon and have observed something that seems counterintuitive. In several cases, MoE models significantly outperform dense models despite having similar total parameter counts. Examples (simplified): Dense model: ~30B parameters MoE model: ~30B total parameters, ~3B active parameters On Apple Silicon, the MoE model consistently achieves higher decode throughput even after accounting for routing overhead. A few hypotheses we're considering: Active parameter count appears to matter more than total parameter count for decode throughput. Memory traffic may dominate M=1 autoregressive decode, making sparse activation more important than expected. Expert matrix geometry might matter as much as parameter count. Some MoE designs appear to produce GPU-friendly GEMV shapes while others do not. Quantization layout and memory alignment seem to have surprisingly large effects on practical throughput. What I'm curious about is whether others have observed similar behavior on Apple Silicon specifically. Has anyone profiled decode throughput across: dense models large-expert MoE many-small-expert MoE and identified which hardware characteristics are actually driving the difference? I'm particularly interested in observations from Metal profiling rather than benchmark leaderboards.
0
0
30
2d
Apple GPU forward progress guarantees for persistent-thread synchronization?
We're doing some research on Apple Silicon inference runtimes and trying to understand the practical synchronization boundary of Apple GPUs. We are not asking about threadgroup barriers (those are documented), but about device-scope synchronization patterns built from atomics. What we've observed: Device-scope atomics are available. It is possible to build global counters and persistent-thread style coordination structures. However, we cannot find any documented guarantee regarding: threadgroup co-residency, global forward progress, occupancy-bounded synchronization safety. In our experiments, synchronization schemes that rely on all threadgroups making progress eventually can become unreliable, while strictly local producer/consumer handoff patterns appear much more robust. Questions: Does Metal provide any documented forward-progress guarantees across threadgroups beyond what is explicitly stated in the Metal specification? Is there any recommended pattern for implementing long-lived producer/consumer GPU pipelines without relying on global synchronization assumptions? For Apple GPUs specifically, should developers assume that occupancy-bounded global synchronization is unsupported unless explicitly provided by the API? We are not looking for undocumented implementation details, only for guidance on what assumptions are safe for production systems. Thanks.
0
0
23
2d
Resolving co channel interference VOIP
Subject: Inquiry Regarding Architectural Overhead and Buffer Access in the Push to Talk Framework for Real-Time Core ML Blind Source Separation Dear Apple Engineering Team, We are currently developing an Apple-native communication platform that utilizes the Push to Talk framework alongside Core ML to handle real-time, on-device audio processing. We are working to resolve the issue of single-channel, co-channel interference (overlapping voice streams) directly on the edge. Our current challenge lies in the pipeline latency and background lifecycle constraints when intercepting incoming audio buffers. To cleanly separate overlapping voices before they hit the audio output mixer, we need to process the raw PCM data immediately upon arrival. Could you please provide guidance on the following architectural questions: Low-Latency Buffer Interception: What is the recommended design pattern within the PTChannelManagerDelegate flow to pass raw incoming audio buffers directly to a Core ML model running on the Apple Neural Engine (ANE) before the system routes them to AVAudioEngine for playback? Background Thread Management: Given the strict background execution boundaries enforced by the Push to Talk framework, how can we best optimize thread scheduling to ensure our speech separation model completes its execution without triggering an OS background processing timeout or process termination? Dynamic UI Manifestation: Once a combined audio stream is separated into two clean, distinct voice vectors on-device, what is the best approach for registering multiple PTParticipant states simultaneously so that the native system UI (like the Dynamic Island) accurately reflects both speakers? Thank you for your time, insights, and continued support of developer innovation within the iOS and iPadOS ecosystems. Best regards, Ken Zakreski Founder, Marine Link Pro
2
0
51
3d
_FusedMatMul with [BiasAdd, Relu] produces incorrect results in graph mode on Metal GPU
When running a tf.function-traced graph on the Metal GPU, any operation that combines MatMul → BiasAdd → Relu (the fused pattern emitted by tf.keras.layers.Dense(activation='relu')) produces numerically incorrect output — errors on the order of tens of units, not floating-point noise. Eager mode on the same Metal GPU is correct. Graph mode forced to CPU (tf.config.set_visible_devices([], 'GPU')) is also correct. The bug is deterministic and data-independent (reproduces with random weights). the three-op combination of MatMul + BiasAdd + Relu trigger the error. Specifically: relu(tf.nn.bias_add(tf.matmul(x, W), b)) in graph mode on Metal is wrong, while relu(tf.matmul(x, W) + b) (using AddV2 instead of BiasAdd) is correct. Removing the Relu also makes the result correct — tf.nn.bias_add(tf.matmul(x, W), b) without a following Relu produces correct output at every shape tested. This points to the Metal plugin's fused _FusedMatMul kernel with fused_ops=[BiasAdd, Relu] as the culprit. Disabling the TF core grappler remapping pass (tf.config.optimizer.set_experimental_options({'remapping': False})) does not fix the issue, confirming that the fusion decision is made inside the Metal plugin's own kernel selection, below the TF core graph optimizer. The bug reproduces across all shapes tested (batch 4–200, inner dimension K 512–8192, output 128–2048) and is not specific to any particular weight values. A minimal reproducer: import tensorflow as tf import numpy as np # Any shape works; larger K makes the error more obvious M, K, N = 64, 2048, 1024 W = tf.Variable(tf.random.normal([K, N])) b = tf.Variable(tf.random.normal([N])) x = tf.random.normal([M, K]) @tf.function def graph_fused(x): return tf.nn.relu(tf.nn.bias_add(tf.matmul(x, W), b)) @tf.function def graph_safe(x): return tf.nn.relu(tf.matmul(x, W) + b) # AddV2 instead of BiasAdd eager_ref = tf.nn.relu(tf.nn.bias_add(tf.matmul(x, W), b)) # eager = correct fused_out = graph_fused(x) # Metal graph mode = WRONG safe_out = graph_safe(x) # Metal graph mode = correct print(f"eager vs graph_fused (BiasAdd): {tf.reduce_max(tf.abs(eager_ref - fused_out)).numpy():.1f}") # ^ typically 30–80+ (WRONG) print(f"eager vs graph_safe (AddV2): {tf.reduce_max(tf.abs(eager_ref - safe_out)).numpy():.2e}") # ^ typically ~1e-5 (correct) Environment: TensorFlow 2.18.1, Keras 3.11.2, tensorflow-metal (latest as of 2026-05-26), Apple Silicon Mac. Impact: This breaks any Keras model that uses Dense(activation='relu') when called inside a tf.function or via SavedModel serving on the Metal GPU. Eager-mode inference is unaffected.
0
0
1.1k
2w
Do loading multiple functions that share model weights multiply memory use?
Hi, I have a multifunction model where the functions share the same model weights, and for latency I have multiple functions loaded at the same time. According to what Codex found this multiplies RAM usage, so if the single model weights 2GB, loading two functions that share the underlying weights still doubles RAM usage to 4GB (seems that it is something like neural wired memory). Does anyone have any knowledge relating to this?
0
0
1.2k
3w
CoreML model load failed with this error : Failed to set up decrypt context for /private/var/mobile/Containers/Data/Application/ACB94507-F8DE-494B-8499-B0CF75FC3B55/Library/Caches/temp.m/xxx.mlmodelc. error:-42905"
Hi there. We use a core ML model for image processing, and because loading core ml model take long time (~10 sec), we preload core ML model when app start time. but in some device, loading core ml model fails with such error. we download core ML model from server then load model from local storage. loading code looks like this. typical. MLModel.load(contentsOf: compliedUrl, configuration: config) once this error happen, it keeps fails until we restart the device. (+) In this article, I saw that it is related some "limitation of decrypt session" : https://developer.apple.com/forums/thread/707622 but it also happens to in-house test flight builds which are used only under 5 people. Can I know why this happens?
4
1
2.5k
3w
CoreML model cache causes fake hard drive memory usage
Hi, I experiment by creating and compiling a lot of CoreML models and I have the issue that this causes a lot of disk usage, but when I try to delete everything (I search in the disk for possible CoreML cache directories) the disk space is not actually freed up. This is a picture of my disk usage according to what is shown inside of Settings>General>Storage and the Disk Utility app. I am running on macOS 15.7.5
0
0
1.5k
May ’26
Does using Vision API offline to label a custom dataset for Core ML training violate DPLA?
Hello everyone, I am currently developing a smart camera app for iOS that recommends optimal zoom and exposure values on-device using a custom Core ML model. I am still waiting for an official response from Apple Support, but I wanted to ask the community if anyone has experience with a similar workflow regarding App Review and the DPLA. Here is my training methodology: I gathered my own proprietary dataset of original landscape photos. I generated multiple variants of these photos with different zoom and exposure settings offline on my Mac. I used the CalculateImageAestheticsScoresRequest (Vision framework) via a local macOS command-line tool to evaluate and score each variant. Based on those scores, I labeled the "best" zoom and exposure parameters for each original photo. I used this labeled dataset to train my own independent neural network using PyTorch, and then converted it to a Core ML model to ship inside my app. Since the app uses my own custom model on-device and does not send any user data to a server, the privacy aspect is clear. However, I am curious if using the output of Apple's Vision API strictly offline to label my own dataset could be interpreted as "reverse engineering" or a violation of the Developer Program License Agreement (DPLA). Has anyone successfully shipped an app using a similar knowledge distillation or automated dataset labeling approach with Apple's APIs? Did you face any pushback during App Review? Any insights or shared experiences would be greatly appreciated!
1
0
525
Apr ’26
MPS SDPA Attention Kernel Regression on A14-class (M1) in macOS 26.3.1 — Works on A15+ (M2+)
Summary Since macOS 26, our Core ML / MPS inference pipeline produces incorrect results on Mac mini M1 (Macmini9,1, A14-class SoC). The same model and code runs correctly on M2 and newer (A15-class and up). The regression appears to be in the Scaled Dot-Product Attention (SDPA) kernel path in the MPS backend. Environment Affected Mac mini M1 — Macmini9,1 (A14-class) Not affected M2 and newer (A15-class and up) Last known good macOS Sequoia First broken macOS 26 (Tahoe) ? Confirmed broken on macOS 26.3.1 Framework Core ML + MPS backend Language C++ (via CoreML C++ API) Description We ship an audio processing application (VoiceAssist by NoiseWorks) that runs a deep learning model (based on Demucs architecture) via Core ML with the MPS compute unit. On macOS Sequoia this works correctly on all Apple Silicon Macs including M1. After updating to macOS 26 (Tahoe), inference on M1 Macs fails — either producing garbage output or crashing. The same binary, same .mlpackage, same inputs work correctly on M2+. Our Apple contact has suggested the root cause is a regression in the A14-specific MPS SDPA attention kernel, which may have broken when the Metal/MPS stack was updated in macOS 26. The model makes heavy use of attention layers, and the failure correlates precisely with the SDPA path being exercised on A14 hardware. Steps to Reproduce Load a Core ML model that uses Scaled Dot-Product Attention (e.g. a transformer or attention-based audio model) Run inference with MLComputeUnits::cpuAndGPU (MPS active) Run on Mac mini M1 (Macmini9,1) with macOS 26.3.1 Compare output to the same model running on M2 / macOS Sequoia Expected: Correct inference output, consistent with M2+ and macOS Sequoia behavior Actual: Incorrect / corrupted output (or crash), only on A14-class hardware running macOS 26+ Workaround Forcing MLComputeUnits::cpuOnly bypasses MPS entirely and produces correct output on M1, confirming the issue is in the MPS compute path. This is not acceptable as a shipping workaround due to performance impact. Additional Notes The failure is hardware-specific (A14 only) and OS-specific (macOS 26+), pointing to a kernel-level regression rather than a model or app bug We first became aware of this through a customer report Happy to provide a symbolicated crash log if helpful this text was summarized by AI and human verified
2
0
445
Apr ’26
CoreML MLE5ProgramLibrary AOT recompilation hangs/crashes on iOS 26.4 — C++ exception in espresso IR compiler bypasses Swift error handling
Area: CoreML / Machine Learning Describe the issue: On iOS 26.4, calling MLModel(contentsOf:configuration:) to load an .mlpackage model hangs indefinitely and eventually kills the app via watchdog. The same model loads and runs inference successfully in under 1 second on iOS 26.3.1. The hang occurs inside eort_eo_compiler_compile_from_ir_program (espresso) during on-device AOT recompilation triggered by MLE5ProgramLibraryOnDeviceAOTCompilationImpl createProgramLibraryHandleWithRespecialization:error:. A C++ exception (__cxa_throw) is thrown inside libBNNS.dylib during the exception unwind, which then hangs inside __cxxabiv1::dyn_cast_slow and __class_type_info::search_below_dst. Swift's try/catch does not catch this — the exception originates in C++ and the process hangs rather than terminating cleanly. Setting config.computeUnits = .cpuOnly does not resolve the issue. MLE5ProgramLibrary initialises as shared infrastructure regardless of compute units. Steps to reproduce: Create an app with an .mlpackage CoreML model using the MLE5/espresso backend Call MLModel(contentsOf: modelURL, configuration: config) at runtime Run on a device on iOS 26.3.1 — loads successfully in <1 second Update device to iOS 26.4 — hangs indefinitely, app killed by watchdog after 60–745 seconds Expected behaviour: Model loads successfully, or throws a catchable Swift error on failure. Actual behaviour: Process hangs in MLE5ProgramLibrary.lazyInitQueue. App killed by watchdog. No Swift error thrown. Full stack trace at point of hang: Thread 1 Queue: com.apple.coreml.MLE5ProgramLibrary.lazyInitQueue (serial) frame 0: __cxxabiv1::__class_type_info::search_below_dst libc++abi.dylib frame 1: __cxxabiv1::(anonymous namespace)::dyn_cast_slow libc++abi.dylib frame 2: ___lldb_unnamed_symbol_23ab44dd4 libBNNS.dylib frame 23: eort_eo_compiler_compile_from_ir_program espresso frame 24: -[MLE5ProgramLibraryOnDeviceAOTCompilationImpl createProgramLibraryHandleWithRespecialization:error:] CoreML frame 25: -[MLE5ProgramLibrary _programLibraryHandleWithForceRespecialization:error:] CoreML frame 26: __44-[MLE5ProgramLibrary prepareAndReturnError:]_block_invoke CoreML frame 27: _dispatch_client_callout libdispatch.dylib frame 28: _dispatch_lane_barrier_sync_invoke_and_complete libdispatch.dylib frame 29: -[MLE5ProgramLibrary prepareAndReturnError:] CoreML frame 30: -[MLE5Engine initWithContainer:configuration:error:] CoreML frame 31: +[MLE5Engine loadModelFromCompiledArchive:modelVersionInfo:compilerVersionInfo:configuration:error:] CoreML frame 32: +[MLLoader _loadModelWithClass:fromArchive:modelVersionInfo:compilerVersionInfo:configuration:error:] CoreML frame 45: +[MLModel modelWithContentsOfURL:configuration:error:] CoreML frame 46: @nonobjc MLModel.__allocating_init(contentsOf:configuration:) GKPersonalV2 frame 47: MDNA_GaitEncoder_v1_3.__allocating_init(contentsOf:configuration:) frame 48: MDNA_GaitEncoder_v1_3.__allocating_init(configuration:) frame 50: GaitModelInference.loadModel() frame 51: GaitModelInference.init() iOS version: Reproduced on iOS 26.4. Works correctly on iOS 26.3.1. Xcode version: 26.2 Device: iPhone (model used in testing) Model format: .mlpackage
4
0
934
Apr ’26
Sharing a Swift port of Gemma 4 for mlx-swift-lm — feedback welcome
Hi all, I've been working on a pure-Swift port of Google's Gemma 4 text decoder that plugs into mlx-swift-lm as a sidecar model registration. Sharing it here in case anyone else hit the same wall I did, and to get feedback from the MLX team and the community before I propose anything upstream. Repo: https://github.com/yejingyang8963-byte/Swift-gemma4-core Why As of mlx-swift-lm 2.31.x, Gemma 4 isn't supported out of the box. The obvious workaround — reusing the Gemma 3 text implementation with a patched config — fails at weight load because Gemma 4 differs from Gemma 3 in several structural places. The chat-template path through swift-jinja 1.x also silently corrupts the prompt, so the model loads but generates incoherent text. What's in the package A from-scratch Swift implementation of the Gemma 4 decoder (Configuration, Layers, Attention, MLP, RoPE, DecoderLayer) Per-Layer Embedding (PLE) support — the shared embedding table that feeds every decoder layer through a gated MLP as a third residual KV sharing across the back half of the decoder, threaded through the forward pass via a donor table with a single global rope offset A custom Gemma4ProportionalRoPE class for the partial-rotation rope type that initializeRope doesn't currently recognize A chat-template bypass that builds the prompt as a literal string with the correct turn markers and encodes via tokenizer.encode(text:), matching Python mlx-lm's apply_chat_template byte-for-byte Measured on iPhone (A-series, 7.4 GB RAM) Model: mlx-community/gemma-4-e2b-it-4bit Warm load: ~6 s Memory after load: 341–392 MB Time to first token (end-to-end, 333-token system prompt): 2.82 s Generation throughput: 12–14 tok/s What I'd love feedback on Is the sidecar registration pattern the right way to extend mlx-swift-lm with new model families, or is there a more idiomatic path I missed? The chat-template bypass works but feels like a workaround. Is the right long-term fix in swift-jinja, in the tokenizer, or somewhere else entirely? Anyone running into the same PLE / KV-sharing issues on other Gemma-family checkpoints? I'd like to make sure the implementation generalizes beyond E2B before tagging a 0.2.0. Happy to open a PR against mlx-swift-lm if the maintainers think any of this belongs upstream. Thanks for reading.
1
0
414
Apr ’26
CoreML GPU NaN bug with fused QKV attention on macOS Tahoe
Problem: CoreML produces NaN on GPU (works fine on CPU) when running transformer attention with fused QKV projection on macOS 26.2. Root cause: The common::fuse_transpose_matmul optimization pass triggers a Metal kernel bug when sliced tensors feed into matmul(transpose_y=True). Workaround: pipeline = ct.PassPipeline.DEFAULT pipeline.remove_passes(['common::fuse_transpose_matmul']) mlmodel = ct.convert(model, ..., pass_pipeline=pipeline) Minimal repro: https://github.com/imperatormk/coreml-birefnet/blob/main/apple_bug_repro.py Affected: Any ViT/Swin/transformer with fused QKV attention (BiRefNet, etc.) Has anyone else hit this? Filed FB report too.
1
0
650
Apr ’26
Memory stride warning when loading CoreML models on ANE
When I am doing an uncached load of CoreML model on ANE, I received this warning in Xcode console Type of hiddenStates in function main's I/O contains unknown strides. Using unknown strides for MIL tensor buffers with unknown shapes is not recommended in E5ML. Please use row_alignment_in_bytes property instead. Refer to https://e5-ml.apple.com/more-info/memory-layouts.html for more information. However, the web link does not seem to be working. Where can I find more information about about this and how can I fix it?
2
0
834
Mar ’26
CoreML regression between macOS 26.0.1 and macOS 26.1 Beta causing scrambled tensor outputs
We’ve encountered what appears to be a CoreML regression between macOS 26.0.1 and macOS 26.1 Beta. In macOS 26.0.1, CoreML models run and produce correct results. However, in macOS 26.1 Beta, the same models produce scrambled or corrupted outputs, suggesting that tensor memory is being read or written incorrectly. The behavior is consistent with a low-level stride or pointer arithmetic issue — for example, using 16-bit strides on 32-bit data or other mismatches in tensor layout handling. Reproduction Install ON1 Photo RAW 2026 or ON1 Resize 2026 on macOS 26.0.1. Use the newest Highest Quality resize model, which is Stable Diffusion–based and runs through CoreML. Observe correct, high-quality results. Upgrade to macOS 26.1 Beta and run the same operation again. The output becomes visually scrambled or corrupted. We are also seeing similar issues with another Stable Diffusion UNet model that previously worked correctly on macOS 26.0.1. This suggests the regression may affect multiple diffusion-style architectures, likely due to a change in CoreML’s tensor stride, layout computation, or memory alignment between these versions. Notes The affected models are exported using standard CoreML conversion pipelines. No custom operators or third-party CoreML runtime layers are used. The issue reproduces consistently across multiple machines. It would be helpful to know if there were changes to CoreML’s tensor layout, precision handling, or MLCompute backend between macOS 26.0.1 and 26.1 Beta, or if this is a known regression in the current beta.
8
4
2.4k
Mar ’26
How does ARKit achieve low-latency and stable head tracking using only RGB camera ?
Hi, I’m working on a real-time head/face tracking pipeline using a standard 2D RGB camera, and I’m trying to better understand how ARKit achieves such stable and responsive results in comparable conditions. To clarify upfront: I’m specifically interested in RGB-only tracking and the underlying vision/ML pipeline. I’m not using TrueDepth or any depth/IR-based sensors, and I’d like to understand how similar stability and responsiveness can be achieved under those constraints. In my current setup, I estimate head pose from RGB frames (facial landmarks + PnP) and apply temporal filtering (e.g., exponential smoothing and Kalman filtering). This significantly reduces jitter, but introduces noticeable latency, especially during faster head movements. What stands out in ARKit is that it appears to maintain both: Very low jitter Very low perceived latency even when operating with camera input alone. I’m trying to understand what techniques might contribute to this behavior. In particular: Does ARKit use predictive tracking (e.g., velocity or acceleration-based pose extrapolation) to compensate for camera and processing delays in RGB-only scenarios? Are there recommended strategies for balancing temporal smoothing and responsiveness without introducing visible lag in camera-based pose estimation pipelines? Is the tracking pipeline internally decoupled from rendering (e.g., asynchronous processing with prediction applied at render time)? Are there general best practices for minimizing end-to-end latency in vision-based head tracking systems beyond standard filtering approaches? I understand that implementation details may not be public, but any high-level insights or pointers would be greatly appreciated. Thanks!
0
0
321
Mar ’26
Massive CoreML latency spike on live AVFoundation camera feed vs. offline inference (CPU+ANE)
Hello, I’m experiencing a severe performance degradation when running CoreML models on a live AVFoundation video feed compared to offline or synthetic inference. This happens across multiple models I've converted (including SCI, RTMPose, and RTMW) and affects multiple devices. The Environment OS: macOS 26.3, iOS 26.3, iPadOS 26.3 Hardware: Mac14,6 (M2 Max), iPad Pro 11 M1, iPhone 13 mini Compute Units: cpuAndNeuralEngine The Numbers When testing my SCI_output_image_int8.mlpackage model, the inference timings are drastically different: Synthetic/Offline Inference: ~1.34 ms Live Camera Inference: ~15.96 ms Preprocessing is completely ruled out as the bottleneck. My profiling shows total preprocessing (nearest-neighbor resize + feature provider creation) takes only ~0.4 ms in camera mode. Furthermore, no frames are being dropped. What I've Tried I am building a latency-critical app and have implemented almost every recommended optimization to try and fix this, but the camera-feed penalty remains: Matched the AVFoundation camera output format exactly to the model input (640x480 at 30/60fps). Used IOSurface-backed pixel buffers for everything (camera output, synthetic buffer, and resize buffer). Enabled outputBackings. Loaded the model once and reused it for all predictions. Configured MLModelConfiguration with reshapeFrequency = .frequent and specializationStrategy = .fastPrediction. Wrapped inference in ProcessInfo.processInfo.beginActivity(options: .latencyCritical, reason: "CoreML_Inference"). Set DispatchQueue to qos: .userInteractive. Disabled the idle timer and enabled iOS Game Mode. Exported models using coremltools 9.0 (deployment target iOS 26) with ImageType inputs/outputs and INT8 quantization. Reproduction To completely rule out UI or rendering overhead, I wrote a standalone Swift CLI script that isolates the AVFoundation and CoreML pipeline. The script clearly demonstrates the ~15ms latency on live camera frames versus the ~1ms latency on synthetic buffers. (I have attached camera_coreml_benchmark.swift and coreml model (very light low light enghancement model) to this repo on github https://github.com/pzoltowski/apple-coreml-camera-latency-repro). My Question: Is this massive overhead expected behavior for AVFoundation + Core ML on live feeds, or is this a framework/runtime bug? If expected, what is the Apple-recommended pattern to bypass this camera-only inference slowdown? One think found interesting when running in debug model was faster (not as fast as in performance benchmark but faster than 16ms. Also somehow if I did some dummy calculation on on different DispatchQueue also seems like model got slightly faster. So maybe its related to ANE Power State issues (Jitter/SoC Wake) and going to fast to sleep and taking a long time to wakeup? Doing dummy calculation in background thought is probably not a solution. Thanks in advance for any insights!
5
0
1.2k
Mar ’26
关于我使用Swift和Metal制作的神经网络引擎
我今年18岁。没有机器学习背景,没有上过大学,高中都没去上,没有导师。 几天前我盯着一张纸发呆。突然想:为什么计算机神经网络一定要是2D的?可以模拟生物吗?为什么一定要在平面上算?如果多个平面,岂不是翻倍?如果把六张纸想象成一个魔方,六个面各自承载神经元,八条体对角线变成新的通信通道会怎么样? 我真的很喜欢折腾这些,然后我立刻制定了详细计划,使用AI工具辅助写下了第一个 kernel。跑崩了。我又重新想了一下,和qq群友分享了我的目标,又写。又崩。连续几十次。没有 PyTorch,没有 TensorFlow,没有 CUDA。只有Swift和Metal。因为我的电脑显卡是AMD Vega 64,没装任何框架辅助,因为我想明白最底层的运行方式是什么原理。 这就是CubeNN。 ##以下为AI的详细解答,内容与架构改动太多,我在这里一次讲不清楚 它是什么 一个用魔方几何作为计算架构的神经网络引擎。 标准 Transformer: 把数据排成一行,O(n²) 地互相看 CubeNN: 把数据分布在 14 个面上,只在该看的地方看 6 个标准面 → 块稀疏注意力(粗看全局 + 细看局部) 8 个 X 面对角线 → 跨面信息桥(不做 Attention,只负责传递) 每轮:6 面算 → 投影到 8 X 面 → 上采样精炼 → 融合回 6 面 最关键的是 Cube Cascade——一个树+链级联推理: 树阶段: 1 个魔方 spawn 8 个 → 8 个 spawn 64 个 → 73 个并行探索 GPU 上同时跑,选最优路径 链阶段: 最优叶子无限深度精炼 3-5 步收敛,方差提升 ~7% 怎么实现的 纯 Swift + Metal。零依赖。零框架。 // 大致代码就是这些 import Metal import Foundation let device = MTLCreateSystemDefaultDevice()! let library = try! device.makeLibrary(filepath: "cube_nn.metallib") // ...12 个 GPU kernel,12,000 次 dispatch 关键技术决策: 单 Command Buffer:整个树阶段 73 个魔方的全部 kernel dispatch 打包进一个 CB,0 次 CPU-GPU 同步 Pipeline State 缓存:编码从 1022ms 降到 42ms Buffer 偏移:所有 73 个魔方的 14 个面存进一个连续 buffer,kernel 通过 buffer(15) 传偏移量 FP16:N≥64 时半精度提速 21% 性能 ##经过测试,但是因设备差异可能不准确,仅参考 AMD Radeon RX Vega 64 (2017 年显卡, 14nm, 295W): 规模 神经元 魔方数 耗时 N=32 6,144 73 (树) 435ms N=64 24,576 21 (树) 817ms N=128 98,304 1 116ms N=32 全连接 Attention 每层 201M FLOP → CubeNN 块稀疏 370K FLOP (544× 减少) N=128 全连接需要 32GB 显存(物理上不存在)→ CubeNN 用 192KB N=256 全连接需要 2.2T FLOP → CubeNN 52M FLOP (42,300× 减少) 代码体积:161KB。 对比 PyTorch 的 800MB。 我经历了什么 这个项目最困难的不是写 kernel,是在没有任何人告诉我"能不能做"的情况下,靠反复试错找到路。 第一次试图跑 73 个魔方,GPU 直接 hang 了。花了 3 天定位到是 Command Buffer 堆叠过多。 改了 single encoder 方案,又碰上 SIGILL——Metal 不允许 makeBuffer(length: 0),B=0 时创建了零长度 buffer。 想用 threadgroup memory 做 kernel fusion,结果跨 threadgroup 读不到数据,才明白 LDS 是 per-group 的。 N=64 的 FP16 要手动写 float↔half 转换函数,因为 macOS 11 上 Float16 类型被标为 unavailable。 每一次崩溃都教会我一个 Metal 的底层细节。没有人教我,但 Metal 的报错信息就是最好的老师。 为什么发在 Apple 开发者论坛 因为这是为苹果生态而生的项目。CubeNN 从头到尾只用了两个东西:Swift 和 Metal。它不需要移植就能跑在任何 Apple Silicon Mac 上(API兼容)。如果未来能把部分 kernel 映射到 Neural Engine,效率会再翻几倍。 我想问 Apple 的 Metal 工程师和 Core ML 团队: ** 有没有更好的 GPU 任务调度方式?**目前表现仍然欠佳(对于我这个完美主义者来说),可能改得有点乱了 有没有兴趣评估这个架构在 M4 上的表现? 我手里只有 Vega 64。M4 GPU + ANE方法 跑 CubeNN 会是什么效果? 源代码 ├── run.swift # 统一 CLI,参数化 N/B/depth ├── src/ │ ├── cube_nn.metal # FP16 kernel │ └── cube_nn_fp32.metal # FP32 kernel └── benchmarks/ # 实测数据 如果你读到了这里——谢谢你。一个门外汉靠痴狂的,纯粹到几乎是妄想的主意和Metal走到了这里。我懂的不是很多,如果这个架构有任何价值,我想让它变得更好。任何建议、批评、或者指教,都非常欢迎。
Replies
0
Boosts
0
Views
35
Activity
20h
Core ML RIP?
No mention of Core ML at WWDC26... Shall we assume it was replaced by Core AI? What about Adapters?
Replies
1
Boosts
0
Views
125
Activity
2d
Silent FP16 Overflow in coremltools: 5 Numerical Failures Affecting ANE Inference (With Fixes)
Silent FP16 Overflow in coremltools: 5 Numerical Failures Affecting ANE Inference (With Fixes) Hi everyone, With the announcements at WWDC26 regarding Core AI and "automatic stable decompositions," it is clear that managing mathematical stability in constrained FP16 environments is a major priority for the ecosystem. To support developers maintaining existing models that cannot migrate to the newest architectures overnight, I have published a research paper and an open-source static analysis tool documenting 5 silent numerical failures in the standard coremltools pipeline. Because the Apple Neural Engine (ANE) executes inference in FP16, the maximum representable value is 65,504 ($\exp(11.09)$). Inputs exceeding these tight bounds cause silent overflows to infinity or collapses to zero without warnings. Deployed Operations Currently Affected softplus (YOLOv5/v8): Outputs silently collapse to 0.0 at $x > 10.4$ on ANE. logsumexp (Attention mechanisms): Overflows at $x > 7.63$ for 32 channels. For vocabulary-sized reductions, the threshold drops below $5$. log_softmax (Classifiers like BERT, GPT, ViT): Softmax probabilities underflow to 0, causing $\log(0) = -\infty$. logcumsumexp (CTC decoders): Overflows at $x > 11.09$. mish (YOLO variants): Inherits the softplus overflow limits. The Immediate Safety Net: Algebraically Equivalent Reformulations We can bypass these hardware limits entirely by rewriting the operations into mathematically stable forms. For example, rewriting softplus as: $$\max(x, 0) + \log(1 + \exp(-|x|))$$ Because $-|x| \le 0$, $\exp(-|x|)$ is bound strictly between $(0, 1]$. Overflow becomes mathematically impossible in any precision, yielding bit-identical outputs for all valid inputs. While PyTorch AMP traditionally classifies these operations as FP32-only, the ANE has no such fallback—making stable decomposition mandatory. Tools & Patches Deployed Today The Paper: "Silent Numerical Failures in On-Device ML Converters: A Systematic Audit of FP16 Overflow in Apple Neural Engine Deployment." (Complete vulnerability census, discrepancy pattern analysis, formal proofs, and quantitative evaluation). The Tool (ane-fp16-lint): A CLI that scans .mlpackage files and flags FP16-unsafe operations before you push to production. It detects nine patterns and provides stable alternatives for each. The Fixes: We have submitted three Pull Requests to the official apple/coremltools repository implementing these stable decompositions, which are currently under review by Apple's Core ML team. While Core AI introduces great automated stability for new architectures like the 20B AFM 3 Core Advanced, millions of deployed production models still need an immediate safety net. Full technical paper, proofs, and the linting tool are available on GitHub: github.com/apple-f16-overflow-audit (Note: Replace with your direct, clean GitHub repository link—avoiding social media redirects so the forum filters do not auto-flag the post) Looking forward to hearing if anyone else has run into these unexpected discrepancy patterns in production!
Replies
0
Boosts
0
Views
22
Activity
2d
LLM inference on Apple Silicon: why do some MoE architectures outperform dense models despite similar parameter counts?
We're doing some local LLM inference experiments on Apple Silicon and have observed something that seems counterintuitive. In several cases, MoE models significantly outperform dense models despite having similar total parameter counts. Examples (simplified): Dense model: ~30B parameters MoE model: ~30B total parameters, ~3B active parameters On Apple Silicon, the MoE model consistently achieves higher decode throughput even after accounting for routing overhead. A few hypotheses we're considering: Active parameter count appears to matter more than total parameter count for decode throughput. Memory traffic may dominate M=1 autoregressive decode, making sparse activation more important than expected. Expert matrix geometry might matter as much as parameter count. Some MoE designs appear to produce GPU-friendly GEMV shapes while others do not. Quantization layout and memory alignment seem to have surprisingly large effects on practical throughput. What I'm curious about is whether others have observed similar behavior on Apple Silicon specifically. Has anyone profiled decode throughput across: dense models large-expert MoE many-small-expert MoE and identified which hardware characteristics are actually driving the difference? I'm particularly interested in observations from Metal profiling rather than benchmark leaderboards.
Replies
0
Boosts
0
Views
30
Activity
2d
Apple GPU forward progress guarantees for persistent-thread synchronization?
We're doing some research on Apple Silicon inference runtimes and trying to understand the practical synchronization boundary of Apple GPUs. We are not asking about threadgroup barriers (those are documented), but about device-scope synchronization patterns built from atomics. What we've observed: Device-scope atomics are available. It is possible to build global counters and persistent-thread style coordination structures. However, we cannot find any documented guarantee regarding: threadgroup co-residency, global forward progress, occupancy-bounded synchronization safety. In our experiments, synchronization schemes that rely on all threadgroups making progress eventually can become unreliable, while strictly local producer/consumer handoff patterns appear much more robust. Questions: Does Metal provide any documented forward-progress guarantees across threadgroups beyond what is explicitly stated in the Metal specification? Is there any recommended pattern for implementing long-lived producer/consumer GPU pipelines without relying on global synchronization assumptions? For Apple GPUs specifically, should developers assume that occupancy-bounded global synchronization is unsupported unless explicitly provided by the API? We are not looking for undocumented implementation details, only for guidance on what assumptions are safe for production systems. Thanks.
Replies
0
Boosts
0
Views
23
Activity
2d
Resolving co channel interference VOIP
Subject: Inquiry Regarding Architectural Overhead and Buffer Access in the Push to Talk Framework for Real-Time Core ML Blind Source Separation Dear Apple Engineering Team, We are currently developing an Apple-native communication platform that utilizes the Push to Talk framework alongside Core ML to handle real-time, on-device audio processing. We are working to resolve the issue of single-channel, co-channel interference (overlapping voice streams) directly on the edge. Our current challenge lies in the pipeline latency and background lifecycle constraints when intercepting incoming audio buffers. To cleanly separate overlapping voices before they hit the audio output mixer, we need to process the raw PCM data immediately upon arrival. Could you please provide guidance on the following architectural questions: Low-Latency Buffer Interception: What is the recommended design pattern within the PTChannelManagerDelegate flow to pass raw incoming audio buffers directly to a Core ML model running on the Apple Neural Engine (ANE) before the system routes them to AVAudioEngine for playback? Background Thread Management: Given the strict background execution boundaries enforced by the Push to Talk framework, how can we best optimize thread scheduling to ensure our speech separation model completes its execution without triggering an OS background processing timeout or process termination? Dynamic UI Manifestation: Once a combined audio stream is separated into two clean, distinct voice vectors on-device, what is the best approach for registering multiple PTParticipant states simultaneously so that the native system UI (like the Dynamic Island) accurately reflects both speakers? Thank you for your time, insights, and continued support of developer innovation within the iOS and iPadOS ecosystems. Best regards, Ken Zakreski Founder, Marine Link Pro
Replies
2
Boosts
0
Views
51
Activity
3d
_FusedMatMul with [BiasAdd, Relu] produces incorrect results in graph mode on Metal GPU
When running a tf.function-traced graph on the Metal GPU, any operation that combines MatMul → BiasAdd → Relu (the fused pattern emitted by tf.keras.layers.Dense(activation='relu')) produces numerically incorrect output — errors on the order of tens of units, not floating-point noise. Eager mode on the same Metal GPU is correct. Graph mode forced to CPU (tf.config.set_visible_devices([], 'GPU')) is also correct. The bug is deterministic and data-independent (reproduces with random weights). the three-op combination of MatMul + BiasAdd + Relu trigger the error. Specifically: relu(tf.nn.bias_add(tf.matmul(x, W), b)) in graph mode on Metal is wrong, while relu(tf.matmul(x, W) + b) (using AddV2 instead of BiasAdd) is correct. Removing the Relu also makes the result correct — tf.nn.bias_add(tf.matmul(x, W), b) without a following Relu produces correct output at every shape tested. This points to the Metal plugin's fused _FusedMatMul kernel with fused_ops=[BiasAdd, Relu] as the culprit. Disabling the TF core grappler remapping pass (tf.config.optimizer.set_experimental_options({'remapping': False})) does not fix the issue, confirming that the fusion decision is made inside the Metal plugin's own kernel selection, below the TF core graph optimizer. The bug reproduces across all shapes tested (batch 4–200, inner dimension K 512–8192, output 128–2048) and is not specific to any particular weight values. A minimal reproducer: import tensorflow as tf import numpy as np # Any shape works; larger K makes the error more obvious M, K, N = 64, 2048, 1024 W = tf.Variable(tf.random.normal([K, N])) b = tf.Variable(tf.random.normal([N])) x = tf.random.normal([M, K]) @tf.function def graph_fused(x): return tf.nn.relu(tf.nn.bias_add(tf.matmul(x, W), b)) @tf.function def graph_safe(x): return tf.nn.relu(tf.matmul(x, W) + b) # AddV2 instead of BiasAdd eager_ref = tf.nn.relu(tf.nn.bias_add(tf.matmul(x, W), b)) # eager = correct fused_out = graph_fused(x) # Metal graph mode = WRONG safe_out = graph_safe(x) # Metal graph mode = correct print(f"eager vs graph_fused (BiasAdd): {tf.reduce_max(tf.abs(eager_ref - fused_out)).numpy():.1f}") # ^ typically 30–80+ (WRONG) print(f"eager vs graph_safe (AddV2): {tf.reduce_max(tf.abs(eager_ref - safe_out)).numpy():.2e}") # ^ typically ~1e-5 (correct) Environment: TensorFlow 2.18.1, Keras 3.11.2, tensorflow-metal (latest as of 2026-05-26), Apple Silicon Mac. Impact: This breaks any Keras model that uses Dense(activation='relu') when called inside a tf.function or via SavedModel serving on the Metal GPU. Eager-mode inference is unaffected.
Replies
0
Boosts
0
Views
1.1k
Activity
2w
When will mps support fp8 dtypes?
https://github.com/pytorch/pytorch/issues/132624 this fp8 dtypes unsupport issue has been existed for 2 years, does mlx have any plan to it?
Replies
0
Boosts
0
Views
608
Activity
2w
Do loading multiple functions that share model weights multiply memory use?
Hi, I have a multifunction model where the functions share the same model weights, and for latency I have multiple functions loaded at the same time. According to what Codex found this multiplies RAM usage, so if the single model weights 2GB, loading two functions that share the underlying weights still doubles RAM usage to 4GB (seems that it is something like neural wired memory). Does anyone have any knowledge relating to this?
Replies
0
Boosts
0
Views
1.2k
Activity
3w
CoreML model load failed with this error : Failed to set up decrypt context for /private/var/mobile/Containers/Data/Application/ACB94507-F8DE-494B-8499-B0CF75FC3B55/Library/Caches/temp.m/xxx.mlmodelc. error:-42905"
Hi there. We use a core ML model for image processing, and because loading core ml model take long time (~10 sec), we preload core ML model when app start time. but in some device, loading core ml model fails with such error. we download core ML model from server then load model from local storage. loading code looks like this. typical. MLModel.load(contentsOf: compliedUrl, configuration: config) once this error happen, it keeps fails until we restart the device. (+) In this article, I saw that it is related some "limitation of decrypt session" : https://developer.apple.com/forums/thread/707622 but it also happens to in-house test flight builds which are used only under 5 people. Can I know why this happens?
Replies
4
Boosts
1
Views
2.5k
Activity
3w
CoreML model cache causes fake hard drive memory usage
Hi, I experiment by creating and compiling a lot of CoreML models and I have the issue that this causes a lot of disk usage, but when I try to delete everything (I search in the disk for possible CoreML cache directories) the disk space is not actually freed up. This is a picture of my disk usage according to what is shown inside of Settings>General>Storage and the Disk Utility app. I am running on macOS 15.7.5
Replies
0
Boosts
0
Views
1.5k
Activity
May ’26
Does using Vision API offline to label a custom dataset for Core ML training violate DPLA?
Hello everyone, I am currently developing a smart camera app for iOS that recommends optimal zoom and exposure values on-device using a custom Core ML model. I am still waiting for an official response from Apple Support, but I wanted to ask the community if anyone has experience with a similar workflow regarding App Review and the DPLA. Here is my training methodology: I gathered my own proprietary dataset of original landscape photos. I generated multiple variants of these photos with different zoom and exposure settings offline on my Mac. I used the CalculateImageAestheticsScoresRequest (Vision framework) via a local macOS command-line tool to evaluate and score each variant. Based on those scores, I labeled the "best" zoom and exposure parameters for each original photo. I used this labeled dataset to train my own independent neural network using PyTorch, and then converted it to a Core ML model to ship inside my app. Since the app uses my own custom model on-device and does not send any user data to a server, the privacy aspect is clear. However, I am curious if using the output of Apple's Vision API strictly offline to label my own dataset could be interpreted as "reverse engineering" or a violation of the Developer Program License Agreement (DPLA). Has anyone successfully shipped an app using a similar knowledge distillation or automated dataset labeling approach with Apple's APIs? Did you face any pushback during App Review? Any insights or shared experiences would be greatly appreciated!
Replies
1
Boosts
0
Views
525
Activity
Apr ’26
MPS SDPA Attention Kernel Regression on A14-class (M1) in macOS 26.3.1 — Works on A15+ (M2+)
Summary Since macOS 26, our Core ML / MPS inference pipeline produces incorrect results on Mac mini M1 (Macmini9,1, A14-class SoC). The same model and code runs correctly on M2 and newer (A15-class and up). The regression appears to be in the Scaled Dot-Product Attention (SDPA) kernel path in the MPS backend. Environment Affected Mac mini M1 — Macmini9,1 (A14-class) Not affected M2 and newer (A15-class and up) Last known good macOS Sequoia First broken macOS 26 (Tahoe) ? Confirmed broken on macOS 26.3.1 Framework Core ML + MPS backend Language C++ (via CoreML C++ API) Description We ship an audio processing application (VoiceAssist by NoiseWorks) that runs a deep learning model (based on Demucs architecture) via Core ML with the MPS compute unit. On macOS Sequoia this works correctly on all Apple Silicon Macs including M1. After updating to macOS 26 (Tahoe), inference on M1 Macs fails — either producing garbage output or crashing. The same binary, same .mlpackage, same inputs work correctly on M2+. Our Apple contact has suggested the root cause is a regression in the A14-specific MPS SDPA attention kernel, which may have broken when the Metal/MPS stack was updated in macOS 26. The model makes heavy use of attention layers, and the failure correlates precisely with the SDPA path being exercised on A14 hardware. Steps to Reproduce Load a Core ML model that uses Scaled Dot-Product Attention (e.g. a transformer or attention-based audio model) Run inference with MLComputeUnits::cpuAndGPU (MPS active) Run on Mac mini M1 (Macmini9,1) with macOS 26.3.1 Compare output to the same model running on M2 / macOS Sequoia Expected: Correct inference output, consistent with M2+ and macOS Sequoia behavior Actual: Incorrect / corrupted output (or crash), only on A14-class hardware running macOS 26+ Workaround Forcing MLComputeUnits::cpuOnly bypasses MPS entirely and produces correct output on M1, confirming the issue is in the MPS compute path. This is not acceptable as a shipping workaround due to performance impact. Additional Notes The failure is hardware-specific (A14 only) and OS-specific (macOS 26+), pointing to a kernel-level regression rather than a model or app bug We first became aware of this through a customer report Happy to provide a symbolicated crash log if helpful this text was summarized by AI and human verified
Replies
2
Boosts
0
Views
445
Activity
Apr ’26
CoreML MLE5ProgramLibrary AOT recompilation hangs/crashes on iOS 26.4 — C++ exception in espresso IR compiler bypasses Swift error handling
Area: CoreML / Machine Learning Describe the issue: On iOS 26.4, calling MLModel(contentsOf:configuration:) to load an .mlpackage model hangs indefinitely and eventually kills the app via watchdog. The same model loads and runs inference successfully in under 1 second on iOS 26.3.1. The hang occurs inside eort_eo_compiler_compile_from_ir_program (espresso) during on-device AOT recompilation triggered by MLE5ProgramLibraryOnDeviceAOTCompilationImpl createProgramLibraryHandleWithRespecialization:error:. A C++ exception (__cxa_throw) is thrown inside libBNNS.dylib during the exception unwind, which then hangs inside __cxxabiv1::dyn_cast_slow and __class_type_info::search_below_dst. Swift's try/catch does not catch this — the exception originates in C++ and the process hangs rather than terminating cleanly. Setting config.computeUnits = .cpuOnly does not resolve the issue. MLE5ProgramLibrary initialises as shared infrastructure regardless of compute units. Steps to reproduce: Create an app with an .mlpackage CoreML model using the MLE5/espresso backend Call MLModel(contentsOf: modelURL, configuration: config) at runtime Run on a device on iOS 26.3.1 — loads successfully in <1 second Update device to iOS 26.4 — hangs indefinitely, app killed by watchdog after 60–745 seconds Expected behaviour: Model loads successfully, or throws a catchable Swift error on failure. Actual behaviour: Process hangs in MLE5ProgramLibrary.lazyInitQueue. App killed by watchdog. No Swift error thrown. Full stack trace at point of hang: Thread 1 Queue: com.apple.coreml.MLE5ProgramLibrary.lazyInitQueue (serial) frame 0: __cxxabiv1::__class_type_info::search_below_dst libc++abi.dylib frame 1: __cxxabiv1::(anonymous namespace)::dyn_cast_slow libc++abi.dylib frame 2: ___lldb_unnamed_symbol_23ab44dd4 libBNNS.dylib frame 23: eort_eo_compiler_compile_from_ir_program espresso frame 24: -[MLE5ProgramLibraryOnDeviceAOTCompilationImpl createProgramLibraryHandleWithRespecialization:error:] CoreML frame 25: -[MLE5ProgramLibrary _programLibraryHandleWithForceRespecialization:error:] CoreML frame 26: __44-[MLE5ProgramLibrary prepareAndReturnError:]_block_invoke CoreML frame 27: _dispatch_client_callout libdispatch.dylib frame 28: _dispatch_lane_barrier_sync_invoke_and_complete libdispatch.dylib frame 29: -[MLE5ProgramLibrary prepareAndReturnError:] CoreML frame 30: -[MLE5Engine initWithContainer:configuration:error:] CoreML frame 31: +[MLE5Engine loadModelFromCompiledArchive:modelVersionInfo:compilerVersionInfo:configuration:error:] CoreML frame 32: +[MLLoader _loadModelWithClass:fromArchive:modelVersionInfo:compilerVersionInfo:configuration:error:] CoreML frame 45: +[MLModel modelWithContentsOfURL:configuration:error:] CoreML frame 46: @nonobjc MLModel.__allocating_init(contentsOf:configuration:) GKPersonalV2 frame 47: MDNA_GaitEncoder_v1_3.__allocating_init(contentsOf:configuration:) frame 48: MDNA_GaitEncoder_v1_3.__allocating_init(configuration:) frame 50: GaitModelInference.loadModel() frame 51: GaitModelInference.init() iOS version: Reproduced on iOS 26.4. Works correctly on iOS 26.3.1. Xcode version: 26.2 Device: iPhone (model used in testing) Model format: .mlpackage
Replies
4
Boosts
0
Views
934
Activity
Apr ’26
Sharing a Swift port of Gemma 4 for mlx-swift-lm — feedback welcome
Hi all, I've been working on a pure-Swift port of Google's Gemma 4 text decoder that plugs into mlx-swift-lm as a sidecar model registration. Sharing it here in case anyone else hit the same wall I did, and to get feedback from the MLX team and the community before I propose anything upstream. Repo: https://github.com/yejingyang8963-byte/Swift-gemma4-core Why As of mlx-swift-lm 2.31.x, Gemma 4 isn't supported out of the box. The obvious workaround — reusing the Gemma 3 text implementation with a patched config — fails at weight load because Gemma 4 differs from Gemma 3 in several structural places. The chat-template path through swift-jinja 1.x also silently corrupts the prompt, so the model loads but generates incoherent text. What's in the package A from-scratch Swift implementation of the Gemma 4 decoder (Configuration, Layers, Attention, MLP, RoPE, DecoderLayer) Per-Layer Embedding (PLE) support — the shared embedding table that feeds every decoder layer through a gated MLP as a third residual KV sharing across the back half of the decoder, threaded through the forward pass via a donor table with a single global rope offset A custom Gemma4ProportionalRoPE class for the partial-rotation rope type that initializeRope doesn't currently recognize A chat-template bypass that builds the prompt as a literal string with the correct turn markers and encodes via tokenizer.encode(text:), matching Python mlx-lm's apply_chat_template byte-for-byte Measured on iPhone (A-series, 7.4 GB RAM) Model: mlx-community/gemma-4-e2b-it-4bit Warm load: ~6 s Memory after load: 341–392 MB Time to first token (end-to-end, 333-token system prompt): 2.82 s Generation throughput: 12–14 tok/s What I'd love feedback on Is the sidecar registration pattern the right way to extend mlx-swift-lm with new model families, or is there a more idiomatic path I missed? The chat-template bypass works but feels like a workaround. Is the right long-term fix in swift-jinja, in the tokenizer, or somewhere else entirely? Anyone running into the same PLE / KV-sharing issues on other Gemma-family checkpoints? I'd like to make sure the implementation generalizes beyond E2B before tagging a 0.2.0. Happy to open a PR against mlx-swift-lm if the maintainers think any of this belongs upstream. Thanks for reading.
Replies
1
Boosts
0
Views
414
Activity
Apr ’26
CoreML GPU NaN bug with fused QKV attention on macOS Tahoe
Problem: CoreML produces NaN on GPU (works fine on CPU) when running transformer attention with fused QKV projection on macOS 26.2. Root cause: The common::fuse_transpose_matmul optimization pass triggers a Metal kernel bug when sliced tensors feed into matmul(transpose_y=True). Workaround: pipeline = ct.PassPipeline.DEFAULT pipeline.remove_passes(['common::fuse_transpose_matmul']) mlmodel = ct.convert(model, ..., pass_pipeline=pipeline) Minimal repro: https://github.com/imperatormk/coreml-birefnet/blob/main/apple_bug_repro.py Affected: Any ViT/Swin/transformer with fused QKV attention (BiRefNet, etc.) Has anyone else hit this? Filed FB report too.
Replies
1
Boosts
0
Views
650
Activity
Apr ’26
Memory stride warning when loading CoreML models on ANE
When I am doing an uncached load of CoreML model on ANE, I received this warning in Xcode console Type of hiddenStates in function main's I/O contains unknown strides. Using unknown strides for MIL tensor buffers with unknown shapes is not recommended in E5ML. Please use row_alignment_in_bytes property instead. Refer to https://e5-ml.apple.com/more-info/memory-layouts.html for more information. However, the web link does not seem to be working. Where can I find more information about about this and how can I fix it?
Replies
2
Boosts
0
Views
834
Activity
Mar ’26
CoreML regression between macOS 26.0.1 and macOS 26.1 Beta causing scrambled tensor outputs
We’ve encountered what appears to be a CoreML regression between macOS 26.0.1 and macOS 26.1 Beta. In macOS 26.0.1, CoreML models run and produce correct results. However, in macOS 26.1 Beta, the same models produce scrambled or corrupted outputs, suggesting that tensor memory is being read or written incorrectly. The behavior is consistent with a low-level stride or pointer arithmetic issue — for example, using 16-bit strides on 32-bit data or other mismatches in tensor layout handling. Reproduction Install ON1 Photo RAW 2026 or ON1 Resize 2026 on macOS 26.0.1. Use the newest Highest Quality resize model, which is Stable Diffusion–based and runs through CoreML. Observe correct, high-quality results. Upgrade to macOS 26.1 Beta and run the same operation again. The output becomes visually scrambled or corrupted. We are also seeing similar issues with another Stable Diffusion UNet model that previously worked correctly on macOS 26.0.1. This suggests the regression may affect multiple diffusion-style architectures, likely due to a change in CoreML’s tensor stride, layout computation, or memory alignment between these versions. Notes The affected models are exported using standard CoreML conversion pipelines. No custom operators or third-party CoreML runtime layers are used. The issue reproduces consistently across multiple machines. It would be helpful to know if there were changes to CoreML’s tensor layout, precision handling, or MLCompute backend between macOS 26.0.1 and 26.1 Beta, or if this is a known regression in the current beta.
Replies
8
Boosts
4
Views
2.4k
Activity
Mar ’26
How does ARKit achieve low-latency and stable head tracking using only RGB camera ?
Hi, I’m working on a real-time head/face tracking pipeline using a standard 2D RGB camera, and I’m trying to better understand how ARKit achieves such stable and responsive results in comparable conditions. To clarify upfront: I’m specifically interested in RGB-only tracking and the underlying vision/ML pipeline. I’m not using TrueDepth or any depth/IR-based sensors, and I’d like to understand how similar stability and responsiveness can be achieved under those constraints. In my current setup, I estimate head pose from RGB frames (facial landmarks + PnP) and apply temporal filtering (e.g., exponential smoothing and Kalman filtering). This significantly reduces jitter, but introduces noticeable latency, especially during faster head movements. What stands out in ARKit is that it appears to maintain both: Very low jitter Very low perceived latency even when operating with camera input alone. I’m trying to understand what techniques might contribute to this behavior. In particular: Does ARKit use predictive tracking (e.g., velocity or acceleration-based pose extrapolation) to compensate for camera and processing delays in RGB-only scenarios? Are there recommended strategies for balancing temporal smoothing and responsiveness without introducing visible lag in camera-based pose estimation pipelines? Is the tracking pipeline internally decoupled from rendering (e.g., asynchronous processing with prediction applied at render time)? Are there general best practices for minimizing end-to-end latency in vision-based head tracking systems beyond standard filtering approaches? I understand that implementation details may not be public, but any high-level insights or pointers would be greatly appreciated. Thanks!
Replies
0
Boosts
0
Views
321
Activity
Mar ’26
Massive CoreML latency spike on live AVFoundation camera feed vs. offline inference (CPU+ANE)
Hello, I’m experiencing a severe performance degradation when running CoreML models on a live AVFoundation video feed compared to offline or synthetic inference. This happens across multiple models I've converted (including SCI, RTMPose, and RTMW) and affects multiple devices. The Environment OS: macOS 26.3, iOS 26.3, iPadOS 26.3 Hardware: Mac14,6 (M2 Max), iPad Pro 11 M1, iPhone 13 mini Compute Units: cpuAndNeuralEngine The Numbers When testing my SCI_output_image_int8.mlpackage model, the inference timings are drastically different: Synthetic/Offline Inference: ~1.34 ms Live Camera Inference: ~15.96 ms Preprocessing is completely ruled out as the bottleneck. My profiling shows total preprocessing (nearest-neighbor resize + feature provider creation) takes only ~0.4 ms in camera mode. Furthermore, no frames are being dropped. What I've Tried I am building a latency-critical app and have implemented almost every recommended optimization to try and fix this, but the camera-feed penalty remains: Matched the AVFoundation camera output format exactly to the model input (640x480 at 30/60fps). Used IOSurface-backed pixel buffers for everything (camera output, synthetic buffer, and resize buffer). Enabled outputBackings. Loaded the model once and reused it for all predictions. Configured MLModelConfiguration with reshapeFrequency = .frequent and specializationStrategy = .fastPrediction. Wrapped inference in ProcessInfo.processInfo.beginActivity(options: .latencyCritical, reason: "CoreML_Inference"). Set DispatchQueue to qos: .userInteractive. Disabled the idle timer and enabled iOS Game Mode. Exported models using coremltools 9.0 (deployment target iOS 26) with ImageType inputs/outputs and INT8 quantization. Reproduction To completely rule out UI or rendering overhead, I wrote a standalone Swift CLI script that isolates the AVFoundation and CoreML pipeline. The script clearly demonstrates the ~15ms latency on live camera frames versus the ~1ms latency on synthetic buffers. (I have attached camera_coreml_benchmark.swift and coreml model (very light low light enghancement model) to this repo on github https://github.com/pzoltowski/apple-coreml-camera-latency-repro). My Question: Is this massive overhead expected behavior for AVFoundation + Core ML on live feeds, or is this a framework/runtime bug? If expected, what is the Apple-recommended pattern to bypass this camera-only inference slowdown? One think found interesting when running in debug model was faster (not as fast as in performance benchmark but faster than 16ms. Also somehow if I did some dummy calculation on on different DispatchQueue also seems like model got slightly faster. So maybe its related to ANE Power State issues (Jitter/SoC Wake) and going to fast to sleep and taking a long time to wakeup? Doing dummy calculation in background thought is probably not a solution. Thanks in advance for any insights!
Replies
5
Boosts
0
Views
1.2k
Activity
Mar ’26