Core ML

Integrate machine learning models into your app using Core ML.

Core ML Documentation

Post

Replies

Boosts

Views

Activity

CoreML GPU NaN bug with fused QKV attention on macOS Tahoe

Problem: CoreML produces NaN on GPU (works fine on CPU) when running transformer attention with fused QKV projection on macOS 26.2. Root cause: The common::fuse_transpose_matmul optimization pass triggers a Metal kernel bug when sliced tensors feed into matmul(transpose_y=True). Workaround: pipeline = ct.PassPipeline.DEFAULT pipeline.remove_passes(['common::fuse_transpose_matmul']) mlmodel = ct.convert(model, ..., pass_pipeline=pipeline) Minimal repro: https://github.com/imperatormk/coreml-birefnet/blob/main/apple_bug_repro.py Affected: Any ViT/Swin/transformer with fused QKV attention (BiRefNet, etc.) Has anyone else hit this? Filed FB report too.

Machine Learning & AI Core ML

601

Apr ’26

Is it possible to instantiate MLModel strictly from memory (Data) to support custom encryption?

We are trying to implement a custom encryption scheme for our Core ML models. Our goal is to bundle encrypted models, decrypt them into memory at runtime, and instantiate the MLModel without the unencrypted model file ever touching the disk. We have looked into the native apple encryption described here https://developer.apple.com/documentation/coreml/encrypting-a-model-in-your-app but it has limitations like not working on intel macs, without SIP, and doesn’t work loading from dylib. It seems like most of the Core ML APIs require a file path, there is MLModelAsset APIs but I think they just write a modelc back to disk when compiling but can’t find any information confirming that (also concerned that this seems to be an older API, and means we need to compile at runtime). I am aware that the native encryption will be much more secure but would like not to have the models in readable text on disk. Does anyone know if this is possible or any alternatives to try to obfuscate the Core ML models, thanks

Machine Learning & AI Core ML Security Core ML

688

Feb ’26

MLX/Ollama Benchmarking Suite - Open Source and Free

Hi all, I spent the last few months developing an MLX/Ollama local AI Benchmarking suite for Apple Silicon, written in pure Swift and signed with an Apple Developer Certificate, open source, GPL, and free. I would love some feedback to continue development. It is the only benchmarking suite I know of that supports live power metrics and MLX natively, as well as quick exports for benchmark results, and an arena mode, Model A vs B with history. I really want this project to succeed, and have widespread use, so getting 75 stars on the github repo makes it eligible for Homebrew/Cask distribution. Github Repo

Machine Learning & AI Core ML

295

Feb ’26

Unable to load a quantized Qwen 1.7B model on an iPhone SE 3

I am trying to benchmark and see if the Qwen3 1.7B model can run in an iPhone SE 3 [4 GB RAM]. My core problem is - Even with weight quantization the SE 3 is not able to load into memory. What I've tried: I am converting a Torch model to the Core ML format using coremltools. I have tried the following combinations of quantization and context length 8 bit + 1024 8 bit + 2048 4 bit + 1024 4 bit + 2048 All the above quantizations are done with dynamic shape with the default being [1,1] in the hope that the whole context length does not get allocated in memory The 4-bit model is approximately 865MB on disk The 8-bit model is approximately 1.7 GB on disk During load: With the int4 quantization the memory spikes during intitial load a lot. Could this be because many operations are converted to int8 or fp16 as core ML does not perform operations natively on int4? With int8 on the profiler the memory does not go above 2 GB (only 900 MB) but it is still not able to load as it shows the following error. 2GB is the limit where jetsam kills the app for the iPhone SE 3 E5RT: Error(s) occurred compiling MIL to BNNS graph: [CreateBnnsGraphProgramFromMIL]: BNNS Graph Compile: failed to preallocate file with error: No space left on device for path: /var/mobile/Containers/Data/Application/ 5B8BB7D2-06A6-4BAE-A042-407B6D805E7C/Library/Caches /com.tss.qwen3-coreml/ com.apple.e5rt.e5bundlecache/ 23A341/<long key>.tmp.12586_4362093968.bundle/ H14.bundle/main/main_bnns/bnns_program.bnnsir Some online sources have suggested activation quantization but I am unsure if that will have any impact on loading [as the spike is during load and not inference] The model spec also suggests that there is no dequantization happening (for e.g from 4 bit -> fp16) So I had couple of queries: Has anyone faced similar issues? What could be the reasons for the temporary memory spike during LOAD What are approaches that can be adopted to deal with this issue? Any help would be greatly appreciated. Thank you.

Machine Learning & AI Core ML Core ML

426

Mar ’26

Qwen3 VL CoreML

Looking for help with or to help with, due to the pending document enhancement, the Vibe Coders edition of cml editor. Also for more information on how to use the .mlkey whether or not my model is suppose to say IOs18 when I am planning to use it on Mac Apple Intelligence seems to think coreML is for iOS but are the capabilities extended when running NPU on the book? How to use this graph. coming in hot sorry. btw. there are 100s of feedback and crash reports sent in form me for additional info? I attached a image that might help with updating Tags

Machine Learning & AI Core ML

426

Mar ’26

Core Model Editor and Params

Optimal Precision • Current Precision: Mixed (Float32, int32) • Optimal Precision: Not specified in the image, but typically involves using the most efficient data type for the model's operations to balance speed and memory usage without significant loss of accuracy. Comparison: • Mixed Precision: Utilizes both Float32 and int32 to optimize performance. Float32 provides high precision, while int32 reduces memory usage and increases computational speed. • Optimal Precision: Aimed at achieving the best trade-off between performance and accuracy, potentially using other data types like Float16 (bfloat16) for even greater efficiency in certain hardware environments. Operation Distribution • Current Distribution: • iOS18.mul: 168 • iOS18.transpose: 126 • iOS18.linear: 98 • iOS18.add: 97 • iOS18.sliceByIndex: 96 • iOS18.expandDims: 74 • iOS18.concat: 72 • iOS18.squeeze: 72 • iOS18.reshape: 67 • iOS18.layerNorm: 49 • iOS18.matmul: 48 • iOS18.gelu: 26 • iOS18.softmax: 24 • Split: 24 • conv: 1 • iOS18.conv: 1 Comparison: • Operation Count: Indicates how frequently each operation is executed. High counts for operations like mul, transpose, and linear suggest these are computationally intensive parts of the model. • Optimization Opportunities: Reducing the count of high-frequency operations or optimizing their execution can improve performance. This might involve pruning unnecessary operations, optimizing algorithms, or leveraging hardware acceleration. General Recommendations • Precision Tuning: Experiment with different precision levels to find the best balance for your specific hardware and accuracy requirements. • Operation Optimization: Focus on optimizing the most frequent operations. Techniques include using more efficient algorithms, parallelizing computations, or utilizing specialized hardware like GPUs or TPUs. • Benchmarking: Regularly benchmark the model to assess the impact of changes and ensure that optimizations lead to meaningful performance improvements. By focusing on these areas, you can potentially enhance the efficiency and performance of your ML model.

Machine Learning & AI Core ML

216

Feb ’26

Massive CoreML latency spike on live AVFoundation camera feed vs. offline inference (CPU+ANE)

Hello, I’m experiencing a severe performance degradation when running CoreML models on a live AVFoundation video feed compared to offline or synthetic inference. This happens across multiple models I've converted (including SCI, RTMPose, and RTMW) and affects multiple devices. The Environment OS: macOS 26.3, iOS 26.3, iPadOS 26.3 Hardware: Mac14,6 (M2 Max), iPad Pro 11 M1, iPhone 13 mini Compute Units: cpuAndNeuralEngine The Numbers When testing my SCI_output_image_int8.mlpackage model, the inference timings are drastically different: Synthetic/Offline Inference: ~1.34 ms Live Camera Inference: ~15.96 ms Preprocessing is completely ruled out as the bottleneck. My profiling shows total preprocessing (nearest-neighbor resize + feature provider creation) takes only ~0.4 ms in camera mode. Furthermore, no frames are being dropped. What I've Tried I am building a latency-critical app and have implemented almost every recommended optimization to try and fix this, but the camera-feed penalty remains: Matched the AVFoundation camera output format exactly to the model input (640x480 at 30/60fps). Used IOSurface-backed pixel buffers for everything (camera output, synthetic buffer, and resize buffer). Enabled outputBackings. Loaded the model once and reused it for all predictions. Configured MLModelConfiguration with reshapeFrequency = .frequent and specializationStrategy = .fastPrediction. Wrapped inference in ProcessInfo.processInfo.beginActivity(options: .latencyCritical, reason: "CoreML_Inference"). Set DispatchQueue to qos: .userInteractive. Disabled the idle timer and enabled iOS Game Mode. Exported models using coremltools 9.0 (deployment target iOS 26) with ImageType inputs/outputs and INT8 quantization. Reproduction To completely rule out UI or rendering overhead, I wrote a standalone Swift CLI script that isolates the AVFoundation and CoreML pipeline. The script clearly demonstrates the ~15ms latency on live camera frames versus the ~1ms latency on synthetic buffers. (I have attached camera_coreml_benchmark.swift and coreml model (very light low light enghancement model) to this repo on github https://github.com/pzoltowski/apple-coreml-camera-latency-repro). My Question: Is this massive overhead expected behavior for AVFoundation + Core ML on live feeds, or is this a framework/runtime bug? If expected, what is the Apple-recommended pattern to bypass this camera-only inference slowdown? One think found interesting when running in debug model was faster (not as fast as in performance benchmark but faster than 16ms. Also somehow if I did some dummy calculation on on different DispatchQueue also seems like model got slightly faster. So maybe its related to ANE Power State issues (Jitter/SoC Wake) and going to fast to sleep and taking a long time to wakeup? Doing dummy calculation in background thought is probably not a solution. Thanks in advance for any insights!

Machine Learning & AI Core ML Performance AVFoundation

1.1k

Mar ’26

How can I change the output dimensions of a CoreML model in Xcode when the outputs come from a NonMaximumSuppression layer?

After exerting a custom model with nms=True. In Xcode, the outputs show as: confidence: MultiArray (0 × 5) coordinates: MultiArray (0 × 4) I want to set fixed shapes (e.g., 100 × 5, 100 × 4), but Xcode does not allow editing—the shape fields are locked. The model graph shows both outputs come directly from a NonMaximumSuppression layer. Is it possible to set fixed output dimensions for NMS outputs in CoreML?

Machine Learning & AI Core ML ML Compute Swift Xcode Core ML

572

Mar ’26

tensorflow-metal ReLU activation fails to clip negative values on M4 Apple Silicon

Environment: Hardware: Mac M4 OS: macOS Sequoia 15.7.4 TensorFlow-macOS Version: 2.16.2 TensorFlow-metal Version: 1.2.0 Description: When using the tensorflow-metal plug-in for GPU acceleration on M4, the ReLU activation function (both as a layer and as an activation argument) fails to correctly clip negative values to zero. The same code works correctly when forced to run on the CPU. Reproduction Script: import os import numpy as np import tensorflow as tf # weights and biases = -1 weights = [np.ones((10, 5)) * -1, np.ones(5) * -1] # input = 1 data = np.ones((1, 10)) # comment this line => GPU => get negative values # uncomment this line => CPU => no negative values # tf.config.set_visible_devices([], 'GPU') # create model model = tf.keras.Sequential([ tf.keras.layers.Input(shape=(10,)), tf.keras.layers.Dense(5, activation='relu') ]) # set weights model.layers[0].set_weights(weights) # get output output = model.predict(data) # check if negative is present print(f"min value: {output.min()}") print(f"is negative present? {np.any(output < 0)}")

Machine Learning & AI Core ML Foundation ML Compute tensorflow-metal

616

Mar ’26

Building a 4-agent autonomous coding pipeline on Apple Silicon — MLX backend questions

Hi, I'm building ANF (Autonomous Native Forge) — a cloud-free, 4-agent autonomous software production pipeline running on local hardware with local LLM inference. No middleware, pure Node.js native. Currently running on NVIDIA Blackwell GB10 with vLLM + DeepSeek-R1-32B. Now porting to Apple Silicon. Three technical questions: How production-ready is mlx-lm's OpenAI-compatible API server for long context generation (32K tokens)? What's the recommended approach for KV Cache management with Unified Memory architecture — any specific flags or configurations for M4 Ultra? MLX vs GGUF (llama.cpp) for a multi-agent pipeline where 4 agents call the inference endpoint concurrently — which handles parallel requests better on Apple Silicon? GitHub: github.com/trgysvc/AutonomousNativeForge Any guidance appreciated.

Machine Learning & AI Core ML Interface Builder Core ML Apple Silicon

508

Mar ’26

MPS SDPA Attention Kernel Regression on A14-class (M1) in macOS 26.3.1 — Works on A15+ (M2+)

Summary Since macOS 26, our Core ML / MPS inference pipeline produces incorrect results on Mac mini M1 (Macmini9,1, A14-class SoC). The same model and code runs correctly on M2 and newer (A15-class and up). The regression appears to be in the Scaled Dot-Product Attention (SDPA) kernel path in the MPS backend. Environment Affected Mac mini M1 — Macmini9,1 (A14-class) Not affected M2 and newer (A15-class and up) Last known good macOS Sequoia First broken macOS 26 (Tahoe) ? Confirmed broken on macOS 26.3.1 Framework Core ML + MPS backend Language C++ (via CoreML C++ API) Description We ship an audio processing application (VoiceAssist by NoiseWorks) that runs a deep learning model (based on Demucs architecture) via Core ML with the MPS compute unit. On macOS Sequoia this works correctly on all Apple Silicon Macs including M1. After updating to macOS 26 (Tahoe), inference on M1 Macs fails — either producing garbage output or crashing. The same binary, same .mlpackage, same inputs work correctly on M2+. Our Apple contact has suggested the root cause is a regression in the A14-specific MPS SDPA attention kernel, which may have broken when the Metal/MPS stack was updated in macOS 26. The model makes heavy use of attention layers, and the failure correlates precisely with the SDPA path being exercised on A14 hardware. Steps to Reproduce Load a Core ML model that uses Scaled Dot-Product Attention (e.g. a transformer or attention-based audio model) Run inference with MLComputeUnits::cpuAndGPU (MPS active) Run on Mac mini M1 (Macmini9,1) with macOS 26.3.1 Compare output to the same model running on M2 / macOS Sequoia Expected: Correct inference output, consistent with M2+ and macOS Sequoia behavior Actual: Incorrect / corrupted output (or crash), only on A14-class hardware running macOS 26+ Workaround Forcing MLComputeUnits::cpuOnly bypasses MPS entirely and produces correct output on M1, confirming the issue is in the MPS compute path. This is not acceptable as a shipping workaround due to performance impact. Additional Notes The failure is hardware-specific (A14 only) and OS-specific (macOS 26+), pointing to a kernel-level regression rather than a model or app bug We first became aware of this through a customer report Happy to provide a symbolicated crash log if helpful this text was summarized by AI and human verified

Machine Learning & AI Core ML Metal Performance Shaders

397

Apr ’26

How does ARKit achieve low-latency and stable head tracking using only RGB camera ?

Hi, I’m working on a real-time head/face tracking pipeline using a standard 2D RGB camera, and I’m trying to better understand how ARKit achieves such stable and responsive results in comparable conditions. To clarify upfront: I’m specifically interested in RGB-only tracking and the underlying vision/ML pipeline. I’m not using TrueDepth or any depth/IR-based sensors, and I’d like to understand how similar stability and responsiveness can be achieved under those constraints. In my current setup, I estimate head pose from RGB frames (facial landmarks + PnP) and apply temporal filtering (e.g., exponential smoothing and Kalman filtering). This significantly reduces jitter, but introduces noticeable latency, especially during faster head movements. What stands out in ARKit is that it appears to maintain both: Very low jitter Very low perceived latency even when operating with camera input alone. I’m trying to understand what techniques might contribute to this behavior. In particular: Does ARKit use predictive tracking (e.g., velocity or acceleration-based pose extrapolation) to compensate for camera and processing delays in RGB-only scenarios? Are there recommended strategies for balancing temporal smoothing and responsiveness without introducing visible lag in camera-based pose estimation pipelines? Is the tracking pipeline internally decoupled from rendering (e.g., asynchronous processing with prediction applied at render time)? Are there general best practices for minimizing end-to-end latency in vision-based head tracking systems beyond standard filtering approaches? I understand that implementation details may not be public, but any high-level insights or pointers would be greatly appreciated. Thanks!

Machine Learning & AI Core ML ARKit

280

Mar ’26

CoreML MLE5ProgramLibrary AOT recompilation hangs/crashes on iOS 26.4 — C++ exception in espresso IR compiler bypasses Swift error handling

Area: CoreML / Machine Learning Describe the issue: On iOS 26.4, calling MLModel(contentsOf:configuration:) to load an .mlpackage model hangs indefinitely and eventually kills the app via watchdog. The same model loads and runs inference successfully in under 1 second on iOS 26.3.1. The hang occurs inside eort_eo_compiler_compile_from_ir_program (espresso) during on-device AOT recompilation triggered by MLE5ProgramLibraryOnDeviceAOTCompilationImpl createProgramLibraryHandleWithRespecialization:error:. A C++ exception (__cxa_throw) is thrown inside libBNNS.dylib during the exception unwind, which then hangs inside __cxxabiv1::dyn_cast_slow and __class_type_info::search_below_dst. Swift's try/catch does not catch this — the exception originates in C++ and the process hangs rather than terminating cleanly. Setting config.computeUnits = .cpuOnly does not resolve the issue. MLE5ProgramLibrary initialises as shared infrastructure regardless of compute units. Steps to reproduce: Create an app with an .mlpackage CoreML model using the MLE5/espresso backend Call MLModel(contentsOf: modelURL, configuration: config) at runtime Run on a device on iOS 26.3.1 — loads successfully in <1 second Update device to iOS 26.4 — hangs indefinitely, app killed by watchdog after 60–745 seconds Expected behaviour: Model loads successfully, or throws a catchable Swift error on failure. Actual behaviour: Process hangs in MLE5ProgramLibrary.lazyInitQueue. App killed by watchdog. No Swift error thrown. Full stack trace at point of hang: Thread 1 Queue: com.apple.coreml.MLE5ProgramLibrary.lazyInitQueue (serial) frame 0: __cxxabiv1::__class_type_info::search_below_dst libc++abi.dylib frame 1: __cxxabiv1::(anonymous namespace)::dyn_cast_slow libc++abi.dylib frame 2: ___lldb_unnamed_symbol_23ab44dd4 libBNNS.dylib frame 23: eort_eo_compiler_compile_from_ir_program espresso frame 24: -[MLE5ProgramLibraryOnDeviceAOTCompilationImpl createProgramLibraryHandleWithRespecialization:error:] CoreML frame 25: -[MLE5ProgramLibrary _programLibraryHandleWithForceRespecialization:error:] CoreML frame 26: __44-[MLE5ProgramLibrary prepareAndReturnError:]_block_invoke CoreML frame 27: _dispatch_client_callout libdispatch.dylib frame 28: _dispatch_lane_barrier_sync_invoke_and_complete libdispatch.dylib frame 29: -[MLE5ProgramLibrary prepareAndReturnError:] CoreML frame 30: -[MLE5Engine initWithContainer:configuration:error:] CoreML frame 31: +[MLE5Engine loadModelFromCompiledArchive:modelVersionInfo:compilerVersionInfo:configuration:error:] CoreML frame 32: +[MLLoader _loadModelWithClass:fromArchive:modelVersionInfo:compilerVersionInfo:configuration:error:] CoreML frame 45: +[MLModel modelWithContentsOfURL:configuration:error:] CoreML frame 46: @nonobjc MLModel.__allocating_init(contentsOf:configuration:) GKPersonalV2 frame 47: MDNA_GaitEncoder_v1_3.__allocating_init(contentsOf:configuration:) frame 48: MDNA_GaitEncoder_v1_3.__allocating_init(configuration:) frame 50: GaitModelInference.loadModel() frame 51: GaitModelInference.init() iOS version: Reproduced on iOS 26.4. Works correctly on iOS 26.3.1. Xcode version: 26.2 Device: iPhone (model used in testing) Model format: .mlpackage

Machine Learning & AI Core ML ML Compute

827

Apr ’26

Sharing a Swift port of Gemma 4 for mlx-swift-lm — feedback welcome

Hi all, I've been working on a pure-Swift port of Google's Gemma 4 text decoder that plugs into mlx-swift-lm as a sidecar model registration. Sharing it here in case anyone else hit the same wall I did, and to get feedback from the MLX team and the community before I propose anything upstream. Repo: https://github.com/yejingyang8963-byte/Swift-gemma4-core Why As of mlx-swift-lm 2.31.x, Gemma 4 isn't supported out of the box. The obvious workaround — reusing the Gemma 3 text implementation with a patched config — fails at weight load because Gemma 4 differs from Gemma 3 in several structural places. The chat-template path through swift-jinja 1.x also silently corrupts the prompt, so the model loads but generates incoherent text. What's in the package A from-scratch Swift implementation of the Gemma 4 decoder (Configuration, Layers, Attention, MLP, RoPE, DecoderLayer) Per-Layer Embedding (PLE) support — the shared embedding table that feeds every decoder layer through a gated MLP as a third residual KV sharing across the back half of the decoder, threaded through the forward pass via a donor table with a single global rope offset A custom Gemma4ProportionalRoPE class for the partial-rotation rope type that initializeRope doesn't currently recognize A chat-template bypass that builds the prompt as a literal string with the correct turn markers and encodes via tokenizer.encode(text:), matching Python mlx-lm's apply_chat_template byte-for-byte Measured on iPhone (A-series, 7.4 GB RAM) Model: mlx-community/gemma-4-e2b-it-4bit Warm load: ~6 s Memory after load: 341–392 MB Time to first token (end-to-end, 333-token system prompt): 2.82 s Generation throughput: 12–14 tok/s What I'd love feedback on Is the sidecar registration pattern the right way to extend mlx-swift-lm with new model families, or is there a more idiomatic path I missed? The chat-template bypass works but feels like a workaround. Is the right long-term fix in swift-jinja, in the tokenizer, or somewhere else entirely? Anyone running into the same PLE / KV-sharing issues on other Gemma-family checkpoints? I'd like to make sure the implementation generalizes beyond E2B before tagging a 0.2.0. Happy to open a PR against mlx-swift-lm if the maintainers think any of this belongs upstream. Thanks for reading.

Machine Learning & AI Core ML

353

Apr ’26

Does using Vision API offline to label a custom dataset for Core ML training violate DPLA?

Hello everyone, I am currently developing a smart camera app for iOS that recommends optimal zoom and exposure values on-device using a custom Core ML model. I am still waiting for an official response from Apple Support, but I wanted to ask the community if anyone has experience with a similar workflow regarding App Review and the DPLA. Here is my training methodology: I gathered my own proprietary dataset of original landscape photos. I generated multiple variants of these photos with different zoom and exposure settings offline on my Mac. I used the CalculateImageAestheticsScoresRequest (Vision framework) via a local macOS command-line tool to evaluate and score each variant. Based on those scores, I labeled the "best" zoom and exposure parameters for each original photo. I used this labeled dataset to train my own independent neural network using PyTorch, and then converted it to a Core ML model to ship inside my app. Since the app uses my own custom model on-device and does not send any user data to a server, the privacy aspect is clear. However, I am curious if using the output of Apple's Vision API strictly offline to label my own dataset could be interpreted as "reverse engineering" or a violation of the Developer Program License Agreement (DPLA). Has anyone successfully shipped an app using a similar knowledge distillation or automated dataset labeling approach with Apple's APIs? Did you face any pushback during App Review? Any insights or shared experiences would be greatly appreciated!

Machine Learning & AI Core ML App Review Vision Machine Learning Core ML

437

Apr ’26

CoreML model cache causes fake hard drive memory usage

Hi, I experiment by creating and compiling a lot of CoreML models and I have the issue that this causes a lot of disk usage, but when I try to delete everything (I search in the disk for possible CoreML cache directories) the disk space is not actually freed up. This is a picture of my disk usage according to what is shown inside of Settings>General>Storage and the Disk Utility app. I am running on macOS 15.7.5

Machine Learning & AI Core ML

1.5k

Do loading multiple functions that share model weights multiply memory use?

Hi, I have a multifunction model where the functions share the same model weights, and for latency I have multiple functions loaded at the same time. According to what Codex found this multiplies RAM usage, so if the single model weights 2GB, loading two functions that share the underlying weights still doubles RAM usage to 4GB (seems that it is something like neural wired memory). Does anyone have any knowledge relating to this?

Machine Learning & AI Core ML

1.1k

When will mps support fp8 dtypes?

https://github.com/pytorch/pytorch/issues/132624 this fp8 dtypes unsupport issue has been existed for 2 years, does mlx have any plan to it?

Machine Learning & AI Core ML ML Compute

389

_FusedMatMul with [BiasAdd, Relu] produces incorrect results in graph mode on Metal GPU

When running a tf.function-traced graph on the Metal GPU, any operation that combines MatMul → BiasAdd → Relu (the fused pattern emitted by tf.keras.layers.Dense(activation='relu')) produces numerically incorrect output — errors on the order of tens of units, not floating-point noise. Eager mode on the same Metal GPU is correct. Graph mode forced to CPU (tf.config.set_visible_devices([], 'GPU')) is also correct. The bug is deterministic and data-independent (reproduces with random weights). the three-op combination of MatMul + BiasAdd + Relu trigger the error. Specifically: relu(tf.nn.bias_add(tf.matmul(x, W), b)) in graph mode on Metal is wrong, while relu(tf.matmul(x, W) + b) (using AddV2 instead of BiasAdd) is correct. Removing the Relu also makes the result correct — tf.nn.bias_add(tf.matmul(x, W), b) without a following Relu produces correct output at every shape tested. This points to the Metal plugin's fused _FusedMatMul kernel with fused_ops=[BiasAdd, Relu] as the culprit. Disabling the TF core grappler remapping pass (tf.config.optimizer.set_experimental_options({'remapping': False})) does not fix the issue, confirming that the fusion decision is made inside the Metal plugin's own kernel selection, below the TF core graph optimizer. The bug reproduces across all shapes tested (batch 4–200, inner dimension K 512–8192, output 128–2048) and is not specific to any particular weight values. A minimal reproducer: import tensorflow as tf import numpy as np # Any shape works; larger K makes the error more obvious M, K, N = 64, 2048, 1024 W = tf.Variable(tf.random.normal([K, N])) b = tf.Variable(tf.random.normal([N])) x = tf.random.normal([M, K]) @tf.function def graph_fused(x): return tf.nn.relu(tf.nn.bias_add(tf.matmul(x, W), b)) @tf.function def graph_safe(x): return tf.nn.relu(tf.matmul(x, W) + b) # AddV2 instead of BiasAdd eager_ref = tf.nn.relu(tf.nn.bias_add(tf.matmul(x, W), b)) # eager = correct fused_out = graph_fused(x) # Metal graph mode = WRONG safe_out = graph_safe(x) # Metal graph mode = correct print(f"eager vs graph_fused (BiasAdd): {tf.reduce_max(tf.abs(eager_ref - fused_out)).numpy():.1f}") # ^ typically 30–80+ (WRONG) print(f"eager vs graph_safe (AddV2): {tf.reduce_max(tf.abs(eager_ref - safe_out)).numpy():.2e}") # ^ typically ~1e-5 (correct) Environment: TensorFlow 2.18.1, Keras 3.11.2, tensorflow-metal (latest as of 2026-05-26), Apple Silicon Mac. Impact: This breaks any Keras model that uses Dense(activation='relu') when called inside a tf.function or via SavedModel serving on the Metal GPU. Eager-mode inference is unaffected.

Machine Learning & AI Core ML Metal tensorflow-metal

116

22h

CoreML GPU NaN bug with fused QKV attention on macOS Tahoe

Machine Learning & AI Core ML