tensorflow-metal

Using Tensorflow for Silicon gives inaccurate results when compared to Google Colab GPU (9-15% differences). Here are my install versions for 4 anaconda env's. I understand the Floating point precision can be an issue, batch size, activation functions but how do you rectify this issue for the past 3 years?

1.) Version TF: 2.12.0, Python 3.10.13, tensorflow-deps: 2.9.0, tensorflow-metal: 1.2.0, h5py: 3.6.0, keras: 2.12.0

2.) Version TF: 2.19.0, Python 3.11.0, tensorflow-metal: 1.2.0, h5py: 3.13.0, keras: 3.9.2, jax: 0.6.0, jax-metal: 0.1.1,jaxlib: 0.6.0, ml_dtypes: 0.5.1

3.) python: 3.10.13,tensorflow: 2.19.0,tensorflow-metal: 1.2.0, h5py: 3.13.0, keras: 3.9.2, ml_dtypes: 0.5.1

4.) Version TF: 2.16.2, tensorflow-deps:2.9.0,Python: 3.10.16, tensorflow-macos 2.16.2, tensorflow-metal: 1.2.0, h5py:3.13.0, keras: 3.9.2, ml_dtypes: 0.3.2

Install of Each ENV with common example:

Create ENV: conda create --name TF_Env_V2 --no-default-packages

start env: source TF_Env_Name

ENV_1.) conda install -c apple tensorflow-deps , conda install tensorflow,pip install tensorflow-metal,conda install ipykernel

ENV_2.) conda install pip python==3.11, pip install tensorflow,pip install tensorflow-metal,conda install ipykernel

ENV_3) conda install pip python 3.10.13,pip install tensorflow, pip install tensorflow-metal,conda install ipykernel

ENV_4) conda install -c apple tensorflow-deps, pip install tensorflow-macos, pip install tensor-metal, conda install ipykernel

Example used on all 4 env:

import tensorflow as tf

cifar = tf.keras.datasets.cifar100 (x_train, y_train), (x_test, y_test) = cifar.load_data() model = tf.keras.applications.ResNet50( include_top=True, weights=None, input_shape=(32, 32, 3), classes=100,)

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False) model.compile(optimizer="adam", loss=loss_fn, metrics=["accuracy"]) model.fit(x_train, y_train, epochs=5, batch_size=64)

Answered by DTS Engineer in 838231022

I had previously suggested filing a bug report about this matter, but after reading the posts on this thread I now think it would be better for you to take these questions to the maintainers of the tensorflow libraries that you are using. And possibly to support forums for those products and libraries.

I had previously suggested filing a bug report about this matter, but after reading the posts on this thread I now think it would be better for you to take these questions to the maintainers of the tensorflow libraries that you are using. And possibly to support forums for those products and libraries.

Any update on this? Our team all have m2-m4 mbp and they do not trust the machines because simple matrix computation give wrong results

This is a simple test we did

from rich import print

import os

os.environ["TF_CPP_MIN_LOG_LEVEL"] = "4"

import numpy as np import tensorflow as tf

RT = tf.constant( [ [-0.25497323, -0.81989247, -0.5126062, 0.3136883], [-0.32365915, 0.57191426, -0.75376326, 0.36354592], [0.9111716, -0.02627973, -0.41118845, 0.511739], [0.0, 0.0, 0.0, 1.0], ] )

def invert(RT): """ I found this bug while applying an inverse transform to a vector and the inverted transform was wrong """

# test data from my dataset
R = RT[..., :3, :3]
T = RT[..., :3, 3]
R_inv = tf.einsum("...ij->...ji", R)
T_inv = -tf.einsum("...ij,...j->...i", R_inv, T)  # (..., 3)

return T_inv

print("Numpy Sanity check") print(f"np inv:\n{np.linalg.inv(RT)[:3, 3]}") print(f"np inv Float32:\n{np.linalg.inv(tf.cast(RT, tf.float32))[:3, 3]}")

print(f"np inv Float16:\n{np.linalg.inv(tf.cast(RT, tf.float16))}")

with tf.device("/GPU:0"): res = invert(RT) print(f"\nTF on {res.device}") print(f"tf inv:\n{res}") print(f"tf inv Float32:\n{invert(tf.cast(RT, tf.float32))}") print(f"tf inv Float16:\n{invert(tf.cast(RT, tf.float16))}")

with tf.device("/CPU:0"): res = invert(RT) print(f"\nTF on {res.device}") print(f"tf inv:\n{res}") print(f"tf inv Float32:\n{invert(tf.cast(RT, tf.float32))}") print(f"tf inv Float16:\n{invert(tf.cast(RT, tf.float16))}")

sorry for the mangled example above, I was on my iPhone when posting. I have nailed the culprit, and it is T = RT[..., :3, 3] which on CPU works as expected but not on GPU. using T = tf.reshape(tf.slice(RT, [0, 3], [3, 1]), [3]) produces consistent results on CPU and GPU

I did a few more tests and RT[:3, 3] works correctly while RT[..., :3, 3] does not. Check this minimal repro:

import tensorflow as tf

RT = tf.constant(
    [
        [-0.25497323, -0.81989247, -0.5126062, 0.3136883],
        [-0.32365915, 0.57191426, -0.75376326, 0.36354592],
        [0.9111716, -0.02627973, -0.41118845, 0.511739],
        [0.0, 0.0, 0.0, 1.0],
    ]
)

with tf.device("/GPU:0"):
    T1 = RT[..., :3,3]
    T2 = RT[:3,3]

print(f"RT[..., :3,3]: {T1}")
print(f"RT[:3,3]: {T2}")

output:

2025-10-04 15:16:12.602764: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M4 Max
2025-10-04 15:16:12.602789: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 128.00 GB
2025-10-04 15:16:12.602794: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 53.76 GB
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1759583772.602803 5120482 pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
I0000 00:00:1759583772.602816 5120482 pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
RT[..., :3,3]: [0.3136883 0.3136883 0.3136883]
RT[:3,3]: [0.3136883  0.36354592 0.511739  ]

tf 2.19, python 3.12, tf metal 1.2

tensorflow-metal
 
 
Q