TensorFlow model predictions are incorrect on M1 GPU

I have a TensorFlow 2.x object detection model (ssd resnet50 v1) that was trained on an Ubuntu 20.04 box with a GPU.

The predictions from the model preform as expected on Linux CPU&GPU, Windows 10 CPU&GPU, and Intel MacBook Air CPU, and the M1 MacBook Air CPU.

However, when I install the tensorflow-metal plugin on the M1, I can see the GPU is being used but the predictions are garbage.

I followed these install instruction:

https://developer.apple.com/metal/tensorflow-plugin/

Which gives me:

  • tensorflow-macos 2.6.0
  • tensorflow-metal 0.2.0

and

  • Python 3.9.5

Anyone have insight as to what may be the problem? The M1 Air is running the public release of Monterey.

  • UPDATE: It may be something specific to the SSD Resnet50 v1 architecture. I have several other models built with the same pipeline and data which do seem to be working.

  • Hi AdkPete, I have the same problems here. I compared my results with the same model on Windows 10, Intel MacBook and Linux CPU. The predictions are very bad when I installed TensorFlow-metal plugin. So far I created another environment without this thing and use CPU only to train the model. Do you have any idea about how we deal with it? Many thanks.

  • Mona190: I don't have a fix and the other models that I thought were working, are not actually working. I looked at the outputs from the models and scores are numbers like 90000 when they should be between 0 and 1. Also the output of a prediction can produce wacky non-existent class values. Something is very wrong with the plugin.

Replies

(Accidentally wrote an answer when I meant to write a comment and now I can't delete this — terrible forum UX)

Add a Comment

We are developing a simple GAN an when training the solution, the behavior of the convergence of the discriminator is different if we use GPU than using only CPU or even executing in Collab. We've read a lot, but this is the only one post that seems to talk about similar behavior. Unfortunately, after updating to 0.4 version problem persists. My Hardware/Software: MacBook Pro. model: MacBookPro18,2. Chip: Apple M1 Max. Cores: 10 (8 de rendimiento y 2 de eficiencia). Memory: 64 GB. firmware: 7459.101.3. OS: Monterey 12.3.1. OS Version: 7459.101.3. Python version 3.8 and libraries (the most related) using !pip freeze keras==2.8.0 Keras-Preprocessing==1.1.2 .... tensorboard==2.8.0 tensorboard-data-server==0.6.1 tensorboard-plugin-wit==1.8.1 tensorflow-datasets==4.5.2 tensorflow-docs @ git+https://github.com/tensorflow/docs@7d5ea2e986a4eae7573be3face00b3cccd4b8b8b%C2%A0tensorflow-macos==2.8.0 tensorflow-metadata==1.7.0 tensorflow-metal==0.4.0 #####. CODE TO REPRODUCE. ####### Code does not fit in the max space in this message... I've shared a Google Collab Notebook at: https://colab.research.google.com/drive/1oDS8EV0eP6kToUYJuxHf5WCZlRL0Ypgn?usp=sharing You can easily see that loss goes to 0 after 1 or 2 epochs when GPU is enabled, buy if GPU is disabled everything is OK