M1 Max GPU fails to converge in more complex models

We run into an issue that a more complex model fails to converge on M1 Max GPU while it converges on its CPU and on Non-M1 based models.

the performance is the same for CPU and GPU for models with single RNN but once we use two RNNs GPU fails to converge.

That said, the below example is based on non-sensical data for the model architecture used. but we can observe here the same behavior as the one we observe in our production models (which for obvious reasons we cannot share here). Mainly:

  • the loss goes down to the bottom of the e-06 precision in all cases but when we use two RNNs on GPU. during training we often test e-07 precision level

  • for double RNN with GPU condition, the results do not go that low sometimes reaching also e-05 value level.

  • for our production data we see that double RNN with GPU results in loss of 1.0 and basically stays the same from the first epoch; but for the other conditions it often reaches 0.2 level with clear learning curve.

  • in production model increasing the LSTM_Cell number made the divergence more visible (in this syntactic date it does not happen)

  • the more complex the model is (after the RNN layers) the more visible the issue.

Suspected issues:

  • different precision used in CPU and GPU training - we had to decrease the data values a lot to make the effect visible ( if you work with raw data all approaches seem to produce the comparable results)

  • somehow the vanishing gradient problem is more pronounced on GPU as indicated by worse performance as the complexity of the model increases.

please let me know if you need any further details

Software Stack: Mac OS 12.1 tf 2.7 metal 0.3 also tested on tf. 2.8

Sample Syntax:


TEST CONDITIONS:

#conditions with issue: 1,2 gpu = 1 # 0 CPU, 1 GPU model_size = 2 # 1 single RNN, 2 double RNN

#PARAMETERS LSTM_Cells = 64 epochs = 300 batch = 128

import numpy as np import pandas as pd import sys from sklearn import preprocessing

#""" if 'tensorflow' in sys.modules: print("tensorflow uploaded") del sys.modules["tensorflow"] #del tf import tensorflow as tf

else: print("tensorflow not uploaded") import tensorflow as tf

if gpu == 1: pass else: tf.config.set_visible_devices([], 'GPU')

#print("GPUs:", tf.config.list_physical_devices('GPU')) print("GPUs:", tf.config.list_logical_devices('GPU')) #print("CPUs:", tf.config.list_physical_devices('CPU')) print("CPUs:", tf.config.list_logical_devices('CPU')) #"""

from tensorflow.keras import Sequential from tensorflow.keras.layers import Dense

url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data' column_names = ['MPG', 'Displacement', 'Horsepower', 'Weight']

dataset = pd.read_csv(url, names=column_names, na_values='?', comment='\t', sep=' ', skipinitialspace=True).dropna()

scaler = preprocessing.StandardScaler().fit(dataset)

X_scaled = scaler.transform(dataset)

X_scaled = X_scaled * 0.001

Large Values

#x_train = np.array(dataset[['Horsepower', 'Weight']]).reshape(-1,2,2) #y_train = np.array(dataset[['MPG','Displacement']]).reshape(-1,2,2)

Small Values

x_train = np.array(X_scaled[:,2:]).reshape(-1,2,2) y_train = np.array(X_scaled[:,:2]).reshape(-1,2,2)

#print(dataset) print(x_train.shape) print(y_train.shape) print(weight.shape)

train_data = tf.data.Dataset.from_tensor_slices((x_train[:,:,:8], y_train)).cache().shuffle(x_train.shape[0]).batch(batch).repeat().prefetch(tf.data.experimental.AUTOTUNE)

if model_size == 2: #""" # MINIMAL NOT WORKING encoder_inputs = tf.keras.Input(shape=(x_train.shape[1],x_train.shape[2])) encoder_l1 = tf.keras.layers.LSTM(LSTM_Cells,return_sequences = True, return_state=True) encoder_l1_outputs = encoder_l1(encoder_inputs) encoder_l2 = tf.keras.layers.LSTM(LSTM_Cells, return_state=True) encoder_l2_outputs = encoder_l2(encoder_l1_outputs[0]) dense_1 = tf.keras.layers.Dense(128, activation='relu')(encoder_l2_outputs[0]) dense_2 = tf.keras.layers.Dense(64, activation='relu')(dense_1) dense_3 = tf.keras.layers.Dense(32, activation='relu')(dense_2) dense_4 = tf.keras.layers.Dense(16, activation='relu')(dense_3) flat = tf.keras.layers.Flatten()(dense_2) dense_5 = tf.keras.layers.Dense(22)(flat) reshape_output = tf.keras.layers.Reshape([2,2])(dense_5) model = tf.keras.models.Model(encoder_inputs, reshape_output) #""" else: #""" # WORKING encoder_inputs = tf.keras.Input(shape=(x_train.shape[1],x_train.shape[2])) encoder_l1 = tf.keras.layers.LSTM(LSTM_Cells,return_sequences = True, return_state=True) encoder_l1_outputs = encoder_l1(encoder_inputs) dense_1 = tf.keras.layers.Dense(128, activation='relu')(encoder_l1_outputs[0]) dense_2 = tf.keras.layers.Dense(64, activation='relu')(dense_1) dense_3 = tf.keras.layers.Dense(32, activation='relu')(dense_2) dense_4 = tf.keras.layers.Dense(16, activation='relu')(dense_3) flat = tf.keras.layers.Flatten()(dense_2) dense_5 = tf.keras.layers.Dense(22)(flat) reshape_output = tf.keras.layers.Reshape([2,2])(dense_5) model = tf.keras.models.Model(encoder_inputs, reshape_output) #"""

print(model.summary())

loss_tf = tf.keras.losses.MeanSquaredError()

model.compile(optimizer='adam', loss=loss_tf, run_eagerly=True)

model.fit(train_data, epochs = epochs, steps_per_epoch = 3)

Replies

Posting this here: https://feedbackassistant.apple.com in the appropriate category is a direct path to apple engineers where it can be prioritized and tracked for investigation.

  • thx, for suggestion.

Add a Comment

Hi @sebtac

Thanks for reporting the issue and posting the sample script for studying it! We will try to reproduce it and see if we can find out where the issue might be. I will update this thread when we have some results regarding this.

But in the meanwhile just to double check: Did you also test this using tensorflow-metal 0.4.0 (with the tf-base 2.8.0) which was released last week? In case you still have the test setup ready, could you try upgrading the metal plugin and see if that makes a difference in this test. The LSTM/GRU layers got a fair bit of reworking done there mainly aimed at improving the speed but it is possible that it could affect these findings as well.

  • thx for swift response. I will check out the new versions today. (I only tested with TF2.8/7 and TM0.3

Add a Comment

updating to TM 0.4.0 fixes the issue (even without the update to TF2.8)!!! thx!

In our company we bought 12 MacBook Pro M1 Max in order to start a new business branch on IA. During the training of the people of the new department we are experiencing some weird effects that I assume are related to the issue of this post.

We are developing a simple GAN an when training the solution, the behavior of the convergence of the discriminator is different if we use GPU than using only CPU or even executing in Collab.

We've read a lot, but this is the only one post that seems to talk about similar behavior.

Unfortunately, after updating to 0.4 version problem persists.

My Hardware/Software: MacBook Pro. model: MacBookPro18,2. Chip: Apple M1 Max. Cores: 10 (8 de rendimiento y 2 de eficiencia). Memory: 64 GB. firmware: 7459.101.3. OS: Monterey 12.3.1. OS Version: 7459.101.3.

Python version 3.8 and libraries (the most related) using !pip freeze

keras==2.8.0 Keras-Preprocessing==1.1.2 .... tensorboard==2.8.0 tensorboard-data-server==0.6.1 tensorboard-plugin-wit==1.8.1 tensorflow-datasets==4.5.2 tensorflow-docs @ git+https://github.com/tensorflow/docs@7d5ea2e986a4eae7573be3face00b3cccd4b8b8b tensorflow-macos==2.8.0 tensorflow-metadata==1.7.0 tensorflow-metal==0.4.0

#####. CODE TO REPRODUCE. ####### Code does not fit in the max space in this message... I've shared a Google Collab Notebook at:

https://colab.research.google.com/drive/1oDS8EV0eP6kToUYJuxHf5WCZlRL0Ypgn?usp=sharing

You can easily see that loss goes to 0 after 1 or 2 epochs when GPU is enabled, buy if GPU is disabled everything is OK

I have found lot of time the same behaviour using the same model initialization weights and using with tf.device("gpu"): or with tf.device("cpu"): I cannot get same results. Also there is huge difference of performances always in favor of cpu version. I have tf 2.10 devs 2.10 and mps 0.6 versions.

unfortunately I cannot share the code.

I'm encountering the similar issue on Macbook M1 Pro 14“. Appreicate if anyone can guide to resolve or walk around it. [Description] I just ran the official Tensorflow single RNN example (feel free to get source code from https://www.tensorflow.org/text/tutorials/text_classification_rnn) on my Macbook M1 Pro 14“, the model fails to coverage. However, I copied the totally same code to Google Colaboratory Cloud to run. The model could coverage well. It proves the RNN example code has no problem.

[Hardware] Model Identifier: MacBookPro18,3 Chip: Apple M1 Pro Total Number of Cores: 8 (6 performance and 2 efficiency) Memory: 16 GB System Firmware Version: 8419.60.44 OS Loader Version: 8419.60.44

[Software] System Version: macOS 13.1 (22C65) Kernel Version: Darwin 22.2.0

[Runtime Environment] Anaconda: 23.3.1 Python: 3.9.17 tensorflow-metal: 1.0.1

A similar problem here with tensorflow.

Training of my CNN model (NVIDIA PilotNet) works fine on a standard python runtime environment but when using a virtual runtime with tensorflow-metal, it fails to converge (loss diverges).

[hardware] Macbook Air M1 (2020) [os] 13.5.2 (22G91) [runtime] virtual env python 3.11.5 + tensorflow-metal (as described in https://developer.apple.com/metal/tensorflow-plugin/)

Any solution?

Same problem with torch on Macbook M2. Using CPU converges well, using colab also converges well, but if I set MPS - fails to converge. Link to the code: https://colab.research.google.com/drive/1xG_R3RpmTVLCTCTeGTG8yo-e7iwIFYbt?usp=sharing torch version 2.1.2

  • Applying contignious after reshape solved the issue. % python -c "import torch;import torch.nn.functional as f;x=torch.arange(1000,dtype=torch.float).reshape(10,10,10).permute(2,0,1);y=x.to('mps');print((f.gelu(x)-f.gelu(y).cpu()).abs().max().item())" 999.0 % python -c "import torch;import torch.nn.functional as f;x=torch.arange(1000,dtype=torch.float).reshape(10,10,10).permute(2,0,1);y=x.to('mps');print((f.gelu(x)-f.gelu(y.contiguous()).cpu()).abs().max().item())" 0.0

Add a Comment