Tensorflow metal: The Metal Performance Shaders operations encoded on it may not have completed.

This does not seem to be effecting the training, but it seems somewhat important (no clue on how to read it however):

Error: command buffer exited with error status.
	The Metal Performance Shaders operations encoded on it may not have completed.
	Error: 
	(null)
	Internal Error (0000000e:Internal Error)
	<AGXG13XFamilyCommandBuffer: 0x29b027b50>
    label = <none> 
    device = <AGXG13XDevice: 0x12da25600>
        name = Apple M1 Max 
    commandQueue = <AGXG13XFamilyCommandQueue: 0x106477000>
        label = <none> 
        device = <AGXG13XDevice: 0x12da25600>
            name = Apple M1 Max 
    retainedReferences = 1

This is happening during a "heavy" model training on "heavy" dataset, so maybe is related to some memory issue, but I have no clue how to confront it

Replies

Hi @Alberto1999!

Thanks for reporting the issue. Would you happen to have a test script you could provide us that would reproduce this error message? I understand however that if this only happens sporadically during very memory heavy training it might be difficult to reproduce consistently. But I can confirm that this does not look like expected behavior so I would like to investigate it in more detail.

Additionally which OS version, tensorflow-macos version and tensorflow-metal version did you observe this?

Hi there, so the problem is very sporadic, and is happening during the training of a heavy TF model, and it's not so "deterministic", however I can provide you a link to a ZIP file with jupyter notebook and dataset

However if you want, the images come from the facades dataset, so maybe I can just share you the code, the dataset is downloadable from here https://www.kaggle.com/datasets/balraj98/facades-dataset, and you need to place it in the directory of the notebook, so something like this:

...
├── notebook.ipynb
└── dataset
       ├── trainA
       ├── trainB
       ├── testA
       └── testB

the whole code can be downloaded from here: https://drive.google.com/file/d/1Clqf1uSzMIntA551dp8B1Z-hZFPAa8VL/view?usp=sharing
It requires basic packages, and the likelihood to see that error message is directly proportional to be batchsize (so I suspect it has something to do with the memory)

My pc is a 2021 16" MacBook Pro M1 MAX 26 core GPU 32Gb RAM with 2Tb SSD running MacOS 12.4 (21F79)

  • Thanks I'll try to reproduce this locally next.

Add a Comment

Sure, let me know if you need more info about this

Hi there
I got a much simpler snipped that causes the same error, without external datasets:

import tensorflow as tf
import tensorflow.keras as K
import numpy as np
num_words = 10000
(X_train, y_train), (X_test, y_test) = K.datasets.imdb.load_data(num_words=num_words)
(X_valid, X_test) = X_test[:12500], X_test[12500:]
(y_valid, y_test) = y_test[:12500], y_test[12500:]
maxlen = 500
X_train_trim = K.preprocessing.sequence.pad_sequences(X_train, maxlen=maxlen)
X_test_trim = K.preprocessing.sequence.pad_sequences(X_test, maxlen=maxlen)
X_valid_trim = K.preprocessing.sequence.pad_sequences(X_valid, maxlen=maxlen)
model_K = K.models.Sequential([
    K.layers.Embedding(input_dim=num_words, output_dim=10),
    K.layers.SimpleRNN(32),
    K.layers.Dense(1, "sigmoid")
])
model_K.compile(loss='binary_crossentropy', optimizer="adam", metrics=["accuracy"])
with tf.device("/device:CPU:0"):
    history_K = model_K.fit(X_train_trim, y_train, epochs=10, batch_size=128, validation_data=(X_valid_trim, y_valid))

In addition to this, there is also the fact that SimpleRNN does not work on M1 GPU what so ever (thus the tf.device), as reported here: https://github.com/tensorflow/tensorflow/issues/56082 (on the other hand, LSTM works fine)

However, I think this might be due to the Graph creation, as a simple reimplementation of SimpleRNN have the same issue (however, this does not really hold, otherwise LSTM would have the same issue)