Performance issue on Macbook Pro M1

Question

OriAlpha OP

Created Jul ’21

Replies 15

Boosts 2

Views 8.0k

Participants 12

System information

Script can be found below
MacBook Pro M1 (Mac OS Big Sir (11.5.1))
TensorFlow installed from (source)
TensorFlow version (2.5 version) with Metal Support
Python version: 3.9
GPU model and memory: MacBook Pro M1 and 16 GB

Steps needed for installing Tensorflow with metal support. https://developer.apple.com/metal/tensorflow-plugin/

I am trying to train a model on Macbook Pro M1, but the performance is so bad and the train doesn't work properly. It takes a ridiculously long time just for a single epoch.

Code needed for reproducing this behavior.

 import tensorflow as tf
from tensorflow.keras.datasets import imdb
from tensorflow.keras.layers import Embedding, Dense, LSTM
from tensorflow.keras.losses import BinaryCrossentropy
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.preprocessing.sequence import pad_sequences
 
# Model configuration
additional_metrics = ['accuracy']
batch_size = 128
embedding_output_dims = 15
loss_function = BinaryCrossentropy()
max_sequence_length = 300
num_distinct_words = 5000
number_of_epochs = 5
optimizer = Adam()
validation_split = 0.20
verbosity_mode = 1
 
# Disable eager execution
tf.compat.v1.disable_eager_execution()
 
# Load dataset
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=num_distinct_words)
print(x_train.shape)
print(x_test.shape)
 
# Pad all sequences
padded_inputs = pad_sequences(x_train, maxlen=max_sequence_length, value = 0.0) # 0.0 because it corresponds with <PAD>
padded_inputs_test = pad_sequences(x_test, maxlen=max_sequence_length, value = 0.0) # 0.0 because it corresponds with <PAD>
 
# Define the Keras model
model = Sequential()
model.add(Embedding(num_distinct_words, embedding_output_dims, input_length=max_sequence_length))
model.add(LSTM(10))
model.add(Dense(1, activation='sigmoid'))
 
# Compile the model
model.compile(optimizer=optimizer, loss=loss_function, metrics=additional_metrics)
 
# Give a summary
model.summary()
 
# Train the model
history = model.fit(padded_inputs, y_train, batch_size=batch_size, epochs=number_of_epochs, verbose=verbosity_mode, validation_split=validation_split)
 
# Test the model after training
test_results = model.evaluate(padded_inputs_test, y_test, verbose=False)
print(f'Test results - Loss: {test_results[0]} - Accuracy: {100*test_results[1]}%')

I have noticed this same problem with LSTM layers

Also, this issue is been reported in Keras and they can't debug.

Keras issue https://github.com/keras-team/keras/issues/15003

Boost

Answer 1

OriAlpha OP

Jul ’21

I tried for few hours, due to slow training I only trained for 1 epoch, this is a log

2021-07-26 23:09:28.130352: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
2021-07-26 23:09:28.185390: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
2021-07-26 23:09:28.217406: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
2021-07-26 23:09:28.229984: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.

Epoch 1/1
20000/20000 [==============================] - loss: 0.5489 - accuracy: 0.6923
--- 6894.8485770225524902 seconds ---

Just for one epoch, it takes around 2 hours that's a nightmare

0

Answer 2

OriAlpha OP

Jul ’21

It is not fair to achieve TensorFlow repo, before fixing issues of code

0

Answer 3

Frameworks Engineer OP

Apple

Aug ’21

Hi @OriAlpha, We recommend users to upgrade to 12.0 for best support and performance of Metal plugin. I tried the attached script with MacOS 12.0 on a M1 machine and Tensorflow-metal==0.1.2 (I recommend updating to latest metal plugin version). And I got following performance. Please let us know if that helps.

 2021-08-24 23:20:50.927094: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
 
157/157 [==============================] - 46s 271ms/step - loss: 0.6877 - accuracy: 0.5416 - val_loss: 0.6579 - val_accuracy: 0.6034
 
Epoch 2/5
 
157/157 [==============================] - 38s 243ms/step - loss: 0.5634 - accuracy: 0.7459 - val_loss: 0.4508 - val_accuracy: 0.8192
 
Epoch 3/5
 
157/157 [==============================] - 38s 244ms/step - loss: 0.4140 - accuracy: 0.8303 - val_loss: 0.3805 - val_accuracy: 0.8410
 
Epoch 4/5
 
157/157 [==============================] - 38s 245ms/step - loss: 0.3474 - accuracy: 0.8609 - val_loss: 0.4135 - val_accuracy: 0.8380
 
Epoch 5/5
 
157/157 [==============================] - 39s 251ms/step - loss: 0.3075 - accuracy: 0.8814 - val_loss: 0.3535 - val_accuracy: 0.8554

0

Answer 4

cantab OP

Aug ’21

I saw the same issue, over 7000 seconds per epoch and a lot of warning messages. Then I tried with tf.device("/gpu:0"). Each epoch takes about 38 seconds. However, then I tried with tf.device("/cpu:0"). Each epoch takes only about 7 seconds. So GPU performance is still awful.

I have not yet found a neural net architecture where the M1 GPU is faster than the CPU. For matrix multiplication, the GPU can be 9x faster, but this does not carry over to network training.

Based on other threads and on the comment above by an Apple engineer, it looks like the Apple team doesn't even realize how bad their TensorFlow speed is.

MacBook Air M1 (Mac OS 12 beta) TensorFlow version (2.5 version) with Metal Support Python version: 3.8 GPU model and memory: MacBook Air M1 and 16 GB

3

Answer 5

mrt77 OP

Sep ’21

I have the exact same problem!! Started noticing really long training times for a simple BLSTM, and decided to test the above code. I'm also using MacBook Air M1 (Mac OS 12 beta) TensorFlow version (2.5 version) with Metal Support Python version: 3.9 GPU model and memory: MacBook Air M1 and 16 GB. This completely undermines my work! Apple should do something!

1

Answer 6

10686142 OP

Oct ’21

Yep for me both CPU and GPU performance are not good at all, a relatively simple CNN on a free google colab (with a K80) took about 7 minutes to train, while this same model took about 30minutes on GPU and 42 on CPU in tf 2.6 on my mac mini m1 16gb.

I have seen multiple posts of people experiencing the same issue and the solution always seems to be that you need to upgrade to 12.0 or use CPU (for smaller batch sizes), which both don't seem to fix the issue at hand for most cases.

I would really expect Apple to come up some solution to this, it has been a year since this m1 model was released and I am paying for 3 party notebooks while I would expect such an optimised machine for ML (according to the marketing) to be able to at least run tf at a similar pace as a free colab notebook.

2

Answer 7

joe44 OP

Jul ’22

Hello, Today, I stil getting the same issue in 2022.

it seems the problem has never been solved... I will start started un class on Tensor soon and getting something whitch is very slow like this, that is just so awful.

I don't have choice to use google collab..

1

Answer 8

chuongmep OP

Dec ’22

Any new update for 22/12/2022 ?

0

Answer 9

TorbenXD OP

Apr ’23

Same issue in 2023

1

Answer 10

OriAlpha OP

Apr ’23

Apple M laptop doesnt care about providing support, if your tasks are GPU and ML use nvidia GPU’s those are best, works out of box.

1

Answer 11

erezkatz OP

Apr ’23

I am having the same issue -- 4/14/2023 -- Not to mention that I still get the warning to use the from keras.optimizers import Adam as AdamLegacy to make my binary classifier work. Is there any update I should be aware of?

1

Answer 12

erezkatz OP

Apr ’23

Als0 I don't see a distribution for tensorflow-metal==0.12.0 (latest version is 0.8.0) where can I get it?

0

Answer 13

quner OP

Jul ’23

Same issue on Tensorflow and the newest Sys env. MacOS 14.0 Beta (23A5286i) Pls help us dear apple!

0

Answer 14

erezkatz OP

Oct ’23

October 2023 and the issue is still there -- after my upgrade to Sonoma OS I can't get my tensorflow metal to behave well with batch-size of 128 -- I used to run at 64 just fine (it was speedy) and now with higher batches I do see some (not great) performance improvements but the model overfits with large batch sizes. I have read the blogs for all sort of suggestions, reverting back to older version of TF for MAC (I don't want to do that). One suggestion I saw from some postings is to disable GPU alltogether -- anyone had any succces with that?

1

Answer 15

JuanAmay OP

Feb ’24

Hey team, any update on this? Still having the issue with next env: absl-py==1.3.0 aio-pika==8.2.3 aiofiles==22.1.0 aiogram==2.23.1 aiohttp==3.8.3 aiormq==6.4.0 aiosignal==1.3.1 APScheduler==3.9.1.post1 astunparse==1.6.3 async-timeout==4.0.2 attrs==22.1.0 Babel==2.9.1 bert-serving-client==1.10.0 bidict==0.22.1 boto3==1.26.136 botocore==1.29.136 CacheControl==0.12.11 cachetools==5.2.1 certifi==2023.7.22 cffi==1.15.1 charset-normalizer==2.1.1 click==8.1.3 cloudpickle==2.2.0 colorclass==2.2.2 coloredlogs==15.0.1 colorhash==1.2.1 confluent-kafka==1.9.2 cryptography==41.0.7 cycler==0.11.0 dask==2022.10.2 dnspython==2.3.0 docopt==0.6.2 fbmessenger==6.0.0 fire==0.5.0 flatbuffers fonttools==4.38.0 frozenlist==1.3.3 fsspec==2022.11.0 future==0.18.3 gast==0.2.1 google-auth==2.16.0 google-auth-oauthlib==0.4.1 google-pasta==0.2.0 greenlet==3.0.3 grpcio==1.51.1 h5py==3.10.0 httptools==0.5.0 humanfriendly==10.0 idna==3.4 jmespath==1.0.1 joblib==1.2.0 jsonpickle==2.2.0 jsonschema==4.16.0 keras Keras-Preprocessing==1.1.2 kiwisolver==1.4.4 libclang==15.0.6.1 locket==1.0.0 magic-filter==1.0.9 Markdown==3.4.1 MarkupSafe==2.1.2 matplotlib==3.5.3 mattermostwrapper==2.2 msgpack==1.0.4 multidict==5.2.0 networkx==2.6.3 numpy==1.23.5 oauthlib==3.2.2 opt-einsum==3.3.0 packaging pamqp==3.2.0 partd==1.3.0 Pillow==9.4.0 pip==22.3.1 pluggy==1.0.0 prompt-toolkit==3.0.28 protobuf psycopg2-binary==2.9.5 pyasn1==0.4.8 pyasn1-modules==0.2.8 pycparser==2.21 pydot==1.4.2 PyJWT==2.6.0 pykwalify==1.8.0 pymongo==4.0.1 pyparsing==3.0.9 pyrsistent==0.19.3 python-crfsuite==0.9.8 python-dateutil==2.8.2 python-engineio==4.3.4 python-socketio==5.7.2 pytz==2022.7.1 pytz-deprecation-shim==0.1.0.post0 PyYAML==6.0.1 pyzmq==25.0.0 questionary==1.10.0 randomname==0.1.5 rasa rasa-sdk redis==4.5.3 regex==2022.10.31 requests==2.28.2 requests-oauthlib==1.3.1 requests-toolbelt==0.10.1 rocketchat-API==1.28.1 rsa==4.9 ruamel.yaml==0.17.21 ruamel.yaml.clib==0.2.7 s3transfer==0.6.0 sanic==21.12.2 Sanic-Cors==2.0.1 sanic-jwt==1.8.0 sanic-routing==0.7.2 scikit-learn==1.1.3 scipy==1.12 sentry-sdk==1.11.1 setuptools==65.6.3 six sklearn-crfsuite==0.3.6 slack-sdk==3.19.5 SQLAlchemy==1.4.46 tabulate==0.9.0 tarsafe==0.0.3 tensorboard==2.9 tensorboard-data-server tensorboard-plugin-wit==1.8.1 tensorflow-macos==2.9 tensorflow-metal==0.5.0 tensorflow-addons==0.18.0 tensorflow-estimator==2.9 tensorflow-hub==0.13.0 tensorflow-io-gcs-filesystem==0.36.0 tensorflow-text termcolor==2.2.0 terminaltables==3.1.10 threadpoolctl==3.1.0 toolz==0.12.0 tqdm==4.64.1 twilio==7.14.2 typeguard==2.13.3 typing_extensions==4.4.0 typing-utils==0.1.0 tzdata==2022.7 tzlocal==4.2 ujson==5.7.0 urllib3==1.26.14 uvloop==0.17.0 wcwidth==0.2.6 webexteamssdk==1.6.1 websockets==10.4 Werkzeug==2.2.2 wheel==0.38.1 wrapt==1.14.1 yarl==1.8.2

0

	import tensorflow as tf
	from tensorflow.keras.datasets import imdb
	from tensorflow.keras.layers import Embedding, Dense, LSTM
	from tensorflow.keras.losses import BinaryCrossentropy
	from tensorflow.keras.models import Sequential
	from tensorflow.keras.optimizers import Adam
	from tensorflow.keras.preprocessing.sequence import pad_sequences

	# Model configuration
	additional_metrics = ['accuracy']
	batch_size = 128
	embedding_output_dims = 15
	loss_function = BinaryCrossentropy()
	max_sequence_length = 300
	num_distinct_words = 5000
	number_of_epochs = 5
	optimizer = Adam()
	validation_split = 0.20
	verbosity_mode = 1

	# Disable eager execution
	tf.compat.v1.disable_eager_execution()

	# Load dataset
	(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=num_distinct_words)
	print(x_train.shape)
	print(x_test.shape)

	# Pad all sequences
	padded_inputs = pad_sequences(x_train, maxlen=max_sequence_length, value = 0.0) # 0.0 because it corresponds with <PAD>
	padded_inputs_test = pad_sequences(x_test, maxlen=max_sequence_length, value = 0.0) # 0.0 because it corresponds with <PAD>

	# Define the Keras model
	model = Sequential()
	model.add(Embedding(num_distinct_words, embedding_output_dims, input_length=max_sequence_length))
	model.add(LSTM(10))
	model.add(Dense(1, activation='sigmoid'))

	# Compile the model
	model.compile(optimizer=optimizer, loss=loss_function, metrics=additional_metrics)

	# Give a summary
	model.summary()

	# Train the model
	history = model.fit(padded_inputs, y_train, batch_size=batch_size, epochs=number_of_epochs, verbose=verbosity_mode, validation_split=validation_split)

	# Test the model after training
	test_results = model.evaluate(padded_inputs_test, y_test, verbose=False)
	print(f'Test results - Loss: {test_results[0]} - Accuracy: {100*test_results[1]}%')

	2021-08-24 23:20:50.927094: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.

	157/157 [==============================] - 46s 271ms/step - loss: 0.6877 - accuracy: 0.5416 - val_loss: 0.6579 - val_accuracy: 0.6034

	Epoch 2/5

	157/157 [==============================] - 38s 243ms/step - loss: 0.5634 - accuracy: 0.7459 - val_loss: 0.4508 - val_accuracy: 0.8192

	Epoch 3/5

	157/157 [==============================] - 38s 244ms/step - loss: 0.4140 - accuracy: 0.8303 - val_loss: 0.3805 - val_accuracy: 0.8410

	Epoch 4/5

	157/157 [==============================] - 38s 245ms/step - loss: 0.3474 - accuracy: 0.8609 - val_loss: 0.4135 - val_accuracy: 0.8380

	Epoch 5/5

	157/157 [==============================] - 39s 251ms/step - loss: 0.3075 - accuracy: 0.8814 - val_loss: 0.3535 - val_accuracy: 0.8554