tensorflow-macos slow (Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.)

Hi all,

I'd installed following the steps from https://developer.apple.com/metal/tensorflow-plugin/

But while I'm running my training I get the log below and the performance is poor than CPU.

Init Plugin
Init Graph Optimizer
Init Kernel
Metal device set to: Apple M1
2021-07-17 18:40:53.716687: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2021-07-17 18:40:53.716767: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)

My versions:

% python -V
Python 3.9.6

% pip freeze
absl-py @ file:///home/conda/feedstock_root/build_artifacts/absl-py_1606234718434/work
astunparse @ file:///home/conda/feedstock_root/build_artifacts/astunparse_1610696312422/work
cached-property @ file:///home/conda/feedstock_root/build_artifacts/cached_property_1615209429212/work
cachetools==4.2.2
certifi==2021.5.30
charset-normalizer==2.0.3
cloudpickle==1.6.0
flatbuffers==1.12
gast @ file:///home/conda/feedstock_root/build_artifacts/gast_1596839682936/work
google-auth==1.33.0
google-auth-oauthlib==0.4.4
google-pasta==0.2.0
grpcio @ file:///Users/runner/miniforge3/conda-bld/grpcio_1610588577338/work
gym==0.18.3
h5py @ file:///Users/runner/miniforge3/conda-bld/h5py_1609497507927/work
idna==3.2
keras-nightly==2.5.0.dev2021032900
Keras-Preprocessing @ file:///home/conda/feedstock_root/build_artifacts/keras-preprocessing_1610713559828/work
Markdown==3.3.4
numpy @ file:///Users/runner/miniforge3/conda-bld/numpy_1610324554245/work
oauthlib==3.1.1
opt-einsum @ file:///home/conda/feedstock_root/build_artifacts/opt_einsum_1617859230218/work
Pillow==8.2.0
protobuf==3.17.2
pyasn1==0.4.8
pyasn1-modules==0.2.8
pyglet==1.5.15
requests==2.26.0
requests-oauthlib==1.3.0
rsa==4.7.2
scipy @ file:///Users/runner/miniforge3/conda-bld/scipy_1624824941870/work
six @ file:///home/conda/feedstock_root/build_artifacts/six_1590081179328/work
tensorboard==2.5.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.0
tensorflow-estimator==2.5.0
tensorflow-macos==2.5.0
tensorflow-metal==0.1.1
termcolor==1.1.0
typing-extensions @ file:///home/conda/feedstock_root/build_artifacts/typing_extensions_1602702424206/work
urllib3==1.26.6
Werkzeug==2.0.1
wrapt @ file:///Users/runner/miniforge3/conda-bld/wrapt_1624972047019/work

How to speed up using the GPU?

Best Regards, Fernando

Can you please provide a script to reproduce the issue?

The script is in my GitHub: my script code

Just follow the steps to reproduce [1] with GPU (slow) and [2] with CPU (fast)

How to reproduce:

[1]: SLOW (with tensorflow-metal installed) == GPU

conda create --name lab_slow python=3.9
conda activate lab_slow
conda install -c apple tensorflow-deps
python -m pip install tensorflow-macos
python -m pip install tensorflow-metal
python -m pip install gym
python -m pip install icecream

With [1] I see the GPU burning but the results are very slow and some Information are showed:

(lab_slow) fernando@minidefernando restml-muzero % python mymuzero.py            
Init Plugin
Init Graph Optimizer
Init Kernel
Metal device set to: Apple M1
2021-07-23 17:10:41.158886: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2021-07-23 17:10:41.159108: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
2021-07-23 17:10:41.301535: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2021-07-23 17:10:41.303570: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
2021-07-23 17:10:41.304081: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
ic| elapsed: -8.216638209
ic| elapsed: -11.702662209000001
ic| elapsed: -7.210992999999998
ic| elapsed: -7.76780325
ic| elapsed: -6.552902541999998
ic| elapsed: -7.051604083000001
ic| elapsed: -7.802484458000002

The time of each iteration is ~7 seconds using GPU.

[2]: FAST (without tensorflow-metal installed) == CPU

conda create --name lab_fast python=3.9
conda activate lab_fast
conda install -c apple tensorflow-deps
python -m pip install tensorflow-macos
python -m pip install gym
python -m pip install icecream

With [2] the CPU is used and the speed is good.

I got this:

(lab_fast) fernando@minidefernando restml-muzero % python mymuzero.py            
2021-07-23 17:12:54.201964: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2021-07-23 17:12:54.204263: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
ic| elapsed: -0.6991699170000001
ic| elapsed: -1.001101625
ic| elapsed: -0.745720833
ic| elapsed: -0.5131082500000002
ic| elapsed: -0.680853667
ic| elapsed: -0.5841365830000003
ic| elapsed: -0.6107562499999997
ic| elapsed: -0.5290834160000006
ic| elapsed: -1.3289379590000001
ic| elapsed: -0.508716166000001
ic| elapsed: -1.1658726250000004
ic| elapsed: -0.5299397080000006
ic| elapsed: -0.5714273750000007
ic| elapsed: -0.6107442499999998

The time of each iteration is ~0.5 seconds using 1 CPU.

@IPSec - did you get this figured out. I'm running into a similar issue.

I have the similar issue as well

On my iMac 27" with Monterey 12.0.1 it crashes with the GPU in tensorflow-metal:

% python muzero.py
2021-10-21 08:36:21.088556: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Metal device set to: AMD Radeon Pro 5700 XT

systemMemory: 128.00 GB
maxCacheSize: 7.99 GB

2021-10-21 08:36:21.089347: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2021-10-21 08:36:21.089966: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
2021-10-21 08:36:21.753689: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
2021-10-21 08:36:21.759239: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
2021-10-21 08:36:34.888 python[14296:730686] -[MPSGraph adamUpdateWithLearningRateTensor:beta1Tensor:beta2Tensor:epsilonTensor:beta1PowerTensor:beta2PowerTensor:valuesTensor:momentumTensor:velocityTensor:gradientTensor:name:]: unrecognized selector sent to instance 0x600001b26220
zsh: segmentation fault  python muzero.py

It runs with the CPU.

 % python --version
Python 3.8.5

% pip freeze
absl-py==0.12.0
anyio==3.3.2
appnope==0.1.2
argon2-cffi==21.1.0
asttokens==2.0.5
astunparse==1.6.3
attrs==21.2.0
Babel==2.9.1
backcall==0.2.0
bleach==4.1.0
bokeh==2.3.3
cachetools==4.2.4
certifi==2021.5.30
cffi==1.14.6
charset-normalizer==2.0.6
clang==5.0
cloudpickle==2.0.0
colorama==0.4.4
cycler==0.10.0
Cython==0.29.24
debugpy==1.5.0
decorator==5.1.0
defusedxml==0.7.1
dill==0.3.4
distinctipy==1.1.5
dm-tree==0.1.6
dotmap==1.3.24
entrypoints==0.3
executing==0.8.2
flatbuffers==1.12
future==0.18.2
gast==0.4.0
gensim==3.8.3
google-auth==1.35.0
google-auth-oauthlib==0.4.6
google-pasta==0.2.0
googleapis-common-protos==1.53.0
grpcio==1.41.0
gviz-api==1.9.0
gym==0.21.0
h5py==3.1.0
hdbscan==0.8.27
icecream==2.1.1
idna==3.2
importlib-resources==5.2.2
ipykernel==6.4.1
ipython==7.28.0
ipython-genutils==0.2.0
ipywidgets==7.6.5
jedi==0.18.0
Jinja2==3.0.2
joblib==1.1.0
json5==0.9.6
jsonschema==4.0.1
jupyter-client==7.0.6
jupyter-core==4.8.1
jupyter-server==1.11.1
jupyterlab==3.1.18
jupyterlab-pygments==0.1.2
jupyterlab-server==2.8.2
jupyterlab-widgets==1.0.2
keras==2.6.0
Keras-Preprocessing==1.1.2
kiwisolver==1.3.2
llvmlite==0.37.0
Markdown==3.3.4
MarkupSafe==2.0.1
matplotlib==3.4.3
matplotlib-inline==0.1.3
memory-profiler==0.58.0
mistune==0.8.4
nbclassic==0.3.2
nbclient==0.5.4
nbconvert==6.2.0
nbformat==5.1.3
nest-asyncio==1.5.1
nmslib==2.1.1
notebook==6.4.4
numba==0.54.0
numpy==1.20.3
oauthlib==3.1.1
opt-einsum==3.3.0
packaging==21.0
pandas==1.3.3
pandocfilters==1.5.0
parso==0.8.2
pexpect==4.8.0
pickleshare==0.7.5
Pillow==8.3.2
prometheus-client==0.11.0
promise==2.3
prompt-toolkit==3.0.20
protobuf==3.18.1
psutil==5.8.0
ptyprocess==0.7.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pybind11==2.6.1
pycparser==2.20
Pygments==2.10.0
pynndescent==0.5.4
pyparsing==2.4.7
pyrsistent==0.18.0
python-dateutil==2.8.2
pytz==2021.3
PyYAML==5.4.1
pyzmq==22.3.0
requests==2.26.0
requests-oauthlib==1.3.0
requests-unixsocket==0.2.0
rsa==4.7.2
scikit-learn==1.0
scipy==1.7.1
Send2Trash==1.8.0
six==1.15.0
smart-open==5.2.1
sniffio==1.2.0
tabulate==0.8.9
tensorboard==2.6.0
tensorboard-data-server==0.6.1
tensorboard-plugin-profile==2.5.0
tensorboard-plugin-wit==1.8.0
tensorflow==2.6.0
tensorflow-consciousness==0.1
tensorflow-datasets==4.4.0
tensorflow-estimator==2.6.0
tensorflow-gan==2.1.0
tensorflow-hub==0.12.0
tensorflow-macos==2.6.0
tensorflow-metadata==1.2.0
tensorflow-metal==0.2.0
tensorflow-probability==0.14.1
tensorflow-similarity==0.13.45
tensorflow-text==2.6.0
termcolor==1.1.0
terminado==0.12.1
testpath==0.5.0
threadpoolctl==3.0.0
top2vec==1.0.26
tornado==6.1
tqdm==4.62.3
traitlets==5.1.0
typing-extensions==3.7.4.3
umap-learn==0.5.1
urllib3==1.26.7
wcwidth==0.2.5
webencodings==0.5.1
websocket-client==1.2.1
Werkzeug==2.0.2
widgetsnbextension==3.5.1
wordcloud==1.8.1
wrapt==1.12.1
zipp==3.6.0

Same issue with new MacBook Pro 16 running 12.0.1. Defaults to 0 MB GPU.

Same issue: Monterey 12.0.1. MacBook Pro 14" M1 10 core CPU 14 Core GPU. Followed https://developer.apple.com/metal/tensorflow-plugin/ here. GPU execution is very slow compared to CPU.

Code is:

import tensorflow as tf import time with tf.device('/CPU:0'): start = time.time() mnist = tf.keras.datasets.mnist (x_train, y_train),(x_test, y_test) = mnist.load_data() x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([ tf.keras.layers.Flatten(input_shape=(28, 28)), tf.keras.layers.Dense(128, activation='relu'), tf.keras.layers.Dropout(0.2), tf.keras.layers.Dense(10, activation='softmax') ])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

model.fit(x_train, y_train, epochs=5) model.evaluate(x_test, y_test) result = time.time()-start print("Total time: {:0.2f}ms".format(1000*result))

with tf.device('/GPU:0'): start = time.time() mnist = tf.keras.datasets.mnist (x_train, y_train),(x_test, y_test) = mnist.load_data() x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([ tf.keras.layers.Flatten(input_shape=(28, 28)), tf.keras.layers.Dense(128, activation='relu'), tf.keras.layers.Dropout(0.2), tf.keras.layers.Dense(10, activation='softmax') ]) model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

model.fit(x_train, y_train, epochs=5) model.evaluate(x_test, y_test) result = time.time()-start print("Total time: {:0.2f}ms".format(1000*result))

Same here. It appears that GPUs are not used...? but then it should not be slower to CPU only. totally weired.

Multi-GPU Not Supported

The current version(0.2.0) of tensorflow-metal seems to use only one GPU

https://developer.apple.com/metal/tensorflow-plugin/

Currently Not Supported

  • Multi-GPU support

  • Acceleration for Intel GPUs

  • V1 TensorFlow Networks

I'm receiving the same problem (slow GPU training and NUMA node warning) on my M1 max. I'd be very happy to see a fix for this issue.

tensorflow-macos slow (Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.)
 
 
Q