Is it possible to use HuggingFace via TF-macOS and TF-Metal?

Dear All Developers,

It is so great that we finally have TF-macOS and TF-Metal for GPU/NPU accelerating. After some tests, it looks like everything works well.

So, I am wondering that if it is possible to solve NLP tasks with HuggingFace via TF-Metal for GPU accelerating. To figure it out, I installed all packages we need and ran the testing code.

What I got is showing here. So far so good, right?

However, it pops out an error while I attempt to fine-tune a BERT model.

Colocation Debug Info:
Colocation group had the following types and supported devices: 
Root Member(assigned_device_name_index_=2 requested_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' assigned_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' resource_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
RealDiv: GPU CPU 
Sqrt: GPU CPU 
UnsortedSegmentSum: CPU 
AssignVariableOp: GPU CPU 
AssignSubVariableOp: GPU CPU 
ReadVariableOp: GPU CPU 
StridedSlice: GPU CPU 
NoOp: GPU CPU 
Mul: GPU CPU 
Shape: GPU CPU 
_Arg: GPU CPU 
ResourceScatterAdd: GPU CPU 
Unique: CPU 
AddV2: GPU CPU 
ResourceGather: GPU CPU 
Const: GPU CPU 

It looks like that GPU is not assigned correctly, therefore, I checked if GPU is detected by TensorFlow. And here is the GPU info. from TensorFlow.

WARNING:tensorflow:From <ipython-input-2-17bb7203622b>:1: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
WARNING:tensorflow:From <ipython-input-2-17bb7203622b>:1: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
2021-06-29 01:56:25.862829: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2021-06-29 01:56:25.862893: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
Out[2]: True

Obviously, the problem resulted from HuggingFace. I do know that it is not Apple's responsibility to packages other than TF-macOS and TF-Metal, I am just curious that if anyone has a solution about it here.

Sincerely,

hawkiyc

Replies

Hi @hawkiyc, We were able to reproduce this issue and working on resolving this issue. The issue is related to the :

UnsortedSegmentSum: CPU                      <---
AssignVariableOp: GPU CPU 
AssignSubVariableOp: GPU CPU 
ReadVariableOp: GPU CPU 
StridedSlice: GPU CPU 
NoOp: GPU CPU 
Mul: GPU CPU 
Shape: GPU CPU 
_Arg: GPU CPU 
ResourceScatterAdd: GPU CPU 
Unique: CPU                                    <---

Highlighted ops above are not registered on GPU and causing colocation error during the device placement of operations in Core tensorflow. We will update here, thanks for filing this issue.

  • Hi, @Frameworks Engineer. I appreciate your time and looking forward to the new version of TF-metal / TF-macos.

Add a Comment

I have the same problem

  • Hi, @itsloudc. According to Apple, this GPU assignment issue happens while you are using TF-hub and/or huggingface for the NLP tasks and there is no known workaround right now. It looks like that we can only wait for the new version of TF-macos and TF-metal.

Add a Comment

Hello, is there any news on that front?

I'm a total newb with TS so I have zero sense of what is going on, but I consistently have this error "Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support."

either with this test (first reply here) or "TensorFlow 2 quickstart for beginners"

Strangly the training does seem to run : simple tests actually go through epochs pretty fast (I guess) and my AMD usage goes around 30-50%

My specs are : Intel Macbook Pro with Monterrey and AMD Radeon Pro 5500M 8 Go Python 3.8.10

Here's an example of the simple test output :

(tensorflow-metal-test) jv@192 tensorflow-metal-test % python /Users/jv/tensorflow-exp/test.py                       

2021-11-22 23:50:48.066315: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.2 AVX AVX2 FMA

To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

Metal device set to: AMD Radeon Pro 5500M

systemMemory: 32.00 GB
maxCacheSize: 3.99 GB

2021-11-22 23:50:48.067311: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2021-11-22 23:50:48.067826: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
2021-11-22 23:50:48.505048: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2021-11-22 23:50:48.505092: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
2021-11-22 23:50:48.712043: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
2021-11-22 23:50:48.734335: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
2021-11-22 23:50:48.827487: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
2021-11-22 23:50:48.858801: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
2021-11-22 23:50:49.081885: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
2021-11-22 23:50:49.113821: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
2021-11-22 23:50:49.169179: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
2021-11-22 23:50:49.208235: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.

2021-11-22 23:50:49.243817: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)

Train on 469 steps, validate on 79 steps

2021-11-22 23:50:49.282608: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.

Epoch 1/12

2021-11-22 23:50:49.309804: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.

469/469 [==============================] - ETA: 0s - batch: 234.0000 - size: 1.0000 - loss: 0.1564 - accuracy: 0.9539/Users/julienvincenot/tensorflow-metal-test/lib/python3.8/site-packages/keras/engine/training.py:2470: UserWarning: `Model.state_updates` will be removed in a future version. This property should not be used in TensorFlow 2.0, as `updates` are applied automatically.

  warnings.warn('`Model.state_updates` will be removed in a future version. '

2021-11-22 23:51:01.268461: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
469/469 [==============================] - 14s 21ms/step - batch: 234.0000 - size: 1.0000 - loss: 0.1564 - accuracy: 0.9539 - val_loss: 0.0707 - val_accuracy: 0.9782
Epoch 2/12
469/469 [==============================] - 12s 19ms/step - batch: 234.0000 - size: 1.0000 - loss: 0.0453 - accuracy: 0.9857 - val_loss: 0.0487 - val_accuracy: 0.9848
Epoch 3/12
469/469 [==============================] - 12s 19ms/step - batch: 234.0000 - size: 1.0000 - loss: 0.0284 - accuracy: 0.9912 - val_loss: 0.0378 - val_accuracy: 0.9878
Epoch 4/12
469/469 [==============================] - 12s 19ms/step - batch: 234.0000 - size: 1.0000 - loss: 0.0191 - accuracy: 0.9939 - val_loss: 0.0346 - val_accuracy: 0.9886
Epoch 5/12
469/469 [==============================] - 12s 19ms/step - batch: 234.0000 - size: 1.0000 - loss: 0.0135 - accuracy: 0.9958 - val_loss: 0.0400 - val_accuracy: 0.9892
Epoch 6/12
469/469 [==============================] - 12s 19ms/step - batch: 234.0000 - size: 1.0000 - loss: 0.0099 - accuracy: 0.9968 - val_loss: 0.0332 - val_accuracy: 0.9902
Epoch 7/12
469/469 [==============================] - 12s 19ms/step - batch: 234.0000 - size: 1.0000 - loss: 0.0069 - accuracy: 0.9978 - val_loss: 0.0376 - val_accuracy: 0.9894
Epoch 8/12
469/469 [==============================] - 12s 19ms/step - batch: 234.0000 - size: 1.0000 - loss: 0.0078 - accuracy: 0.9973 - val_loss: 0.0389 - val_accuracy: 0.9889
Epoch 9/12
469/469 [==============================] - 12s 19ms/step - batch: 234.0000 - size: 1.0000 - loss: 0.0059 - accuracy: 0.9980 - val_loss: 0.0448 - val_accuracy: 0.9887
Epoch 10/12
469/469 [==============================] - 12s 19ms/step - batch: 234.0000 - size: 1.0000 - loss: 0.0047 - accuracy: 0.9985 - val_loss: 0.0434 - val_accuracy: 0.9902
Epoch 11/12
469/469 [==============================] - 12s 19ms/step - batch: 234.0000 - size: 1.0000 - loss: 0.0053 - accuracy: 0.9984 - val_loss: 0.0486 - val_accuracy: 0.9873
Epoch 12/12
469/469 [==============================] - 12s 19ms/step - batch: 234.0000 - size: 1.0000 - loss: 0.0047 - accuracy: 0.9984 - val_loss: 0.0383 - val_accuracy: 0.9896