Is it possible to use HuggingFace via TF-macOS and TF-Metal?

Question

hawkiyc OP

Created Jun ’21

Replies 3

Boosts 0

Views 2.5k

Participants 4

Dear All Developers,

It is so great that we finally have TF-macOS and TF-Metal for GPU/NPU accelerating. After some tests, it looks like everything works well.

So, I am wondering that if it is possible to solve NLP tasks with HuggingFace via TF-Metal for GPU accelerating. To figure it out, I installed all packages we need and ran the testing code.

What I got is showing here. So far so good, right? 截圖 2021-06-29 00.53.46.jpg

However, it pops out an error while I attempt to fine-tune a BERT model.

 Colocation Debug Info:
Colocation group had the following types and supported devices: 
Root Member(assigned_device_name_index_=2 requested_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' assigned_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' resource_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
RealDiv: GPU CPU 
Sqrt: GPU CPU 
UnsortedSegmentSum: CPU 
AssignVariableOp: GPU CPU 
AssignSubVariableOp: GPU CPU 
ReadVariableOp: GPU CPU 
StridedSlice: GPU CPU 
NoOp: GPU CPU 
Mul: GPU CPU 
Shape: GPU CPU 
_Arg: GPU CPU 
ResourceScatterAdd: GPU CPU 
Unique: CPU 
AddV2: GPU CPU 
ResourceGather: GPU CPU 
Const: GPU CPU

It looks like that GPU is not assigned correctly, therefore, I checked if GPU is detected by TensorFlow. And here is the GPU info. from TensorFlow.

 WARNING:tensorflow:From <ipython-input-2-17bb7203622b>:1: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
WARNING:tensorflow:From <ipython-input-2-17bb7203622b>:1: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
2021-06-29 01:56:25.862829: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2021-06-29 01:56:25.862893: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
Out[2]: True

Obviously, the problem resulted from HuggingFace. I do know that it is not Apple's responsibility to packages other than TF-macOS and TF-Metal, I am just curious that if anyone has a solution about it here.

Sincerely,

hawkiyc

Boost

Answer 1

Frameworks Engineer OP

Apple

Aug ’21

Hi @hawkiyc, We were able to reproduce this issue and working on resolving this issue. The issue is related to the :

 UnsortedSegmentSum: CPU                      <---
AssignVariableOp: GPU CPU 
AssignSubVariableOp: GPU CPU 
ReadVariableOp: GPU CPU 
StridedSlice: GPU CPU 
NoOp: GPU CPU 
Mul: GPU CPU 
Shape: GPU CPU 
_Arg: GPU CPU 
ResourceScatterAdd: GPU CPU 
Unique: CPU                                    <---

Highlighted ops above are not registered on GPU and causing colocation error during the device placement of operations in Core tensorflow. We will update here, thanks for filing this issue.

1

Answer 2

itsloudc OP

Sep ’21

I have the same problem Screen Shot 2021-09-04 at 11.08.02 PM.png

0

Answer 3

JulienVincenot OP

Nov ’21

Hello, is there any news on that front?

I'm a total newb with TS so I have zero sense of what is going on, but I consistently have this error "Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support."

either with this test (first reply here) or "TensorFlow 2 quickstart for beginners"

Strangly the training does seem to run : simple tests actually go through epochs pretty fast (I guess) and my AMD usage goes around 30-50%

My specs are : Intel Macbook Pro with Monterrey and AMD Radeon Pro 5500M 8 Go Python 3.8.10

Here's an example of the simple test output :

 (tensorflow-metal-test) jv@192 tensorflow-metal-test % python /Users/jv/tensorflow-exp/test.py                       
 
2021-11-22 23:50:48.066315: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.2 AVX AVX2 FMA
 
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
 
Metal device set to: AMD Radeon Pro 5500M
 
systemMemory: 32.00 GB
maxCacheSize: 3.99 GB
 
2021-11-22 23:50:48.067311: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2021-11-22 23:50:48.067826: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
2021-11-22 23:50:48.505048: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2021-11-22 23:50:48.505092: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
2021-11-22 23:50:48.712043: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
2021-11-22 23:50:48.734335: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
2021-11-22 23:50:48.827487: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
2021-11-22 23:50:48.858801: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
2021-11-22 23:50:49.081885: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
2021-11-22 23:50:49.113821: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
2021-11-22 23:50:49.169179: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
2021-11-22 23:50:49.208235: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
 
2021-11-22 23:50:49.243817: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
 
Train on 469 steps, validate on 79 steps
 
2021-11-22 23:50:49.282608: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
 
Epoch 1/12
 
2021-11-22 23:50:49.309804: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
 
469/469 [==============================] - ETA: 0s - batch: 234.0000 - size: 1.0000 - loss: 0.1564 - accuracy: 0.9539/Users/julienvincenot/tensorflow-metal-test/lib/python3.8/site-packages/keras/engine/training.py:2470: UserWarning: `Model.state_updates` will be removed in a future version. This property should not be used in TensorFlow 2.0, as `updates` are applied automatically.
 
  warnings.warn('`Model.state_updates` will be removed in a future version. '
 
2021-11-22 23:51:01.268461: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
469/469 [==============================] - 14s 21ms/step - batch: 234.0000 - size: 1.0000 - loss: 0.1564 - accuracy: 0.9539 - val_loss: 0.0707 - val_accuracy: 0.9782
Epoch 2/12
469/469 [==============================] - 12s 19ms/step - batch: 234.0000 - size: 1.0000 - loss: 0.0453 - accuracy: 0.9857 - val_loss: 0.0487 - val_accuracy: 0.9848
Epoch 3/12
469/469 [==============================] - 12s 19ms/step - batch: 234.0000 - size: 1.0000 - loss: 0.0284 - accuracy: 0.9912 - val_loss: 0.0378 - val_accuracy: 0.9878
Epoch 4/12
469/469 [==============================] - 12s 19ms/step - batch: 234.0000 - size: 1.0000 - loss: 0.0191 - accuracy: 0.9939 - val_loss: 0.0346 - val_accuracy: 0.9886
Epoch 5/12
469/469 [==============================] - 12s 19ms/step - batch: 234.0000 - size: 1.0000 - loss: 0.0135 - accuracy: 0.9958 - val_loss: 0.0400 - val_accuracy: 0.9892
Epoch 6/12
469/469 [==============================] - 12s 19ms/step - batch: 234.0000 - size: 1.0000 - loss: 0.0099 - accuracy: 0.9968 - val_loss: 0.0332 - val_accuracy: 0.9902
Epoch 7/12
469/469 [==============================] - 12s 19ms/step - batch: 234.0000 - size: 1.0000 - loss: 0.0069 - accuracy: 0.9978 - val_loss: 0.0376 - val_accuracy: 0.9894
Epoch 8/12
469/469 [==============================] - 12s 19ms/step - batch: 234.0000 - size: 1.0000 - loss: 0.0078 - accuracy: 0.9973 - val_loss: 0.0389 - val_accuracy: 0.9889
Epoch 9/12
469/469 [==============================] - 12s 19ms/step - batch: 234.0000 - size: 1.0000 - loss: 0.0059 - accuracy: 0.9980 - val_loss: 0.0448 - val_accuracy: 0.9887
Epoch 10/12
469/469 [==============================] - 12s 19ms/step - batch: 234.0000 - size: 1.0000 - loss: 0.0047 - accuracy: 0.9985 - val_loss: 0.0434 - val_accuracy: 0.9902
Epoch 11/12
469/469 [==============================] - 12s 19ms/step - batch: 234.0000 - size: 1.0000 - loss: 0.0053 - accuracy: 0.9984 - val_loss: 0.0486 - val_accuracy: 0.9873
Epoch 12/12
469/469 [==============================] - 12s 19ms/step - batch: 234.0000 - size: 1.0000 - loss: 0.0047 - accuracy: 0.9984 - val_loss: 0.0383 - val_accuracy: 0.9896

0

	Colocation Debug Info:
	Colocation group had the following types and supported devices:
	Root Member(assigned_device_name_index_=2 requested_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' assigned_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' resource_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
	RealDiv: GPU CPU
	Sqrt: GPU CPU
	UnsortedSegmentSum: CPU
	AssignVariableOp: GPU CPU
	AssignSubVariableOp: GPU CPU
	ReadVariableOp: GPU CPU
	StridedSlice: GPU CPU
	NoOp: GPU CPU
	Mul: GPU CPU
	Shape: GPU CPU
	_Arg: GPU CPU
	ResourceScatterAdd: GPU CPU
	Unique: CPU
	AddV2: GPU CPU
	ResourceGather: GPU CPU
	Const: GPU CPU

	WARNING:tensorflow:From <ipython-input-2-17bb7203622b>:1: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
	Instructions for updating:
	Use `tf.config.list_physical_devices('GPU')` instead.
	WARNING:tensorflow:From <ipython-input-2-17bb7203622b>:1: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
	Instructions for updating:
	Use `tf.config.list_physical_devices('GPU')` instead.
	2021-06-29 01:56:25.862829: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
	2021-06-29 01:56:25.862893: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
	Out[2]: True

	UnsortedSegmentSum: CPU <---
	AssignVariableOp: GPU CPU
	AssignSubVariableOp: GPU CPU
	ReadVariableOp: GPU CPU
	StridedSlice: GPU CPU
	NoOp: GPU CPU
	Mul: GPU CPU
	Shape: GPU CPU
	_Arg: GPU CPU
	ResourceScatterAdd: GPU CPU
	Unique: CPU <---

	(tensorflow-metal-test) jv@192 tensorflow-metal-test % python /Users/jv/tensorflow-exp/test.py

	2021-11-22 23:50:48.066315: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.2 AVX AVX2 FMA

	To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

	Metal device set to: AMD Radeon Pro 5500M

	systemMemory: 32.00 GB
	maxCacheSize: 3.99 GB

	2021-11-22 23:50:48.067311: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
	2021-11-22 23:50:48.067826: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
	2021-11-22 23:50:48.505048: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
	2021-11-22 23:50:48.505092: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
	2021-11-22 23:50:48.712043: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
	2021-11-22 23:50:48.734335: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
	2021-11-22 23:50:48.827487: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
	2021-11-22 23:50:48.858801: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
	2021-11-22 23:50:49.081885: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
	2021-11-22 23:50:49.113821: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
	2021-11-22 23:50:49.169179: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
	2021-11-22 23:50:49.208235: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.

	2021-11-22 23:50:49.243817: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)

	Train on 469 steps, validate on 79 steps

	2021-11-22 23:50:49.282608: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.

	Epoch 1/12

	2021-11-22 23:50:49.309804: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.

	469/469 [==============================] - ETA: 0s - batch: 234.0000 - size: 1.0000 - loss: 0.1564 - accuracy: 0.9539/Users/julienvincenot/tensorflow-metal-test/lib/python3.8/site-packages/keras/engine/training.py:2470: UserWarning: `Model.state_updates` will be removed in a future version. This property should not be used in TensorFlow 2.0, as `updates` are applied automatically.

	warnings.warn('`Model.state_updates` will be removed in a future version. '

	2021-11-22 23:51:01.268461: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
	469/469 [==============================] - 14s 21ms/step - batch: 234.0000 - size: 1.0000 - loss: 0.1564 - accuracy: 0.9539 - val_loss: 0.0707 - val_accuracy: 0.9782
	Epoch 2/12
	469/469 [==============================] - 12s 19ms/step - batch: 234.0000 - size: 1.0000 - loss: 0.0453 - accuracy: 0.9857 - val_loss: 0.0487 - val_accuracy: 0.9848
	Epoch 3/12
	469/469 [==============================] - 12s 19ms/step - batch: 234.0000 - size: 1.0000 - loss: 0.0284 - accuracy: 0.9912 - val_loss: 0.0378 - val_accuracy: 0.9878
	Epoch 4/12
	469/469 [==============================] - 12s 19ms/step - batch: 234.0000 - size: 1.0000 - loss: 0.0191 - accuracy: 0.9939 - val_loss: 0.0346 - val_accuracy: 0.9886
	Epoch 5/12
	469/469 [==============================] - 12s 19ms/step - batch: 234.0000 - size: 1.0000 - loss: 0.0135 - accuracy: 0.9958 - val_loss: 0.0400 - val_accuracy: 0.9892
	Epoch 6/12
	469/469 [==============================] - 12s 19ms/step - batch: 234.0000 - size: 1.0000 - loss: 0.0099 - accuracy: 0.9968 - val_loss: 0.0332 - val_accuracy: 0.9902
	Epoch 7/12
	469/469 [==============================] - 12s 19ms/step - batch: 234.0000 - size: 1.0000 - loss: 0.0069 - accuracy: 0.9978 - val_loss: 0.0376 - val_accuracy: 0.9894
	Epoch 8/12
	469/469 [==============================] - 12s 19ms/step - batch: 234.0000 - size: 1.0000 - loss: 0.0078 - accuracy: 0.9973 - val_loss: 0.0389 - val_accuracy: 0.9889
	Epoch 9/12
	469/469 [==============================] - 12s 19ms/step - batch: 234.0000 - size: 1.0000 - loss: 0.0059 - accuracy: 0.9980 - val_loss: 0.0448 - val_accuracy: 0.9887
	Epoch 10/12
	469/469 [==============================] - 12s 19ms/step - batch: 234.0000 - size: 1.0000 - loss: 0.0047 - accuracy: 0.9985 - val_loss: 0.0434 - val_accuracy: 0.9902
	Epoch 11/12
	469/469 [==============================] - 12s 19ms/step - batch: 234.0000 - size: 1.0000 - loss: 0.0053 - accuracy: 0.9984 - val_loss: 0.0486 - val_accuracy: 0.9873
	Epoch 12/12
	469/469 [==============================] - 12s 19ms/step - batch: 234.0000 - size: 1.0000 - loss: 0.0047 - accuracy: 0.9984 - val_loss: 0.0383 - val_accuracy: 0.9896