GPU cannot be assigned properly while NLP task.

Dear All Developers,

I have reported an issue about the HuggingFace package on 683992.

In the beginning, I thought the problem is from HuggingFace. However, I found out it seems results from TensorFlow-Hub after some further tests.

Here is the thing, I made a fine-tuning BERT model with TF and TF-Hub only. And I got the same error as before.

Here is the detail about the error.

InvalidArgumentError: Cannot assign a device for operation AdamWeightDecay/AdamWeightDecay/update/Unique: Could not satisfy explicit device specification '/job:localhost/replica:0/task:0/device:GPU:0' because no supported kernel for GPU devices is available.
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index_=2 requested_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' assigned_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' resource_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
RealDiv: GPU CPU
ResourceGather: GPU CPU
AddV2: GPU CPU
Sqrt: GPU CPU
Unique: CPU
ResourceScatterAdd: GPU CPU
UnsortedSegmentSum: CPU
AssignVariableOp: GPU CPU
AssignSubVariableOp: GPU CPU
ReadVariableOp: GPU CPU
NoOp: GPU CPU
Mul: GPU CPU
Shape: GPU CPU
Identity: GPU CPU
StridedSlice: GPU CPU
_Arg: GPU CPU
Const: GPU CPU

So, obviously, there is something wrong with the TF part and I don't think there is a quick solution.

As transformers and related models are so powerful in the NLP area, it is a great shame that if we cannot solving NLP tasks with GPU accelerating.

I will raise this issue on Feedback Assistant App too, and please comment here if you would also like Apple to solve this issue.

Sincerely,

hawkiyc

Hi hawkiyc!

Thank you so much for reporting this issue. Team is aware of it, reproduced it and working on a fix. There is no known workaround at this time. The fix will be provided in the upcoming seeds.

Please file FB assistant ticket and put its number there, so we could update you on a progress.

Have a great day!

I have been experiencing a similar issue while training a GAN.

tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation loader/GeneratorDataset: Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.

Any news about when and how the issue would be solved?

Is there any update on this? Any ETA?

I am seeing this error when training the tensorflowtts model on mac m1 chip.

Metal device set to: Apple M1 Max
...
systemMemory: 64.00 GB
maxCacheSize: 21.33 GB
Traceback (most recent call last):
File "/Users/bemnet.merha/P4/TensorFlowTTS/./examples/tacotron2/train_tacotron2.py", line 528, in <module>
main()
File "/Users/bemnet.merha/P4/TensorFlowTTS/./examples/tacotron2/train_tacotron2.py", line 516, in main
trainer.fit(
File "/Users/bemnet.merha/P4/TensorFlowTTS/tensorflow_tts/trainers/base_trainer.py", line 1010, in fit
self.run()
File "/Users/bemnet.merha/P4/TensorFlowTTS/tensorflow_tts/trainers/base_trainer.py", line 104, in run
self._train_epoch()
File "/Users/bemnet.merha/P4/TensorFlowTTS/tensorflow_tts/trainers/base_trainer.py", line 126, in _train_epoch
self._train_step(batch)
File "/Users/bemnet.merha/P4/TensorFlowTTS/./examples/tacotron2/train_tacotron2.py", line 113, in _train_step
self.one_step_forward(batch)
File "/Users/bemnet.merha/miniforge3/envs/tensorflow/lib/python3.9/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/Users/bemnet.merha/miniforge3/envs/tensorflow/lib/python3.9/site-packages/tensorflow/python/eager/execute.py", line 58, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation gradients/tacotron2/decoder/while_grad/tacotron2/decoder/while/Placeholder_0/accumulator: Could not satisfy explicit device specification '/job:localhost/replica:0/task:0/device:GPU:0' because no supported kernel for GPU devices is available.
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index_=1 requested_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' assigned_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' resource_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
Merge: GPU CPU
AddV2: GPU CPU

Same when trying to fine-tune the universal sentence encoder (tfhub). CPU training works, slowly though. To be able to train just add: tf.config.set_visible_devices([], 'GPU') to hide the GPUs. Any updates on this?

InvalidArgumentError: Cannot assign a device for operation Adam/Adam/update/Unique: Could not satisfy explicit device specification '/job:localhost/replica:0/task:0/device:GPU:0' because no supported kernel for GPU devices is available. Colocation Debug Info: Colocation group had the following types and supported devices: Root Member(assigned_device_name_index_=2 requested_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' assigned_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' resource_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]

GPU cannot be assigned properly while NLP task.
 
 
Q