GPU cannot be assigned properly while NLP task.

Dear All Developers,

I have reported an issue about the HuggingFace package on 683992.

In the beginning, I thought the problem is from HuggingFace. However, I found out it seems results from TensorFlow-Hub after some further tests.

Here is the thing, I made a fine-tuning BERT model with TF and TF-Hub only. And I got the same error as before.

Here is the detail about the error.

InvalidArgumentError: Cannot assign a device for operation AdamWeightDecay/AdamWeightDecay/update/Unique: Could not satisfy explicit device specification '/job:localhost/replica:0/task:0/device:GPU:0' because no supported kernel for GPU devices is available.
Colocation Debug Info:
Colocation group had the following types and supported devices: 
Root Member(assigned_device_name_index_=2 requested_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' assigned_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' resource_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
RealDiv: GPU CPU 
ResourceGather: GPU CPU 
AddV2: GPU CPU 
Sqrt: GPU CPU 
Unique: CPU 
ResourceScatterAdd: GPU CPU 
UnsortedSegmentSum: CPU 
AssignVariableOp: GPU CPU 
AssignSubVariableOp: GPU CPU 
ReadVariableOp: GPU CPU 
NoOp: GPU CPU 
Mul: GPU CPU 
Shape: GPU CPU 
Identity: GPU CPU 
StridedSlice: GPU CPU 
_Arg: GPU CPU 
Const: GPU CPU 

So, obviously, there is something wrong with the TF part and I don't think there is a quick solution.

As transformers and related models are so powerful in the NLP area, it is a great shame that if we cannot solving NLP tasks with GPU accelerating.

I will raise this issue on Feedback Assistant App too, and please comment here if you would also like Apple to solve this issue.

Sincerely,

hawkiyc

Post not yet marked as solved Up vote post of hawkiyc Down vote post of hawkiyc
3.8k views

Replies

Hi hawkiyc!

Thank you so much for reporting this issue. Team is aware of it, reproduced it and working on a fix. There is no known workaround at this time. The fix will be provided in the upcoming seeds.

Please file FB assistant ticket and put its number there, so we could update you on a progress.

Have a great day!

  • Dear Eugene,

    The feedback has been submitted last night, and the no. is FB9220496. I have also attached the requirements.txt file of my env and 2 py. files of NLP models in the feedback. Those files shall help you to reproduce this issue more easily.

    Sincerely,

    hawkiyc

  • Hi, Eugene. I made a new test, fine-tuned BERT without the TF-hub package, yet it still not work. It looks like that neither TF-hub nor TF_model_official gets the assignment of the device correctly. I also updated the feedback with the test .py file and hope this could help.

  • Hi hawkiyc!

    Yes, we are aware of this issue and already working on a fix. We will update FB assistant ticket once the fix is included into the macOS12 beta seed.

    Thanks!

I have been experiencing a similar issue while training a GAN.

tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation loader/GeneratorDataset: Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.

Any news about when and how the issue would be solved?

  • Hi! same issue for me when trying to fit pre-trained EfficientNetB7

    InvalidArgumentError: Cannot assign a device for operation sequential/efficientnetb7/stem_conv/Conv2D/ReadVariableOp: Could not satisfy explicit device specification '' because the node {{colocation_node sequential/efficientnetb7/stem_conv/Conv2D/ReadVariableOp}} was colocated with a group of nodes that required incompatible device '/job:localhost/replica:0/task:0/device:GPU:0'. All available devices [/job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:GPU:0]. Colocation Debug Info: Colocation group had the following types and supported devices: Root Member(assigned_device_name_index_=2 requested_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' assigned_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' resource_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[] ResourceApplyAdaMax: CPU ReadVariableOp: GPU CPU _Arg: GPU CPU
Add a Comment

Is there any update on this? Any ETA?

I am seeing this error when training the tensorflowtts model on mac m1 chip.

Metal device set to: Apple M1 Max
...
systemMemory: 64.00 GB
maxCacheSize: 21.33 GB

Traceback (most recent call last):
  File "/Users/bemnet.merha/P4/TensorFlowTTS/./examples/tacotron2/train_tacotron2.py", line 528, in <module>
    main()
  File "/Users/bemnet.merha/P4/TensorFlowTTS/./examples/tacotron2/train_tacotron2.py", line 516, in main
    trainer.fit(
  File "/Users/bemnet.merha/P4/TensorFlowTTS/tensorflow_tts/trainers/base_trainer.py", line 1010, in fit
    self.run()
  File "/Users/bemnet.merha/P4/TensorFlowTTS/tensorflow_tts/trainers/base_trainer.py", line 104, in run
    self._train_epoch()
  File "/Users/bemnet.merha/P4/TensorFlowTTS/tensorflow_tts/trainers/base_trainer.py", line 126, in _train_epoch
    self._train_step(batch)
  File "/Users/bemnet.merha/P4/TensorFlowTTS/./examples/tacotron2/train_tacotron2.py", line 113, in _train_step
    self.one_step_forward(batch)
  File "/Users/bemnet.merha/miniforge3/envs/tensorflow/lib/python3.9/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/Users/bemnet.merha/miniforge3/envs/tensorflow/lib/python3.9/site-packages/tensorflow/python/eager/execute.py", line 58, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation gradients/tacotron2/decoder/while_grad/tacotron2/decoder/while/Placeholder_0/accumulator: Could not satisfy explicit device specification '/job:localhost/replica:0/task:0/device:GPU:0' because no supported kernel for GPU devices is available.
Colocation Debug Info:
Colocation group had the following types and supported devices: 
Root Member(assigned_device_name_index_=1 requested_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' assigned_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' resource_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
Merge: GPU CPU 
AddV2: GPU CPU 

Same when trying to fine-tune the universal sentence encoder (tfhub). CPU training works, slowly though. To be able to train just add: tf.config.set_visible_devices([], 'GPU') to hide the GPUs. Any updates on this?

InvalidArgumentError: Cannot assign a device for operation Adam/Adam/update/Unique: Could not satisfy explicit device specification '/job:localhost/replica:0/task:0/device:GPU:0' because no supported kernel for GPU devices is available. Colocation Debug Info: Colocation group had the following types and supported devices: Root Member(assigned_device_name_index_=2 requested_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' assigned_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' resource_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]