GPU cannot be assigned properly while NLP task.

Question

hawkiyc OP

Created Jun ’21

Replies 4

Boosts 1

Views 4.3k

Participants 11

Dear All Developers,

I have reported an issue about the HuggingFace package on 683992.

In the beginning, I thought the problem is from HuggingFace. However, I found out it seems results from TensorFlow-Hub after some further tests.

Here is the thing, I made a fine-tuning BERT model with TF and TF-Hub only. And I got the same error as before.

Here is the detail about the error.

 InvalidArgumentError: Cannot assign a device for operation AdamWeightDecay/AdamWeightDecay/update/Unique: Could not satisfy explicit device specification '/job:localhost/replica:0/task:0/device:GPU:0' because no supported kernel for GPU devices is available.
Colocation Debug Info:
Colocation group had the following types and supported devices: 
Root Member(assigned_device_name_index_=2 requested_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' assigned_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' resource_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
RealDiv: GPU CPU 
ResourceGather: GPU CPU 
AddV2: GPU CPU 
Sqrt: GPU CPU 
Unique: CPU 
ResourceScatterAdd: GPU CPU 
UnsortedSegmentSum: CPU 
AssignVariableOp: GPU CPU 
AssignSubVariableOp: GPU CPU 
ReadVariableOp: GPU CPU 
NoOp: GPU CPU 
Mul: GPU CPU 
Shape: GPU CPU 
Identity: GPU CPU 
StridedSlice: GPU CPU 
_Arg: GPU CPU 
Const: GPU CPU

So, obviously, there is something wrong with the TF part and I don't think there is a quick solution.

As transformers and related models are so powerful in the NLP area, it is a great shame that if we cannot solving NLP tasks with GPU accelerating.

I will raise this issue on Feedback Assistant App too, and please comment here if you would also like Apple to solve this issue.

Sincerely,

hawkiyc

Boost

Answer 1

EugeneZhidkov OP

Jun ’21

Hi hawkiyc!

Thank you so much for reporting this issue. Team is aware of it, reproduced it and working on a fix. There is no known workaround at this time. The fix will be provided in the upcoming seeds.

Please file FB assistant ticket and put its number there, so we could update you on a progress.

Have a great day!

1

Answer 2

silviaP OP

Dec ’21

I have been experiencing a similar issue while training a GAN.

tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation loader/GeneratorDataset: Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.

Any news about when and how the issue would be solved?

1

Answer 3

Bemnet4u OP

Jan ’22

Is there any update on this? Any ETA?

I am seeing this error when training the tensorflowtts model on mac m1 chip.

 Metal device set to: Apple M1 Max
...
systemMemory: 64.00 GB
maxCacheSize: 21.33 GB
 
Traceback (most recent call last):
  File "/Users/bemnet.merha/P4/TensorFlowTTS/./examples/tacotron2/train_tacotron2.py", line 528, in <module>
    main()
  File "/Users/bemnet.merha/P4/TensorFlowTTS/./examples/tacotron2/train_tacotron2.py", line 516, in main
    trainer.fit(
  File "/Users/bemnet.merha/P4/TensorFlowTTS/tensorflow_tts/trainers/base_trainer.py", line 1010, in fit
    self.run()
  File "/Users/bemnet.merha/P4/TensorFlowTTS/tensorflow_tts/trainers/base_trainer.py", line 104, in run
    self._train_epoch()
  File "/Users/bemnet.merha/P4/TensorFlowTTS/tensorflow_tts/trainers/base_trainer.py", line 126, in _train_epoch
    self._train_step(batch)
  File "/Users/bemnet.merha/P4/TensorFlowTTS/./examples/tacotron2/train_tacotron2.py", line 113, in _train_step
    self.one_step_forward(batch)
  File "/Users/bemnet.merha/miniforge3/envs/tensorflow/lib/python3.9/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/Users/bemnet.merha/miniforge3/envs/tensorflow/lib/python3.9/site-packages/tensorflow/python/eager/execute.py", line 58, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation gradients/tacotron2/decoder/while_grad/tacotron2/decoder/while/Placeholder_0/accumulator: Could not satisfy explicit device specification '/job:localhost/replica:0/task:0/device:GPU:0' because no supported kernel for GPU devices is available.
Colocation Debug Info:
Colocation group had the following types and supported devices: 
Root Member(assigned_device_name_index_=1 requested_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' assigned_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' resource_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
Merge: GPU CPU 
AddV2: GPU CPU

0

Answer 4

andrewwww OP

Feb ’22

Same when trying to fine-tune the universal sentence encoder (tfhub). CPU training works, slowly though. To be able to train just add: tf.config.set_visible_devices([], 'GPU') to hide the GPUs. Any updates on this?

InvalidArgumentError: Cannot assign a device for operation Adam/Adam/update/Unique: Could not satisfy explicit device specification '/job:localhost/replica:0/task:0/device:GPU:0' because no supported kernel for GPU devices is available. Colocation Debug Info: Colocation group had the following types and supported devices: Root Member(assigned_device_name_index_=2 requested_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' assigned_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' resource_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]

0

	InvalidArgumentError: Cannot assign a device for operation AdamWeightDecay/AdamWeightDecay/update/Unique: Could not satisfy explicit device specification '/job:localhost/replica:0/task:0/device:GPU:0' because no supported kernel for GPU devices is available.
	Colocation Debug Info:
	Colocation group had the following types and supported devices:
	Root Member(assigned_device_name_index_=2 requested_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' assigned_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' resource_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
	RealDiv: GPU CPU
	ResourceGather: GPU CPU
	AddV2: GPU CPU
	Sqrt: GPU CPU
	Unique: CPU
	ResourceScatterAdd: GPU CPU
	UnsortedSegmentSum: CPU
	AssignVariableOp: GPU CPU
	AssignSubVariableOp: GPU CPU
	ReadVariableOp: GPU CPU
	NoOp: GPU CPU
	Mul: GPU CPU
	Shape: GPU CPU
	Identity: GPU CPU
	StridedSlice: GPU CPU
	_Arg: GPU CPU
	Const: GPU CPU

	Metal device set to: Apple M1 Max
	...
	systemMemory: 64.00 GB
	maxCacheSize: 21.33 GB

	Traceback (most recent call last):
	File "/Users/bemnet.merha/P4/TensorFlowTTS/./examples/tacotron2/train_tacotron2.py", line 528, in <module>
	main()
	File "/Users/bemnet.merha/P4/TensorFlowTTS/./examples/tacotron2/train_tacotron2.py", line 516, in main
	trainer.fit(
	File "/Users/bemnet.merha/P4/TensorFlowTTS/tensorflow_tts/trainers/base_trainer.py", line 1010, in fit
	self.run()
	File "/Users/bemnet.merha/P4/TensorFlowTTS/tensorflow_tts/trainers/base_trainer.py", line 104, in run
	self._train_epoch()
	File "/Users/bemnet.merha/P4/TensorFlowTTS/tensorflow_tts/trainers/base_trainer.py", line 126, in _train_epoch
	self._train_step(batch)
	File "/Users/bemnet.merha/P4/TensorFlowTTS/./examples/tacotron2/train_tacotron2.py", line 113, in _train_step
	self.one_step_forward(batch)
	File "/Users/bemnet.merha/miniforge3/envs/tensorflow/lib/python3.9/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
	raise e.with_traceback(filtered_tb) from None
	File "/Users/bemnet.merha/miniforge3/envs/tensorflow/lib/python3.9/site-packages/tensorflow/python/eager/execute.py", line 58, in quick_execute
	tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
	tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation gradients/tacotron2/decoder/while_grad/tacotron2/decoder/while/Placeholder_0/accumulator: Could not satisfy explicit device specification '/job:localhost/replica:0/task:0/device:GPU:0' because no supported kernel for GPU devices is available.
	Colocation Debug Info:
	Colocation group had the following types and supported devices:
	Root Member(assigned_device_name_index_=1 requested_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' assigned_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' resource_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
	Merge: GPU CPU
	AddV2: GPU CPU