InvalidArgumentError: Cannot assign a device for operation agent/VerifyFinite/CheckNumerics

Hardware:

  • MacBook Pro (13-inch, M1, 2020)
  • macOS Monterey 12.1

Libs:

  • tensorflow-macos 2.7.0 pypi_0 pypi
  • tensorflow-metal 0.3.0 pypi_0 pypi
  • tensorforce 0.6.5 dev_0

Validate tensorflow:

import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

...
 % python3 tensorflow_demo.py 
Num GPUs Available:  1

Install tensorforce from source:

git clone https://github.com/tensorforce/tensorforce.git
cd tensorforce
# edit requirements.txt
# -tensorflow == 2.7.0
# +tensorflow-macos == 2.7.0
pip3 install -e .

Copy the Quickstart Example code to tensorforce_demo.py

$ python3 tensorforce_demo.py
Metal device set to: Apple M1

systemMemory: 16.00 GB
maxCacheSize: 5.33 GB

WARNING:root:Infinite min_value bound for state.
Traceback (most recent call last):
  File "/Users/derek/Sites/ml/lottoenv/tensorforce_demo.py", line 31, in <module>
    actions = agent.act(states=states)
  File "/Users/derek/Sites/ml/tensorforce/tensorforce/agents/agent.py", line 415, in act
    return super().act(
  File "/Users/derek/Sites/ml/tensorforce/tensorforce/agents/recorder.py", line 262, in act
    actions, internals = self.fn_act(
  File "/Users/derek/Sites/ml/tensorforce/tensorforce/agents/agent.py", line 462, in fn_act
    actions, timesteps = self.model.act(
  File "/Users/derek/Sites/ml/tensorforce/tensorforce/core/module.py", line 136, in decorated
    output_args = function_graphs[str(graph_params)](*graph_args)
  File "/opt/homebrew/Caskroom/miniforge/base/lib/python3.9/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/opt/homebrew/Caskroom/miniforge/base/lib/python3.9/site-packages/tensorflow/python/eager/execute.py", line 58, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation agent/VerifyFinite/CheckNumerics: Could not satisfy explicit device specification '' because the node {{colocation_node agent/VerifyFinite/CheckNumerics}} was colocated with a group of nodes that required incompatible device '/job:localhost/replica:0/task:0/device:GPU:0'. All available devices [/job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:GPU:0]. 
Colocation Debug Info:
Colocation group had the following types and supported devices: 
Root Member(assigned_device_name_index_=1 requested_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' assigned_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' resource_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
Identity: GPU CPU 
Switch: GPU CPU 
CheckNumerics: CPU 
_Arg: GPU CPU 

Colocation members, user-requested devices, and framework assigned devices, if any:
  args_0 (_Arg)  framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
  agent/VerifyFinite/CheckNumerics (CheckNumerics) 
  agent/VerifyFinite/control_dependency (Identity) 
  agent/assert_greater_equal/Assert/AssertGuard/args_0/_16 (Switch) 
  agent/assert_less_equal/Assert/AssertGuard/args_0/_26 (Switch) 
  Func/agent/StatefulPartitionedCall/input/_80 (Identity) /job:localhost/replica:0/task:0/device:GPU:0
  Func/agent/assert_greater_equal/Assert/AssertGuard/then/_10/input/_150 (Identity) 
  Func/agent/assert_greater_equal/Assert/AssertGuard/else/_11/input/_156 (Identity) 
  Func/agent/assert_less_equal/Assert/AssertGuard/then/_20/input/_162 (Identity) 
  Func/agent/assert_less_equal/Assert/AssertGuard/else/_21/input/_168 (Identity) 
  Func/agent/StatefulPartitionedCall/state_preprocessing/PartitionedCall/input/_257 (Identity) /job:localhost/replica:0/task:0/device:GPU:0
  Func/agent/StatefulPartitionedCall/state_preprocessing/PartitionedCall/linear_normalization0/PartitionedCall/input/_350 (Identity) /job:localhost/replica:0/task:0/device:GPU:0

         [[{{node agent/VerifyFinite/CheckNumerics}}]] [Op:__inference_act_1212]

Run the code on CPU as follows:

import tensorflow as tf

with tf.device("/cpu:0"):
    #... indent the demo code above

Returns no error

% python3 tensorforce_demo.py
Metal device set to: Apple M1

systemMemory: 16.00 GB
maxCacheSize: 5.33 GB

WARNING:root:Infinite min_value bound for state.

Hi @Derek C

Thanks for reporting this and providing the sample code. It looks like we are missing the CheckNumerics op registration on the GPU causing this colocation issue. We have added this to the list of missing ops to be implemented and we will update here once the solution is out. In the mean while unfortunately the portion of the code containing this op has to be run on the CPU as you noted.

Hi @Derek C, @Frameroks Engineer

I have run threw a similar issue, have you found a way to fix it, running on gpu?

Thanks

I try to run training with tensor force agent today and get the same error. is there any fix? many thanks.

InvalidArgumentError: Cannot assign a device for operation agent/VerifyFinite/CheckNumerics
 
 
Q