Please let me know something about this.
This error is making me unable to run my code, and I'm sure that will happen to anyone in M1 that needs to use the tensorflow_addons.
In summary:
1- I had finetuning code running without problem in my old MacOs (loads a previously trained TCN model and creates a finetuned model per file in the dataset);
2- When I bought the new M1, almost 1 year ago, the same code started producing the following error:
Cannot assign a device for operation model/conv_1_convolution/Conv2D/ReadVariableOp: Could not satisfy explicit device specification '' because the node {{colocation_node model/conv_1_convolution/Conv2D/ReadVariableOp}} was colocated with a group of nodes that required incompatible device '/job:localhost/replica:0/task:0/device:GPU:0'. All available devices [/job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:GPU:0].
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index_=2 requested_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' assigned_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' resource_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
Equal: CPU
AssignSubVariableOp: GPU CPU
AssignVariableOp: GPU CPU
GreaterEqual: GPU CPU
FloorDiv: CPU
Sqrt: GPU CPU
NoOp: GPU CPU
Pow: GPU CPU
Mul: CPU
Cast: GPU CPU
Identity: GPU CPU
SelectV2: GPU CPU
ReadVariableOp: GPU CPU
RealDiv: GPU CPU
Sub: GPU CPU
AddV2: GPU CPU
Const: GPU CPU
Square: GPU CPU
_Arg: GPU CPU
3- I avoided the last problem by setting tf.config.set_soft_device_placement(True) and forcing with tf.device('/cpu:0'): before any call to tensorflow, but when I do long finetuning sessions, inevitably at some random point, I'll get the error I reported in the 1st post (ie "Incompatible shapes: [0] vs. [5,40,20]", with varying error shapes).
4- I've tried with 2 different versions of tf+tfaddons (conda environments), but I got the same type of errors, probably more frequent with the pylast conda environment. You can see the environment.yml attached.
pylast: tensorflow-macos 2.9.0, tensorflow-metal 0.5.0, tensorflow-addons 0.17.0
py39deps26-source: tensorflow-macos 2.6.0, tensorflow-metal 0.2.0, tensorflow-addons 0.15.0.dev0
5- The developers of tensorflow_addons have been really helpful https://github.com/tensorflow/addons/issues/2578, but they said "As tensorflow-macos and tensorflow-metal are closed source packages we cannot do anything here in the case we cannot reproduce the issue on another platform."
6- the code does finetuning of a TCN network to specific audio files (with annotations), so you really need this data to debug this. Furthermore, the problem happens when doing long runs, so the dataset must be big for you to run into the issue.
So, please clone the https://github.com/MR-T77/M1_tf_problems (~1.4GB), and extract it to the same path as the py file.
You will see:
a stripped down version of my code (problem_TCNv2.py) - just run it as it is; go into run_me() to change dataset or data augmentation.
2 datasets, one big and one small. The code breaks when I'm doing long runs, so I'm pretty sure that if you run the code with the big dataset, the issue will appear.
pretrained_model.h5 - the pretrained network model;
yml files of the conda environments that I tested.
I really hope you can figure out what is the problem, as I really need this code to work.
Please keep me posted.
Topic:
Machine Learning & AI
SubTopic:
General
Tags: