Unable to use tensorflow addons on Mac M1

Question

Created May ’22

Replies 4

Boosts 0

Views 2.1k

Participants 2

First of all, as I understand that this is a problem related with tensorflow addons, I've been in contact with tfa developers (https://github.com/tensorflow/addons/issues/2578), and this issue only happens in M1, so they think it has to do with Apple tensorflow-metal.

I've been getting spurious errors while doing model.fit with the Lookahead optimizer (I'm doing fine-tuning with big datasets, and my code just breaks while fitting to different files, and in a not-reproducible way, i.e. each time I run it it breaks on a different file, and on different operations). I can see that these errors are undoubtedly related to the Lookahead optimizer. Let me try to explain this new info in a clear manner. I've tried with 2 different versions of tf+tfaddons (conda environments), but I got the same type of errors, probably more frequent with the pylast conda environment:

pylast:tensorflow-macos 2.9.0, tensorflow-metal 0.5.0, tensorflow-addons 0.17.0
py39deps26-source: tensorflow-macos 2.6.0, tensorflow-metal 0.2.0, tensorflow-addons 0.15.0.dev0

The base code is always the same, I use tf.config.set_soft_device_placement(True) and also with tf.device('/cpu:0'): in every call to tensorflow, otherwise I get errors. As explained before, in my code, I just load a model, and fine-tune it to each file of a dataset.

Here are a pair of example error outputs (obtained with the pylast conda environment):

File "/Users/machine/Projects/finetune-asp/src/finetune_IMR2020.py", line 138, in finetune_dataset_db
    history = model.fit(ft, steps_per_epoch=len(ft), epochs=ft_cfg["num_epochs"], shuffle=True,
  File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/tensorflow/python/eager/execute.py", line 54, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError: Graph execution error:

Detected at node 'Lookahead/Lookahead/update_64/mul_11' defined at (most recent call last):
    
    File "/Users/machine/Projects/finetune-asp/src/finetune_IMR2020.py", line 138, in finetune_dataset_db
      history = model.fit(ft, steps_per_epoch=len(ft), epochs=ft_cfg["num_epochs"], shuffle=True,
    File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler
      return fn(*args, **kwargs)
    File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/keras/engine/training.py", line 1409, in fit
      tmp_logs = self.train_function(iterator)
    File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/keras/engine/training.py", line 1051, in train_function
      return step_function(self, iterator)
    File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/keras/engine/training.py", line 1040, in step_function
      outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/keras/engine/training.py", line 1030, in run_step
      outputs = model.train_step(data)
    File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/keras/engine/training.py", line 893, in train_step
      self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
    File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/keras/optimizers/optimizer_v2/optimizer_v2.py", line 539, in minimize
      return self.apply_gradients(grads_and_vars, name=name)
    File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/tensorflow_addons/optimizers/lookahead.py", line 104, in apply_gradients
      return super().apply_gradients(grads_and_vars, name, **kwargs)
    File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/keras/optimizers/optimizer_v2/optimizer_v2.py", line 678, in apply_gradients
      return tf.__internal__.distribute.interim.maybe_merge_call(
    File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/keras/optimizers/optimizer_v2/optimizer_v2.py", line 723, in _distributed_apply
      update_op = distribution.extended.update(
    File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/keras/optimizers/optimizer_v2/optimizer_v2.py", line 706, in apply_grad_to_update_var
      update_op = self._resource_apply_dense(grad, var, **apply_kwargs)
    File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/tensorflow_addons/optimizers/lookahead.py", line 130, in _resource_apply_dense
      train_op = self._optimizer._resource_apply_dense(
    File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/tensorflow_addons/optimizers/rectified_adam.py", line 249, in _resource_apply_dense
      coef["r_t"] * m_corr_t / (v_corr_t + coef["epsilon_t"]),
Node: 'Lookahead/Lookahead/update_64/mul_11'
Incompatible shapes: [0] vs. [5,40,20]
	 [[{{node Lookahead/Lookahead/update_64/mul_11}}]] [Op:__inference_train_function_30821]

and

Another error output

Answer 1

mrt77 OP

Jun ’22

Sorry for insisting, but this issue does not let me use tensorflow and it's being really needed.

Answer 2

Frameworks Engineer OP

Apple

Jun ’22

Hi @mrt77

Thanks for reporting the issue! Do you have a sample script we could use to reproduce this issue locally? That would speedup the debugging process but based on the error trace it does seem that something on our side returns unexpected shapes.

Answer 3

mrt77 OP

Jun ’22

Hi there! That will be difficult, given that these are spurious errors while doing model.fit with the Lookahead optimizer (I'm doing fine-tuning with big datasets, and my code just breaks while fitting to different files, and in a not-reproducible way, i.e. each time I run it it breaks on a different file, and on different operations). So, the only way for me to share this it would be to try to reduce a little bit my part of code (but it will still be big) and also send you one of the datasets (>2G), to be sure it would break also on your side. I don't think I have any other way I can share this with you. Is that ok? I'm asking because this will take me some hours to do, time that I don't really have, but I would do it if you could look at the code I'll send. I'll wait on your feedback.

Answer 4

mrt77 OP

Jun ’22

Please let me know something about this. This error is making me unable to run my code, and I'm sure that will happen to anyone in M1 that needs to use the tensorflow_addons. In summary:

1- I had finetuning code running without problem in my old MacOs (loads a previously trained TCN model and creates a finetuned model per file in the dataset);

2- When I bought the new M1, almost 1 year ago, the same code started producing the following error:

Cannot assign a device for operation model/conv_1_convolution/Conv2D/ReadVariableOp: Could not satisfy explicit device specification '' because the node {{colocation_node model/conv_1_convolution/Conv2D/ReadVariableOp}} was colocated with a group of nodes that required incompatible device '/job:localhost/replica:0/task:0/device:GPU:0'. All available devices [/job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:GPU:0]. 
Colocation Debug Info:
Colocation group had the following types and supported devices: 
Root Member(assigned_device_name_index_=2 requested_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' assigned_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' resource_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
Equal: CPU 
AssignSubVariableOp: GPU CPU 
AssignVariableOp: GPU CPU 
GreaterEqual: GPU CPU 
FloorDiv: CPU 
Sqrt: GPU CPU 
NoOp: GPU CPU 
Pow: GPU CPU 
Mul: CPU 
Cast: GPU CPU 
Identity: GPU CPU 
SelectV2: GPU CPU 
ReadVariableOp: GPU CPU 
RealDiv: GPU CPU 
Sub: GPU CPU 
AddV2: GPU CPU 
Const: GPU CPU 
Square: GPU CPU 
_Arg: GPU CPU

3- I avoided the last problem by setting tf.config.set_soft_device_placement(True) and forcing with tf.device('/cpu:0'): before any call to tensorflow, but when I do long finetuning sessions, inevitably at some random point, I'll get the error I reported in the 1st post (ie "Incompatible shapes: [0] vs. [5,40,20]", with varying error shapes).

4- I've tried with 2 different versions of tf+tfaddons (conda environments), but I got the same type of errors, probably more frequent with the pylast conda environment. You can see the environment.yml attached.

pylast: tensorflow-macos 2.9.0, tensorflow-metal 0.5.0, tensorflow-addons 0.17.0
py39deps26-source: tensorflow-macos 2.6.0, tensorflow-metal 0.2.0, tensorflow-addons 0.15.0.dev0

5- The developers of tensorflow_addons have been really helpful https://github.com/tensorflow/addons/issues/2578, but they said "As tensorflow-macos and tensorflow-metal are closed source packages we cannot do anything here in the case we cannot reproduce the issue on another platform."

6- the code does finetuning of a TCN network to specific audio files (with annotations), so you really need this data to debug this. Furthermore, the problem happens when doing long runs, so the dataset must be big for you to run into the issue.

So, please clone the https://github.com/MR-T77/M1_tf_problems (~1.4GB), and extract it to the same path as the py file. You will see:

a stripped down version of my code (problem_TCNv2.py) - just run it as it is; go into run_me() to change dataset or data augmentation.
2 datasets, one big and one small. The code breaks when I'm doing long runs, so I'm pretty sure that if you run the code with the big dataset, the issue will appear.
pretrained_model.h5 - the pretrained network model;
yml files of the conda environments that I tested.

I really hope you can figure out what is the problem, as I really need this code to work. Please keep me posted.