Training custom data set model using mask_rcnn_inception from tensorflow model zoo

Running into GPU related error while working with latest tensorflow ( 2.13 ) . Please note the test model training provided on tensorflow-metal page to verify my setup works fine.

PLEASE ADVISE -

tensorflow.python.framework.errors_impl.InvalidArgumentError: {{function_node __wrapped__IteratorGetNext_output_types_18_device_/job:localhost/replica:0/task:0/device:GPU:0}} indices[0] = 0 is not in [0, 0)
	 [[{{node GatherV2_7}}]]
	 [[MultiDeviceIteratorGetNextFromShard]]
	 [[RemoteCall]] [Op:IteratorGetNext] name: 

The above are the last lines of the error message. below is the full log from the model training script

https://stackoverflow.com/questions/77076602/training-custom-data-set-model-using-mask-rcnn-inception-from-tensorflow-model-z

I went to SO since I cant share the full log here due to length restrictions.

Please help.

Replies

Happy update - Though I couldn't run the mask_rcnn_inception model training job due to the error described my M2 is able to train other models such as faster_rcnn. The issue seems to arise from the fact that some of the models are not TPU compatible. And any device using Tensorflow PluggableDevice mechanism such as tensorflow-metal is considered to be a TPU and not GPU and thus the issue.

Hope it helps save someones weekend :)