The new tensorflow-macos and tensorflow-metal incapacitate training

Not only Upgrading tensorflow-macos and tensorflow-metal breaks Conv2d with groups arg , it also makes training unable to finish.

Today, after upgrading the tensorflow-macos to 2.9.0 and tensorflow-metal to 0.5.0, my notebook can no longer make progress after training around 16 minutes.

I tested 4 times. It could happily run around 17 to 18 epochs, each epoch around 55 seconds. After that, it just stopped making progress.

I checked the activity monitor, both cpu and gpu usage were 0 at that point.

I accidentally found that there are a lot of kernel faults in the Console app.

The last one before I force-killed the process:

IOReturn IOGPUDevice::new_resource(IOGPUNewResourceArgs *, struct IOGPUNewResourceReturnData *, IOByteCount, uint32_t *): PID 68905 likely leaking IOGPUResource (count=200000)

The PID 68905 is in fact the training process.

I have always observed this kind of issue for several months. But it's not as frequent and I can restart my notebook train successfully. No luck today.

Hope Apple engineers can found the cause and fix it.

Post not yet marked as solved Up vote post of wangcheng Down vote post of wangcheng
4.1k views
  • I ran my notebook again, observed that the training progress paused once the count reached 200000.

    So maybe the tensorflow-metal is indeed leaking kernel resources which prevents training to continue at some point.

Add a Comment

Replies

I'm facing the same issue on Macos Ventura with tensorflow-macos 2.11 and tensorflow-meta 0.7.

I also peviously had the same problems with training coming to a near-halt mid-epoch. Today, I (again) followed the steps on https://developer.apple.com/metal/tensorflow-plugin/ and installed tensorflow-deps 2.9.0, tensorflow-macos 2.12.0, and tensorflow metal 0.8.0. So far, I have not experienced any training "deadlocks".

  • Hi @johnny_A,

    which python version did you use? I want to give it a try (after months of lost hope with TF).

    thanks

  • Hi @karbapi, I'm using Python 3.10.10 in a conda environment. Good luck!

Add a Comment

Can anyone here confirm or deny that the newest versions of tensorflow-metal are free of this issue? Reading back in this thread, the problem seems to first appear in version 0.5.0 and reportedly disappears starting in 0.8.0- but I'd really appreciate it if any additional people can confirm that.

Thanks.