The new tensorflow-macos and tensorflow-metal incapacitate training

Question

Created May ’22

Replies 19

Boosts 6

Views 5.2k

Participants 15

Not only Upgrading tensorflow-macos and tensorflow-metal breaks Conv2d with groups arg , it also makes training unable to finish.

Today, after upgrading the tensorflow-macos to 2.9.0 and tensorflow-metal to 0.5.0, my notebook can no longer make progress after training around 16 minutes.

I tested 4 times. It could happily run around 17 to 18 epochs, each epoch around 55 seconds. After that, it just stopped making progress.

I checked the activity monitor, both cpu and gpu usage were 0 at that point.

I accidentally found that there are a lot of kernel faults in the Console app.

The last one before I force-killed the process:

IOReturn IOGPUDevice::new_resource(IOGPUNewResourceArgs *, struct IOGPUNewResourceReturnData *, IOByteCount, uint32_t *): PID 68905 likely leaking IOGPUResource (count=200000)

The PID 68905 is in fact the training process.

I have always observed this kind of issue for several months. But it's not as frequent and I can restart my notebook train successfully. No luck today.

Hope Apple engineers can found the cause and fix it.

Boost

Answer 1

Thalesian OP

Jan ’23

Can confirm that python 3.9 + tensorflow-macos 2.8 + tensorflow-metal 0.4.0 is the combination you want to avoid the deadlock/freezing issue. Model successfully ran overnight.

0

Answer 2

deketh OP

Jan ’23

I'm facing the same issue on Macos Ventura with tensorflow-macos 2.11 and tensorflow-meta 0.7.

0

Answer 3

johnny_A OP

Apr ’23

I also peviously had the same problems with training coming to a near-halt mid-epoch. Today, I (again) followed the steps on https://developer.apple.com/metal/tensorflow-plugin/ and installed tensorflow-deps 2.9.0, tensorflow-macos 2.12.0, and tensorflow metal 0.8.0. So far, I have not experienced any training "deadlocks".

0

Answer 4

wbattel4607 OP

Jul ’23

Can anyone here confirm or deny that the newest versions of tensorflow-metal are free of this issue? Reading back in this thread, the problem seems to first appear in version 0.5.0 and reportedly disappears starting in 0.8.0- but I'd really appreciate it if any additional people can confirm that.

Thanks.

0