The new tensorflow-macos and tensorflow-metal incapacitate training

Not only Upgrading tensorflow-macos and tensorflow-metal breaks Conv2d with groups arg , it also makes training unable to finish.

Today, after upgrading the tensorflow-macos to 2.9.0 and tensorflow-metal to 0.5.0, my notebook can no longer make progress after training around 16 minutes.

I tested 4 times. It could happily run around 17 to 18 epochs, each epoch around 55 seconds. After that, it just stopped making progress.

I checked the activity monitor, both cpu and gpu usage were 0 at that point.

I accidentally found that there are a lot of kernel faults in the Console app.

The last one before I force-killed the process:

IOReturn IOGPUDevice::new_resource(IOGPUNewResourceArgs *, struct IOGPUNewResourceReturnData *, IOByteCount, uint32_t *): PID 68905 likely leaking IOGPUResource (count=200000)

The PID 68905 is in fact the training process.

I have always observed this kind of issue for several months. But it's not as frequent and I can restart my notebook train successfully. No luck today.

Hope Apple engineers can found the cause and fix it.

Post not yet marked as solved Up vote post of wangcheng Down vote post of wangcheng
472 views
  • I ran my notebook again, observed that the training progress paused once the count reached 200000.

    So maybe the tensorflow-metal is indeed leaking kernel resources which prevents training to continue at some point.

Add a Comment

Replies

Hi @wangcheng

Thanks for reporting this issue! Can you confirm was this issue reproduced with the same script sample you provided in the other thread? This leak seems like it is related to a specific op so it would help us focus the search.

No, that script was for reproducing the groups arg issue only.

Here is the script to reproduce this problem. You need to obtain the data from kaggle first. https://www.kaggle.com/competitions/tabular-playground-series-may-2022/data

On my macbook pro, after running the script for around 10 mins, I started to see the leaking IOGPUResource messages from Console.app.

Hardware info: 14' M1 Pro, 32G, 10 core CPU, 16 core GPU

Software info: macOS 12.4, python 3.9.7, tensorflow-macos 2.9.2, tensorflow-metal 0.5.0

If I train the model on CPU, it can finish training, and surprisingly near 20% faster each epoch.

Add a Comment

I think I'm facing this same issue, but using a transformer model (https://developer.apple.com/forums/thread/703081?answerId=719068022#719068022)

I've also had this issue on this network on TF 2.9.1 (both tensorflow-macos and built from source), tensorflow-metal 0.4.1 and 0.5.0. It's mitigable by restarting the python script every 10-30 epochs (usually corresponding to 20 mins of time), transferring weights and then resuming training, but sometimes it would hang right off the bat. Eventually I had to switch to CPU training, and I can also concur that it seems to be faster for some reason.

Hardware info: Mac Studio w/ M1 Max, 10 CPU cores 24 GPU cores, 32GB RAM

Software info: macOS 12.4, Python 3.9.13 (homebrew)

I also get this issue when training a PointNet model (https://keras.io/examples/vision/pointnet/). My setup is tensorflow-macos 2.9.2, tensorflow-metal 0.5.0, Python 3.9.13 on a Mac Studio M1 Max with 64GB memory running macOS 12.4. For the timebeing I've reverted to an older Ubuntu machine for training. I could just use CPU training on the M1 I suppose.

It would be great to get this resolved because I love using the M1 for everything. It's so fast and quiet.