Huge memory leakage issue with tf.keras.models.predict()

Comparison between MAC Studio M1 Ultra (20c, 64c, 128GB RAM) vs 2017 Intel i5 MBP (16GB RAM) for the subject matter i.e. memory leakage while using tf.keras.models.predict() for saved model on both machines:

MBP-2017: First prediction takes around 10MB and subsequent calls ~0-1MB

MACSTUDIO-2022: First prediction takes around 150MB and subsequent calls ~70-80MB.

After say 10000 such calls o predict(), while my MBP memory usage stays under 10GB, MACSTUDIO climbs to ~80GB (and counting up for higher number of calls).

Even using keras.backend.clear_session() after each call on MACSTUDIO did not help.

Can anyone having insight on TensorFlow-metal and/or MAC M1 machines help?

Thanks, Bapi

Post not yet marked as solved Up vote post of karbapi Down vote post of karbapi
5.5k views
  • Moreover, in predict function, when I turned on multi_processing as below, it merely turns on 4cores (seen on activity monitor- CPU HISTORY). Rest of the cores are dead!

    predict(data, max_queue_size=10, workers=8, use_multiprocessing=True )

Add a Comment

Replies

Hi @karbapi!

Thanks for reporting this issue. While the different "base" level of memory reserved for running TF on different machines are expected to vary based on the total memory available for the machine at the moment the trend of memory demand growing through the execution of the program is certainly not. Would you be able to share a script to reproduce this behavior so we can triage which ops are responsible for the memory leak?

Hi,

Please see the output of memory_profiler below (Example: first instance of call to predict() with 2.3GB to nth instance with 30.6GB). As mentioned in my previous comment, it was going up to ~80GB and counting up. Sorry I could not share the code. But I can tell you that it is pretty straightforward to recreate. Any help would be appreciated. Thanks!

############################################### #First Instance of leakage (@predict() highlighted below as bold):

Line #    Mem usage    Increment  Occurrences   Line Contents

=============================================================

    29   2337.5 MiB   2337.5 MiB           1   @profile

    30                                         def predict_hpwl(graph, graph_label, model):    

    31   2337.5 MiB      0.0 MiB           1       lindex = range(len([graph_label]))

    32   2337.6 MiB      0.0 MiB           2       gdata = DataGenerator("Prediction",graphs=[graph],

    33   2337.5 MiB      0.0 MiB           1                                    labels=[graph_label],

    34   2337.5 MiB      0.0 MiB           1                                    indices=lindex,

    35   2337.5 MiB      0.0 MiB           1                                    shuffle=True,

    36   2337.5 MiB      0.0 MiB           1                                    cache_size=10,

    37   2337.5 MiB      0.0 MiB           1                                    debug=False,

    38   2337.5 MiB      0.0 MiB           1                                    isSparse=True)

    39      

    40                                             ## Test the GNN

    41   2487.5 MiB    149.9 MiB           2       hpwl = GNN.predict(gdata,

    42   2337.6 MiB      0.0 MiB           1               max_queue_size=10,

    43   2337.6 MiB      0.0 MiB           1               workers=8,

    44   2337.6 MiB      0.0 MiB           1               use_multiprocessing=True

    45                                                     )     

    46      

    47                                            

    48   2486.5 MiB     -1.0 MiB           1       keras.backend.clear_session()

    49      

    50                                            

    51   2486.5 MiB      0.0 MiB           1       return hpwl

############################################### #n'th Instance of leakage (@predict() highlighted below as bold):

Line #    Mem usage    Increment  Occurrences   Line Contents

=============================================================

    29  30661.9 MiB  30661.9 MiB           1   @profile

    30                                         def predict_hpwl(graph, graph_label, model):

    31  30661.9 MiB      0.0 MiB           1       lindex = range(len([graph_label]))

    32  30661.9 MiB      0.0 MiB           2       gdata = DataGenerator("Prediction",graphs=[graph],

    33  30661.9 MiB      0.0 MiB           1                                    labels=[graph_label],

    34  30661.9 MiB      0.0 MiB           1                                    indices=lindex,

    35  30661.9 MiB      0.0 MiB           1                                    shuffle=True,

    36  30661.9 MiB      0.0 MiB           1                                    cache_size=10,

    37  30661.9 MiB      0.0 MiB           1                                    debug=False,

    38  30661.9 MiB      0.0 MiB           1                                    isSparse=True)

    39

    40                                             ## Test the GNN

    41  30720.0 MiB     58.1 MiB           2       hpwl = GNN.predict(gdata,

    42  30661.9 MiB      0.0 MiB           1               max_queue_size=10,

    43  30661.9 MiB      0.0 MiB           1               workers=8,

    44  30661.9 MiB      0.0 MiB           1               use_multiprocessing=True

    45                                                     )

    46

    47                                            

    48  30720.0 MiB     -0.0 MiB           1       keras.backend.clear_session()

    49

    50                                            

    51  30720.0 MiB      0.0 MiB           1       return hpwl

Hi @karbapi,

Alright understandable. We do have a bunch of memory leak fixes in various ops included in the next outgoing update for tf-metal so we hope that they will also address the leaks you are seeing here. At least they do prevent this memory issue in the networks we have test coverage over.

I'll update here once the next version of tf-metal goes out so you can confirm if this fixes the issue for you as well.

Thanks for the response. I would await your next fixes/updates.

Just thought to share with you that the above results are based on TF-MACOS (2.8.0) and TF-METAL(0.4.0) with python=3.8.13 in my CURRENT ENV.

Although my BASE ENV is TF-MACOS (2.9.2) and TF-METAL(0.5.0) with python=3.9.13 does exhibit the same behaviour, I faced some other issues as well (beyond the scope of this thread). That's why I am using the above ENV.

Finally, I would like to ask: why does the latest TF-MACOS have version number as 2.9.2, while TensorFlow.org shows the latest TF version is 2.9.1 (ref: https://www.tensorflow.org/api_docs/python/tf)?

Thanks, Bapi

Hi @karbapi,

Thanks for the additional information about the package versions used.

Regarding the last question: The minor version number in tf-macos is currently ahead of the tensorflow baseline minor version because of a patch we had to make after the release of the 2.9.1 version of the package. Due to how PyPI works we cannot replace an already released package but instead have to advance the release version if there is need for a bug fix. The version numbering will return to be back in sync with the TF baseline when the next major version update happens.

A quick update:

  1. when I use with CPU only as:

**tf.device('/CPU'):

predict_hpwl()**

the memory leak is insignificant (1-10MB max initially, then <=1MB). Please see a single output instance of memory_profiler for predict_hpwl() as below:

Line #    Mem usage    Increment  Occurrences   Line Contents

=============================================================

    31   2448.5 MiB   2448.5 MiB           1   @profile

    32                                         def predict_hpwl(graph, graph_label, model):    

    33   2448.5 MiB      0.0 MiB           1       lindex = range(len([graph_label]))

    34   2448.5 MiB      0.0 MiB           2       gdata = DataGenerator("Prediction",graphs=[graph],

    35   2448.5 MiB      0.0 MiB           1                                    labels=[graph_label],

    36   2448.5 MiB      0.0 MiB           1                                    indices=lindex,

    37   2448.5 MiB      0.0 MiB           1                                    shuffle=True,

    38   2448.5 MiB      0.0 MiB           1                                    cache_size=10,

    39   2448.5 MiB      0.0 MiB           1                                    debug=False,

    40   2448.5 MiB      0.0 MiB           1                                    isSparse=True)

    41                                         

    42                                             ## Test the GNN

    43   2449.4 MiB      0.9 MiB           2       hpwl = model.predict(gdata,

    44   2448.5 MiB      0.0 MiB           1               max_queue_size=10,

    45   2448.5 MiB      0.0 MiB           1               workers=8,

    46   2448.5 MiB      0.0 MiB           1               use_multiprocessing=True

    47                                                     )

    48                                         

    49   2449.3 MiB     -0.0 MiB           1       tf.keras.backend.clear_session()

    50                                         

    51   2449.3 MiB      0.0 MiB           1       return hpwl

  1. when I use GPU only as:

tf.device('/GPU'):

predict_hpwl()

I see a similar (large) memory leakage as reported earlier.

Apparently, the GPU is causing the memory leak issue! Hope it will help you in providing the fix.

Note: my env is still python 3.8.13 with TF-macos ==2.8.0 and TF-metal==0.4.0

Thanks, Bapi

Hi @karbapi,

We have now released tensorflow-metal==0.5.1 with multiple memory leak issues fixed. If you can, please try out tensorflow-macos==2.9.2 and tensorflow-metal==0.5.1 to see if these fixes address the problem you are seeing.

Hi Thanks for your update.

I don't see any improvement w.r.t tensorflow-metal==0.5.1 (along with both using tensorflow-macos==2.9.2 along with python 3.9.13 and tensorflow-macos==2.8.0 along python 3.8.13). In fact, I see quite a similar output from memory_profiler as below:

Line #    Mem usage    Increment  Occurrences   Line Contents

=============================================================

    31   3534.6 MiB   3534.6 MiB           1   @profile

    32                                         def predict_hpwl(graph, graph_label, model):    

    33   3534.6 MiB      0.0 MiB           1       lindex = range(len([graph_label]))

    34   3534.6 MiB      0.0 MiB           2       gdata = DataGenerator("Prediction",graphs=[graph],

    35   3534.6 MiB      0.0 MiB           1                                    labels=[graph_label],

    36   3534.6 MiB      0.0 MiB           1                                    indices=lindex,

    37   3534.6 MiB      0.0 MiB           1                                    shuffle=True,

    38   3534.6 MiB      0.0 MiB           1                                    cache_size=10,

    39   3534.6 MiB      0.0 MiB           1                                    debug=False,

    40   3534.6 MiB      0.0 MiB           1                                    isSparse=True)

    41                                         

    42                                             ## Test the GNN

    43   3594.9 MiB     60.3 MiB           2       hpwl = model.predict(gdata,

    44   3534.6 MiB      0.0 MiB           1               max_queue_size=10,

    45   3534.6 MiB      0.0 MiB           1               workers=8,

    46   3534.6 MiB      0.0 MiB           1               use_multiprocessing=True

    47                                                     )

    48                                         

    49   3594.9 MiB     -0.0 MiB           1       tf.keras.backend.clear_session()

    50                                         

    51   3594.9 MiB      0.0 MiB           1       return hpwl

  • Any update/comment on this?

Add a Comment

Hi There, Kindly look into the CPUs, GPUs and RAM usage (and fan speed along with the Temperature of the CPUs/GPUs at the top of the image). Sorry if the image quality is bad!

Takeaway: CPU/GPU usage is extremely POOR, perhaps due to memory leakage and sub-optimal process scheduling among CPUs/GPUs.

Environment: TF-MACOS==2.9.2 and TF-METAL=0.5.1 along with python 3.9.

I AM STUCK. KINDLY HELP. --Bapi

Hello

I am encountering the same issue in my Mac M1. I have a saved model for prediction only. It predicts one batch after another.

  • Tensorflow v2.10.0
  • tensorflow-metal v0.6.0

When using CPU, only consume ~400MB memory.

When starting using GPU via tf-metal, RAM consumption keeps increasing and never stops as long as prediction does not stop.

I have 16G RAM in total, and it can easily run out of my RAM.

  • Yes you are right. I have just tested and it shows TF2.10/METAL-0.6 shows the same LEAKY behaviour with "GPU".

  • I tried to delete the existing session and reload the saved model everytime before prediction. that does not help at all. memory keeps increasing.

Add a Comment

even if you use keras.backend.clear_session(), it does not help. only saviour here is the CPU mode. But this is not why paid hefty money for the GPU cores!

just FYI, TF2.10/METAL-0.6 with python 3.10.6 in CPU mode got me segmentation fault. may be it is specific to me!

  • Yep. Mac M1/M2 GPU cannot be used for machine learning if this bug is not fixed.

Add a Comment

I have the same problem, in the activity monitor I see the Python process using 111 GB of memory, while my RAM is only 32 GB. Some models will drop completely during training, when your memory is constantly growing.

Still no solution!!! Environment: TF-MACOS==2.9.2 and TF-METAL=0.5.1 along with python 3.9 in CPU mode ONLY. Memory usage had gone up to 162GB (RAM is 128GB). Strange thing is step time has increased from ~140ms to 64s to hopping 496s before being STUCK? How could someone use these BRITTLE (METAL) GPUs? :-(

1/1 [==============================] - 0s 144ms/step

1/1 [==============================] - 0s 141ms/step

1/1 [==============================] - 0s 142ms/step

**1/1 [==============================] - 64s 64s/step

1/1 [==============================] - 496s 496s/step

1/1 [==============================] - 496s 496s/step

1/1 [==============================] - 496s 496s/step

1/1 [==============================] - 496s 496s/step

1/1 [==============================] - 496s 496s/step

1/1 [==============================] - 496s 496s/step

1/1 [==============================] - 496s 496s/step

1/1 [==============================] - 496s 496s/step

1/1 [==============================] - 496s 496s/step

1/1 [==============================] - 496s 496s/step

1/1 [==============================] - 496s 496s/step

1/1 [==============================] - 496s 496s/step**

Just FYI, 3 months ago I saw a similar leakage-until-mem-full problem on GPU. I found this forum and a now 4 month old report of what seemed an identical problem by user @wangchengThe new tensorflow-macos and tensorflow-metal incapacitate training. I've been able to limp along by switching to CPU-only prediction since then.

Post not yet marked as solved Up vote reply of cwr Down vote reply of cwr

I don't see any improvement after upgrading to MacOS Ventura. Only difference is the step time has reduced to ~80ms vs ~140ms. Rest remains the same.

  • I'm having the same issue and would really appreciate an answer to this! This machine has cost me a fortune and I've gone through a lot of trouble to get the Metal1 version of Python and Tensorflow installed.

    Tensorflow 2.10.0 Python 3.9.13

Add a Comment