TensorFlow Memory Usage

I am noticing huge memory usage with TensorFlow. The memory usage will keep on increasing up to 36GB of memory usage only after one epoch.

The following is the dataset preprocessing process:

with tf.device('CPU: 0'):
    data_augmentation = keras.Sequential([
      keras.layers.experimental.preprocessing.RandomFlip("horizontal"),
      keras.layers.experimental.preprocessing.RandomRotation(0.2), 
      keras.layers.experimental.preprocessing.RandomHeight(0.2), 
      keras.layers.experimental.preprocessing.RandomWidth(0.2),
      keras.layers.experimental.preprocessing.RandomZoom(0.2), 
    ], name="data_augmentation")

train_data = train_data.map(map_func=lambda x, y: (data_augmentation(x), y), num_parallel_calls=tf.data.AUTOTUNE).prefetch(buffer_size=tf.data.AUTOTUNE)
test_data = test_data.prefetch(buffer_size=tf.data.AUTOTUNE)

And the following is the model I used

base_model = keras.applications.EfficientNetB0(include_top=False)
base_model.trainable = False

inputs = keras.layers.Input(shape=(224, 224, 3), name='input_layer')
x = base_model(inputs, training=False)
x = keras.layers.GlobalAveragePooling2D(name='global_average_pooling')(x)
outputs = keras.layers.Dense(101, activation='softmax', name='output_layer')(x)
model = keras.Model(inputs, outputs)
# Compile
model.compile(loss="categorical_crossentropy",
              optimizer=tf.keras.optimizers.Adam(), # use Adam with default settings
              metrics=["accuracy"])

from tqdm.keras import TqdmCallback
tqdm_callback = TqdmCallback()


# Fit
history_all_classes_10_percent = model.fit(train_data,
                                           verbose=0,
                                           epochs=5, 
                                           validation_data=test_data,
                                           validation_steps=int(0.15 * len(test_data)), 
                                           callbacks=[checkpoint_callback, tqdm_callback]) # save best model weights to file

Hi @Leozz99

Thanks for reporting this behaviour you are seeing! I tried running our memory profiler on a script using the transformations and the network that you defined in the snippets of code but could not see any leaks or unexpected memory consumption. My suspicion is that maybe the train_data tf.Dataset is told to use .cache() on some line before the ones you have pasted here. That would explain the pattern of increasing persistent memory usage throughout the first epoch as the preprocessed data gets stored to the cache for the next epochs. You should however see the memory usage plateau after the first epoch as there is no more additional data to be cached at that point.

In order for me to investigate further could you let us know:

  • Which OS version is this on?
  • What tensorflow-macos and tensorflow-metal versions are installed in your python environment?
  • Which machine are you running on? An x86 or arm64 based machine?
  • Could you provide a full script that is runnable as is that demonstrates this memory behaviour you are seeing? Right know I had to provide the original tf.Dataset myself so some parameters I used there could also prevent me from reproducing this issue.

Hi, thanks for your reply. I am running this on macOS 12.4, and my machine is M1 Max 64GB. The current tensorflow-macos version is 2.9.1, and tensorflow-metal is 0.5.0.

I was using a jupyter notebook in PyCharm, and here are the code:

import cv2
import pandas as pd
import tensorflow as tf
import psutil

# for auto-completion
import typing
from tensorflow import keras
if typing.TYPE_CHECKING:
    from keras.api._v2 import keras

from helper_functions import *
import gc
import tqdm.notebook as tqdm

# tf.config.run_functions_eagerly(False)
gc.enable()
gc.collect()

if not os.path.exists('101_food_classes_10_percent'):
    if not os.path.exists('101_food_classes_10_percent.zip'):
        !wget https://storage.googleapis.com/ztm_tf_course/food_vision/101_food_classes_10_percent.zip
    # !unzip 10_food_classes_10_percent.zip
    unzip_data('101_food_classes_10_percent.zip')

train_dir = '101_food_classes_10_percent/train/'
test_dir = '101_food_classes_10_percent/test/'

IMG_SIZE = (224, 224)

train_data = keras.preprocessing.image_dataset_from_directory(train_dir,
                                                              label_mode='categorical',
                                                              image_size=IMG_SIZE,
                                                              seed=45)
test_data = keras.preprocessing.image_dataset_from_directory(test_dir,
                                                             label_mode='categorical',
                                                             image_size=IMG_SIZE,
                                                             shuffle=False,
                                                             seed=45)

# train_generator = keras.preprocessing.image.ImageDataGenerator()

# Setup data augmentation
with tf.device('CPU: 0'):
    data_augmentation = keras.Sequential([
      keras.layers.experimental.preprocessing.RandomFlip("horizontal"), # randomly flip images on horizontal edge
      keras.layers.experimental.preprocessing.RandomRotation(0.2), # randomly rotate images by a specific amount
      keras.layers.experimental.preprocessing.RandomHeight(0.2), # randomly adjust the height of an image by a specific amount
      keras.layers.experimental.preprocessing.RandomWidth(0.2), # randomly adjust the width of an image by a specific amount
      keras.layers.experimental.preprocessing.RandomZoom(0.2), # randomly zoom into an image
      # keras.layers.experimental.preprocessing.Rescaling(1./255) # keep for models like ResNet50V2, remove for EfficientNet
    ], name="data_augmentation")

train_data = train_data\
    .map(map_func=lambda x, y: (data_augmentation(x, training=True), y), num_parallel_calls=tf.data.AUTOTUNE)\
    .prefetch(tf.data.AUTOTUNE)
test_data = test_data\
    .prefetch(tf.data.AUTOTUNE)
valid_data = test_data\
    .take(int(0.15*len(test_data)))\
    .prefetch(tf.data.AUTOTUNE)

base_model = keras.applications.EfficientNetB0(include_top=False)
base_model.trainable = False
for i, layer in enumerate(base_model.layers):
    print(i, layer.name)

inputs = keras.layers.Input(shape=(224, 224, 3), name='input_layer')
# x = keras.layers.experimental.preprocessing.Rescaling(1./255)(inputs)
x = base_model(inputs, training=False)
x = keras.layers.GlobalAveragePooling2D(name='global_average_pooling')(x)
outputs = keras.layers.Dense(101, activation='softmax', name='output_layer')(x)
model = keras.Model(inputs, outputs)

gc.collect()
# Create checkpoint callback to save model for later use
checkpoint_path = "101_classes_10_percent_data_model_checkpoint"
checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(checkpoint_path,
                                                         save_weights_only=True, # save only the model weights
                                                         monitor="val_accuracy", # save the model weights which score the best validation accuracy
                                                         save_best_only=True) # only keep the best model weights on file (delete the rest)

# def free_memory(epochs, logs):
#     gc.collect()
#
# free_memory_callback = keras.callbacks.LambdaCallback(on_batch_end=free_memory)

# Compile
model.compile(loss="categorical_crossentropy",
              run_eagerly=False,
              optimizer=tf.keras.optimizers.Adam(), # use Adam with default settings
              metrics=["accuracy"])
# from tqdm.keras import TqdmCallback
# tqdm_callback = TqdmCallback()


# Fit
history_all_classes_10_percent = model.fit(train_data,
                                           epochs=5,
                                           validation_data=test_data,
                                           validation_steps=int(0.15 * len(test_data)), # evaluate on smaller portion of test data
                                           callbacks=[checkpoint_callback],
                                           max_queue_size=10,
                                           workers=10,
                                           use_multiprocessing=True) # save best model weights to file
plot_loss_curves(history_all_classes_10_percent)
model.load_weights(checkpoint_path)
print(model.evaluate(test_data))

I also notice that the training is weirdly slowing down significantly after finishing one epoch.

Hi @Leozz99

Thanks for the script! I ran that locally on my MacOS 12.4 M1 Air using TF-macos==2.9.2 and TF-metal==0.5.0 (the difference between 2.9.1 and 2.9.2 shouldn't matter in this case) and unfortunately I couldn't get the issue you are seeing. Below is the persistent memory usage I'm seeing with Xcode Instruments Leaks profiling.

So I'm seeing a pretty standard pattern in memory usage without an increasing trend between the epochs. Similarly my training run times between epochs do not display the worrisome pattern you are seeing.

However there is at least one possibly significant difference between our setups as I'm executing the code directly from command line instead of running it through jupyter notebook in PyCharm. I will have to get some consulting on if I can do the testing with PyCharm or not before I can get to checking whether the cause comes from there but it might take some time to get that answer.

In the meantime could you try to run your script directly from the command line without using PyCharm or notebook in between and see if the issue is reproduced or not? This could speed up the diagnosis significantly if we can either confirm or rule out the IDE from the list of suspects.

I tried using a script instead of the PyCharm, but I am getting similar issues here. I did not use the Xcode Instruments Leaks Profiling since it slows down the speed by quite a lot. But I take a look at the activity monitor during the last epoch, and it takes over 55.40 GB of memory. And the speed is still decreasing over epochs: 89s -> 126s -> 126s -> 185s -> 246s.

Ok I have confirmed that I am able to locally reproduce this on a M1 Ultra. I will start the investigation and will update here once I know the reason and the fix. Thanks for reporting the issue!

We have added multiple fixes for memory leaks in tensorflow-metal==0.5.1. Could you check if these solve your issue? Additionally RngReadAndSkip op is registered so the data_augmentation you are performing should be able to run on the GPU as well now.

TensorFlow Memory Usage
 
 
Q