'ReLU' activation problem when running inference on GPU

Hi,

there seems to be a difference in behavior when running inference on a trained Keras model using the model __call__ method vs. using the predict or predict_on_batch methods. This only happens when using the GPU for inference and it seems that for certain sequence of operations and float types the 'relu' activation doesn't work as expected and seems to do nothing.

I can replicate the problem with the following code (it would only fail with 'relu' activation and tf.float16 and tf.float32 types, while it works fine with tf.float64).

import tensorflow as tf
import numpy as np

DATA_LENGTH = 16
DENSE_WIDTH = 16
BATCH_SIZE = 8
DTYPE = tf.float32
ACTIVATION = 'relu'


def TestModel():
    inputs = tf.keras.Input(DATA_LENGTH, dtype=DTYPE)
    u = tf.keras.layers.Dense(DENSE_WIDTH, activation=ACTIVATION, dtype=DTYPE)(inputs)
    # u = tf.maximum(u, 0.0)
    output = u*tf.constant(1.0, dtype=DTYPE)  

    model = tf.keras.Model(inputs, output, name="TestModel")
    return model


model = TestModel()
model.compile()

x = np.random.uniform(size=(BATCH_SIZE, DATA_LENGTH)).astype(DTYPE.as_numpy_dtype)
with tf.device('/GPU:0'):
    out_gpu_call = model(x, training=False)
    out_gpu_predict = model.predict_on_batch(x)

with tf.device('/CPU:0'):
    out_cpu_call = model(x, training=False)
    out_cpu_predict= model.predict_on_batch(x)

print(f'\nDTYPE {DTYPE}, ACTIVATION: {ACTIVATION}')
print("\tMean Abs. Difference GPU (__call__ vs. predict):", np.mean(np.abs(out_gpu_call - out_gpu_predict)))
print("\tMean Abs. Difference CPU (__call__ vs. predict):", np.mean(np.abs(out_cpu_call - out_cpu_predict)))
print("\tMean Abs. Difference GPU-CPU __call__:", np.mean(np.abs(out_gpu_call - out_cpu_call)))
print("\tMean Abs. Difference GPU-CPU predict():", np.mean(np.abs(out_gpu_predict - out_cpu_predict)))

The code above produces for example the following output:

DTYPE <dtype: 'float32'>, ACTIVATION: relu
Mean Abs. Difference GPU (__call__ vs. predict): 0.1955472
Mean Abs. Difference CPU (__call__ vs. predict): 0.0
Mean Abs. Difference GPU-CPU __call__: 1.3573299e-08
Mean Abs. Difference GPU-CPU predict(): 0.1955472

And the results for the GPU are:

out_gpu_call

<tf.Tensor: shape=(8, 16), dtype=float32, numpy=
array([[0.1496982 , 0.        , 0.        , 0.73772687, 0.26131183,
        0.27757105, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.4164225 , 1.0367445 , 0.        , 0.5860609 ,
        0.        ], ...

out_gpu_predict

array([[ 1.49698198e-01, -3.48425686e-01, -2.44667321e-01,
         7.37726867e-01,  2.61311829e-01,  2.77571052e-01,
        -2.26729304e-01, -1.06500387e-01, -3.66294265e-01,
        -2.93850392e-01, -4.51043218e-01,  4.16422486e-01,
         1.03674448e+00, -1.39347658e-01,  5.86060882e-01,
        -2.05334812e-01], ...

Upon inspection of the results it seems that the problem is that the 'relu' activation is not setting the values < 0 to 0 when calling predict_on_batch. When uncommenting the # u = tf.maximum(u, 0.0) line after the Dense layer there is no difference between the two calls (as should be expected).

It also happens that removing the multiplication by a constant after the Dense layer, output = u*tf.constant(1.0, dtype=DTYPE) makes the problem dissappear (even when leaving the # u = tf.maximum(u, 0.0) line commented).

This is running with the following setup:

  • MacBook Pro, Apple M2 Max chip, macOS Sonoma 14.2
  • tf version 2.15.0
  • tensorflow-metal 1.1.0
  • Python 3.10.13

Replies

Same here.

  • MacBook Pro, Apple M1 Pro, macOS Sonoma 14.3.1
  • tensorflow 2.15.0
  • tensorflow-metal 1.1.0
  • Python 3.9.6