Tensor Flow Metal 1.2.0 on M2 Fails to converge on common toy models

I've been trying to get some basic models to work on an M2 with tensor metal 1.2 and keras 2.15 and 2.18 and they all fail to work as expected.

I'm running models copy/pasted from common tutorials like Jason Brownlee ML Mastery Object Classification tutorial using CIFAR-10. When run with the GPU I can't get any reasonable results. Under keras 2.15 the best validation accuracy ends up being around 10-15%. Under keras 2.18, the validation goes off the rails around epoch 5 with wildly low accuracy and loss values that are reported as "nan".

Epoch 4/25
782/782: 19s 24ms/step - accuracy: 0.3450 - loss: 2.8925 - val_accuracy: 0.2992 - val_loss: 1.9869

Epoch 5/25
782/782: 19s 24ms/step - accuracy: 0.2553 - loss: nan - val_accuracy: 0.0000e+00 - val_loss: nan

Running the same code on the CPU using keras 2.15 using tf.config.experimental.set_visible_devices([], 'GPU') yields a reasonable result with the validation accuracy around 75% as expected. Running the same code on keras 2.15 on a linux instance with just the CPU provides similar results.

The tutorial can be found here: https://machinelearningmastery.com/object-recognition-convolutional-neural-networks-keras-deep-learning-library/

The only places I've deviated from the provided tutorial is using

sdg = tf.keras.optimizers.legacy.SGD(learning_rate=lrate, momentum=0.9, nesterov=False)

I did this at the advice of the warning:

WARNING:absl:At this time, the v2.11+ optimizer `tf.keras.optimizers.SGD` runs slowly on M1/M2 Macs, please use the legacy Keras optimizer instead, located at `tf.keras.optimizers.legacy.SGD`.

Is there something special that I need to do to make this work? I've followed the instructions here: https://developer.apple.com/metal/tensorflow-plugin/

I've purged the venv a few times and started from scratch, but all with similarly terrible results.

Here are my platform details:

  • Chip: Apple M2
  • Memory: 16 GB
  • macOS : Sequoia 15.2
  • Python venv: 3.11
  • Jupyter Lab Version: 4.3.3
  • TensorFlow versions: 2.15, 2.18
  • tensorflow-metal: 1.2.0

Thanks for any assistance or advice.

I have M2 Max and faced similar issues ( not resolved fully yet ).

but thing that helped me progress was different version of tensorflow and pythons ( generally 3.13.2) performed best , and tensorflow 2.16 give a shot , or only MLX then is option :-(

pip install tensorflow-macos 2.16

Downside is when you try coremltools it won’t convert to .mlmodels

@Xcode-K I tried 2.18 TensorFlow with the metal plugin and got different bad results. fit() just coughed up a bunch of NaN for loss and the accuracy fell to almost zero after about 5 epochs.

I also have a project requirement of Keras 2. I can probably work around that, but the uncertainty of the results leaves me very suspicious.

Is there something about 2.16 that fixes whatever is broken in the other versions?

@txoof

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, LSTM, Dense, Dropout
import tensorflow as tf
from tensorflow.keras.callbacks import EarlyStopping
import logging

# Suppress warnings
logging.getLogger("tensorflow").setLevel(logging.ERROR)
logging.getLogger("urllib3").setLevel(logging.ERROR)

# Verify GPU availability
physical_devices = tf.config.list_physical_devices('GPU')
print(f"GPUs Available: {bool(physical_devices)}")

# Load data with flexible features
def load_data(filename, features=['Close']):
    df = pd.read_csv(filename, parse_dates=['DateTime'])
    df['DateTime'] = pd.to_datetime(df['DateTime'], format='%Y.%m.%d %H:%M:%S')
    required_columns = ['DateTime'] + features
    df_cleaned = df[required_columns].dropna().sort_values('DateTime')
    if df_cleaned.empty:
        raise ValueError("Dataset is empty after cleaning!")
    return df_cleaned, features

# Example using multiple features
selected_features = ['Open', 'High', 'Low', 'Close']
df, features = load_data('hourdata.csv', features=selected_features)
print("Using features:", features)

# Normalization
scaler = MinMaxScaler(feature_range=(0, 1))
scaled_data = scaler.fit_transform(df[features])

# Sequence creation with dynamic close index
def create_sequences(data, time_steps):
    X, y = [], []
    close_idx = features.index('Close')
    for i in range(time_steps, len(data)):
        X.append(data[i-time_steps:i])
        y.append(data[i, close_idx])  # Predict Close price
    return np.array(X), np.array(y)

# Train/test split
def split_data(data, time_steps, split_ratio=0.8):
    split_idx = int(len(data) * split_ratio)
    train = data[:split_idx]
    test = data[split_idx - time_steps:]
    return train, test

# Enhanced prediction function
def predict_price(model, scaler, df, features, time_steps):
    close_idx = features.index('Close')
    last_seq = df[features].iloc[-time_steps:]
    scaled_seq = scaler.transform(last_seq)
    X = scaled_seq.reshape(1, time_steps, len(features))
    pred_scaled = model.predict(X, verbose=0)
    
    # Create dummy array for inverse transform
    dummy_row = np.zeros((1, len(features)))
    dummy_row[0, close_idx] = pred_scaled[0][0]
    return scaler.inverse_transform(dummy_row)[0, close_idx]

# Time steps to evaluate
time_steps_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 50, 100]
log_df = pd.DataFrame(columns=['time_steps', 'test_loss', 'test_mae', 'current_price', 'predicted_price'])

for time_steps in time_steps_list:
    print(f"\nProcessing time_steps={time_steps}")
    
    # Data preparation
    train_data, test_data = split_data(scaled_data, time_steps)
    X_train, y_train = create_sequences(train_data, time_steps)
    X_test, y_test = create_sequences(test_data, time_steps)
    
    # Reshape data
    X_train = X_train.reshape(-1, time_steps, len(features))
    X_test = X_test.reshape(-1, time_steps, len(features))
    
    # Model architecture
    model = Sequential([
        Input(shape=(time_steps, len(features))),
        LSTM(100, return_sequences=True),
        Dropout(0.3),
        LSTM(100, return_sequences=True),
        Dropout(0.3),
        LSTM(50, return_sequences=True),
        Dropout(0.2),
        LSTM(25),
        Dropout(0.2),
        Dense(25, activation='relu'),
        Dense(1)
    ])
    model.compile(optimizer='adam', loss='mse', metrics=['mae'])
    
    # Training
    early_stop = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
    history = model.fit(
        X_train, y_train,
        validation_data=(X_test, y_test),
        epochs=100,
        batch_size=256,
        verbose=0,
        callbacks=[early_stop]
    )
    
    # Evaluation
    test_loss, test_mae = model.evaluate(X_test, y_test, verbose=0)
    current_price = df['Close'].iloc[-1]
    predicted_price = predict_price(model, scaler, df, features, time_steps)
    
    # Logging
    new_row = pd.DataFrame([{
        'time_steps': time_steps,
        'test_loss': test_loss,
        'test_mae': test_mae,
        'current_price': current_price,
        'predicted_price': predicted_price
    }])
    log_df = pd.concat([log_df, new_row], ignore_index=True)

# Save and display results
log_df.to_csv('multi_feature_analysis.csv', index=False)
print("\nFinal Results:")
print(log_df)

# Plot results with actual dates
split_index = int(len(df) * 0.8)
plt.figure(figsize=(14, 7))
plt.plot(df['DateTime'].iloc[split_index:], df['Close'].iloc[split_index:], label='Actual Prices')
plt.title('Actual vs Model Predictions')
plt.xlabel('DateTime')
plt.ylabel('Price')
plt.legend()
plt.xticks(rotation=45)
plt.tight_layout()
plt.show

I had code like this for LSTM , I have many versions and 2.16 just performed best but unfortunately I can’t recall as it was some time ago , and difference was in pip seem to be different for Mac OS and later versions .

@Xcode-K After reading a bunch of git issues, it looks like the problem resides in relu activation and possibly softmax. Apparently tanh and softplus are possibly not impacted. I managed to squeeze out some good results by using those.

I don't know enough to verify this so I'm just abandoning running on the apple metal for the moment. I've got a second hand GPU and stuffed it in a linux box. I'll just work on that for the time being.

@txoof fair enough have you had a go with MLX ?

here is some CIFAR-10 example on GitHub https://github.com/ml-explore/mlx-examples/blob/main/cifar/README.md

Thanks for the tip! I'll have to try MLX out later. Right now I need some results. I'm taking a course and a full week behind because all the data and models that I generated in the previous week is now suspect.

Tensor Flow Metal 1.2.0 on M2 Fails to converge on common toy models
 
 
Q