Low performance on matrix multiplication

I'm using latest Macbook Pro M1 Max chip, and I did some benchmark using tensorflow-metal (v2.6):

import tensorflow as tf


def foo():
  x = tf.random.uniform((1024 * 12, 1024 * 12), dtype=tf.float32)
  y = tf.random.uniform((1024 * 12, 1024 * 12), dtype=tf.float32)
  z = x * y
  return z


if __name__ == '__main__':
  z0 = None
  for _ in tqdm(range(10000000000)):
    zz = foo()
    if z0 is None:
      z0 = zz
    else:
      z0 += zz

The above code runs at on average of 29.66 it/s for M1 Max chip which should use all the GPU cores of M1 max chip (GPU utilisation is also 100%).

But same code running on a Zotac RTX 3090 Trinity is 175.22 it/s, which means M1 max chip is only 16.9% the performance of RTX 3090.

Note that M1 Max in theory should result in 10.4 TFLOPS and RTX 3090 is roughly 35.5 TFLOPS, it means M1 Max should be 29.3% the performance of a RTX 3090.

I also notice that M1 max only consumes 12 watts, but a RTX 3090 consumes 340 watts. While on gaming, M1 max usually consumes 30~40 watts which is much higher compared to deep learning setting.

My guess is tensorflow-metal somehow doesn't utilise all the performance of M1 max, that's why the power consumption is so low and so is the matrix multiplication performance.

Can you look into this issue?

Can you look into this issue? Did you post a bug report ? Or you could also burn a DTS ticket.

Hello,

If your goal is to benchmark the performance of matrix multiplication on M1 max chip, I would recommend creating the x and y tensors outside the loop; and then looping over the matmul alone in the for loop. This ensures that you don't pay the penalty of creating a random matrix on the GPU each time and the runtime measured will be for matrix multiplication alone. An example python script:

def foo(x, y):
  z = x * y
  return z

if __name__ == '__main__':
  z0 = None
  x = tf.random.uniform((1024 * 12, 1024 * 12), dtype=tf.float32)
  y = tf.random.uniform((1024 * 12, 1024 * 12), dtype=tf.float32)
  for _ in tqdm(range(10000000000)):
    zz = foo(x, y)
    if z0 is None:
      z0 = zz
    else:
      z0 += zz

Thanks for pointing out the issue. The reason I want to create random tensor each time during the loop is to avoid potential "caching" for the same calculation of two variables to reflect performance in the real scenario (Loading different data during the for-loop for training / inference).

I also did the experiemnts for your attached code. M1 max scores 103.10it/s and RTX 3090 scores 234.82it/s, which reflects 43.9% of the performance of RTX 3090. But I think this is due to some internal caching, when I run some deep learning models for M1 max, it also shows the training performance is roughly 1/6 the performance of a RTX 3090 which is consistent with my result above (An example would be training qa model on the huggingface tensorflow examples).

The interesting part is about the wattage usage. GPU utilisation and wattage consumption is the same for deep learning and gaming on a RTX 3090 GPU, but they are much different with M1 Max (wattage consumption is much lower for deep learning compared to gaming) which suggests the GPU cores of M1 max might not be fully utilized for deep learning.

Hope you can find the issues and improve the performance of tensorflow running on M1 chip.

More experiments below which shows interesting result:

from tqdm import tqdm

def foo(x, y):
  z = x * y
  return z

if __name__ == '__main__':
  z0 = None
  x = tf.random.uniform((1024 * 12, 1024 * 12), dtype=tf.float32)
  y = tf.random.uniform((1024 * 12, 1024 * 12), dtype=tf.float32)
  for i in tqdm(range(1000000)):
    zz = foo(x, y)
    x += 1
    y += 1
    if z0 is None:
      z0 = zz
    else:
      z0 += zz

This experiment avoids caching as well as extra cost creating another tensor, M1 max scores 61.9it/s and RTX 3090 scores 160.2it/s which definitely shows the potential of M1 max (38.6% the performance of a RTX 3090). Interestingly in this experiment, the M1 Max consumes roughly 50 watts in total.

I'm trying to find the performance bottleneck of M1 max using deep learning models like Transformer since currently it's very slow to run on a M1 series chip, I will update here when I find something.

Low performance on matrix multiplication
 
 
Q