I'm using latest Macbook Pro M1 Max chip, and I did some benchmark using tensorflow-metal (v2.6):
import tensorflow as tf
def foo():
x = tf.random.uniform((1024 * 12, 1024 * 12), dtype=tf.float32)
y = tf.random.uniform((1024 * 12, 1024 * 12), dtype=tf.float32)
z = x * y
return z
if __name__ == '__main__':
z0 = None
for _ in tqdm(range(10000000000)):
zz = foo()
if z0 is None:
z0 = zz
else:
z0 += zz
The above code runs at on average of 29.66 it/s
for M1 Max chip which should use all the GPU cores of M1 max chip (GPU utilisation is also 100%).
But same code running on a Zotac RTX 3090 Trinity is 175.22 it/s,
which means M1 max chip is only 16.9% the performance of RTX 3090.
Note that M1 Max in theory should result in 10.4 TFLOPS and RTX 3090 is roughly 35.5 TFLOPS, it means M1 Max should be 29.3% the performance of a RTX 3090.
I also notice that M1 max only consumes 12 watts, but a RTX 3090 consumes 340 watts. While on gaming, M1 max usually consumes 30~40 watts which is much higher compared to deep learning setting.
My guess is tensorflow-metal somehow doesn't utilise all the performance of M1 max, that's why the power consumption is so low and so is the matrix multiplication performance.
Can you look into this issue?