I got 3203.23 GFLOPS (FP16) on the M3 Macbook Pro and only 2833.24 GFLOPS (FP16) on the M4 Macbook Air for 4096x4096 matrix multiplications for a PyTorch MPS FP16 Benchmark. Wasn't the performance supposed to be twice as high on the M4 compared to the M3 even with the termal throtling on the Macbook Air? What went wrong?