I am running the same Python script using the TensorFlow Metal module on computers with M3 and M4 GPUs. While 1 epoch takes 5 minutes on the M3 device, it takes 15 minutes on the M4 device. What could be the reason for this? Could it be that TensorFlow Metal is not yet optimized for the M4 architecture?