“Accelerate Transformer Training on Apple Devices from Months to Hours!”

I am excited to share that I have developed a Metal kernel for Flash Attention that eliminates race conditions and fully leverages Apple Silicon’s shared memory and registers. This kernel can dramatically accelerate training of transformer-based models.

Early benchmarks suggest that models which previously required months to train could see reductions to just a few hours on Apple hardware, while maintaining numerical stability and accuracy. I plan to make the code publicly available to enable the broader community to benefit.

I would be happy to keep you updated on the latest developments and improvements as I continue testing and optimizing the kernel. I believe this work could provide valuable insights for Apple’s machine learning research and products.

“Accelerate Transformer Training on Apple Devices from Months to Hours!”
 
 
Q