Add FFT in Metal Performance Shaders

I am working on the implementation of a highly-demanding signal processing algorithm, and I am not able to reach an acceptable execution time with vDSP's routines.

I am now having a look at Metal and learn how to use it. It seems like Metal Performance Shaders as well as MPS Graph could replace almost all of my vDSP calls, but not the Fast Fourier Transform (which is the most time consuming part of the algorithm).

I was wondering if there's a way for FFT methods to be included in MPS, because it could be insanely fast if optimized for unified architecture of the M1.

Thanks !

  • Curious as to why vDSPs routines weren't acceptable. Was the CPU version of the algorithms too slow? On Apple Silicon machines, does the vDSP routines automatically use the GPU or is that only through Metal API?

  • I had to perform 2D FFT on very large tensors. So I did some research about vDSP's fft routines, and find out the following things :

    There's no batch function for fft 2D (something like fftm for 1D), so if I wanted to perform fft2 method on all my tensor I had to put it in a loop and manually batching it by moving my pointers across the tensor for each call of the function, which was obviously pretty slow. I couldn't made just one call of fft2 on all the tensor because the log2N parameter would be too big.I find a trick, doing an fftm, then transposing the tensor to have the columns becoming rows and so becoming contiguous in memory, then doing another fftm. This way was the fastest I could find, even if the transpose operation cost some time too.

    Basically I followed all the tips I find on the documentation to have the best performances with vDSP : using a stride of 1 as much as possible (that's why I transposed my tensor between the two fftm) and allocating memory 16 bytes align, using posix_memalign method.

    However I am a beginner developper and I definitely could have missed something that made my vDSP's fft too slow, but the fact is my current results didn't match performance of some other GPU framework, like the cuFFT in CUDA. That is why I thought that a highly-optimized Metal FFT could exist in MPS.

  • thank you

Replies

Thanks for the suggestion. Can you create a request using Feedback Assistant? (The more of these we get the higher priority we'll give it).

Repost the number here and I can route it the MPS team ASAP.

  • Whoever made this comment, consider routing the link to MetalFFT to the MPS team. I would ideally like to get in contact with them over this through email - use the one posted on my GitHub profile. I've had a bad experience with Apple not responding to my stuff (see the comment below about a bug), so please leave a reply when you read this.

  • I have a feeling that my last post on this thread didn't get through to you because it wasn't a reply to a comment you made.

  • Please disregard the replies above.

Add a Comment

It has the number FB9791504.

Thank you for taking this request in consideration.

I'm thinking of adding 1D, 2D, and 3D FFT transforms to an open-source project. They'll either end up in a Metal backend for Swift for TensorFlow, or in a related project. I am wondering whether the MPS team could use my open-source work to save time for themselves. Right now, the MPS team could postpone making the FFT shaders, using their time for another project. When I have open-sourced my implementation, they could use it as a reference, jump-starting their efforts and saving time.

From my experience with bug FB9653639, the Metal team is very slow to implement changes. In addition, they may need to rigorously test the shaders for bugs, which are very frequent and difficult to solve in GPGPU contexts. @CaptainHarlock my open-source effort might solve your needs before FFT shaders are added to MPS. We could discuss this more off of developer forums if it's time-sensitive - my GitHub account is "philipturner".

  • Your bug has been routed to the proper Xcode team for review and may be fixed in a future update.

Add a Comment

Another thing I'd like to see in MPS is support for encoding into indirect compute commands. I recently thought of plans for how to add a Metal backend to DL4S, a deep learning framework for Swift. This requires commands to be dispatched semi-eagerly, where you can't pre-compile them into graphs like with MPSGraph. Being able to utilize indirect command buffers in a JIT compiler like XLA (tensorflow.org/xla) would provide opportunities to reduce encoding overhead.

This isn't encouraged by Apple, but I found a way to load the raw MPS shaders by peering into a private Metallib directory accessible from public APIs. I'll go into as little detail as possible for obvious reasons, but it was possible to create compute pipeline states from MPS shaders. If I had studied them longer, I could have made an indirect command buffer workflow using them. However, there are numerous details about MPS's internals that I don't know, so I might accidentally do something unsafe. The reason I'm saying this is because it proves the MPS team can theoretically pull this off - they just need to expose a safe public API for it. There is also a precedent for unique features geared toward rare performance use cases - MTLCommandQueue.makeCommandBufferWithUnretainedReferences().

I ended up scrapping plans for ICBs in because I would need entirely custom shaders to securely execute GPU work, and Apple's MMX shaders far outperformed mine. With that restriction gone, I readily changed my plans to use MPS. For more context on how this played out, you can check out some of the closed issues under the DL4S Evolution repository. I later shifted my efforts to Swift for TensorFlow, so that repo shouldn't experience major updates in the future.

I'm debating whether I should jump-start MetalFFT now, while I wait for the S4TF project to gain momentum in the Swift community (also to help out @CaptainHarlock). I would structure its API similarly to MPS, but you need to input either a MTLComputeCommandEncoder or a MTLIndirectComputeCommand instead of kernel.encode(commandBuffer:, ...). Perhaps the completion of MetalFFT will help the MPS team better understand my suggestion about ICBs. To the Graphics and Games Engineer responding to this post - could you route the info about MetalFFT and ICBs to the MPS team?

  • Since ICB support is a separate feature request, you should make another request on feedback assistant. If you post the number here I can send it to MPS.

  • Number: FB9797575

    I just made a short message directing them to the comment this falls under. Is that good enough?

  • It would be preferable for you to put the all the details into the feedback item, just for ease of access for the MPS team. But I think as it is should be sufficient, since the request is pretty straightfoward.

Add a Comment

I just published over a week worth of work on implementing Fast Fourier Transforms in Metal: MetalFFT. @CaptainHarlock worked with me throughout the process, and this thread is effectively resolved.

I have one more request for the MPS team, which is listed in my repository's README. I have no way of knowing whether any Apple engineers review a specific issue in the Feedback Assistant, and I especially do not want this one to be ignored. Graphics and Games Engineer, please relay this development directly to the MPS team (sorry for this being the third time you are asked that on this thread). I would like them to carry on my work and integrate it into Metal Performance Shaders, but we must establish communication first.

  • Please disregard this reply. The repository’s license has been changed to remove any restrictions possibly mentioned above.

Add a Comment

Hi,

Please could you tell us the FFT sizes that you're using that are running slowly under vDSP?

Thanks!

  • Hi,

    I'm working with 128x128x512 tensors, so in order to execute the fourier transform all over the tensor I'm looping 512 times over the FFT2 function using LOGN0 and LOGN1 = 7, because my vectors length are 128. I am still experimenting ways to get the fastest FFT2 over tensors with vDSP so if you have any tips on how I can speed it up I would really appreciate.

    Thank you !

  • Would it be good if I linked the file in MetalFFT showing the profiling concerns? I mirrored the FFT sizes @CaptainHarlock told me and showed benchmarks concerning system-level cache thrashing. The cache bottleneck was one reason I gave up MetalFFT, but Apple might be better-suited to investigating it. This is about the GPU implementation, not vDSP, so I don't know if it's relevant.

Add a Comment

Hello,

I can add Metal backend for VkFFT (https://github.com/DTolm/VkFFT) as it is already abstracted to support different API code generation for FFTs. However, I don't have any Apple machine to test this on currently. There are reports from people who launched the VkFFT Vulkan backend with MoltenVK and OpenCL seems to be still supported by Apple as well.

If Apple engineers are interested in this, feel free to contact me.

Best regards, Dmitrii

  • I can test it for you. I have an Apple silicon and Intel Mac. But, I strongly recommend that you thoroughly read through my MetalFFT project first. In fact, if you could transfer over code from VkFFT to MetalFFT, you’ll complete the project. I don’t have much time to spend, but we could work out some plan where I test or translate code for you.

    If you want the best performance, you need to make native Metal shaders, not a virtualized graphics technology like MoltenVK or SPIR-V. And learning from the hard lessons of MetalFFT, performance can surprise you.

  • Also, you can finish MetalFFT on your own if you can access an iPad and download Swift Playgrounds. That’s a benefit of MetalFFT being written entirely in Swift.

Add a Comment