Another thing I'd like to see in MPS is support for encoding into indirect compute commands. I recently thought of plans for how to add a Metal backend to DL4S, a deep learning framework for Swift. This requires commands to be dispatched semi-eagerly, where you can't pre-compile them into graphs like with MPSGraph. Being able to utilize indirect command buffers in a JIT compiler like XLA (tensorflow.org/xla) would provide opportunities to reduce encoding overhead.
This isn't encouraged by Apple, but I found a way to load the raw MPS shaders by peering into a private Metallib directory accessible from public APIs. I'll go into as little detail as possible for obvious reasons, but it was possible to create compute pipeline states from MPS shaders. If I had studied them longer, I could have made an indirect command buffer workflow using them. However, there are numerous details about MPS's internals that I don't know, so I might accidentally do something unsafe. The reason I'm saying this is because it proves the MPS team can theoretically pull this off - they just need to expose a safe public API for it. There is also a precedent for unique features geared toward rare performance use cases - MTLCommandQueue.makeCommandBufferWithUnretainedReferences().
I ended up scrapping plans for ICBs in because I would need entirely custom shaders to securely execute GPU work, and Apple's MMX shaders far outperformed mine. With that restriction gone, I readily changed my plans to use MPS. For more context on how this played out, you can check out some of the closed issues under the DL4S Evolution repository. I later shifted my efforts to Swift for TensorFlow, so that repo shouldn't experience major updates in the future.
I'm debating whether I should jump-start MetalFFT now, while I wait for the S4TF project to gain momentum in the Swift community (also to help out @CaptainHarlock). I would structure its API similarly to MPS, but you need to input either a MTLComputeCommandEncoder or a MTLIndirectComputeCommand instead of kernel.encode(commandBuffer:, ...). Perhaps the completion of MetalFFT will help the MPS team better understand my suggestion about ICBs. To the Graphics and Games Engineer responding to this post - could you route the info about MetalFFT and ICBs to the MPS team?
-
—
fotonism
-
—
CaptainHarlock
-
—
fotonism
Add a CommentCurious as to why vDSPs routines weren't acceptable. Was the CPU version of the algorithms too slow? On Apple Silicon machines, does the vDSP routines automatically use the GPU or is that only through Metal API?
I had to perform 2D FFT on very large tensors. So I did some research about vDSP's fft routines, and find out the following things :
There's no batch function for fft 2D (something like fftm for 1D), so if I wanted to perform fft2 method on all my tensor I had to put it in a loop and manually batching it by moving my pointers across the tensor for each call of the function, which was obviously pretty slow. I couldn't made just one call of fft2 on all the tensor because the log2N parameter would be too big.I find a trick, doing an fftm, then transposing the tensor to have the columns becoming rows and so becoming contiguous in memory, then doing another fftm. This way was the fastest I could find, even if the transpose operation cost some time too.Basically I followed all the tips I find on the documentation to have the best performances with vDSP : using a stride of 1 as much as possible (that's why I transposed my tensor between the two fftm) and allocating memory 16 bytes align, using posix_memalign method.
However I am a beginner developper and I definitely could have missed something that made my vDSP's fft too slow, but the fact is my current results didn't match performance of some other GPU framework, like the cuFFT in CUDA. That is why I thought that a highly-optimized Metal FFT could exist in MPS.
thank you