Surprising HPC results with M1 Max (OpenCL is stellar, Metal not so much)

I have a project that solves the viscoelastic equation for sound transmission in biological media https://github.com/ProteusMRIgHIFU/BabelViscoFDTD. This code supports CUDA, OpenCL, Metal, and OpenMP backends. We have done a lot of fine-tuning for each backend to get the best performance possible for each platform. Details of the numerical simulation and hardware used are detailed in the link above. Here you can see a summary of the results: First of all, the M1 Max is a knockout to both AMD and Nvidia, but only if using OpenCL. Worth noting, the OpenMP performance of the M1 Max is also more than excellent. It is simply mindblowing the M1 Max is neck to neck to an Nvidia RTX A6000 that cost more than the Macbook Pro that was used for the test. Metal results, on the other hand, are a bit inconsistent. Metal shows excellent results on AMD W6800 Pro (the best computing time of all tested GPUs), but not so much with a Vega 56 or the M1 Max. For all Metal-capable processors, we used the first formula recommended at https://developer.apple.com/documentation/metal/calculating_threadgroup_and_grid_sizes.

Further tests trying different domain sizes showed that the M1 Max with OpenCL can get even better results than the A6000, but Metal remains lagging by a lot.

Is there something else for the M1 Max with Metal that I could be missing or worth exploring? I want to be sure our applications are future-proof, given it was even surprising OpenCL is still alive in Monterey, but we know it is supposed to be discontinued at some point.

Accepted Reply

I just want to wrap up this thread as we managed to finally bring OpenCL and Metal to show the same level of performance. It took a lot of changes (replacing C++ wrappers by Swift and then later with a modified Python library that compiles kernels on the flight), but ultimately the biggest difference was we packed as many constants as possible as #define statements instead of passing them through constant memory. Once that change all those changes were done, finally, the M1 Max Metal's performance is slightly better than OpenCL, so an improvement of 300%, which was dramatic. For those interested, the newest and much more simplified code is at https://github.com/ProteusMRIgHIFU/BabelViscoFDTD. Below is a screenshot of the performance test (I also pushed it to a more challenging test that illustrates better how the M1 processors stand vs the A6000).

Replies

Hi Sam,

We've looked at your project some, but we haven't had time to go through it thoroughly. We theorize that there is some pathological case you're hitting in the compiler. Since it's such a large delta, it could even be something as bad as your kernels spilling out of the register space, but it's hard to know exactly how that could happen.

It would be helpful if you performed a Metal capture. This will give you utilization and limiter statistics which may give a hint as to what's going awry here.

  • Also, an instruments trace would probably be helpful here too. There are some situations were Metal manages power more aggressively than OpenCL and instruments would show this.

Add a Comment

Hi guys, thanks for taking a look, really appreciate it.

I tried to explore the Metal capture at some point, but because the project is coded outside XCode, I didn't manage to make it work. For example, I tried to follow these suggestions (https://alia-traces.github.io/metal/tools/xcode/2020/07/18/adding-framecapture-outside-of-xcode.html) (https://developer.apple.com/documentation/metal/frame_capture_debugging_tools/capturing_gpu_command_data_programmatically?language=objc#3325255) to capture Metal outside XCode in a gputrace file but MTLCaptureManager.supportsDestination(.gpuTraceDocument) returned that is not supported. I read I needed to enable it with a Info.plist with the flag MetalCaptureEnabled set to true.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>MetalCaptureEnabled</key>
    <true/>
</dict>
</plist>

But I never managed to make it work. I put Info.plist in the directory where I run my application (via Python) as suggested here (https://stackoverflow.com/questions/63707866/macos-metal-failing-to-capture-gpu-frame-from-command-line-app), in the location of the libraries, Python binary, etc) but never seemed to be taken into account.

Do you have any other suggestions to enable the capture? Same for the instruments trace, any hint to do it from a command-line application would be welcome.

Cheers,

S

  • Okay so this needs to be handled a bit differently since you're not trying to capture a true macOS app. Fortunately there is a easy way to enable this; you should be able to set METAL_CAPTURE_ENABLED=1 env var to enable support in the API and initiate a capture.

    I don't think you need to do anything in particular to perform an instruments trace on a python script; you should just be able to attach to the process. I will confirm this, however.

  • Also, we noted that some OpenCL developers stumble upon achieving CPU / GPU parallelism with Metal since OpenCL handles this for you. You should confirm that your app is not waiting for command buffers to complete before encoding and submitting more work. This usually stems from calling MTLCommandBuffer.waitUntilCompleted() after committing command buffers. It can also occur if your app waits on a semaphore or other synchronization primitive often. The waiting itself may yield slower performance, but also may cause secondary effect; it can also cause the device to switch to a lower power state since the app won't be feeding much work to the GPU which in-turn causes the OS to assume that the performance of a higher power state isn't necessary.

Add a Comment

Amazing, that METAL_CAPTURE_ENABLED should be mentioned in the official documentation! :). It worked without a problem, now I can capture the gputrace.

There is indeed some register spilling in 2 of 3 kernels being run in the main for..loop of the FDTD process. As commented, I need to use MTLCommandBuffer.waitUntilCompleted() at each loop otherwise the results start to be unstable given the nature of the time-difference method (you need to wait that stress values finish calculation before calculating particle values, and so on).

For the spillage, and after analyzing the trace, I may have an idea to try to split the most intense kernels (they are massive currently) into mini kernels that should reduce the number of registers. These mini kernels can be run independently using a single command buffer in a single encoding operation, hoping this will maximize occupancy. It would be interesting to see how all 3 backends (OpenCL, Metal and CUDA) behave since this splitting will be common to all of them.

Thanks again for that METAL_CAPTURE_ENABLED info, I can see I will use it heavily in the future.

  • Okay, so you're doing the equivalent of a waitUntilCompleted() on OpenCL (clFinish) and CUDA then?

    And in CUDA and OpenCL you're running the kernels exhibiting spilling in Metal have not been split apart as you're thinking of doing?

  • Indeed, all 3 backends do similar sync conditions as you describe. So they are being compared on the same "grounds" the best I can.

    I managed to recode things so splitting could be tailored very granularly using macros (from running the original large Kernels to splitting in multiple mini kernels) that ensure that only the strictly necessary code is present at the compilation time of each mini kernel.

    The mini-kernel approach (and putting all those in a single encoder before doing a commit) helped to improve the Metal computing time in the M1 Max by roughly 15%, but still far away from the OpenCL times (157s in Metal vs 57s with OpenCL). I verified with the gputrace that now all mini kernels do not show anymore any register spilling.

    I may guess that the Metal power management may be limiting the execution; these are indeed very intense computing kernels. The original large kernels involve doing 3D operations over 30+ 3D arrays. Doing the splitting in mini kernel limited the number of buffers being accessed each time.

    Something worth noting is that it is different in the Metal execution, compared to OpenCL and CUDA. As mentioned above, these kernels need to access a lot of separate 3D arrays. In CUDA, they are passed as a single parameter structure that contains pointers to all the 3D arrays. In OpenCL, this didn;t work but I just simply pass like 55 input parameters. But in Metal, I have to do some packaging to merge several 3D arrays in a single buffer to be sure all info could be passed in 32 or fewer input parameters. The indexing to the arrays can be smartly managed via macros. In Metal, this translated that for accessing the beginning of Array X always involves adding an offset. So that is an operation that is not present in the CUDA and OpenCL operation. But I do not think this should translate into a big penalty. As noted in the first post, the W6800 Pro Metal execution was the best of the pack and includes these operations.

    Coming back to power management, Is there a programmatic way to disable Metal power management that could be explored? edit:typo

  • There is no way to directly control power management, but we're thinking that OpenCL shouldn't utilize a drastically different power management policy when idling the GPU. However, comparing instruments traces could give us a clue as to whether that's what you're hitting. Running a Metal System Trace on the OpenCL workload should give you info with which we could compare vs the native Metal workload.

    Were you ever able to use instruments to trace your workload?

Add a Comment

The instruments trace shows that the OpenCL execution has higher occupancy and much higher memory bandwidth (390 GB/s in OpenCL vs 110 GB/s in Metal). The access to memory strategy of the OpenCL alone can explain the difference. But keep in mind this execution of the OpenCL was with the very large kernels, so memory is being accessed aggressively, still, it is nice to see hitting the peak performance.

This is the Metal instruments overview using the "mini-kernels" approach

and here OpenCL using the original large kernels

From the instruments trace (and apologies since I'm just learning how to use it correctly), besides seeing some of the evidence, how can go into more details to identify what specific bottleneck may explain the difference?

So we took a look at the code a bit and found a couple of things items note.

  • It doesn't appear as if the OpenCL path is actually waiting anywhere or at least we couldn't find where it was doing this. There appears to be logic that selects which API to call to process each portion of work. Each batch creates a queue the Metal side and also waits for completion on every iteration inside the processing function. However we don't see this in the OpenCL path.
  • This segment of code recreates buffers every iteration. Buffers are meant to be reused so recreating them so often can incur significant overhead. The data you're passing to the kernel with these buffers is small enough that you probably can just use the MTLComputeCommandEncoder.setByte(_:length:index:) method to send that data.
  • Here it looks like you're recreating the compute pipeline each iteration. In the worst case. this means you're invoking the shader compiler every time, which is an incredibly expensive operations. So this should be done before any processing and the pipeline should be reused each iteration.

Also, it's difficult to tell what's happening in the instruments traces you posted as we can't see the the command buffers and encoders that are responsible for the work measured here. Including the Metal Application and GPU tracks would tell us that.

To send us info, it's probably easier if you create a Feedback Assistant report (and repost the number here). You can attach any files with data you're gathering.

Awesome, thanks for taking a look, and apologies, I should have pointed you to the right spot beforehand... you are watching the wrong function :), This segment of code points to the Rayleigh-Sommerfeld Integral that is included in the library to complement the FDTD solution of the Viscoelastic solver, which is mainly covered in the FDTD3D_GPU_VERSION.h file. I'm linking to the experimental branch FirstTestAppleSilicon that has all the recent "mini-kernels" approach, those changes have not been yet merged into the main branch. All the recent tests of the instruments profiling were done with that new branch.

As a side note, yes, in the ForwardSimpe function I should move out of the main loop the defaultLibrary.makeFunction(name: "ForwardSimpleMetal")! call.], but that function is already working quite well, the one that really initiated this thread is at FDTD3D_GPU_VERSION.h

So back to FDTD3D_GPU_VERSION.h, there you can see all the 3 GPU backends compilations controlled with macro conditions (i.e #if defined(METAL) ...) . You can see (here), that for the code in question I didn't do that mistake of creating the library inside the main loop. Be aware that for FDTD3D_GPU_VERSION.h I used a this wrapper mtlpp instead of a Swift-based interface since FDTD3D_GPU_VERSION.h was originally developed as a C-based Python extension (I have a student who will work to change that so we can use Swift instead, but that is for another day). You can see there how each of the GPU backends get their synchronization.

Btw, the mini-kernel Metal approach showed significant improvements for the AMD GPUs, the W6800 is now showing even better results and the Vega 56 shows now also a better performance than OpenCL. Here you can see updated benchmarks,

The Vega 56 was the one showing a dramatic improvement from 144s to 83s after using the mini-kernels approach. I haven't yet completed doing the proper tests in OpenCL and CUDA to see how the mini-kernels approach will work with those backends. But at least this exercise has been quite productive to improve performance in Metal-supported GPUs, it is clear that with adequate fine-tuning Metal can and should do better than OpenCL, but now I just need to continue to investigate how to ensure Metal shows also a similar trend in the M1 Max as it shows with the AMD GPUs.

Thanks again for taking a look.

  • We're glad you're seeing some gains on AMD with the mini-kernel approach. Our M1 performance with Metal still concerns us and is really unexpected.

    Thanks for pointing us to the code exhibiting this behavior. Is it possible to create a report with Feedback Assistant with instructions to run this workload. Attaching the instruments traces you created would also be helpful. This will allow us to take a closer look at the problem here.

  • Awesome,

    I just created the report FB9882670 , so you can take a look now at the Instruments trace. As noted in the report, in the capture, I ran a shorter simulation than the one used for the table above, but it still shows the same level of difference between OpenCL and Metal in the M1 Max. In the Instruments capture, you will find the following 3 runs:

    Run 1: Metal-based capture with the M1 Max using the newest kernels implementation based on my chats with the Engineers in the Developer forums thread where I split my problem into "mini-kernels". This version is the one showing the best Metal-based performance, but still significantly slower than OpenCL.Run 2: OpenCL-based capture with the M1 Max showing the best performance. This one used the original (ver large) kernels for my FDTD solver.Run 3: Metal-based capture of the same kernels used for OpenCL. This shows the worst performance of the 3 runs.

    Looking forward to hearing your thoughts on the potential issues I may have incurred in my Metal implementation,

    Cheers

    S

  • Thanks for the report. I see that it doesn't contain a trace of the encoders working, so it will be tricky to tell what's going on. But we can try to reproduce this ourselves with the code on GitHub.. Can you give us any specific instructions we can follow so we can reproduce what you're seeing?

It is pity this seems to get lost through the cracks, and now I have a bigger issue. The latest MacOS 12.3 update truly broke the operation of my W6800 GPU!!! I can't run anymore any of my Metal code, it just hangs there indefinitely. I saw multiple reports of a drop in performance with eGPUS associated with the latest MacOS release. Please check what is going on! This is highly disruptive.

Hi Sam,

Your issue hasn't been lost. We're still looking at it, but I'm sorry for the lack of updates.

Wow, so none of your code runs. Do have any more details you can share? Does the whole system hang or just your app?

Thanks for the reply. The problem with my eGPU was just fixed yesterday with the roll of MacOS 12.3.1. There were reports all over the place in many forums of issues with eGPU associated with 12.3. After the update, now my programs were able to run again. It was just my app that hanged. But anyway, it is not a problem anymore. Looking forward to hearing more on the original issue. Maybe to give extra motivation, given the excellent experience with the M1 Max, for our applications, we are now considering using Mac Studios with the M1 Ultra (and anything that is your pipeline of new processors) for neuronavigated procedures in human participants where this numerical model running in OpenCL/Metal will be tightly integrated, so being able to be future-proof with Metal is truly something critical for us moving forward.

Hi guys, any updates on what I could try out? Or is this something you think could be addressed in the future with Metal 3 once it becomes available?

I just want to wrap up this thread as we managed to finally bring OpenCL and Metal to show the same level of performance. It took a lot of changes (replacing C++ wrappers by Swift and then later with a modified Python library that compiles kernels on the flight), but ultimately the biggest difference was we packed as many constants as possible as #define statements instead of passing them through constant memory. Once that change all those changes were done, finally, the M1 Max Metal's performance is slightly better than OpenCL, so an improvement of 300%, which was dramatic. For those interested, the newest and much more simplified code is at https://github.com/ProteusMRIgHIFU/BabelViscoFDTD. Below is a screenshot of the performance test (I also pushed it to a more challenging test that illustrates better how the M1 processors stand vs the A6000).