Indeterminate "while" loop in metal shader causes (IOAF code 5) on DTK and M1

Hi All - I am stuck on a kernel that works on 2014 MacBook, 2014 mini, 2012 iMac, but doesn’t work on DTK or M1 Mac mini or air.

Its a simple chained prefix sum ("StreamScan" for anyone that cares - 10.1145/2442516.2442539). So specifically, the first thread group does a prefix sum on its items, then updates a global array with its inclusive sum and a flag to report that it is finished and that the inclusive sum is available. Subsequent thread groups wait on the flag and when they see that the prior thread group is done, update their global inclusive sum and their status, etc.

Nvidia showed that a similar but slightly more complex strategy (called decoupled loopback) was faster than multi level scans that’s why I was trying it out.

#include <metal_stdlib> using namespace metal; kernel void ChainedPrefixExclusiveSum(device uint* output_data,                           volatile device uchar* status_flags,                           volatile device uint* inclusive_sums,                           const device uint* input_data,                           constant uint& n,                           uint global_id [[ thread_position_in_grid ]],                           uint group_id [[threadgroup_position_in_grid]],                           ushort local_id [[thread_position_in_threadgroup]],                           ushort local_size [[ threads_per_threadgroup]],                           ushort simd_size [[threads_per_simdgroup]],                           ushort simd_lane_id [[thread_index_in_simdgroup]],                           ushort simd_group_id [[simdgroup_index_in_threadgroup]]) {     //  we are going to load from global into registers     uint value = (global_id < n) ? input_data[global_id] : 0;     //  simple but slow - scan by warp and save inclusive sum into shared mem     uint scan = simd_prefix_exclusive_sum(value);     threadgroup uint local_partial_sums[32];     if (simd_lane_id == simd_size - 1)         local_partial_sums[simd_group_id] = scan + value;     threadgroup_barrier(mem_flags::mem_threadgroup);     //  scan the partial sums in shared mem and save local inclusive sum     threadgroup uint prefix = 0;     if (simd_group_id == 0){         uint aggregate = local_partial_sums[simd_size - 1];         local_partial_sums[simd_lane_id] = simd_prefix_exclusive_sum(local_partial_sums[simd_lane_id]);         aggregate += local_partial_sums[simd_size - 1];         if (simd_lane_id == 0){         //==================================================//             //  THE WHILE LOOP HERE IS THE PROBLEM:             if (group_id != 0) while(status_flags[group_id - 1] == 0){}  //==================================================//             prefix = (group_id == 0) ? 0 : inclusive_sums[group_id - 1];             inclusive_sums[group_id] = aggregate + prefix;             status_flags[group_id] = 'X';         }     }     threadgroup_barrier(mem_flags::mem_threadgroup);     //  add per warp sum and prefix sum to thread scan value and store     output_data[global_id] = scan + local_partial_sums[simd_group_id] + prefix; }

I get an IOAF code 5 on the while loop on the DTK and M1. The debugger really isn't helping me much, basically its an infinite wait because the device buffer, specifically inclusive_sums[group_id - 1], never gets a value.

Any ideas? Could this be something with tile memory and deferred writes and if so is there a way to always force the write to device via a buffer descriptor somewhere?

I did file a report (FB8967586) but upon further reflection was thinking that maybe this was actually TBDR working as expected, and maybe I am just not finding some API I need to use with a compute kernel. And so I figured I would ask here!

Thanks in advance.



Accepted Reply

Thanks for the feedback request.

This kernel relies on some undefined behavior which is why it works on some GPUs but not others. There are 3 invalid assumptions you're relying on for this kernel to work
  1. That the GPU will launch threadgroups in an order consistent with threadgroup_position_in_grid.

  2. That for any pair of threadgroups that run concurrently [in real time] on the GPU, the GPU will fairly schedule them so that both make forward progress concurrently.

  3. That operations on memory marked as volatile are atomic.

Specifically, problem is how you're using status_flags. Each threadgroup waits for the previous one to finish before setting its own flag. However, Metal does not guarantee that the previous one will execute prior to the current one (they may be in any order). Also since status_flags is not atomic, one thread can squash the results from another thread.

We suggest using a more traditional prefix sum algorithm.

Replies

Thanks for the feedback request.

This kernel relies on some undefined behavior which is why it works on some GPUs but not others. There are 3 invalid assumptions you're relying on for this kernel to work
  1. That the GPU will launch threadgroups in an order consistent with threadgroup_position_in_grid.

  2. That for any pair of threadgroups that run concurrently [in real time] on the GPU, the GPU will fairly schedule them so that both make forward progress concurrently.

  3. That operations on memory marked as volatile are atomic.

Specifically, problem is how you're using status_flags. Each threadgroup waits for the previous one to finish before setting its own flag. However, Metal does not guarantee that the previous one will execute prior to the current one (they may be in any order). Also since status_flags is not atomic, one thread can squash the results from another thread.

We suggest using a more traditional prefix sum algorithm.
Thanks so much for the thoughtful reply, I really appreciate you being on here looking at these discussions. Would you indulge two followup questions? Point 1 is of course an easy fix, use an atomic counter instead of threadgroup_position_in_grid to have the thread groups be sequenced (I had tried that but stripped it out for the example here).

Point 3 I am a little stuck on - since only one threadgroup writes to a given index, isn't it that as long as the relative order of writing the inclusive sum before writing the status flag within a threadgroup that matters, because the subsequent threadgroup waits until the single byte is written? I had assumed that single byte write is intrinsically atomic, but even if that assumption is wrong, the other thread group can't do anything until something (anything) is written. I guess I am not understanding how if only one thread group can write, how another thread group can squash the result. Is it really that it's just so slow to write to device that the thread group timeout gets hit before the next threadgroup sees the write?

Point 2 is obviously a deal killer, but is also a limitation of other platforms that people have gotten around. Was going to try to implement "Inter-Block GPU Communication via Fast Barrier Synchronization" (10.1109/IPDPS.2010.5470477). Any other ideas or resources about how to do global synchronization in Metal?

In the interim, I will take your advice, and do it the old way:) Thanks again!!!


After thinking about your responses more, I think I get it, so since I can't edit my last post figured I would just post my updated thoughts in case anyone else was thinking about this.

About point 3, when I was thinking about a single byte write with volatile already being atomic and ordered within a thread group, I was missing the point that it still doesn't ensure that its globally ordered, and that assumption is obviously wrong in hindsight. AND, this naive strategy wouldn't necessarily work in OpenCL or CUDA either. We would use something like CLK_GLOBAL_MEM_FENCE in OpenCL or __threadfence() in CUDA to make sure that the memory transaction completes. So will just have to wait for that to get added to Metal at some point to make an algorithm like this work.

About point 2, I know you all are really busy, but it would be useful, if it's not purposefully a secret, to have more documentation on how the scheduler is actually working beyond "no assumptions allowed". For example, once a threadgroup begins execution, if it stalls does it get evicted from execution or can we assume it will finish after the stall is cleared (i.e. once a thread group is running, it will run until it completes)? If there were ways to handle the memory transaction part of the data dependency, and if the dependencies were all forward, i.e. thread group 2 depends on thread group 1, thread group 1 depends on thread group 0 (no cycles), AND we correctly handled the thread groups being submitted out of order (by using atomic counter as group_id for example), could this deadlock or would be OK?


Part of the problem with documenting #2 is that Metal is supported on a variety of different GPU all of which will schedule work differently. Perhaps we could make some guarantees as to how all GPUs should do thing, of course we'd have to find that common set of guarantees which is a nontrivial task in of itself; We'd have to figure not only how all current GPUs schedule threads to find commonality (if there are any at all), and also enforce those commonalities in future GPUs supporting Metal. CUDA is able to do a better job of this because it only supports Nvidia hardware and, I'm guessing, they believe they'll be able to keep those guarantees in future hardware.

That said, this is possible, especially if the hardware supporting Metal becomes more similar. If you have an idea of what you'd like to see, please create a request via the Feedback Assistant.

As far at #3: there are many caches and each version of a variable on each cache could get added independently (unless the variable is explicitly marked as atomic).