Threadgroup memory for fragment shader

Question

MikeAlpha OP

Created 5d

Replies 1

Boosts 0

Participants 1

Hello

I am trying to get thread group memory access in fragment shader. In essence, I would like to have all the fragments in a tile to bitwiseOR some value. My idea was to use simd_or across the SIMD group, then make each SIMD group thread 0 to atomic or the value into thread group memory. Finally very first thread of the tile would be tasked with writing the value down to texture with write access.

Now, I can allocate the thread group memory argument to the fragment function all right. MTLRenderEncoder has setThreadgroupMemoryLength call, which I am using the following way [renderEncoder setThreagroupMemoryLength: 16 offset: 0 atIndex:0]

Unfortunately, all I am getting is the following error (runtime assertion)

-[MTLDebugRenderCommandEncoder setThreadgroupMemoryLength:offset:atIndex:]:3487: failed assertion Set Threadgroup Memory Length Validation offset + length(16) must be <= threadgroupMemoryLength(0).`

What I am doing wrong? How I can get thread group memory in the fragment shader? I know I could use tile shading and compute function but the problem is that here I really like to use fragment stuff. Will be grateful for help.

Boost

Answer 1

MikeAlpha OP

1d

Feeling a bit like in echo chamber, but anyway. I was able to "hack" it and perform pretty efficient reduce over tile memory from the fragment shader the following way:

[[simdgroup_index_in_threadgroup]] is not available, but I computed my own out of [[pixel_position_in_tile]]. Metal specs talk a bit about how simd group is laid out when processing the tile, also it can be checked by writing thread indices to memory buffer.
Having simdgroup index one can perform simd group wide reduce, then write simd group result to temporary buffer (which needs to have enough space for all the simd groups in the tile, in my case 32x32 / 32 = 32)
Extra compute kernel is then dispatched to read the temporary buffer and perform final reduction of 32 values for each tile into final tile value.

Overall, this is still a win, because I can use the data that is loaded into tile memory and avoid need to re-read it again for reduction.

BUT it is a pity that it doesn't work as it should. Metal specs version 3.2 mention usage of threadgroup memory in fragment functions several times. For example, on page 86 it says "See the Metal Feature Set Tables to learn which GPUs support threadgroup space arguments for fragment shaders". Yet I couldn't find such entry in Feature Set Tables.

This is omission/bug in what is otherwise excellent feature of Metal and Apple GPUs. I hope it can be rectified some day...

0