Hello
I am trying to get thread group memory access in fragment shader. In essence, I would like to have all the fragments in a tile to bitwiseOR some value. My idea was to use simd_or across the SIMD group, then make each SIMD group thread 0 to atomic or the value into thread group memory. Finally very first thread of the tile would be tasked with writing the value down to texture with write access.
Now, I can allocate the thread group memory argument to the fragment function all right. MTLRenderEncoder has setThreadgroupMemoryLength call, which I am using the following way
[renderEncoder setThreagroupMemoryLength: 16 offset: 0 atIndex:0]
Unfortunately, all I am getting is the following error (runtime assertion)
-[MTLDebugRenderCommandEncoder setThreadgroupMemoryLength:offset:atIndex:]:3487: failed assertion Set Threadgroup Memory Length Validation
offset + length(16) must be <= threadgroupMemoryLength(0).`
What I am doing wrong? How I can get thread group memory in the fragment shader? I know I could use tile shading and compute function but the problem is that here I really like to use fragment stuff. Will be grateful for help.
Selecting any option will automatically load the page
Post
Replies
Boosts
Views
Activity
Hello!
I have a question about how thread groups work with tile shading. When running "traditional" compute, I get to choose both thread group size and the grid size. However, when using tile shading kernel I only have dispatchThreadsPerTile method - this controls how many threads will be ran in each tile. So far so good, but what about thread groups?
The examples in video "Tile Shading on A11" seem to suggest that there will be only one thread group per tile. In the video, [[thread_index_in_threadgroup]] is called "local_id" and it is used to access the image block.
I assume this is the default configuration. So when one does the following:
Creates MTLRenderPassDescriptor with tileWidth set to W and tileHeight set to H
Fires up the tile shading kernel using dispatchThreadsPerTile with MTLSize size = { W, H, 1 }
I understand that the result is 1-to-1 mapping between the tile "pixels" and kernel threads. Now, what I would like to do is to have more than one thread group there. I want this for performance reasons: I have a certain compute kernel which I know executes very well with small thread group size. In fact, { 32, 1, 1 } seems to be the fastest. My understanding is that even if I set tile size to 16x16, and so I am executing 256 threads there, there will only be one SIMD group active in a thread group. Meaning that this SIMD group has to execute 8 times over the tile.
Is it possible somehow? Or perhaps the limitations of the API are pointing at the limitations of hardware itself, and if I want to execute with SIMD group sized thread groups I have to use "traditional" compute encoder?
Will be grateful for help.
Michał
Hello!
I need to "draw" a set of particles into the texture. It would be trivial in render encoder of course. However, I would like to implement the task in compute kernel. Every particle draw operation is expected to set 5 texels - "center" one and left/right/upper/lower. Particles can and will overlap, so concurrent draws are to be expected.
I tried using texture atomics - atomic_store() to be more precise. This worked, albeit pretty slowly - too slow for my purpose.
Just to test what would happen, I tried using normal texture write(). I was expecting to see some kind of visual artefacts, but to my surprise, it worked very well (and much faster).
My question: is it safe? I understand that calling write() doesn't guarantee any ordering of the operations, so if multiple threads write to the same texel, the final value may come from any of those threads. But suppose all the threads were to write the very same color? Can I assume that the texel in question will have said color after the compute kernel finishes?
I am using M2 Pro MacBook, but ideally I would love to get the answer for the all Apple Silicon devices. My texture format is R32Int (so as to be able to use atomics), but I could do with any single-channel format, the purpose of the texture is to be binary mask of sorts.
Thanks!