MTLComputeCommandEncoder dispatchThreads hangs

Question

Created Dec ’17

Replies 3

Boosts 0

Participants 2

Howdy All,

I've been pondering with this issue a few days now - and eventually gave up with using MTLComputeCommandEncoder dispatchThreads method altogether just to be safe. But the problem I've been facing is calling encoder method dispatchThreads with threadsSize having smaller dimension than threadsPerThreadgroup. On my oldish laptop I'm getting 32 threads width for example, and if I use that default value to create size for threadsPerThreadgroup, which I believe is pretty much what documentation states to use, and at the same time start less than 32 threads wide operation, it brings my laptop to total halt.

I'm totally guessing but maybe someone has managed to implement unsigned value to underflow, those tend to bring long loops to execute at least.

As a workaround it seems possible to implement threadsPerThreadgroup to use minimum of threadsSize and threadsPerThreadgroup dimensions myself. But seeing something as simple as this fail gives me indication not to touch this convenience method until it works with basic inputs.

However not necessararily this is affecting all Apple computers but late 2013 MacBook Pro + integrated Intel Iris Pro GPU totally dislikes the situation described.

--

H

Answered by harism1234 in 798352022

It was nice to see this issue has been resolved in a way I wasn’t even expecting. Nice job 🤜🤛

https://youtu.be/xCA1hNLMZ8M

Boost

Answer 1

Graphics and Games Engineer OP

Apple

Dec ’17

It's difficult to say what exactly is going on without seeing the kernel code, but I can make a guess based on common causes of these issues.

It could be an out-of-bounds memory access in the kernel. It may work in some cases because while the driver for one GPU driver may pad buffer or texture allocations such that an access outside the lengh of a buffer or dimensions of a teture wouldn't actually touch unallocated memory. Some GPU driver may do this but another may not may not so your kernel would crash inconsistently

Make sure that you're using a paramer with the [[thread_position_in_grid]] attribute qualifier to test whether the kernel is processing outside the bound of your buffers and/or textures and return early if so. For example, 'myKernel' below would return early if 'gid' would process outside the bounds of 'textureForProcessing'. This prevents the write that occurs later from touching memoyr outside the dimensions of the texture.

kernel void myKernel(uint2 gid [[thread_position_in_grid]],
                     texture2d<half, access::write> textureForProcessing [[texture(0)]])
{
    if((gid.x >= textureForProcessing.get_width()) ||(gid.y >= textureForProcessing.get_height()))
    {
        return;
    }

    ...

    textureForProcessing.write(value, gid);
}

0

Answer 2

harism1234 OP

Jan ’18

Hello,

Actually I can reproduce this issue with 'nop-kernel' too;

kernel void TestHang(uint3 gid [[thread_position_in_grid]])
{
    return;
}

And the problem seems to rely on calling dispatchThreads with smaller execution grid than the size given as threadsPerThreadgroup;

// Normally this would be calculated with threadExecutionWidth and
// maxTotalThreadsPerThreadGroup to be dymamic on GPU capabilities
const auto threadsSize = MTLSizeMake(8, 8, 1);
auto enc = [commandBuffer computeCommandEncoder];
[enc setComputePipelineState:_testHangPipeline];
[enc dispatchThreads:MTLSizeMake(4, 4, 1) threadsPerThreadgroup:threadsSize];
[enc endEncoding];

This code snippet hangs, but the same code works just fine as long as dispatchThreads is called with execution grid size of MTLSize(8, 8, 1) or larger, i.e at least the same size as threadsPerThreadgroup size. If I use any smaller size it freezes whole OS.

Anyway, I can get away with the freeze with limiting threadsPerThreadgroup MTLSize to be min(dispatchThreads, threadsSize). But I'm rather writing here to wonder about the whole OS freeze this can cause. Also the dispatchThreads documentation indicates it should handle this kind of situations optimally too where dispatchThreads size is smaller than threadsPerThreadgroup. Latter one can vary with different GPUs significantly after all.

--

H

0

Answer 3

harism1234 OP

Aug ’24

Accepted Answer

It was nice to see this issue has been resolved in a way I wasn’t even expecting. Nice job 🤜🤛

https://youtu.be/xCA1hNLMZ8M

0