In Metal compute kernels, when do thread variables get spilled into the device memory?

How many 32-bit variables can I use concurrently in a single thread of a Metal compute kernel without worrying about the variables getting spilled into the device memory? Alternatively: how many 32-bit registers does a single thread have available for itself?

Let's say that each thread of my compute kernel needs to store and work with its own array of N float variables, where N can be 128, 256, 512 or more. To achieve maximum possible performance, I do not want to the local thread variables to get spilled into the slow device memory. I want all N variables to be stored "on-chip", in the thread memory space.

To make my question more concrete, let's say there is an array thread float localArray[N]. Assuming an unrealistic hypothetical scenario where localArray is the only variable in the whole kernel, what is the maximum value of N for which no portion of localArray would get spilled into the device memory?

I searched in the Metal feature set tables, but I could not find any details.

Answered by DTS Engineer in 827215022

Hello,

You'll want to consider maxThreadgroupMemoryLength:

"The maximum threadgroup memory available to a compute kernel, in bytes."

Likewise recommendedMaxWorkingSetSize.

Spilled bytes are shown in a GPU trace with the Metal Debugger.

We recommend Metal Compute on MacBook Pro to learn more.

Hello,

You'll want to consider maxThreadgroupMemoryLength:

"The maximum threadgroup memory available to a compute kernel, in bytes."

Likewise recommendedMaxWorkingSetSize.

Spilled bytes are shown in a GPU trace with the Metal Debugger.

We recommend Metal Compute on MacBook Pro to learn more.

In Metal compute kernels, when do thread variables get spilled into the device memory?
 
 
Q