Post not yet marked as solved
I was familiarising myself with the Metal mesh shaders and run into some issues. First, a trivial application that uses mesh shaders to generate simple rectangular geometry hangs the GPU when dispatching 2D grids of mesh shader threadgroups, but it's really weird as it is sensitive to the grid shape. E.g.
// these work!
meshGridProperties.set_threadgroups_per_grid(uint3(512, 1, 1));
meshGridProperties.set_threadgroups_per_grid(uint3(16, 8, 1));
meshGridProperties.set_threadgroups_per_grid(uint3(32, 5, 1));
// these (and anything "bigger") hang!
meshGridProperties.set_threadgroups_per_grid(uint3(16, 9, 1));
meshGridProperties.set_threadgroups_per_grid(uint3(32, 6, 1));
The sample shader code is attached. The invocation is trivial enough:
re.drawMeshThreadgroups(
MTLSizeMake(1, 1, 1),
threadsPerObjectThreadgroup: MTLSizeMake(1, 1, 1),
threadsPerMeshThreadgroup: MTLSizeMake(1, 1, 1)
)
For apple engineers: a bug has been submitted under FB10367407
Mesh shader code:
2d_grid_mesh_shader_hangs.metal
I also have a more complex application where mesh shaders are used to generate sphere geometry: each mesh shader thread group generates a single slice of the sphere. Here the problem is similar: once there more than X slices to render, some of the dispatched mesh threadtroups don't seem to do anything (see screenshot below). But the funny thing is that the geometry is produced, as it would occasionally flicker in and out of existence, and if I manually block out some threadgroups from running (e.g. by using something like if(threadgroup_index > 90) return; in the mesh shader, the "hidden" geometry works! It almost looks like different mesh shaders thread group would reuse the same memory allocation for storing the output mesh data and output of some threadgroups is overwritten. I have not submitted this as a bug, since the code is more complex and messy, but can do so if someone from the Apple team wants to have a look.
Apple documentation states about Tier 2 Argument Buffer hardware capability
The maximum per-app resources available at any given time are:
500,000 buffers or textures
What does it mean exactly? Does this number refer to the maximal count of attachment points (e.g. unique indices) across all bound argument buffers, the maximal count of only bound resources across the argument buffers (e.g. when using dynamic indexing and sparsely binding resources) or the number of resource objects that the application can create and manage at a given time?
Prompted by some discussions in the community I decided to run some tests and was surprised to discover that I could bind many millions buffer attachments to a single argument buffer in a Metal shader on my M1 Max laptop, way in excess of the quoted 500,000 limit. Is that just undefined behaviour that one should not rely on or does "500,000" refer to something else instead of the number of attachment points?
Hope that someone from Apple Gpu team can clarify this. If this is not the correct venue for this question, please tell me where I can send my inquiry.
Looking at the new Metal 3 APIs diffs, I noticed that objects now expose a new gpuHandle/gpuRessourceD property, and that the MTLArgumentEncoder is marked as deprecated and there seems to be the family of new MTLBinding APIs that looks like a replacement for it. Does this mean that we are getting some new resource binding model? I was not able to find any details in the documentation and Tuesday's Metal session did not mention these API changes at all. And the APIs themselves seem to be in flux, as gpuHandle is already marked as deprecated even though it is still beta :)
Will there be a WWDC session about these APIs or could you share some details here?
Post not yet marked as solved
There is currently an ongoing discussion about the validity of GPU compute performance estimates like those offered by popular benchmarking tools such as Geekbench 5. It has been observed that Apple GPUs have a relatively slow frequency ramp up do not reach their peak performance if the submitted kernels have a runtime under a few seconds. I understand that these GPUs are designed for throughtput rather than latency, but sometimes one does work with “small” work packages (such as processing a single image). Is there an official way to tell the system that it should use peak performance for such work? E.g. some sort of hint along the lines of “I will now submit some GPU work and I want you to power up all the relevant subsystems” instead of relying on the OS to lazily adjust the performance profile?
Post not yet marked as solved
In the talk „Create image processing apps powered by Apple Silicon“ Harsh Patil mentioned that one should use threadgroup memory to load a chunk of image containing all the required pixels to run a convolution kernel. Unfortunately there was no code example and I have difficulty figuring out how something like that would be set up. I can imagine using imageblocks, but how would one load/store them in the shader? Could anyone offer some guidance (ideally with a code snippet)?
Post not yet marked as solved
Metal shading language specification in section 5.10 states:
"If a vertex function does writes to one or more buffers or textures, its return type must be void"
However, writing to buffers from vertex functions works correctly on Intel, AMD and A13 GPUs. Has this restriction been removed on later hardware? Can one rely on this behavior going forward? Or is it just a fluke?
I found this new API in MTLDevice “supportsPullModelInterpolation()” but there was no additional info. Anyone knows what is this about?
Post not yet marked as solved
Apple so far has been very enigmatic about the capabilities of the A14. Updated Metal feature tables suggest that A14 gains some features up to now reserved to the desktop GPUs (e.g. barycentrics support). Is there anything else? Will there be an updated tech note?
Post not yet marked as solved
With the release of 10.13.4, Apple mentions "Pro applications and 3D games that accelerate the built-in display of an iMac or MacBook Pro. (This capability must be enabled by the application's developer.)". I was not able to find any developer documentation about this. How exactly can this be "enabled"? Do they mean drawing on the external GPU and then "manually" copying the buffer to the iGPU?