DrawIndexedIndirectCount functionality under Metal

Hello everybody,

I have a situation here.

I cannot realize vkCmdDrawIndexedIndirectCount functionality by using argument buffers (actually, they are useless and buggy). I tried to reach developer support with those issues, but nobody is answering.

So maybe somebody has an idea of how to execute multiple indirect draw calls based on GPU-generated count?

Moreover, it is impossible to use indirect command buffers for that: "Fragment shader cannot be used with indirect command buffers".

Current issues with indirect command buffers:
  1. Intel UHD Graphics 630 is not rendering all elements from the buffer.

  2. eGPU RX Vega 56 hangs the whole system for 5-6 seconds when command generation is performed by the vertex shader.

  3. "Compiler encountered an internal error" on Intel Iris Plus Graphics.

  4. Apple M1 renders a magenta screen when the generation is performed on compute shader.

  5. Apple M1 renders a magenta screen with a 20% chance of success rendering when the generation is performed on vertex shader.

Thank you!
Link a minimal project that demonstrates this - then I can review it and confirm or deny whether it is your error, or Apple's.
Reproduction applications are available here:

https://www.icloud.com/iclouddrive/0fIpVg83LFG-OACxsMtjVtZHw#apple1

The problem with partial rendering on Intel UHD 630 is fixed under Big Sur.
More information is inside the Readme.txt file.

Thank you!

Please send an actual xcode project for analysis.

What you uploaded has telling signs that some of this is Apple's fault, and some of it, like #3 should be an expected error.

But if you want someone outside or inside Apple to try to help, you'll need to spoon feed everyone a normal Xcode project. On a normal day, they are inadequate to address basic problems and do not test thoroughly, but they are even less inclined to help if you send it in this format.



We are not using Xcode for development. Moreover, I will take time to isolate source code for issue replication. Usually, such binaries are more than enough for driver developers. Because they can track all API calls internally, and they have much better tools for that. Xcode project will not help if nobody cares about software the quality on the Apple side. I'm trying to find answers to those questions during the last months. This forum is the last hope :) And it's not possible to talk about missing functionality in Metal because nobody is listening.
No, what you provided is not enough.

It is not merely tracking api calls. You should provide a project that compiles to allow them to use the full diagnostic tools that are unavailable with just the binary.

If you don't provide them this, then the other party has to write it themselves.

Often, this causes several tangent problems to occur during triage, that delays identifying and resolving the true problems. (As opposed to the misconstrued notions of what the problems are thought to be)

These things occur unnecessarily, and you can do something about that today.

You can go to File -> New Project in Xcode, and make a minimal project that replicates what you are seeing in your main project.

I am available to provide a second look on your work today to confirm without doubt the issues, but if you neglect to make the sample project and provide this, it will sit on the shelf further.

After you have sent this to me, and I have confirmed it's entirety, we can both submit crystal clear reports, to make the complaint more effective.

(Also, in case it isn't obvious, the projects you submit should be in Objective C, not swift or C++, and they genuinely should be the minimum that depicts the bug without dependencies.)
Can create a request with Feedback Assistant and post the FB number you get here? We can have someone look at fixing this and hopefully provide a workaround in the interim.
The Feedback Assistant numbers are:
  • FB8254449

  • FB8638856

Thank you!
I'm looking at the feedback reports.

It sounds like in FB8254449 you're requesting vkCmdDrawIndexedIndirectCount functionality in Metal, but Metal already has this here:

-[MTLRenderCommandEncoder drawIndexedPrimitives:indexType:indexBuffer:indexBufferOffset:indirectBuffer:indirectBufferOffset:].


With FB8638856, where you're linking to your project that's not rendering on a Intel UHD Graphics 630, it looks like someone on the Metal team tried to reproduce it, but could not. I don't know what version of the OS he tried though, so I'm following up with him. What version of MacOS did you try this on? I'm wondering if this is a bug that has been fixed in a later OS build.


Hello,

According to FB8254449, yes, there is a function to draw indirect primitives, but that command is only executing single draw command. Vulkan and other API (OpenGL, D3D12, D3D11 (via extensions)) are proving more advanced functions to draw multiple commands with CPU and GPU-generated count:

This command is rendering multiple indirect commands, and the number of draw commands is inside GPU-buffer. Unfortunately, there is no such functionality in Metal API:

void vkCmdDrawIndexedIndirectCount(
VkCommandBuffer commandBuffer,
VkBuffer buffer,
VkDeviceSize offset,
VkBuffer countBuffer,
VkDeviceSize countBufferOffset,
uint32t maxDrawCount,
uint32
t stride);

There is another request inside FB8254449: is to add an indirect buffer Offset based on GPU-buffer:

void vkCmdDrawIndexedIndirectCountOffset(
VkCommandBuffer commandBuffer,
VkBuffer buffer,
VkDeviceSize offset,
VkBuffer offsetBuffer,
VkDeviceSize offsetBufferOffset,
VkBuffer countBuffer,
VkDeviceSize countBufferOffset,
uint32t maxDrawCount,
uint32
t stride);

FB8638856: Everything is fine with UHD 630 under Big Sur. The problems are 20 seconds start time with eGPU and inability to use textures with indirect command buffer (except M1).

Thank you!
FB8254449: Metal's solution for multi-draw commands is for a kernel to create an Indirect Command Buffer with multiple draw commands. This is essentially what the driver does for you anyways for multi-draw commands in other APIs. It sounds like you've used ICBs. Why does this not work for you?

FB8638856: Okay so item 1 is no longer a problem. But each of the other 4 issues still occur? (FYI, usually better to create separate feedback requests for separate issues. The guy trying to repro it, probably just tried the first one).
FB8254449:

Yes, that was what I tried to achieve with ICB. But the following issues make it impossible at that moment:
  1. Rendering pipeline for ICB cannot use textures, except M1 GPU.

  2. 20 seconds start time with AMD eGPU with whole system freeze during this time.

  3. Big chance to have magenta screen instead of normal rendering on M1 while using ICB.

  4. Argument buffer tier 2 is not available on iPhone/iPad/DTK.

ICB and Argument buffers specification are very flexible. It makes it impossible to implement them on all HW.
So maybe a single function solution with internal driver implementation for different HW will be more flexible as a result?

FB8638856:

Reproduction applications for 2 and 3 are available in the single archive with all descriptions.
https://www.icloud.com/iclouddrive/0fIpVg83LFG-OACxsMtjVtZHw#apple1/
Both of them are related to ICB creation/execution.

Thank you!
Regarding FB8254449:

A render pipeline using ICBs can definitely use textures. The texture references just need to be in an argument buffer set the ICB render command.

iPhone 11 and 12 support tier 2 argument buffers. iPhone 10 and 10S can only access 96 textures and 96 buffers for an executeIndirect command on the CPU encoder, but, unlike earlier devices, you can write to the argument buffers in a shader or kernel. In other words iPhone 10 and 10S support all the tier 2 features, but cannot access thousands of buffers and textures per executeIndirect command as iPhone 11 and iPhone 12 devices can. iPads of the similar generation have the same features and limitation. Although the DTK may not support tier2. the M1 in retail products do.

Regarding FB8638856:

The magenta screen issue was not mentioned in the feedback report. Does this happen on any particular device? I can add a note, but I think it would be clearer to the Metal team if you created a separate report with Feedback Assistant and post the number here.
But what if somebody doesn't need thousands of textures and buffers. We need 12 textures and 4 buffers for the whole scene rendering. Accessing textures through Argument buffer is an additional indirection during shader execution.

What we need to execute is just a simple loop:
for(pipeline in pipelines) {
bind pipeline
bind 12 textures
bind 4 buffers
drawIndexedIndirectCount(indirect buffer, count buffer)
}

So my idea with ICB was to implement code like this:
for(pipeline in pipelines) {

bind ICB generation rendering pipeline with rasterizer discard
bind indirectbuffer
bind ICB
drawPointsIndirect(count buffer)

bind pipeline
bind 12 textures
bind 4 buffers
executeIndirect(ICB)
}

But it looks that I have to patch pipeline shaders for ICB additionally.

I will submit the magenta screen issue on M1 and 20 seconds startup time with eGPU into another FBs.

Thank you!
The separate error for the magenta screen is FB8928674

The report for 20 seconds startup time with eGPU is FB8928678

Thank you!
Thanks for the Feedback requests. I've assigned them to the teams that can help here.

As far as indirections go; arguments buffers are designed to minimize this compared to other APIs. The object metadata itself is stored in the buffer rather than having a table with the data from which you need to index. (You can, of course, create your own table with another argument buffer so long as your indexing code properly takes into account the object size).
Hello,

The iPhone 11 Pro Max (A13) (13.5.1 and 14.2) reports that Tier 2 is supported.

ICB generation on Compute shader is working, but ~5% of objects partially rendered (or with corruption).

ICB generation on Vertex shader produces a black screen with console error:
"Execution of the command buffer was aborted due to an error during execution. Ignored (for causing prior/excessive GPU errors) (IOAF code 4)"

The iPhone XR (A12, which is newer than the iPhone 10) (13.5.1 and 14.2) reports that Tier 2 is not supported.

iPad Pro (12.9-inch) (4th generation with LiDAR A12) (13.5.1 and 14.2) reports that argument buffer Tier 2 is not supported (same as DTK).

What am I doing wrong, guys? Ignoring the Tier 2 test makes a random magenta pattern over the screen. Does that mean that all currently available iPad Pro models are not compatible with ICB? So it's just technically impossible to create vkCmdDrawIndexedIndirectCount() functionality.

Thank you!

Hello,

I have checked the ICB performance of serial drawIndexedPrimitives commands in comparison with drawPrimitives indirect method.
The test scene is 16K DIPs of 2 triangle quads. The static ICB is created on the CPU.

Vega 56:
Combined geometry (single DIP): 200M tri/sec
Serial drawPrimitivesIndirect: 12M tri/sec
Single executeCommandsInBuffer: 7M tri/sec
CPU and GPU ICB are working without any issues. GPU ICB is 4-5 times faster than the CPU ICB. The funny thing that AMD GPU has a native multiDrawIndirectCount command, which is working much faster...

Apple M1 (MacBook Air):
Combined geometry (single DIP): 50M tri/sec
Serial drawPrimitivesIndirect: 8M tri/sec
Single executeCommandsInBuffer: hangs after 1 second of execution with the random magenta noise. Debugging runtime nothing tells.

Apple A12 (iPhone XR):
Combined geometry (single DIP): 27M tri/sec
Serial drawPrimitivesIndirect: 13M tri/sec
Single executeCommandsInBuffer: hangs after 1 second of execution (with CPU ICB).
Copying from CPU ICB to Private ICB causes app crash.

Intel Iris Plus (MacBook Air 2020):
Combined geometry (single DIP): 4.3M tri/sec
Serial drawPrimitivesIndirect: 1.46M tri/sec
Single executeCommandsInBuffer: draws nothing, debug runtime crashes with the message that ICB is empty. executeCommandsInBuffer telling that source CPU ICB is not an ICB.

Thank you!
The IOAF error you're seeing and the hangs you're experiencing could be a driver bug, but they're also typical symptoms of accessing memory out-of-bounds in a shader or kernel. The fact that it works on an AMD GPU could just mean that AMD happens to handle that particular out-of-bounds condition in a favorable manner.

Have you tried running your app with Xcode shader validation? (Go to the Scheme, select the Diagnostics tab, and check Shader Validation) This will perform bounds checking and also check for use of many undefined behaviors.
Hello,

Can you advise me please how to run existed .ipa file with Xcode shader validation/debug?
We are not using xcodeproject files. We have a couple of bash scripts and Makefiles, which are doing all jobs well and fast for all platforms. On MacOS it's possible to set METALDEVICEWRAPPER_TYPE=1 variable to run the Metal debug layer, but unfortunately, we cannot do the same on iOS.

The Xcode feature to run an already installed app on the device would be awesome.

I can provide you reproductions samples if you need them.

Thank you!
To get shader validation via an env var you would set METAL_DEVICE_WRAPPER_TYPE=4. If you can rebuild the source, you can use setenv to set this before you create the Metal Device. (Still trying to find out there is a better way to do this and where the output goes when you don't use Xcode).

Just curious, has shader validation on M1 or another Mac shown you anything?
Thank you for the new value for the device wrapper type. I will retest everything. A validation message on M1 tells that ICB is not yet supported :)

Code Block
-[MTLGPUDebugDevice newIndirectCommandBufferWithDescriptor:maxCommandCount:options:]:1035: failed assertion `Indirect Command Buffers are not currently supported with Shader Validation'

I will check it on other devices a bit later.
Checking with some engineers on the Metal frameworks team; ICB support for shader validation is limited on Big Sur and not yet supported on iOS. Testing of this support has not been fully validated, so it must be explicitly enabled on Big Sur by setting another env var:

MTL_SHADER_VALIDATION_GPUOPT_ENABLE_INDIRECT_COMMAND_BUFFERS=1

The driver team will look at the feedback requests you submitted and hopefully will have further explanation for the IOS failures.
Thanks for the new variable. There are no errors from the debug/GPU validation layers during execution. Except that nothing is rendering during GPU ICB generation. I will wait for the answers.

PS: iOS debug layers are working great with setenv(). Thank you for that!
Is there any update about that?
Thank you
Hi,

A12 devices are not able to draw more than 512 drawindirect commands (CPU unroll for multidrawindirectcount).
The rendering objects start flickering. A13 and M1 devices are working fine even with 50K draw calls.

Thank you
DrawIndexedIndirectCount functionality under Metal
 
 
Q