Metal ICB: set_compute_pipeline not working?

Hi,

I'm currently working on a GPU driven pipeline and am in need of setting compute pipeline state from within the compute shader which fills in the ICB. (I've narrowed down the problem to how the commands get encoded in the ICB - the pipelines which I'm encoding into the ICB all work individually).

For some reason, if I encode more than one compute_command into my ICB, the Metal GPU Replayer acts up (the presentable is all purple and none of the resources are inspectable) which makes me feel like I'm doing something very illegal in this compute shader which fills the ICB but am completely oblivious as to what it could be.

This is the compute shader which fills the ICB:

struct ExclusivePrefixSumICB
{
    command_buffer exclusivePrefixSumCommands    [[id(0)]];
    command_buffer exclusivePrefixSumAccCommands [[id(1)]];

    compute_pipeline_state exclusivePrefixSumPipeline[[id(2)]];
    compute_pipeline_state exclusivePrefixSumAccPipeline[[id(3)]];

    compute_pipeline_state increaseCommandNumber[[id(4)]];
};

template <typename T, uint kBlockSize, uint kGrainSize>
kernel void kernelExclusivePrefixSumDispatch(device T *inputData                      [[buffer(0)]],
                                             device T *outputData                     [[buffer(1)]],
                                             device ExclusivePrefixSumInfo *info      [[buffer(2)]],
                                             device ExclusivePrefixSumICB  &icb       [[buffer(3)]],
                                             device PrefixSumNBuffer *nBuffer         [[buffer(4)]],
                                             uint lid [[thread_position_in_threadgroup]])
{
    uint numValuesInInput = info->numElements;
    uint offset = 0;

    uint offsets[MAX_COMMANDS_IN_PREFIX_SUM][2] = {};
    uint offsetsCount = 0;

    uint iterCount = ceil(log2((float)numValuesInInput) / log2((float)PREFIX_SUM_NUM_VALUES_PER_THREADGROUP));
    iterCount = 1;

    uint commandCount = 0;

    for (uint i = 0; i < iterCount; ++i)
    {
        { /* Prefix scan pipeline. */
            compute_command cmd(icb.exclusivePrefixSumCommands, i*2);

            offsets[offsetsCount][0] = offset;
            offsets[offsetsCount][1] = offset+numValuesInInput;

            nBuffer[commandCount].commandNumber = 0;
            nBuffer[commandCount].n = numValuesInInput;
            nBuffer[commandCount].inputDataOffset = offset;
            nBuffer[commandCount].outputDataOffset = offset;
            nBuffer[commandCount].partialSumOffset = offset + numValuesInInput;

            cmd.set_compute_pipeline_state(icb.exclusivePrefixSumPipeline);
            cmd.set_kernel_buffer(outputData, 0);
            cmd.set_kernel_buffer(inputData, 1);
            cmd.set_kernel_buffer(nBuffer, 2);

            uint3 threadgroupGridSize = uint3(divideRoundUp((uint)numValuesInInput, (uint)PREFIX_SUM_NUM_VALUES_PER_THREADGROUP), 1, 1);
            uint3 threadgroupSize = uint3(PREFIX_SUM_NUM_THREADS_PER_THREADGROUP, 1 ,1);

            cmd.concurrent_dispatch_threadgroups(threadgroupGridSize, threadgroupSize);

            cmd.set_barrier();

            offset += numValuesInInput;
            numValuesInInput = divideRoundUp((uint)numValuesInInput, (uint)PREFIX_SUM_NUM_VALUES_PER_THREADGROUP);
        }


        
        { /* Increase command number: */
            compute_command increaseCmd(icb.exclusivePrefixSumCommands, i*2+1);
            increaseCmd.set_compute_pipeline_state(icb.increaseCommandNumber);
            increaseCmd.set_kernel_buffer(nBuffer, 0);
            increaseCmd.concurrent_dispatch_threadgroups(uint3(1, 1, 1), uint3(1, 1, 1));
            increaseCmd.set_barrier();
        }


        ++offsetsCount;
        ++commandCount;
    }
}

Is there anything obviously wrong with the way I am encoding the commands into the ICB?

Furthermore, the GPU replayed complains that the pipeline exclusivePrefixSumPipeline is missing all its buffer bindings even though I'm setting them right there in the shader. Furthermore, the replayer warns that "The ICB doesn’t inherit the parent encoder’s buffer arguments; specify the command’s buffer parameters when you encode commands into the ICB." which I'm guessing it's telling me because it thinks that this means that the pipelines are missing their buffer bindings (I intentionally made the ICB not inherit pipelines/buffers so I can set them in the shader).

Sorry for the blurt of stuff, but I am at a complete loss of how to do this and there are no examples of compute commands being encoded into ICBs on the GPU.

I would really appreciate any support / tips to figure out what is wrong.

(P.S. all these issues would go away if there is a way to get the command index (passed into the constructor of compute_command) from the shader which that command invokes so if there is a way to do that, please let me know. All these issues stem from the fact that I need each invocation of the exclusivePrefixSumPipeline to be able to access data that is individual to each invocation. Equivalently if there is an equivalent of setBytes or something that would be exactly what I need).

Also, in the GPU replayer all commands executed in the ICB seem to have the last pipeline that was bound. In the code I posted, that would be icb.increaseCommandNumber. If instead, I created the command which invokes increaseCommandNumber before the one which invokes icb.exclusivePrefixSumPipeline, all the compute commands in the replayer will have the icb.exclusivePrefixSumPipeline bound because it was the last one to be bound. Is this normal?

Could you please create a Feedback Assistant and post it here in the comments? And also, could you show the code you use to create the indirect command buffers and the pipelines?

Hi,

Thank you so much for the swift response!

I am new to the Apple Forums - how do I create a Feedback Assistant?

Also, here is the code for creating the pipelines and indirect command buffers:

    NSError *error;

    auto prefixSumLib = File::projectPath("build-shaders/parallel-primitives.metallib");
    NSURL *pathURL = [NSURL fileURLWithPath:[NSString stringWithUTF8String:prefixSumLib.c_str()] isDirectory:false];
    id<MTLLibrary> library = [ctx.gpuDevice newLibraryWithURL:pathURL error:&error];

    if (library == nil)
        NSLog(@"%@", error);

    { /* Prefix sum function. */
        id<MTLFunction> prefixSumFunction = [library newFunctionWithName:@"kernelExclusivePrefixSumUint32"];
        MTLComputePipelineDescriptor *descriptor = [MTLComputePipelineDescriptor new];
        descriptor.supportIndirectCommandBuffers = YES;
        descriptor.computeFunction = prefixSumFunction;
        mExclusivePrefixScan = [ctx.gpuDevice newComputePipelineStateWithDescriptor:descriptor options:MTLPipelineOptionNone reflection:nil error:&error];
    }

    { /* Prefix sum accumulation function. */
        id<MTLFunction> prefixSumAccFunction = [library newFunctionWithName:@"kernelExclusivePrefixSumAccumulateUint32"];
        MTLComputePipelineDescriptor *descriptor = [MTLComputePipelineDescriptor new];
        descriptor.supportIndirectCommandBuffers = YES;
        descriptor.computeFunction = prefixSumAccFunction;
        mExclusivePrefixScanAcc = [ctx.gpuDevice newComputePipelineStateWithDescriptor:descriptor options:MTLPipelineOptionNone reflection:nil error:&error];
    }

    { /* Increase Command Number kernel. */
        id<MTLFunction> increaseCommandN = [library newFunctionWithName:@"kernelIncreaseCommandNumber"];
        MTLComputePipelineDescriptor *descriptor = [MTLComputePipelineDescriptor new];
        descriptor.supportIndirectCommandBuffers = YES;
        descriptor.computeFunction = increaseCommandN;
        mIncreaseCommandNumber = [ctx.gpuDevice newComputePipelineStateWithDescriptor:descriptor options:MTLPipelineOptionNone reflection:nil error:&error];
    }

    { /* Make prefix sum indirect command buffer argument. */
        id<MTLFunction> prefixSumDispatch = [library newFunctionWithName:@"kernelExclusivePrefixSumDispatchUint32"];
        mExclusivePrefixScanDispatch = [ctx.gpuDevice newComputePipelineStateWithFunction:prefixSumDispatch error:&error];

        /* Make the indirect command buffer */
        MTLIndirectCommandBufferDescriptor *icbDescriptor = [MTLIndirectCommandBufferDescriptor new];
        icbDescriptor.commandTypes = MTLIndirectCommandTypeConcurrentDispatch;
        icbDescriptor.inheritBuffers = NO;
#if defined TARGET_MACOS || defined(__IPHONE_13_0)
        if (@available(iOS 13.0, *))
            icbDescriptor.inheritPipelineState = NO;
#endif

        mExclusivePrefixScanCommands = [ctx.gpuDevice newIndirectCommandBufferWithDescriptor:icbDescriptor 
                                        maxCommandCount:MAX_COMMANDS_IN_PREFIX_SUM options:MTLResourceStorageModePrivate];
        mExclusivePrefixScanAccCommands = [ctx.gpuDevice newIndirectCommandBufferWithDescriptor:icbDescriptor 
                                          maxCommandCount:MAX_COMMANDS_IN_PREFIX_SUM options:MTLResourceStorageModePrivate];
        
        mExclusivePrefixScanCommands.label = @"Prefix Sum Commands";
        mExclusivePrefixScanAccCommands.label = @"Prefix Sum Accumulate Commands";

        /* Make argument buffer for the ICBs. */
        id<MTLArgumentEncoder> argumentEncoder = [prefixSumDispatch newArgumentEncoderWithBufferIndex:3];
        mExclusivePrefixArgumentBuffer = [ctx.gpuDevice newBufferWithLength:argumentEncoder.encodedLength options:MTLResourceStorageModeShared];
        mExclusivePrefixArgumentBuffer.label = @"Prefix Sum Argument Encoder";

        [argumentEncoder setArgumentBuffer:mExclusivePrefixArgumentBuffer offset:0];
        [argumentEncoder setIndirectCommandBuffer:mExclusivePrefixScanCommands atIndex:0];
        [argumentEncoder setIndirectCommandBuffer:mExclusivePrefixScanAccCommands atIndex:1];

        [argumentEncoder setComputePipelineState:mExclusivePrefixScan atIndex:2];
        [argumentEncoder setComputePipelineState:mExclusivePrefixScanAcc atIndex:3];
        [argumentEncoder setComputePipelineState:mIncreaseCommandNumber atIndex:4];

        mExclusivePrefixScanScratch = [ctx.gpuDevice newBufferWithLength: sizeof(PrefixSumNBuffer)*2*MAX_COMMANDS_IN_PREFIX_SUM options:MTLResourceStorageModePrivate];
    }

Which category should I choose for the feedback assistant? (I see a list of items under profiles and logs)

Metal ICB: set_compute_pipeline not working?
 
 
Q