How to dispatch my `MLCustomLayer` to GPU instead of CPU

MLCustomLayer implementation always dispatches to CPU instead of GPU

Background:

I am trying to run my CoreML model with a custom layer on the iPhone 13 Pro. My custom layer runs successfully on the CPU, however it still dispatches to the CPU instead of the mobile's GPU despite the encodeToCommandBuffer member function being defined in the application's binding class for the custom layer.

I have been following the CoreMLTools documentation's suggested Swift example to get this working, but note that my implementation is purely in Objective-C++.

Despite reading in depth into the documentation, I still have not come across any resolution to the problem. Any help looking into this issue (or perhaps even bug in CoreML) would be much appreciated!

Below, I provide a minimal example based off of the Swift example mentioned above.

Implementation

My toy Objective C++ implementation is based off of the Swift example here. This implements the Swish activation function for both the CPU and GPU.

PyTorch model to CoreML MLModel transformation

For brevity, I will not define my toy PyTorch model, nor the Python bindings to allow the custom Swish layer to be scripted/traced and then converted to a CoreML MLModel, but I can provide these if necessary. Just note that the Python layer's name and bindings should match the name in the class defined below, ie. ToySwish.

To convert the scripted/traced PyTorch model (called torchscript_model in the listing below) to a CoreML MLModel, I use CoreMLTools (from Python) and then save the model as follows;

input_shapes = [[1,64,256,256]]

mlmodel = coremltools.converters.convert(
		torchscript_model,
		source='pytorch',
		inputs=[coremltools.TensorType(name=f'input_{i}', shape=input_shape) for i, input_shape in enumerate(input_shapes)],
		add_custom_layers = True,
		minimum_deployment_target = coremltools.target.iOS14,
		compute_units = coremltools.ComputeUnit.CPU_AND_GPU,
	)

mlmodel.save('toy_swish_model.mlmodel')

Metal shader

I use the same Metal shader function swish from Swish.metal here.

MLCustomLayer binding class for Swish MLModel layer

I define an analogous Objective-C++ class to the Swift example. This class inherits from NSObject and the MLCustomLayer protocol. This class follows the guidelines in the Apple documentation for integrating a CoreML MLModel with a custom layer. This is defined as follows;

  • Class definition and resource setup;
#import <Foundation/Foundation.h>
#include <CoreML/CoreML.h>
#import <Metal/Metal.h>

@interface ToySwish : NSObject<MLCustomLayer>{}
@end

@implementation ToySwish{
    id<MTLComputePipelineState> swishPipeline;
}

- (instancetype) initWithParameterDictionary:(NSDictionary<NSString *,id> *)parameters error:(NSError *__autoreleasing _Nullable *)error{
  
    NSError* errorPSO = nil;
    id<MTLDevice> device = MTLCreateSystemDefaultDevice();
    id<MTLLibrary> defaultlibrary = [device newDefaultLibrary];
    id<MTLFunction> swishFunction = [defaultlibrary newFunctionWithName:@"swish"];
    swishPipeline = [device newComputePipelineStateWithFunction:swishFunction error:&errorPSO];
    assert(errorPSO == nil);

    return self;
}

- (BOOL) setWeightData:(NSArray<NSData *> *)weights error:(NSError *__autoreleasing _Nullable *) error{
    return YES;
}

- (NSArray<NSArray<NSNumber *> * > *) outputShapesForInputShapes:(NSArray<NSArray<NSNumber *> *> *)inputShapes error:(NSError *__autoreleasing _Nullable *) error{
    return inputShapes;
}
  • CPU compute method (this is only shown for completeness);
- (BOOL) evaluateOnCPUWithInputs:(NSArray<MLMultiArray *> *)inputs outputs:(NSArray<MLMultiArray *> *)outputs error:(NSError *__autoreleasing _Nullable *)error{

    NSLog(@"Dispatching to CPU");

    for(NSInteger i = 0; i < inputs.count; i++){
        NSInteger num_elems = inputs[i].count;
        float* input_ptr = (float *) inputs[i].dataPointer;
        float* output_ptr = (float *) outputs[i].dataPointer;
   
        for(int j = 0; j < num_elems; j++){
            output_ptr[j] = 1.0/(1.0 + exp(-input_ptr[j]));
        }
    }

    return YES;
}
  • Encode GPU commands to command buffer;
    • Note, according to documentation, this command buffer should not be committed, as it is executed by CoreML after this method returns.
- (BOOL) encodeToCommandBuffer:(id<MTLCommandBuffer>)commandBuffer inputs:(NSArray<id<MTLTexture>> *)inputs outputs:(NSArray<id<MTLTexture>> *)outputs error:(NSError *__autoreleasing _Nullable *)error{
  
    NSLog(@"Dispatching to GPU");
  

    id<MTLComputeCommandEncoder> computeEncoder = [commandBuffer 
        computeCommandEncoderWithDispatchType:MTLDispatchTypeSerial];
    assert(computeEncoder != nil);

    for(int i = 0; i < inputs.count; i++){
  
        [computeEncoder setComputePipelineState:swishPipeline];
        [computeEncoder setTexture:inputs[i] atIndex:0];
        [computeEncoder setTexture:outputs[i] atIndex:1];
  
        NSInteger w = swishPipeline.threadExecutionWidth;
        NSInteger h = swishPipeline.maxTotalThreadsPerThreadgroup / w;

        MTLSize threadGroupSize = MTLSizeMake(w, h, 1);

        NSInteger groupWidth = (inputs[0].width    + threadGroupSize.width - 1) / threadGroupSize.width;
        NSInteger groupHeight = (inputs[0].height   + threadGroupSize.height - 1) / threadGroupSize.height;
        NSInteger groupDepth = (inputs[0].arrayLength + threadGroupSize.depth - 1) / threadGroupSize.depth;

        MTLSize threadGroups = MTLSizeMake(groupWidth, groupHeight, groupDepth);

        [computeEncoder dispatchThreads:threadGroups threadsPerThreadgroup:threadGroupSize];

        [computeEncoder endEncoding];
    }

    return YES;
}

Run inference for a given input

The MLModel is loaded and compiled in the application. I check to ensure that the model configuration's computeUnits are set to MLComputeUnitsAll as desired (this should allow dispatching to CPU, GPU and ANU) of the MLModel layers.

I define a MLDictionaryFeatureProvider object called feature_provider from a NSMutableDictionary of input features (input tensors in this case), and then pass this to the predictionFromFeatures method of my loaded model model as follows;

@autoreleasepool {
    [model predictionFromFeatures:feature_provider error:error];
} 

This computes a single forward pass of my model. When this executes, you can see that the 'Dispatching to CPU' string is printed instead of the 'Dispatching to GPU' string. This (along with the slow execution time) indicates the Swish layer is being run from the evaluateOnCPUWithInputs method and thus on the CPU, instead of the GPU as expected.

I am quite new to developing for iOS and to Objective-C++, so I might have missed something that is quite simple, however from reading the documentation and examples, it is not at all clear to me what the issue is. Any help or advice would be really appreciated :)

Environment

  • XCode 13.1
  • iPhone 13
    • iOS 15.1.1
  • iOS deployment target 15.0

Accepted Reply

After a lot of experimentation and further debugging, I've found the issue; the input tensors to the custom layer must be at most a rank-4 tensor. For my case, one of my inputs was rank-5. This seemed to force the fallback function evaluateOnCPUWithInputs to be used.

I have not found anywhere in the documentation mentioning this restriction (although I might have missed it). It could be very helpful for others in the future if Apple/CoreML included this in their documentation, or even an example detailing the restrictions.

Replies

Hi, Can you please check whether switching to compiling the model on device (using this api) helps?

  • Hi, thank you for your response. Yes, my current implementation already compiles the model on the device as follows;

    NSString* modelPath = [NSString stringWithUTF8String:model_path.c_str()]; NSURL* modelURL = [NSURL fileURLWithPath:modelPath]; NSURL* compiledModel = [MLModel compileModelAtURL:modelURL error:error]; MLModel* module = [MLModel modelWithContentsOfURL:compiledModel error:NULL];

    where model_path is a std::string set to the correct model path on the device (and checked to see if it exists).

    (My apologies for the badly formatted code snippet; the forums don't seem to allow for non-inline code in comments.)

Add a Comment

After a lot of experimentation and further debugging, I've found the issue; the input tensors to the custom layer must be at most a rank-4 tensor. For my case, one of my inputs was rank-5. This seemed to force the fallback function evaluateOnCPUWithInputs to be used.

I have not found anywhere in the documentation mentioning this restriction (although I might have missed it). It could be very helpful for others in the future if Apple/CoreML included this in their documentation, or even an example detailing the restrictions.

You run the custom model by Objective-C++ implementation? I try to do this but get some error as i described in https://developer.apple.com/forums/thread/698611. The MLCustomLayer protocol including evaluateOnCPUWithInputs initWithParameterDictionary outputShapesForInputShapes setWeightData are implemented. But it still can't work. Are there any special settings in your project? Any idea on help this would be grateful. Thank you!