MLCustomLayer implementation always dispatches to CPU instead of GPU
Background:
I am trying to run my CoreML model with a custom layer on the iPhone 13 Pro. My custom layer runs successfully on the CPU, however it still dispatches to the CPU instead of the mobile's GPU despite the encodeToCommandBuffer
member function being defined in the application's binding class for the custom layer.
I have been following the CoreMLTools documentation's suggested Swift example to get this working, but note that my implementation is purely in Objective-C++.
Despite reading in depth into the documentation, I still have not come across any resolution to the problem. Any help looking into this issue (or perhaps even bug in CoreML) would be much appreciated!
Below, I provide a minimal example based off of the Swift example mentioned above.
Implementation
My toy Objective C++ implementation is based off of the Swift example here. This implements the Swish activation function for both the CPU and GPU.
PyTorch model to CoreML MLModel transformation
For brevity, I will not define my toy PyTorch model, nor the Python bindings to allow the custom Swish layer to be scripted/traced and then converted to a CoreML MLModel
, but I can provide these if necessary. Just note that the Python layer's name and bindings should match the name in the class defined below, ie. ToySwish
.
To convert the scripted/traced PyTorch model (called torchscript_model
in the listing below) to a CoreML MLModel
, I use CoreMLTools (from Python) and then save the model as follows;
input_shapes = [[1,64,256,256]]
mlmodel = coremltools.converters.convert(
torchscript_model,
source='pytorch',
inputs=[coremltools.TensorType(name=f'input_{i}', shape=input_shape) for i, input_shape in enumerate(input_shapes)],
add_custom_layers = True,
minimum_deployment_target = coremltools.target.iOS14,
compute_units = coremltools.ComputeUnit.CPU_AND_GPU,
)
mlmodel.save('toy_swish_model.mlmodel')
Metal shader
I use the same Metal shader function swish
from Swish.metal
here.
MLCustomLayer binding class for Swish MLModel layer
I define an analogous Objective-C++ class to the Swift example. This class inherits from NSObject
and the MLCustomLayer
protocol. This class follows the guidelines in the Apple documentation for integrating a CoreML MLModel
with a custom layer. This is defined as follows;
- Class definition and resource setup;
#import <Foundation/Foundation.h>
#include <CoreML/CoreML.h>
#import <Metal/Metal.h>
@interface ToySwish : NSObject<MLCustomLayer>{}
@end
@implementation ToySwish{
id<MTLComputePipelineState> swishPipeline;
}
- (instancetype) initWithParameterDictionary:(NSDictionary<NSString *,id> *)parameters error:(NSError *__autoreleasing _Nullable *)error{
NSError* errorPSO = nil;
id<MTLDevice> device = MTLCreateSystemDefaultDevice();
id<MTLLibrary> defaultlibrary = [device newDefaultLibrary];
id<MTLFunction> swishFunction = [defaultlibrary newFunctionWithName:@"swish"];
swishPipeline = [device newComputePipelineStateWithFunction:swishFunction error:&errorPSO];
assert(errorPSO == nil);
return self;
}
- (BOOL) setWeightData:(NSArray<NSData *> *)weights error:(NSError *__autoreleasing _Nullable *) error{
return YES;
}
- (NSArray<NSArray<NSNumber *> * > *) outputShapesForInputShapes:(NSArray<NSArray<NSNumber *> *> *)inputShapes error:(NSError *__autoreleasing _Nullable *) error{
return inputShapes;
}
- CPU compute method (this is only shown for completeness);
- (BOOL) evaluateOnCPUWithInputs:(NSArray<MLMultiArray *> *)inputs outputs:(NSArray<MLMultiArray *> *)outputs error:(NSError *__autoreleasing _Nullable *)error{
NSLog(@"Dispatching to CPU");
for(NSInteger i = 0; i < inputs.count; i++){
NSInteger num_elems = inputs[i].count;
float* input_ptr = (float *) inputs[i].dataPointer;
float* output_ptr = (float *) outputs[i].dataPointer;
for(int j = 0; j < num_elems; j++){
output_ptr[j] = 1.0/(1.0 + exp(-input_ptr[j]));
}
}
return YES;
}
- Encode GPU commands to command buffer;
- Note, according to documentation, this command buffer should not be committed, as it is executed by CoreML after this method returns.
- (BOOL) encodeToCommandBuffer:(id<MTLCommandBuffer>)commandBuffer inputs:(NSArray<id<MTLTexture>> *)inputs outputs:(NSArray<id<MTLTexture>> *)outputs error:(NSError *__autoreleasing _Nullable *)error{
NSLog(@"Dispatching to GPU");
id<MTLComputeCommandEncoder> computeEncoder = [commandBuffer
computeCommandEncoderWithDispatchType:MTLDispatchTypeSerial];
assert(computeEncoder != nil);
for(int i = 0; i < inputs.count; i++){
[computeEncoder setComputePipelineState:swishPipeline];
[computeEncoder setTexture:inputs[i] atIndex:0];
[computeEncoder setTexture:outputs[i] atIndex:1];
NSInteger w = swishPipeline.threadExecutionWidth;
NSInteger h = swishPipeline.maxTotalThreadsPerThreadgroup / w;
MTLSize threadGroupSize = MTLSizeMake(w, h, 1);
NSInteger groupWidth = (inputs[0].width + threadGroupSize.width - 1) / threadGroupSize.width;
NSInteger groupHeight = (inputs[0].height + threadGroupSize.height - 1) / threadGroupSize.height;
NSInteger groupDepth = (inputs[0].arrayLength + threadGroupSize.depth - 1) / threadGroupSize.depth;
MTLSize threadGroups = MTLSizeMake(groupWidth, groupHeight, groupDepth);
[computeEncoder dispatchThreads:threadGroups threadsPerThreadgroup:threadGroupSize];
[computeEncoder endEncoding];
}
return YES;
}
Run inference for a given input
The MLModel is loaded and compiled in the application. I check to ensure that the model configuration's computeUnits
are set to MLComputeUnitsAll
as desired (this should allow dispatching to CPU, GPU and ANU) of the MLModel layers.
I define a MLDictionaryFeatureProvider
object called feature_provider
from a NSMutableDictionary
of input features (input tensors in this case), and then pass this to the predictionFromFeatures
method of my loaded model model
as follows;
@autoreleasepool {
[model predictionFromFeatures:feature_provider error:error];
}
This computes a single forward pass of my model. When this executes, you can see that the 'Dispatching to CPU' string is printed instead of the 'Dispatching to GPU' string. This (along with the slow execution time) indicates the Swish layer is being run from the evaluateOnCPUWithInputs
method and thus on the CPU, instead of the GPU as expected.
I am quite new to developing for iOS and to Objective-C++, so I might have missed something that is quite simple, however from reading the documentation and examples, it is not at all clear to me what the issue is. Any help or advice would be really appreciated :)
Environment
- XCode 13.1
- iPhone 13
- iOS 15.1.1
- iOS deployment target 15.0