Run Time Issues with Swift/Core ML

Hello!

I have a swift program that tracks the location of a ball (through the back camera). It seems to be working fine, but the only issue is the run time, particularly my concatenate, normalize, and argmax functions, which are meant to be a 1 to 1 copy of the PyTorch argmax function and the following python lines:

imgs = np.concatenate((img, img_prev, img_preprev), axis=2)
imgs = imgs.astype(np.float32)/255.0
imgs = np.rollaxis(imgs, 2, 0)
inp = np.expand_dims(imgs, axis=0) # used to pass into model

However, I need my program to run in real time and in an ideal world, I want it to run way under real time. Below is a run down of the run times that result from my code:

Starting model inference
Setup took: 0.0 seconds
Resize took: 0.03741896152496338 seconds
Concatenation took: 0.3359949588775635 seconds
Normalization took: 0.9906361103057861 seconds
Model prediction took: 0.3425499200820923 seconds
Argmax took: 28.17007803916931 seconds
Postprocess took: 0.054128050804138184 seconds
Model inference took 29.934185028076172 seconds

Here are the concatenateBuffers, normalizeBuffers, and argmax functions that I use:

func concatenateBuffers(_ buffers: [CVPixelBuffer?]) -> CVPixelBuffer? {
    guard buffers.count == 3, let first = buffers[0] else { return nil }
    let width = CVPixelBufferGetWidth(first)
    let height = CVPixelBufferGetHeight(first)
    let targetChannels = 9
    
    var concatenated: CVPixelBuffer?
    let attrs = [kCVPixelBufferCGImageCompatibilityKey: kCFBooleanTrue] as CFDictionary
    CVPixelBufferCreate(kCFAllocatorDefault, width, height, kCVPixelFormatType_32BGRA, attrs, &concatenated)
    guard let output = concatenated else { return nil }
    
    CVPixelBufferLockBaseAddress(output, [])
    defer { CVPixelBufferUnlockBaseAddress(output, []) }
    
    guard let outputData = CVPixelBufferGetBaseAddress(output) else { return nil }
    let outputPtr = UnsafeMutablePointer<UInt8>(OpaquePointer(outputData))
    
    // Lock all input buffers at once
    buffers.forEach { buffer in
        guard let buffer = buffer else { return }
        CVPixelBufferLockBaseAddress(buffer, .readOnly)
    }
    defer {
        buffers.forEach { CVPixelBufferUnlockBaseAddress($0!, .readOnly) }
    }
    
    // Process each input buffer
    for (frameIdx, buffer) in buffers.enumerated() {
        guard let buffer = buffer,
              let inputData = CVPixelBufferGetBaseAddress(buffer) else { continue }
        
        let inputPtr = UnsafePointer<UInt8>(OpaquePointer(inputData))
        let bytesPerRow = CVPixelBufferGetBytesPerRow(buffer)
        let totalPixels = width * height
        
        // Process all pixels in one go for this frame
        for i in 0..<totalPixels {
            let y = i / width
            let x = i % width
            
            let inputOffset = y * bytesPerRow + x * 4
            let outputOffset = i * targetChannels + frameIdx * 3
            
            // BGR order to match numpy
            outputPtr[outputOffset] = inputPtr[inputOffset + 2]     // B
            outputPtr[outputOffset + 1] = inputPtr[inputOffset + 1] // G
            outputPtr[outputOffset + 2] = inputPtr[inputOffset]     // R
        }
    }
    
    return output
}

func normalizeBuffer(_ buffer: CVPixelBuffer?) -> MLMultiArray? {
    guard let input = buffer else { return nil }
    
    let width = CVPixelBufferGetWidth(input)
    let height = CVPixelBufferGetHeight(input)
    let channels = 9
    
    CVPixelBufferLockBaseAddress(input, .readOnly)
    defer { CVPixelBufferUnlockBaseAddress(input, .readOnly) }
    
    guard let inputData = CVPixelBufferGetBaseAddress(input) else { return nil }
    
    let shape = [1, NSNumber(value: channels), NSNumber(value: height), NSNumber(value: width)]
    guard let output = try? MLMultiArray(shape: shape, dataType: .float32) else { return nil }
    
    let inputPtr = inputData.assumingMemoryBound(to: UInt8.self)
    let bytesPerRow = CVPixelBufferGetBytesPerRow(input)
    
    let ptr = UnsafeMutablePointer<Float>(OpaquePointer(output.dataPointer))
    let totalSize = width * height
    
    for c in 0..<channels {
        for idx in 0..<totalSize {
            let h = idx / width
            let w = idx % width
            let inputIdx = h * bytesPerRow + w * channels + c
            ptr[c * totalSize + idx] = Float(inputPtr[inputIdx]) / 255.0
        }
    }
    
    return output
}

func argmax(_ array: MLMultiArray) -> MLMultiArray? {
    let shape = array.shape.map { $0.intValue }
    guard shape.count == 3,
          shape[0] == 1,
          shape[1] == 256,
          shape[2] == 230400 else {
        return nil
    }
    
    guard let output = try? MLMultiArray(shape: [1, NSNumber(value: 230400)], dataType: .int32) else { return nil }
    
    let ptr = UnsafePointer<Float>(OpaquePointer(array.dataPointer))
    let outputPtr = UnsafeMutablePointer<Int32>(OpaquePointer(output.dataPointer))
    
    let channelSize = 230400
    
    for pos in 0..<230400 {
        var maxValue = -Float.infinity
        var maxIndex: Int32 = 0
        
        for channel in 0..<256 {
            let value = ptr[channel * channelSize + pos]
            if value > maxValue {
                maxValue = value
                maxIndex = Int32(channel)
            }
        }
        
        outputPtr[pos] = maxIndex
    }
    
    return output
}

Are there any glaring areas of inefficiencies that can be reduced to allow for under real time processing whilst following the same logic as found in the python code exactly? Would using Obj-C speed things up for some reason? Are there any tools I can use so I don't have to write these functions myself?

Additionally, in the classes init, function, I tried to check the compute units being used since I feel 0.34 seconds for a singular model prediction is also far too long, but no print statements are showing for some reason:

init() {
        guard let loadedModel = try? BallTrackerModel() else {
            fatalError("Could not load model")
        }
        let config = MLModelConfiguration()
        config.computeUnits = .all
        guard let configuredModel = try? BallTrackerModel(configuration: config) else {
            fatalError("Could not configure model")
        }
        self.model = configuredModel
        print("model loaded with compute units \(config.computeUnits.rawValue)")
    }

Thanks!

Answered by Frameworks Engineer in 822891022

Hi @michaeldegoat

If you are targeting iOS 18.0+, you may find MLTensor useful.

The equivalent code for the NumPy snippet you showed would look something like:

var imgs = MLTensor(
    concatenating: [img, imgPrev, imgPrevPrev],
    alongAxis: 2
).cast(to: Float.self) / 255
// Assuming rollaxis is used to move the channel to the first dimension
// (aka transpose)
imgs = imgs.transposed()
// Add batch dimension
imgs = imgs.expandingShape(at: 0)

If these are large images, then you may find it beneficial to dispatch the workload to the GPU.

let imgs = withMLTensorComputePolicy(.cpuAndGPU) {
    var imgs = MLTensor(
        concatenating: [img, imgPrev, imgPrevPrev],
        alongAxis: 2
    ).cast(to: Float.self) / 255
    imgs = imgs.transposed()
    imgs = imgs.expandingShape(at: 0)
    return imgs
}

But to instantiate a MLTensor from a CVPixelBuffer, you will need to instantiate a MLMultiArray and then a MLShapedArray (which also supports some transformation operations you may find useful).

Alternatively, you could extend the existing model to perform the preprocessing, i.e., wrap the model in a module/layer which takes in multiple inputs and stitches them together before passing to the pre-trained model.

In regards to determining compute device compatibility of the model, you can either use the MLComputePlan API or Xcode Core ML Performance Report. Check out the following WWDC sessions to learn more.

https://developer.apple.com/videos/play/wwdc2024/10161/ https://developer.apple.com/videos/play/wwdc2023/10049/

Hi @michaeldegoat

If you are targeting iOS 18.0+, you may find MLTensor useful.

The equivalent code for the NumPy snippet you showed would look something like:

var imgs = MLTensor(
    concatenating: [img, imgPrev, imgPrevPrev],
    alongAxis: 2
).cast(to: Float.self) / 255
// Assuming rollaxis is used to move the channel to the first dimension
// (aka transpose)
imgs = imgs.transposed()
// Add batch dimension
imgs = imgs.expandingShape(at: 0)

If these are large images, then you may find it beneficial to dispatch the workload to the GPU.

let imgs = withMLTensorComputePolicy(.cpuAndGPU) {
    var imgs = MLTensor(
        concatenating: [img, imgPrev, imgPrevPrev],
        alongAxis: 2
    ).cast(to: Float.self) / 255
    imgs = imgs.transposed()
    imgs = imgs.expandingShape(at: 0)
    return imgs
}

But to instantiate a MLTensor from a CVPixelBuffer, you will need to instantiate a MLMultiArray and then a MLShapedArray (which also supports some transformation operations you may find useful).

Alternatively, you could extend the existing model to perform the preprocessing, i.e., wrap the model in a module/layer which takes in multiple inputs and stitches them together before passing to the pre-trained model.

In regards to determining compute device compatibility of the model, you can either use the MLComputePlan API or Xcode Core ML Performance Report. Check out the following WWDC sessions to learn more.

https://developer.apple.com/videos/play/wwdc2024/10161/ https://developer.apple.com/videos/play/wwdc2023/10049/

Ok thanks! This is really helpful!

I had a couple of follow up questions:

  1. So the general approach to minimize run time would be to get the CVPixelBuffers from the camera and convert them into vImages so I can resize them quickly, then change these vImages into Float arrays which can be reshaped and converted to MLTensors in which we can perform the needed operations?
  2. When I am coding in Xcode, I am getting cannot find "MLTensor" in scope, though I have imported CoreML. How can I use MLTensor?
  3. Right now my model takes in MultiArrays as input, and produces a MultiArray as output. How do I set the input and outputs to be MLTensors? I have my current converter code pasted below
def convert_to_coreml(model_path):
    logging.basicConfig(level=logging.DEBUG)

    model = BallTrackerNet()
    model.load_state_dict(torch.load(model_path, map_location='cpu'))
    model.eval()
    
    example_input = torch.rand(1, 9, 360, 640)
    
    # Trace the model to verify shapes
    traced_model = torch.jit.trace(model, example_input)
    
    model_coreml = ct.convert(
        traced_model,
        inputs=[
            ct.TensorType(
                name="input_frames",
                shape=(1, 9, 360, 640),
                dtype=np.float32
            )
        ],
        convert_to="mlprogram",
    )
    
    model_coreml.save("BallTracker2.mlpackage")
    return model_coreml

# Run conversion
try:
    model = convert_to_coreml("balltrackerbest.pt")
    print("Conversion successful!")
except Exception as e:
    print(f"Conversion error: {str(e)}")

Thanks!

Hi @michaeldegoat

Using MLTensor is one option; another option is to perform this inside the model itself, i.e., wrap the BallTrackerNet module in a parent module which accepts three inputs and performs the necessary transformations before passing the concatenated input to BallTrackerNet. As you’ve suggested, you could also investigate using other accelerated frameworks for the pre-processing, such as Accelerate (for the CPU) or Metal (for the GPU).

If you don’t see MLTensor, it might be because you need to update Xcode (see availability here). Download the latest version and see if that fixes the issue.

MLModel also accepts MLTensor; check out the API documentation here.

Run Time Issues with Swift/Core ML
 
 
Q