MLModelAsset(specification:blobMapping:) with mlprogram model: correct predictions but drastically slower inference than compiled .mlmodelc path

I'm distributing an encrypted .mlpackage to my app and want to load it entirely in memory without ever writing decrypted weights to disk. I tried MLModelAsset(specification:blobMapping:) as the path to achieve this, but ran into a significant inference performance gap compared to the compiled code path.

What I'm trying to do

The encrypted .enc file is a serialized FileWrapper of the full .mlpackage, sealed with AES-GCM. At runtime I decrypt it in memory, deserialize the FileWrapper, extract the spec and weight blob, and load via MLModelAsset:

static func loadEncryptedPackage(url: URL, configuration: MLModelConfiguration) async throws -> MLModel {
    // AES-GCM decryption → decryptedData (full serialized .mlpackage)

    guard let wrapper = FileWrapper(serializedRepresentation: decryptedData) else { throw ... }

    guard let (specWrapper, specParent) = findSpecWrapper(in: wrapper),
          let spec = specWrapper.regularFileContents else { throw ... }

    var blobs: [URL: Data] = [:]
    collectBlobs(in: specParent, relativePath: "", excluding: specWrapper, into: &blobs)
    // keys built as URL(fileURLWithPath: rel), e.g. "weights/weight.bin"

    let asset = try MLModelAsset(specification: spec, blobMapping: blobs)
    let model = try await MLModel.load(asset: asset, configuration: configuration)

    // See observation #3 below — must retain these for the model's lifetime
    objc_setAssociatedObject(model, &retentionKey, Retainer(spec: spec, blobs: blobs), .OBJC_ASSOCIATION_RETAIN)
    return model
}

What I observed

  1. Predictions are accurate. The blobs are found, weights are applied, and the model produces correct results.

  2. Inference is drastically slower than the compiled code path. The same model loaded via MLModel.compileModel(at:) + MLModel.load(contentsOf:) runs inference much faster on the same device with the same MLModelConfiguration (computeUnits = .all). With MLModelAsset the slowdown is consistent across every prediction call, not just the first one.

  3. The spec and blob Data objects must stay alive for the model's lifetime. Without retaining them via objc_setAssociatedObject, inference produces NaN outputs or crashes. This suggests Core ML holds a reference back into those Data buffers beyond the load() call, rather than copying them into its own memory during loading.

  4. Using the exact blob URI from the spec as the blobMapping key triggers a compilation error. The spec (inspected via strings on the .mlmodel protobuf) stores blob references as @model_path/weights/weight.bin. When I key the blobMapping with URL(string: "@model_path/weights/weight.bin"), MLModel.load(asset:) throws:

compiler error: Encountered an error while compiling a model: validator error: The in-memory ML Program must not have a blob file reference but found a reference to mem://weights/weight.bin. With other key formats (e.g. URL(fileURLWithPath: "weights/weight.bin")), this error does not appear — the model loads and predictions are accurate, but inference is slow as in observation #2.

The working alternative (which I want to avoid)

Decrypting to a temporary directory, calling MLModel.compileModel(at:), loading from the compiled .mlmodelc, then deleting the temp files produces fast inference. Same model, same device, same configuration. The only difference is the compilation step — and the fact that decrypted weights touch disk, which I want to avoid for security reasons.

Questions

  1. Is MLModelAsset(specification:blobMapping:) expected to produce inference performance equivalent to loading from a compiled .mlmodelc? If not, is the performance gap fundamental to the API or something that can be addressed?

  2. Is there any supported way to load an mlprogram model with external weight blobs entirely in memory and achieve inference performance comparable to the compiled code path — i.e. without writing decrypted model data to disk at any point?

  3. The validator error "in-memory ML Program must not have a blob file reference" is a hard block when Core ML successfully resolves the blobs and attempts mlprogram compilation. Is this an intended constraint, and does it mean MLModelAsset(specification:blobMapping:) is not the right API for this use case?

Just to set the context here, I want to avoid the model being stolen, on iOS and macOS. So I'm encrypting and downloading the encrypted model from server and decrypting it with a key in the keychain (which we get from server as well). I don't want to write the decrypted model onto disk at any time. Thus I'm trying to use MLModelAsset because it has a way to get initialized with decrypted memory bytes in memory.

MLModelAsset(specification:blobMapping:) with mlprogram model: correct predictions but drastically slower inference than compiled .mlmodelc path
 
 
Q