MLCompute performance - control (pre)allocation of result and output tensors?

With the new MLCompute framework in Objective-C, I built a simple MLCInferenceGraph for testing. It just runs one layer, the batched MatMul.
Works fine except for one thing.
With all two input and one output MLCTensor allocated on the CPU, when you call repeatedly (e.g. with new input data :)
  • [MLCInferenceGraph executeWithInputsData:batchSize:options:completionHandler:]

the resultTensor caught in the completionHandler is newly allocated after each call to executeWithInputsData.

I do not want that as I have already pre-allocated, properly cache-aligned CPU memory for all the input and output buffers i.e. no new allocations necessary. And this is CPU-only i.e. no sync needs.

First, I tried binding the outputData via addOutputs: (like input) at the MLCInferenceGraph [1,2] but still the resultTensor data in completionHandler is always newly allocated.

Second, I tried using executeWithInputsData [1,3] with outputsData: myPreallocatedOutdata and still it is newly allocated on every invocation.
  • [MLCInferenceGraph executeWithInputsData:outputsData:batchSize:options:completionHandler:]


Question: How can we as users avoid the new allocation of output data on every invocation of
  • [MLCInferenceGraph executeWithInputsData:batchSize:options:completionHandler:]

please?

(Via crashes I see that an internal call to
  • [MLCDeviceCPU(MLCEngineDispatch) dispatchForwardMatMulLayer:sourceTensor:secondarySourceTensor:resultTensor:resultTensorIsTemporary:resultTensorAllocate:] + 204)

might be involved but the user seems unable to control allocation behavior of the output tensor.

PS: I check the [resultTensor data] in completionHandler to verify whether I get my pre-allocated tensor/data buffers or not.

What am I missing :) ? Any solutions?

[1] https://developer.apple.com/documentation/mlcompute/mlcinferencegraph?language=objc
[2] https://developer.apple.com/documentation/mlcompute/mlcinferencegraph/3579690-addoutputs?language=objc
[3] https://developer.apple.com/documentation/mlcompute/mlcinferencegraph/3579696-executewithinputsdata?language=objc
After 2 months of waiting for a response from Apple, just a ping from me that I heard nothing yet whether there is bug in my code, a bug in MLCompute, or do I need a different workaround in ObjC/MLCompute? I mailed the friendly Apple development contact as well but no followup by MLCompute team. Even with my effort for the ObjC/MLCompute test case in the Feedback assistant, no response. The GitHub repos for TF-M1 and PyTorch are also very quiet on MLCompute.

MLCompute performance - control (pre)allocation of result and output tensors?
 
 
Q