We built a setup where a model split into an encoder and a decoder can run each part on a different backend, using our own component protocols. Is mixing Core AI and Core ML within a single inference pass something you would recommend, and what is the realistic cost at the boundary where we convert between MLMultiArray / MLTensor and NDArray? Is there a way to keep the encoder output resident on the GPU or ANE so it does not need a host round trip into the other backend?
Mixing compute between CoreML and CoreAI is definitely possible, some parts can be done without bridging cost, others may require synchronization:
In terms of mixing buffer types, MLMultiArray/MLShapedArray can be bridged without copy by using NDArray.View/RawView types to construct NDArray views from the memory of the MultiArray/ShapedArray . Then those views can be used as inputs to the CoreAI work (or construct mutable views for the outputs).
You can similarly bridge an MLTensor to an NDArray by first bridging MLTensor to an MLShapedArray and then bridge to NDArray as shown.
However for the MLTensor bridging it sounds like you're trying to have an optimized asynchronous pipeline between the CoreML work and CoreAI model such that if both are running on GPU/Neural engine they share the same async flow without synchronizing to CPU in between. We have support for this with separate models/functions running through CoreAI only , however there isn't a way to put both the CoreML and CoreAI inference on the same async pipeline, you'd have to wait on the CPU for the first to complete, then dispatch the second one.