ANE Error with Statefu Model: "Unable to compute prediction" when State Tensor width is not 32-aligned

Hi everyone,

I believe I’ve encountered a potential bug or a hardware alignment limitation in the Core ML Framework / ANE Runtime specifically affecting the new Stateful API (introduced in iOS 18/macOS 15).

The Issue:

A Stateful mlprogram fails to run on the Apple Neural Engine (ANE) if the state tensor dimensions (specifically the width) are not a multiple of 32. The model works perfectly on CPU and GPU, but fails on ANE both during runtime and when generating a Performance Report in Xcode.

Error Message in Xcode UI:

"There was an error creating the performance report Unable to compute the prediction using ML Program. It can be an invalid input data or broken/unsupported model."

Observations:

Case A (Fails): State shape = (1, 3, 480, 270). Prediction fails on ANE.

Case B (Success): State shape = (1, 3, 480, 256). Prediction succeeds on ANE.

This suggests an internal memory alignment or tiling issue within the ANE driver when handling Stateful buffers that don't meet the 32-pixel/element alignment.

Reproduction Code (PyTorch + coremltools):

import torch.nn as nn
import coremltools as ct
import numpy as np

class RNN_Stateful(nn.Module):
    def __init__(self, hidden_shape):
        super(RNN_Stateful, self).__init__()
        # Simple conv to update state
        self.conv1 = nn.Conv2d(3 + hidden_shape[1], hidden_shape[1], kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(hidden_shape[1], 3, kernel_size=3, padding=1)
        self.register_buffer("hidden_state", torch.ones(hidden_shape, dtype=torch.float16))
    
    def forward(self, imgs):
        self.hidden_state = self.conv1(torch.cat((imgs, self.hidden_state), dim=1))
        return self.conv2(self.hidden_state)

# h=480, w=255 causes ANE failure. w=256 works.
b, ch, h, w = 1, 3, 480, 255 
model = RNN_Stateful((b, ch, h, w)).eval()
traced_model = torch.jit.trace(model, torch.randn(b, 3, h, w))

mlmodel = ct.convert(
    traced_model,
    inputs=[ct.TensorType(name="input_image", shape=(b, 3, h, w), dtype=np.float16)],
    outputs=[ct.TensorType(name="output", dtype=np.float16)],
    states=[ct.StateType(wrapped_type=ct.TensorType(shape=(b, ch, h, w), dtype=np.float16), name="hidden_state")],
    minimum_deployment_target=ct.target.iOS18,
    convert_to="mlprogram"
)
mlmodel.save("rnn_stateful.mlpackage")

Steps to see the error:

Open the generated .mlpackage in Xcode 16.0+.

Go to the Performance tab and run a test on a device with ANE (e.g., iPhone 15/16 or M-series Mac).

The report will fail to generate with the error mentioned above.

Environment:

OS: macOS 15.2

Xcode: 16.3

Hardware: M4

Has anyone else encountered this 32-pixel alignment requirement for StateType tensors on ANE? Is this a known hardware constraint or a bug in the Core ML runtime?

Any insights or workarounds (other than manual padding) would be appreciated.

ANE Error with Statefu Model: "Unable to compute prediction" when State Tensor width is not 32-aligned
 
 
Q