I've been optimizing a similar STT-to-action pipeline on macOS 26 and found a few additional tricks beyond prepareToAnalyze that helped bring the finalization latency down: Use volatileResults aggressively for UI feedback, but trigger your downstream action (FoundationModel call) on the volatile transcript as soon as it stabilizes — don't wait for the finalized event. In my testing, the volatile transcript matches the final one ~95% of the time for short utterances. You can always correct if the final differs. Audio format matters more than you'd expect. If your input is coming through at 48kHz (common from ScreenCaptureKit or external mics), the internal resample to 16kHz adds measurable overhead. Setting up your AVAudioEngine tap at 16kHz mono from the start shaves ~200ms off the pipeline. The large variance Bersaelor observed with prepareToAnalyze (0.05s to 3s) likely correlates with whether the ANE was already warm. If other CoreML workloads are running concurrently (even system ones like Visual
Topic:
Media Technologies
SubTopic:
Audio
Tags: