SpeechTranscriber/SpeechAnalyzer being relatively slow compared to FoundationModel and TTS

So,

I've been wondering how fast a an offline STT -> ML Prompt -> TTS roundtrip would be.

Interestingly, for many tests, the SpeechTranscriber (STT) takes the bulk of the time, compared to generating a FoundationModel response and creating the Audio using TTS.

E.g.

        InteractionStatistics:
        - listeningStarted:             21:24:23 4480 2423
        - timeTillFirstAboveNoiseFloor: 01.794
        - timeTillLastNoiseAboveFloor:  02.383
        - timeTillFirstSpeechDetected:  02.399
        - timeTillTranscriptFinalized:  04.510
        - timeTillFirstMLModelResponse: 04.938
        - timeTillMLModelResponse:      05.379
        - timeTillTTSStarted:           04.962
        - timeTillTTSFinished:          11.016
        - speechLength:                 06.054
        - timeToResponse:               02.578
        - transcript:                   This is a test.
        - mlModelResponse:              Sure! I'm ready to help with your test. What do you need help with?

Here, between my audio input ending and the Text-2-Speech starting top play (using AVSpeechUtterance) the total response time was 2.5s. Of that time, it took the SpeechAnalyzer 2.1s to get the transcript finalized, FoundationModel only took 0.4s to respond (and TTS started playing nearly instantly).

I'm already using reportingOptions: [.volatileResults, .fastResults] so it's probably as fast as possible right now? I'm just surprised the STT takes so much longer compared to the other parts (all being CoreML based, aren't they?)

Accepted Answer

We've added some advice on improving performance to our documentation, at https://developer.apple.com/documentation/speech/speechanalyzer#Improve-responsiveness.

The prepareToAnalyze method may be useful to preheat the analyzer and get the transcription started a bit sooner.

Ah, nice, let's see, first baseline without prepareToAnalyze:

The KPI I'm interested is the time between the last audio above the noise-ground level and the final transcript (e.g. between the user stopping to speak and the transcription being ready to trigger actions): n: 11, avg: 2.2s, Var: 0.75

Then, with calling prepareToAnalyze: n: 11, avg: 1.45s, Var: 1.305 (the delay varied greatly between 0.05s and 3s)

So yeah, based on this small sample, preparing did seem to decrease the delay.

I've been optimizing a similar STT-to-action pipeline on macOS 26 and found a few additional tricks beyond prepareToAnalyze that helped bring the finalization latency down:

  1. Use volatileResults aggressively for UI feedback, but trigger your downstream action (FoundationModel call) on the volatile transcript as soon as it stabilizes — don't wait for the finalized event. In my testing, the volatile transcript matches the final one ~95% of the time for short utterances. You can always correct if the final differs.

  2. Audio format matters more than you'd expect. If your input is coming through at 48kHz (common from ScreenCaptureKit or external mics), the internal resample to 16kHz adds measurable overhead. Setting up your AVAudioEngine tap at 16kHz mono from the start shaves ~200ms off the pipeline.

  3. The large variance Bersaelor observed with prepareToAnalyze (0.05s to 3s) likely correlates with whether the ANE was already warm. If other CoreML workloads are running concurrently (even system ones like Visual Intelligence), the first inference after a cold ANE is significantly slower. Keeping a lightweight keep-alive inference running in the background can help, though it's a tradeoff with power consumption.

  4. For the specific use case of voice-triggered actions, I found that monitoring the noise floor drop (timeTillLastNoiseAboveFloor) and immediately calling prepareToAnalyze at that moment — rather than at session start — gives more consistent results because the analyzer context is fresher when the actual finalization happens.

SpeechTranscriber/SpeechAnalyzer being relatively slow compared to FoundationModel and TTS
 
 
Q