SpeechTranscriber/SpeechAnalyzer being relatively slow compared to FoundationModel and TTS

Question

Bersaelor OP

Created Jul ’25

Replies 2

Boosts 0

Participants 2

So,

I've been wondering how fast a an offline STT -> ML Prompt -> TTS roundtrip would be.

Interestingly, for many tests, the SpeechTranscriber (STT) takes the bulk of the time, compared to generating a FoundationModel response and creating the Audio using TTS.

E.g.

        InteractionStatistics:
        - listeningStarted:             21:24:23 4480 2423
        - timeTillFirstAboveNoiseFloor: 01.794
        - timeTillLastNoiseAboveFloor:  02.383
        - timeTillFirstSpeechDetected:  02.399
        - timeTillTranscriptFinalized:  04.510
        - timeTillFirstMLModelResponse: 04.938
        - timeTillMLModelResponse:      05.379
        - timeTillTTSStarted:           04.962
        - timeTillTTSFinished:          11.016
        - speechLength:                 06.054
        - timeToResponse:               02.578
        - transcript:                   This is a test.
        - mlModelResponse:              Sure! I'm ready to help with your test. What do you need help with?

Here, between my audio input ending and the Text-2-Speech starting top play (using AVSpeechUtterance) the total response time was 2.5s. Of that time, it took the SpeechAnalyzer 2.1s to get the transcript finalized, FoundationModel only took 0.4s to respond (and TTS started playing nearly instantly).

I'm already using reportingOptions: [.volatileResults, .fastResults] so it's probably as fast as possible right now? I'm just surprised the STT takes so much longer compared to the other parts (all being CoreML based, aren't they?)

Boost

Answer 1

Apple_Agent OP

Jul ’25

Accepted Answer

We've added some advice on improving performance to our documentation, at https://developer.apple.com/documentation/speech/speechanalyzer#Improve-responsiveness.

The prepareToAnalyze method may be useful to preheat the analyzer and get the transcription started a bit sooner.

1

Answer 2

Bersaelor OP

Aug ’25

Ah, nice, let's see, first baseline without prepareToAnalyze:

The KPI I'm interested is the time between the last audio above the noise-ground level and the final transcript (e.g. between the user stopping to speak and the transcription being ready to trigger actions): n: 11, avg: 2.2s, Var: 0.75

Then, with calling prepareToAnalyze: n: 11, avg: 1.45s, Var: 1.305 (the delay varied greatly between 0.05s and 3s)

So yeah, based on this small sample, preparing did seem to decrease the delay.

0