Are there any background processing restrictions for Audio background mode?

Question

Hi,

I'd like to develop an iOS application that keeps the mic open for voice recording and processing even when the screen is off. I want to perform speech-to-text requests whenever samples of voice are detected (using a voice activity detection library) and also send requests to the cloud based on what is spoken.

I've enabled the Audio background mode and preliminary testing seems to indicate that this is working. That is, I can press "record" in my app, switch to another app then shut the screen off, and speak for several seconds before auto-stopping the recording and sending it to a SFSpeechRecognizer task, which appears to succeed.

However, I have read that this should not be supported so before going further down this path, I wanted to understand what exactly are the processing limitations in this mode? The documentation doesn't seem very clear to me.

Thanks,

-- B.

912

Posted by

trzy

Reply

Add a Comment

Answer 1

What sort of limitations are you concerned about?

Share and Enjoy
—
Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

Posted by

eskimo

Add a Comment

Answer 2

Hi Quinn,

CPU and network usage. I would like to at least:

Continuously perform voice activity detection (this does seem to work with a basic VAD algo; and I imagine streaming apps are doing more work decoding audio than this anyway).
Send voice to a server for processing.
Receive and store (with minimal processing) JSON responses.
Play back synthesized voice.

Ideally, rather than sending voice to the server, I'd like to perform Siri speech-to-text transcription and speech synthesis on the way back, allowing me to upload only text and receive text responses.

My understanding is there are some limitations on CPU usage for at least some of these cases. However, I imagine that audio streaming apps (YouTube, Spotify, etc.) must be doing a fair bit of decoding work themselves?

Thank you,

-- B.

Posted by

trzy

Add a Comment

Answer 3

Continuously perform voice activity detection (this does seem to work with a basic VAD algo; and I imagine streaming apps are doing more work decoding audio than this anyway).

Send voice to a server for processing.

Receive and store (with minimal processing) JSON responses.

Play back synthesized voice.

Items 1 and 4 are covered by the background mode itself.

Items 2 and 3 cover three separate topics:

Network
File system
CPU

Each of these has their own challenges.

On the network front, networking in the background is generally the same as networking in the foreground as long as your app doesn’t get suspended (and the active audio session, allowed for by the audio background mode, should prevent that). The one thing you have to watch out for is network interface transitions. For example, running in the background means you have more opportunity to experience a Wi-Fi to WWAN transition, or vice versa.

Our general advice on this front is that you not worry about specific network hardware but instead express your networking requirements in terms of the constrained and expensive flags. See WWDC 2019 Session 712 Advances in Networking, Part 1.

On the file system front, the main thing to worry about is data protection. You have to make sure that the files you access have data protection setup such that you can access them while the device might be locked.

On the CPU front, yes, there are potential limits here. These are not well documented, with the only saving grace being that most people don’t hit them. If you’re only doing a limited amount of processing you should be fine.

Share and Enjoy
—
Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

Posted by

eskimo

Add a Comment

Answer 4

Thanks, Quinn, that is incredibly helpful!

Re: Item 1, I currently have my own VAD but was also thinking of just using Siri speech-to-text as well. Will test it out. CPU limits are definitely a concern but I can test and see what happens. It becomes a CPU vs. network (which has a stable CPU cost) trade-off. On-device voice transcription is highly desirable from an economic perspective because doing this on the server is costly (not to mention the user's data plan bandwidth caps).

Did not know about constrained vs. expensive network flags. Very helpful.

Another related question (but maybe should start a new thread on this?):

Background Bluetooth mode (separate but related project): apps can receive Bluetooth events in the background but do similar constraints apply? That is, can I safely perform a REST API request and be confident that I will have time to process the response?

Specific use case:

Receive an audio sample from a Bluetooth peripheral (not headphones nor anything that can present itself as such)
Upload audio to a voice-to-text API (or use Siri speech-to-text).
Receive result of [2].
Hit a REST service with text obtained from [2].
Receive result of [4].
Send result of [4] (just some text data) back to the peripheral.

Posted by

trzy

Add a Comment

Answer 5

apps can receive Bluetooth events in the background but do similar constraints apply?

Yes.

In general, when the system resumes (or relaunches) your app in the background it takes out a do-not-suspend assertion on its behalf. If the relevant API has a completion handler, the system drops that assertion when you call the completion handler. If not, the system typically drops the assertion after a few seconds.

If you need more background time, you may be able to use a UIApplication background task to take out your own assertion. However, that API has strict limits. See UIApplication Background Task Notes.

The story with networking is the same here as it is elsewhere. Most networking APIs function fine in the background as long as your app isn’t suspended.

The primary exception to that rule is tasks in a URLSession background session, which are run by a system daemon on your behalf and thus don’t care if your app get suspended. However, the system treats such requests as discretionary, which often means there’s significant latency before they run. So, if latency is important, a background session isn’t really a great option.

Share and Enjoy
—
Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

Posted by

eskimo

Add a Comment

Are there any background processing restrictions for Audio background mode?

Replies