Speech

Question about Apple Vision Pro audio input sampling rate for research

I am a graduate student conducting research in speech/audio signal processing and multimodal interaction. Apple Vision Pro is widely recognized as a multimodal interactive system supporting voice, eye, and gesture inputs. However, I could not find detailed specifications or documentation about the audio input sampling rate used by the device’s built-in microphone array when capturing user audio. Specifically, I would like to understand: What is the default audio input sampling rate (e.g., 16 kHz, 44.1 kHz, 48 kHz, etc.) for the Vision Pro’s microphones? When developing with visionOS / AVAudioSession / AVAudioEngine, is there a documented or recommended sampling rate for audio capture? Are there any best practices or settings for enabling high-quality voice capture on Vision Pro (especially for voice research tasks)? For context, my work involves voice processing, analysis, and possibly on-device real-time speech recognition. Any pointers to relevant APIs, documentation or examples (especially regarding audio capture buffer size or available formats on visionOS) would be very helpful. Thank you in advance! Best regards.

Media Technologies Audio Speech AVAudioSession ReplayKit visionOS

0

218

Jan ’26

Is Speech framework allowed?

Hello, I want to use the Speech framework in my app. However, I found that if I want it to work offline, it must be downloaded separately on the device. Do I understand correctly that it is not allowed to use it in a Swift Student Challenge submission if English (as the speech language) must be downloaded by the tester on their device using the internet beforehand?

Community Swift Student Challenge Swift Student Challenge Frameworks Speech

1

2

501

Feb ’26

AVAudioEngine fails to start during FaceTime call (error 2003329396)

Is it possible to perform speech-to-text using AVAudioEngine to capture microphone input while being on a FaceTime call at the same time? I tried implementing this, but whenever I attempt to start the AVAudioEngine while a FaceTime call is active, I get the following error: “The operation couldn’t be completed. (OSStatus error 2003329396)” I assume this might be due to microphone resource restrictions during FaceTime, but I’d like to confirm whether this limitation is at the system level or if there’s any possible workaround or entitlement that allows concurrent microphone access. Has anyone encountered this issue or found a solution?

Media Technologies General Speech AVAudioEngine

2

1

1.1k

Mar ’26

SpeechAnalyzer.start(inputSequence:) fails with _GenericObjCError nilError, while the same WAV succeeds with start(inputAudioFile:)

I'm trying to use the new Speech framework for streaming transcription on macOS 26.3, and I can reproduce a failure with SpeechAnalyzer.start(inputSequence:). What is working: SpeechAnalyzer + SpeechTranscriber offline path using start(inputAudioFile:finishAfterFile:) same Spanish WAV file transcribes successfully and returns a coherent final result What is not working: SpeechAnalyzer + SpeechTranscriber stream path using start(inputSequence:) same WAV, replayed as AnalyzerInput(buffer:bufferStartTime:) fails once replay starts with: _GenericObjCError domain=Foundation._GenericObjCError code=0 detail=nilError I also tried: DictationTranscriber instead of SpeechTranscriber no realtime pacing during replay Both still fail in stream mode with the same error. So this does not currently look like a ScreenCaptureKit issue or a Python integration issue. I reduced it to a pure Swift CLI repro. Environment: macOS 26.3 (25D122) Xcode 26.3 Swift 6.2.4 Apple Silicon Mac Has anyone here gotten SpeechAnalyzer.start(inputSequence:) working reliably on macOS 26.x? If so, I'd be interested in any workaround or any detail that differs from the obvious setup: prepareToAnalyze(in:) bestAvailableAudioFormat(...) AnalyzerInput(buffer:bufferStartTime:) replaying a known-good WAV in chunks I already filed Feedback Assistant: FB22149971

Media Technologies Audio Swift Speech Audio

1

0

526

Mar ’26

Building Real-Time Voice Input on macOS 26 with SpeechAnalyzer + ScreenCaptureKit

We built an open-source macOS menu bar app that turns speech into text and pastes it into the active app — using SpeechAnalyzer for on-device transcription, ScreenCaptureKit + Vision for screen-aware context, and FluidAudio for speaker diarization in meeting mode. Here's what we learned shipping it on macOS 26. GitHub: github.com/Marvinngg/ambient-voice Architecture The app has two modes: hotkey dictation (press to talk, release to inject) and meeting recording (continuous transcription with a floating panel). Dictation Mode Audio capture uses AVCaptureSession (more on why below). The captured audio feeds into SpeechAnalyzer via an AsyncStream: let transcriber = SpeechTranscriber( locale: locale, transcriptionOptions: [], reportingOptions: [.volatileResults, .alternativeTranscriptions], attributeOptions: [.audioTimeRange, .transcriptionConfidence] ) let analyzer = SpeechAnalyzer(modules: [transcriber]) let (inputSequence, inputBuilder) = AsyncStream.makeStream() try await analyzer.start(inputSequence: inputSequence) While recording, we capture a screenshot of the focused window using ScreenCaptureKit, run Vision OCR (VNRecognizeTextRequest), extract keywords, and inject them into SpeechAnalyzer as contextual bias: let context = AnalysisContext() context.contextualStrings[.general] = ocrKeywords try await analyzer.setContext(context) This improves accuracy for technical terms and proper nouns visible on screen. If your screen shows "SpeechAnalyzer", saying it out loud is more likely to be transcribed correctly. After transcription, an optional L2 step sends the text through a local LLM (ollama) for spoken-to-written cleanup, then CGEvent simulates Cmd+V to paste into the active app. Meeting Mode Meeting mode forks the same audio stream to two consumers: SpeechAnalyzer — real-time streaming transcription, displayed in a floating NSPanel FluidAudio buffer — accumulates 16kHz Float32 mono samples for batch speaker diarization after recording stops When the user ends the meeting, FluidAudio's performCompleteDiarization() runs on the accumulated audio. We align transcription segments with speaker segments using audioTimeRange overlap matching — each transcription segment gets assigned the speaker ID with the most time overlap. Results export to Markdown. Pitfalls We Hit on macOS 26 1. AVAudioEngine installTap doesn't fire with Bluetooth devices We started with AVAudioEngine.inputNode.installTap() for audio capture. It worked fine with built-in mics but the tap callback never fired with Bluetooth devices (tested with vivo TWS 4 Hi-Fi). Fix: switched to AVCaptureSession. The delegate callback captureOutput(_:didOutput:from:) fires reliably regardless of audio device. The tradeoff is you get CMSampleBuffer instead of AVAudioPCMBuffer, so you need a conversion step. 2. NSEvent addGlobalMonitorForEvents crashes Our global hotkey listener used NSEvent.addGlobalMonitorForEvents. On macOS 26, this crashes with a Bus error inside GlobalObserverHandler — appears to be a Swift actor runtime issue. Fix: switched to CGEventTap. Works reliably, but the callback runs on a CFRunLoop context, which Swift doesn't recognize as MainActor. 3. CGEventTap callbacks aren't on MainActor If your CGEventTap callback touches any @MainActor state, you'll get concurrency violations. The callback runs on whatever thread owns the CFRunLoop. Fix: bridge with DispatchQueue.main.async {} inside the tap callback before touching any MainActor state. 4. CGPreflightScreenCaptureAccess doesn't request permission We used CGPreflightScreenCaptureAccess() as a guard before calling ScreenCaptureKit. If it returned false, we'd bail out. The problem: this function only checks — it never triggers macOS to add your app to the Screen Recording permission list. Chicken-and-egg: you can't get permission because you never ask for it. Fix: call CGRequestScreenCaptureAccess() at app startup. This adds your app to System Settings → Screen Recording. Then let ScreenCaptureKit calls proceed without the preflight guard — SCShareableContent will also trigger the permission prompt on first use. 5. Ad-hoc signing breaks TCC permissions on every rebuild During development, codesign --sign - (ad-hoc) generates a different code directory hash on every build. macOS TCC tracks permissions by this hash, so every rebuild = new app identity = all permissions reset. Fix: sign with a stable certificate. If you have an Apple Development certificate, use that. The TeamIdentifier stays constant across rebuilds, so TCC permissions persist. We also discovered that launching via open WE.app (LaunchServices) instead of directly executing the binary is required — otherwise macOS attributes TCC permissions to Terminal, not your app. Benchmarks We ran end-to-end benchmarks on public datasets (Mac Mini M4 16GB, macOS 26): Transcription (SpeechAnalyzer, AliMeeting Chinese): • Near-field CER 34% (excluding outliers ~25%) • Far-field CER 40% (single channel, no beamforming, >30% overlap) • Processing speed 74-89x real-time Speaker diarization (FluidAudio offline): • AMI English 16 meetings: avg DER 23.2% (collar=0.25s, ignoreOverlap=True) • AliMeeting Chinese 8 meetings: DER 48.5% (including overlap regions) • Memory: RSS ~500MB, peak 730-930MB Full evaluation methodology, scripts, and raw results are in the repo. Open Source The project is MIT licensed: github.com/Marvinngg/ambient-voice It includes the macOS client (Swift 6.2, SPM), server-side distillation/training scripts (Python), and a complete evaluation framework with reproducible benchmarks. Feedback and contributions welcome.

Machine Learning & AI Apple Intelligence macOS Speech ScreenCaptureKit Apple Intelligence

0

729

Mar ’26

Building Real-Time Voice Input on macOS 26 with SpeechAnalyzer + ScreenCaptureKit

We built an open-source macOS menu bar app that turns speech into text and pastes it into the active app — using SpeechAnalyzer for on-device transcription, ScreenCaptureKit + Vision for screen-aware context, and FluidAudio for speaker diarization in meeting mode. Here's what we learned shipping it on macOS 26. GitHub: github.com/Marvinngg/ambient-voice Architecture The app has two modes: hotkey dictation (press to talk, release to inject) and meeting recording (continuous transcription with a floating panel). Dictation Mode Audio capture uses AVCaptureSession (more on why below). The captured audio feeds into SpeechAnalyzer via an AsyncStream: let transcriber = SpeechTranscriber( locale: locale, transcriptionOptions: [], reportingOptions: [.volatileResults, .alternativeTranscriptions], attributeOptions: [.audioTimeRange, .transcriptionConfidence] ) let analyzer = SpeechAnalyzer(modules: [transcriber]) let (inputSequence, inputBuilder) = AsyncStream.makeStream() try await analyzer.start(inputSequence: inputSequence) While recording, we capture a screenshot of the focused window using ScreenCaptureKit, run Vision OCR (VNRecognizeTextRequest), extract keywords, and inject them into SpeechAnalyzer as contextual bias: let context = AnalysisContext() context.contextualStrings[.general] = ocrKeywords try await analyzer.setContext(context) This improves accuracy for technical terms and proper nouns visible on screen. If your screen shows "SpeechAnalyzer", saying it out loud is more likely to be transcribed correctly. After transcription, an optional L2 step sends the text through a local LLM (ollama) for spoken-to-written cleanup, then CGEvent simulates Cmd+V to paste into the active app. Meeting Mode Meeting mode forks the same audio stream to two consumers: SpeechAnalyzer — real-time streaming transcription, displayed in a floating NSPanel FluidAudio buffer — accumulates 16kHz Float32 mono samples for batch speaker diarization after recording stops When the user ends the meeting, FluidAudio's performCompleteDiarization() runs on the accumulated audio. We align transcription segments with speaker segments using audioTimeRange overlap matching — each transcription segment gets assigned the speaker ID with the most time overlap. Results export to Markdown. Pitfalls We Hit on macOS 26 1. AVAudioEngine installTap doesn't fire with Bluetooth devices We started with AVAudioEngine.inputNode.installTap() for audio capture. It worked fine with built-in mics but the tap callback never fired with Bluetooth devices (tested with vivo TWS 4 Hi-Fi). Fix: switched to AVCaptureSession. The delegate callback captureOutput(_:didOutput:from:) fires reliably regardless of audio device. The tradeoff is you get CMSampleBuffer instead of AVAudioPCMBuffer, so you need a conversion step. 2. NSEvent addGlobalMonitorForEvents crashes Our global hotkey listener used NSEvent.addGlobalMonitorForEvents. On macOS 26, this crashes with a Bus error inside GlobalObserverHandler — appears to be a Swift actor runtime issue. Fix: switched to CGEventTap. Works reliably, but the callback runs on a CFRunLoop context, which Swift doesn't recognize as MainActor. 3. CGEventTap callbacks aren't on MainActor If your CGEventTap callback touches any @MainActor state, you'll get concurrency violations. The callback runs on whatever thread owns the CFRunLoop. Fix: bridge with DispatchQueue.main.async {} inside the tap callback before touching any MainActor state. 4. CGPreflightScreenCaptureAccess doesn't request permission We used CGPreflightScreenCaptureAccess() as a guard before calling ScreenCaptureKit. If it returned false, we'd bail out. The problem: this function only checks — it never triggers macOS to add your app to the Screen Recording permission list. Chicken-and-egg: you can't get permission because you never ask for it. Fix: call CGRequestScreenCaptureAccess() at app startup. This adds your app to System Settings → Screen Recording. Then let ScreenCaptureKit calls proceed without the preflight guard — SCShareableContent will also trigger the permission prompt on first use. 5. Ad-hoc signing breaks TCC permissions on every rebuild During development, codesign --sign - (ad-hoc) generates a different code directory hash on every build. macOS TCC tracks permissions by this hash, so every rebuild = new app identity = all permissions reset. Fix: sign with a stable certificate. If you have an Apple Development certificate, use that. The TeamIdentifier stays constant across rebuilds, so TCC permissions persist. We also discovered that launching via open WE.app (LaunchServices) instead of directly executing the binary is required — otherwise macOS attributes TCC permissions to Terminal, not your app. Benchmarks We ran end-to-end benchmarks on public datasets (Mac Mini M4 16GB, macOS 26): Transcription (SpeechAnalyzer, AliMeeting Chinese): • Near-field CER 34% (excluding outliers ~25%) • Far-field CER 40% (single channel, no beamforming, >30% overlap) • Processing speed 74-89x real-time Speaker diarization (FluidAudio offline): • AMI English 16 meetings: avg DER 23.2% (collar=0.25s, ignoreOverlap=True) • AliMeeting Chinese 8 meetings: DER 48.5% (including overlap regions) • Memory: RSS ~500MB, peak 730-930MB Full evaluation methodology, scripts, and raw results are in the repo. Open Source The project is MIT licensed: github.com/Marvinngg/ambient-voice It includes the macOS client (Swift 6.2, SPM), server-side distillation/training scripts (Python), and a complete evaluation framework with reproducible benchmarks. Feedback and contributions welcome.

Machine Learning & AI General macOS Speech Core ML ScreenCaptureKit

0

616

Mar ’26

CarPlay: Voice Conversational Entitlement Details

With the Voice Conversational Entitlement, can a CarPlay app establish a turn-based audio interface that operates in two modes: Speaking mode: Audio Session configured for playback Buffered audio Listening mode: Switch Audio Session to .record or .playAndRecord Activate SFSpeechRecognizer And continue toggling back and forth. The app should listen for responses to questions or other audio cues, and assuming those answers are correct (based on analysis of results from SFSpeechRecognizer), continue this pattern of mode 1 and 2 alternating. This appears to be a valid use of this entitlement. Does this also require the Audio App Entitlement, or is the Voice Conversational Entitlement sufficient? Are there other obstacles to this type of app that I'm not seeing? Or perhaps this is technically possible, but unlikely to pass app store review?

Media Technologies Audio CarPlay Speech

0

311

Apr ’26

I have the same, iOS 26.3.0

open FB22712056

Media Technologies General Speech Debugging

1

0

231

3w

iOS feasibility question: user-initiated wake-word detection during active session

Hi all, Technical architecture question for those experienced with iOS background audio / microphone constraints. I’m exploring an app concept where: the user explicitly starts a temporary active session during that session, on-device wake-word / keyword detection runs locally no audio is stored or transmitted during passive monitoring monitoring stops when the user ends the session The intended UX is that the user may then lock the phone or place it away while the active session remains in progress. Question: Is there any App Store-compliant architecture that would allow local keyword / wake-word detection to continue while the device is locked or the app is backgrounded during that active session? Or would iOS lifecycle / background execution rules make this infeasible for custom wake-word detection? Interested in practical experience around: AVAudioSession background audio modes on-device speech processing App Review acceptability Thanks in advance.

App & System Services Processes & Concurrency iOS Speech Background Tasks AVFoundation

0

311

2w

iOS 26.4 — How to return from main app to host app after a keyboard-extension dictation round-trip, without private APIs?

I'm building a custom keyboard extension that offers voice dictation. Because keyboard extensions are constrained (memory cap ~30–48 MB, restricted audio session access), I delegate recording to my container app: User in a host app (e.g., Safari) taps the mic in my keyboard extension. The keyboard calls extensionContext.open(URL("myapp://dictation")) to launch the container app. The container app records audio via AVAudioEngine + SFSpeechRecognizer, writes the final transcript to the App Group, and signals completion via a Darwin notification. 4. The user is expected to be returned to the original host app (Safari) automatically so they can keep typing. The problem (step 4): On iOS 26.4 I can no longer identify which app was the host. Every previously-known path returns nil for the keyboard extension's host: parent.value(forKey: "_hostBundleID") → returns the literal string parent.value(forKey: "_hostApplicationBundleIdentifier") → returns NSNull xpc_connection_copy_bundle_id on the underlying XPC connection (via PKService.defaultService.personalities[…]) → returns NULL NSXPCConnection.processBundleIdentifier on extensionContext._extensionHostProxy._connection → returns nil proc_pidpath(hostPID, …) → EPERM from the keyboard sandbox LSApplicationWorkspace.frontmostApplication → selector unavailable from the extension RBSProcessHandle.handleForIdentifier:error: → returns an RBSServiceErrorDomain error Without the host's bundle ID, the container app has no way to call LSApplicationWorkspace.openApplicationWithBundleID: (the technique that worked on iOS 25 and earlier). UIApplication.suspend() correctly sends the container to background, but iOS treats us as a "fresh launch" — it returns the user to the Home Screen instead of Safari, because the container app was launched by an extension, not directly by Safari. KeyboardKit's maintainer reached the same conclusion (issue #1014) and shipped 10.4 without the feature. My questions: Is there a public, App-Store-safe API in iOS 26+ for a custom keyboard extension to identify its host application, or for the container app (launched via the extension's openURL) to identify which app initially hosted the extension that opened it? UIOpenURLContext.options.sourceApplication reports the extension's own container, not the actual host. 2. Is there a public mechanism for "return to source app" when the container app was launched by an extension's openURL? Equivalent to the ← Source affordance iOS shows for normal inter-app openURL, but triggered programmatically by the launched app. 3. Some popular keyboards (e.g., 微信输入法 / WeChat Keyboard) still appear to round-trip through their container app on iOS 26.4 and return the user to the original host — including the iOS ← WeChat back affordance in the host's status bar afterward. What's the recommended approach to achieve this? If it requires a specific scene-activation flow, NSUserActivity pattern, or extension-context configuration, please point at the relevant docs. 4. If there is no public path today, is FB22247647 (or a related radar) the right place to track this? Should developers in this position migrate to in-extension audio capture (which has its own significant constraints in keyboard extensions)? I'd much rather not rely on private APIs. Concrete guidance — or even an acknowledgment of which direction Apple intends — would help thousands of custom-keyboard developers who currently have a degraded voice-input experience on iOS 26.4+. Tested on iPhone 12 Pro Max running iOS 26.4.2 (build 23E261), Xcode 26.x, Swift 5. Thanks!

App & System Services General Frameworks App Review Speech

0

171

1w

Post

Replies

Boosts

Views

Activity

Speech

Posts under Speech tag

Post

Replies

Boosts

Views

Activity