I am a graduate student conducting research in speech/audio signal processing and multimodal interaction.
Apple Vision Pro is widely recognized as a multimodal interactive system supporting voice, eye, and gesture inputs. However, I could not find detailed specifications or documentation about the audio input sampling rate used by the device’s built-in microphone array when capturing user audio.
Specifically, I would like to understand:
- What is the default audio input sampling rate (e.g., 16 kHz, 44.1 kHz, 48 kHz, etc.) for the Vision Pro’s microphones?
- When developing with visionOS / AVAudioSession / AVAudioEngine, is there a documented or recommended sampling rate for audio capture?
- Are there any best practices or settings for enabling high-quality voice capture on Vision Pro (especially for voice research tasks)?
For context, my work involves voice processing, analysis, and possibly on-device real-time speech recognition. Any pointers to relevant APIs, documentation or examples (especially regarding audio capture buffer size or available formats on visionOS) would be very helpful.
Thank you in advance! Best regards.