This post is from the WWDC26 Audio Q&A.
I am designing an educational app. I notice that current system text-to-speech (like AVSpeechSynthesizer) often sounds too mechanical because the time intervals between characters are strictly equal, making it lack natural human prosody, phrasing, and warmth-which is a huge dealbreaker for sensitive users like children.
How can we customize text-to-speech to break this uniform word-spacing, manage prosody dynamically, and make the Al voice sound more emotionally engaging and natural rather than a cold robot?
I really want to create an elegant listening experience that feels like a real human storytelling, not just machine reading.