VibeVoice: A Frontier Open-Source Text-to-Speech Model VibeVoice is an innovative framework designed to generate expressive, long-form, multi-speaker conversational audio from text, ideal for formats like podcasts. --- Key Features Expressive & Multi-Speaker: Supports up to 4 distinct speakers in a single synthesis, surpassing typical 1-2 speaker limitations. Long-Form Speech: Can synthesize speech sequences as long as 90 minutes, maintaining quality and speaker consistency. Context-Aware Expression: Uses a Large Language Model (LLM) to capture fine-grained textual context and dialogue flow for natural expression, including emotion and singing. High Fidelity & Efficiency: Employs continuous speech tokenizers (Acoustic and Semantic) working at an ultra-low frame rate of 7.5 Hz, balancing audio quality and computational efficiency. Next-Token Diffusion Framework: Combines LLM-based context understanding with a diffusion head to generate detailed acoustic features. --- Technical Innovations Continuous Speech Tokenizers: Preserve audio fidelity while efficiently handling long audio sequences. Multi-Speaker & Turn-Taking: Manages natural conversational turn-taking and maintains speaker identity over extended speech durations. Cross-Lingual Capabilities: Supports speech synthesis across languages, demonstrated by Mandarin-English and English-Mandarin examples. --- Demonstrations & Use Cases Context-Aware Expression Spontaneous Emotion: Realistic emotional speech in conversational context. Spontaneous Singing: Speech synthesis that includes singing segments. Podcast with Background Music Multi-speaker podcast style audio with integrated background music. Cross-Lingual Speech synthesized from Mandarin text to English audio. Speech synthesized from English text to Mandarin audio. Long Conversational Speech Examples include 45-minute and 100-minute multi-party conversations demonstrating stable speaker consistency and long-duration synthesis. --- Resources Research Report: ArXiv Paper Source Code: GitHub Repository Model Collection: Hugging Face Demo: Live Demo --- VibeVoice represents a major advancement in TTS technology, enabling natural, long, expressive conversations among multiple speakers with superior audio quality and computational efficiency.