Introduction to VoXtream
VoXtream is a groundbreaking open-sourced Text-to-Speech (TTS) model developed by KTH’s Speech, Music and Hearing group. It addresses a common challenge in real-time applications like live dubbing and simultaneous translation: latency. Traditional TTS systems often wait for a full block of text before starting to speak, causing frustrating delays. VoXtream, however, begins speaking with the very first word, making it a game-changer in the field.
Understanding Full-Stream TTS
Full-stream TTS is a significant advancement over traditional output streaming. Instead of waiting for a complete sentence, it processes text as it comes in, generating audio in real-time. This is achieved through a continuous audio frame generation, which eliminates the need for input-side buffering. The focus here is on the immediate onset of speech, enhancing the user experience significantly.
How VoXtream Works
The secret behind VoXtream’s immediate speech output lies in its innovative use of a dynamic phoneme look-ahead within an incremental Phoneme Transformer (PT). This technology allows the system to generate audio as soon as the first word enters the buffer, effectively sidestepping the delays typically associated with fixed look-ahead windows.
Technical Architecture
VoXtream’s architecture is built around a single, fully-autoregressive (AR) pipeline that includes three key transformers:
- Phoneme Transformer (PT): A decoder-only, incremental transformer that uses a dynamic look-ahead of up to 10 phonemes, converting text to phonemes at the word level.
- Temporal Transformer (TT): An AR predictor that works with semantic tokens and a duration token, ensuring a smooth phoneme-to-audio alignment.
- Depth Transformer (DT): This generator produces the remaining acoustic codebooks, relying on TT outputs and a speaker embedding for zero-shot voice prompting.
Performance Metrics
VoXtream’s performance is impressive. On an A100 GPU, it achieves a first-packet latency (FPL) of 102 ms and a real-time factor (RTF) of 0.17 when compiled. Comparatively, on an RTX 3090, the FPL is 123 ms with an RTF of 0.19. These metrics showcase its efficiency and speed, making it suitable for real-time applications.
Comparative Analysis
When evaluated against popular streaming TTS systems, VoXtream shows a lower word error rate (WER) of 3.24%, significantly better than CosyVoice2’s 6.11%. Listener studies reveal that users prefer the naturalness of VoXtream’s output, although CosyVoice2 has an edge in speaker similarity. Notably, VoXtream operates over five times faster than real-time in compiled mode, making it a highly efficient choice.
Data Utilization
VoXtream was trained on a robust dataset of approximately 9,000 hours, which includes around 4,500 hours each from Emilia and HiFiTTS-2. The training process involved a meticulous diarization step to eliminate multi-speaker clips and filtering transcripts using Automatic Speech Recognition (ASR) to ensure high-quality audio output.
Quality Metrics
The model’s performance is validated across various metrics, including WER, UTMOS (a Mean Opinion Score predictor), and speaker similarity. An ablation study indicated that incorporating the CSM Depth Transformer and speaker encoder enhances speaker similarity without adversely affecting WER.
Positioning in the TTS Landscape
VoXtream’s primary contribution is its latency-focused AR arrangement and duration-token alignment, which allows for effective input-side streaming. This design offers a trade-off: while it may have slightly lower speaker similarity compared to chunked non-autoregressive vocoders, the reduction in FPL is significant, making it a preferred choice for real-time applications.
Conclusion
VoXtream represents a significant leap forward in TTS technology, particularly for applications requiring immediate audio output. Its innovative architecture and performance metrics position it as a leading solution in the field, promising to enhance user experiences across various domains.
Frequently Asked Questions (FAQ)
- What is VoXtream? VoXtream is an open-sourced TTS model designed to start speaking immediately after receiving text input, addressing latency issues in real-time applications.
- How does VoXtream differ from traditional TTS systems? Unlike traditional systems that wait for a chunk of text, VoXtream generates audio from the first word, significantly reducing delays.
- What are the key components of VoXtream’s architecture? VoXtream consists of three transformers: the Phoneme Transformer, Temporal Transformer, and Depth Transformer, each serving a unique function in audio generation.
- What performance metrics does VoXtream achieve? VoXtream achieves a first-packet latency of 102 ms and operates over five times faster than real-time in compiled mode.
- How was VoXtream trained? It was trained on a dataset of approximately 9,000 hours, ensuring high-quality audio through careful data processing and filtering.



























