Itinai.com it development details code screens blured futuris ee00b4e7 f2cd 46ad 90ca 3140ca10c792 2
Itinai.com it development details code screens blured futuris ee00b4e7 f2cd 46ad 90ca 3140ca10c792 2

VoXtream: Revolutionizing Real-Time TTS with Zero-Delay Audio Output

Introduction to VoXtream

VoXtream is a groundbreaking open-sourced Text-to-Speech (TTS) model developed by KTH’s Speech, Music and Hearing group. It addresses a common challenge in real-time applications like live dubbing and simultaneous translation: latency. Traditional TTS systems often wait for a full block of text before starting to speak, causing frustrating delays. VoXtream, however, begins speaking with the very first word, making it a game-changer in the field.

Understanding Full-Stream TTS

Full-stream TTS is a significant advancement over traditional output streaming. Instead of waiting for a complete sentence, it processes text as it comes in, generating audio in real-time. This is achieved through a continuous audio frame generation, which eliminates the need for input-side buffering. The focus here is on the immediate onset of speech, enhancing the user experience significantly.

How VoXtream Works

The secret behind VoXtream’s immediate speech output lies in its innovative use of a dynamic phoneme look-ahead within an incremental Phoneme Transformer (PT). This technology allows the system to generate audio as soon as the first word enters the buffer, effectively sidestepping the delays typically associated with fixed look-ahead windows.

Technical Architecture

VoXtream’s architecture is built around a single, fully-autoregressive (AR) pipeline that includes three key transformers:

  • Phoneme Transformer (PT): A decoder-only, incremental transformer that uses a dynamic look-ahead of up to 10 phonemes, converting text to phonemes at the word level.
  • Temporal Transformer (TT): An AR predictor that works with semantic tokens and a duration token, ensuring a smooth phoneme-to-audio alignment.
  • Depth Transformer (DT): This generator produces the remaining acoustic codebooks, relying on TT outputs and a speaker embedding for zero-shot voice prompting.

Performance Metrics

VoXtream’s performance is impressive. On an A100 GPU, it achieves a first-packet latency (FPL) of 102 ms and a real-time factor (RTF) of 0.17 when compiled. Comparatively, on an RTX 3090, the FPL is 123 ms with an RTF of 0.19. These metrics showcase its efficiency and speed, making it suitable for real-time applications.

Comparative Analysis

When evaluated against popular streaming TTS systems, VoXtream shows a lower word error rate (WER) of 3.24%, significantly better than CosyVoice2’s 6.11%. Listener studies reveal that users prefer the naturalness of VoXtream’s output, although CosyVoice2 has an edge in speaker similarity. Notably, VoXtream operates over five times faster than real-time in compiled mode, making it a highly efficient choice.

Data Utilization

VoXtream was trained on a robust dataset of approximately 9,000 hours, which includes around 4,500 hours each from Emilia and HiFiTTS-2. The training process involved a meticulous diarization step to eliminate multi-speaker clips and filtering transcripts using Automatic Speech Recognition (ASR) to ensure high-quality audio output.

Quality Metrics

The model’s performance is validated across various metrics, including WER, UTMOS (a Mean Opinion Score predictor), and speaker similarity. An ablation study indicated that incorporating the CSM Depth Transformer and speaker encoder enhances speaker similarity without adversely affecting WER.

Positioning in the TTS Landscape

VoXtream’s primary contribution is its latency-focused AR arrangement and duration-token alignment, which allows for effective input-side streaming. This design offers a trade-off: while it may have slightly lower speaker similarity compared to chunked non-autoregressive vocoders, the reduction in FPL is significant, making it a preferred choice for real-time applications.

Conclusion

VoXtream represents a significant leap forward in TTS technology, particularly for applications requiring immediate audio output. Its innovative architecture and performance metrics position it as a leading solution in the field, promising to enhance user experiences across various domains.

Frequently Asked Questions (FAQ)

  • What is VoXtream? VoXtream is an open-sourced TTS model designed to start speaking immediately after receiving text input, addressing latency issues in real-time applications.
  • How does VoXtream differ from traditional TTS systems? Unlike traditional systems that wait for a chunk of text, VoXtream generates audio from the first word, significantly reducing delays.
  • What are the key components of VoXtream’s architecture? VoXtream consists of three transformers: the Phoneme Transformer, Temporal Transformer, and Depth Transformer, each serving a unique function in audio generation.
  • What performance metrics does VoXtream achieve? VoXtream achieves a first-packet latency of 102 ms and operates over five times faster than real-time in compiled mode.
  • How was VoXtream trained? It was trained on a dataset of approximately 9,000 hours, ensuring high-quality audio through careful data processing and filtering.
Itinai.com office ai background high tech quantum computing 0002ba7c e3d6 4fd7 abd6 cfe4e5f08aeb 0

Vladimir Dyachkov, Ph.D
Editor-in-Chief itinai.com

I believe that AI is only as powerful as the human insight guiding it.

Unleash Your Creative Potential with AI Agents

Competitors are already using AI Agents

Business Problems We Solve

  • Automation of internal processes.
  • Optimizing AI costs without huge budgets.
  • Training staff, developing custom courses for business needs
  • Integrating AI into client work, automating first lines of contact

Large and Medium Businesses

Startups

Offline Business

100% of clients report increased productivity and reduced operati

AI news and solutions