TTS Wiki - User contributions [en]

SNAC

2025-12-23T03:44:01Z

Ttswikiadmin: Created page with "'''SNAC''' (Multi-Scale Neural Audio Codec) is a neural audio codec that introduces multi-scale temporal quantization for efficient audio compression. It was presented at the NeurIPS 2024 Workshop on AI-Driven Speech, Music, and Sound Generation by researchers from Papla Media and ETH Zurich. === Overview === Neural audio codecs have recently gained popularity because they can represent audio signals with high fidelity at very low bitrates, making it feasible to use..."

'''SNAC''' (Multi-Scale Neural Audio Codec) is a neural audio codec that introduces multi-scale temporal quantization for efficient audio compression. It was presented at the NeurIPS 2024 Workshop on AI-Driven Speech, Music, and Sound Generation by researchers from [[Papla Media]] and ETH Zurich.

=== Overview ===
Neural audio codecs have recently gained popularity because they can represent audio signals with high fidelity at very low bitrates, making it feasible to use language modeling approaches for audio generation and understanding. While Residual Vector Quantization (RVQ) has become the standard technique for neural audio compression using a cascade of VQ codebooks, SNAC proposes a simple extension of RVQ where the quantizers can operate at different temporal resolutions.

=== Architecture ===
SNAC encodes audio into hierarchical tokens similarly to SoundStream, EnCodec, and [[DAC]]. However, SNAC introduces a simple change where coarse tokens are sampled less frequently, covering a broader time span.

The architecture includes several key innovations:

* '''Multi-Scale Quantization''': By applying a hierarchy of quantizers at variable frame rates, the codec adapts to the audio structure across multiple timescales.
* '''Noise Blocks''': Noise blocks that inject input-dependent Gaussian noise for enhanced expressiveness
* '''Depthwise Convolutions''': Depthwise convolutions for efficient computation and training stability
* '''Local Windowed Attention''': Local windowed attention layers at the lowest temporal resolution to capture contextual relationships

=== Model Variants ===
SNAC offers several pretrained models optimized for different use cases:
{| class="wikitable"
!Model
!Sample Rate
!Bitrate
!RVQ Levels
!Token Rates
!Parameters
!Use Case
|-
|snac_24khz
|24 kHz
|0.98 kbps
|3
|12, 23, and 47 Hz
|~20M
|Speech
|-
|snac_32khz
|32 kHz
|1.9 kbps
|4
|10, 21, 42, and 83 Hz
|~55M
|General audio
|-
|snac_44khz
|44 kHz
|2.6 kbps
|4
|14, 29, 57, and 115 Hz
|~55M
|Music/SFX
|}
Each codebook holds 4096 entries (12-bit). The general audio model consists of 16M parameters in the encoder and 38.3M in the decoder, totaling 54.5 M parameters.

=== Performance ===
For speech, SNAC consistently outperforms all other codecs. Notably, even at bitrates below 1 kbit/s, SNAC maintains audio quality that closely approaches the reference signal. In evaluations, SNAC outperformed competing codecs like Encodec and DAC at comparable bitrates, even matching the quality of systems operating at twice its bitrate.

=== Applications ===
SNAC has been adopted in several text-to-speech systems:

* [[Orpheus TTS|'''Orpheus TTS''']]: Orpheus uses SNAC, which creates tokens at four levels of hierarchy. The SNAC model is relatively lightweight and fast, making it suitable for real-time decoding.

With coarse tokens of ~10 Hz and a context window of 2048 you can effectively model a consistent structure of an audio track for ~3 minutes.

=== Comparison with Other Codecs ===
SNAC from Orpheus does 83 tokens per second, compared to 50 t/s for [[X-Codec|X-Codec 2.0]] and 25 t/s for [[CosyVoice]]'s codec. SNAC uses one codebook but tokens are created for each level of downsampling, in contrast to codecs like [[Mimi]] which use multiple separate codebooks.

[[Category:Neural audio codecs]]

X-Codec

2025-12-23T02:33:59Z

Ttswikiadmin:

'''X-Codec''' is a neural audio codec designed to enhance semantic understanding in audio language models (LLMs). It was introduced in the paper "Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model," published at AAAI 2025.

=== Background ===
Traditional audio codecs like EnCodec were originally designed for audio compression, which may lead to suboptimal performance in the context of audio LLM. Research found that methods like VALL-E, which condition acoustic token generation on text transcriptions, often suffer from content inaccuracies and elevated word error rates (WER) due to semantic misinterpretations of acoustic tokens, resulting in word skipping and errors.

=== Architecture ===
X-Codec addresses these limitations through a dual-encoder design that incorporates semantic features from a pre-trained semantic encoder before the Residual Vector Quantization (RVQ) stage and introduces a semantic reconstruction loss after RVQ.

The architecture consists of:

* '''Acoustic Encoder/Decoder''': Convolutional encoder and decoder with a Residual Vector Quantizer (RVQ)
* '''Semantic Module''': A pre-trained self-supervised model such as HuBERT or WavLM
* '''Projectors''': Linear layers that combine and process the acoustic and semantic features

The acoustic and semantic features are concatenated, transformed, and then quantized together. After quantization, separate post-processing layers reconstruct both semantic and acoustic representations.

=== Applications ===
X-Codec demonstrated improvements across multiple audio generation tasks including text-to-speech synthesis, music continuation, and general audio classification tasks. By enhancing the semantic ability of the codec, X-Codec significantly reduces WER in speech synthesis tasks and extends these benefits to non-speech applications, including music and sound generation.

== X-Codec 2.0 ==
'''X-Codec 2.0''' (also written as XCodec2) is a successor to X-Codec, introduced alongside the LLaSA text-to-speech system in the paper "LLaSA: Scaling Train-time and Inference-time Compute for LLaMA-based Speech Synthesis."

=== Key Differences from X-Codec ===
X-Codec2 extends the original X-Codec by refining how semantic and acoustic information is integrated and tokenized. Major architectural changes include:

* '''Unified Semantic-Acoustic Tokenization''': X-Codec2 fuses outputs from a semantic encoder (e.g., Wav2Vec2-BERT) and an acoustic encoder into a single embedding, capturing both high-level meaning (e.g., text content, emotion) and low-level audio details (e.g., timbre).
* '''Single-Stage Vector Quantization''': Unlike the multi-layer residual VQ in most approaches (e.g., X-Codec, DAC, EnCodec), X-Codec2 uses a single-layer Feature-Space Quantization (FSQ) for stability and compatibility with causal language models.
* '''Large Codebook''': 65,536 codebook size using Finite Scalar Quantization achieving 99% codebook usage, which is comparable to text tokenizers (LLaMA3 uses 128,256).

=== Technical Specifications ===

* '''Semantic Encoder''': Wav2Vec2-BERT, a semantic encoder pre-trained on 4.5M hours of unlabeled audio data covering more than 143 languages.
* '''Training Data''': Codec trained on 150k hours of multilingual speech data, including Emilia (En/Zh/De/Fr/Ja/Ko) and MLS (En/Fr/De/Nl/Es/It/Pt/Pl).
* '''Quantization''': Finite Scalar Quantization (FSQ), which does not require an explicit VQ objective term (e.g., codebook commitment loss), simplifying optimization during training.

=== Derivatives ===
X-Codec 2.0 has been extended in several ways:

* '''NeuCodec''': Neuphonic's codec for on-device TTS, which is largely based on extending X-Codec 2.0
* '''XCodec2-Streaming''': A streaming variant that adopts a causal decoder to focus solely on historical context, enabling streaming waveform reconstruction.

=== Availability ===
X-Codec is available on GitHub and integrated into Hugging Face's Transformers library. X-Codec 2.0 is available via the <code>xcodec2</code> Python package and on Hugging Face.
[[Category:Neural audio codecs]]

NeuCodec

2025-12-23T02:33:27Z

Ttswikiadmin: Created page with "'''NeuCodec''' is a neural audio codec developed by Neuphonic, designed for efficient speech tokenization and high-quality audio compression at relatively low bitrates. === Technical Specifications === * '''Bitrate:''' 0.8 kbps * '''Output sample rate:''' 24 kHz * '''Frame rate:''' 50 Hz * '''Quantization:''' Finite Scalar Quantization (FSQ) with a single codebook === Architecture === NeuCodec is largely based on extending the work of X-Codec 2.0. It e..."

'''NeuCodec''' is a neural audio codec developed by [[Neuphonic]], designed for efficient speech tokenization and high-quality audio compression at relatively low bitrates.

=== Technical Specifications ===

* '''Bitrate:''' 0.8 kbps
* '''Output sample rate:''' 24 kHz
* '''Frame rate:''' 50 Hz
* '''Quantization:''' Finite Scalar Quantization (FSQ) with a single codebook

=== Architecture ===
NeuCodec is largely based on extending the work of [[X-Codec|X-Codec 2.0]]. It employs a dual-encoder approach, using both audio ([[BigCodec]]) and semantic (Wav2Vec2-BERT) encoders. The FSQ-based design produces a single quantized vector output, making it well-suited for downstream Speech Language Model (SpeechLM) training.

=== Features ===

* Compresses and reconstructs audio with near-inaudible reconstruction loss
* Upsamples from 16 kHz to 24 kHz
* Commercial use permitted
* Pre-encoded datasets available (Emilia-YODAS compressed from 1.7 TB to 41 GB)

=== Applications ===
NeuCodec serves as the audio codec for [[NeuTTS Air]], Neuphonic's on-device text-to-speech model with voice cloning capabilities. It's intended for researchers and developers building text-to-speech systems who need efficient speech tokenization without developing their own codec.

=== Availability ===
Available on Hugging Face and GitHub under the <code>neuphonic/neucodec</code> repository, installable via pip.

[[Category:Neural audio codecs]]

X-Codec

2025-12-23T02:30:16Z

ElevenLabs

2025-09-25T03:01:23Z

Ttswikiadmin:

{{Infobox TTS model
| name = ElevenLabs
| developer = ElevenLabs Inc.
| release_date = January 2023 (beta)
| latest_version = Eleven v3 (alpha)
| languages = 32+ languages
| voices = 1000+ voices
| voice_cloning = Yes (professional & instant)
| emotion_control = Yes (via audio tags)
| streaming = Yes
| latency = ~135ms (Flash models)
| open_source = No
| website = https://elevenlabs.io
}}

'''ElevenLabs''' is a commercial artificial intelligence company specializing in text-to-speech synthesis and voice cloning technology. Founded in 2022 by Piotr Dąbkowski and Mateusz Staniszewski, the company has gained prominence for its AI-generated voices that can replicate human speech patterns, emotions, and intonation across multiple languages.

== History and Founding ==

ElevenLabs was co-founded in 2022 by Piotr Dąbkowski, a former Google machine learning engineer, and Mateusz Staniszewski, an ex-Palantir deployment strategist. Both founders, originally from Poland, reportedly drew inspiration from the poor quality of film dubbing they experienced while watching American movies in their home country.<ref>https://venturebeat.com/ai/now-hear-this-voice-cloning-ai-startup-elevenlabs-nabs-19m-from-a16z-and-other-heavy-hitters</ref>

The founders first met as teenagers at Copernicus High School in Warsaw before pursuing separate academic paths—Dąbkowski studying at Oxford and Cambridge, while Staniszewski studied mathematics in London. Their shared vision of making quality content accessible across all languages led to the creation of ElevenLabs as a research-first company.<ref>https://research.contrary.com/company/elevenlabs</ref>

The company launched its beta platform in January 2023, quickly gaining traction with over one million users within five months. This rapid adoption demonstrated market demand for high-quality AI voice synthesis technology.<ref>https://research.contrary.com/company/elevenlabs</ref>

== Funding and Valuation ==

ElevenLabs has experienced rapid growth in both user adoption and valuation:

* '''Pre-seed (January 2023)''': $2 million led by Credo Ventures and Concept Ventures
* '''Series A (June 2023)''': $19 million at $100 million valuation, co-led by Andreessen Horowitz, Nat Friedman, and Daniel Gross
* '''Series B (January 2024)''': $80 million at $1.1 billion valuation, achieving unicorn status
* '''Series C (January 2025)''': $180 million at $3.3 billion valuation, led by Andreessen Horowitz and ICONIQ Growth<ref>https://en.wikipedia.org/wiki/ElevenLabs</ref>

The company reportedly achieved $200 million in annual recurring revenue (ARR) by August 2025, demonstrating significant commercial traction.<ref>https://sacra.com/c/elevenlabs/</ref>

== Technology and Products ==

=== Core Technology ===

ElevenLabs's architecture is proprietary and remains undisclosed, with little information about it being publicly available. Some have speculated that early versions of ElevenLabs were based off of Tortoise TTS; however, these rumors remain unverified.<ref>https://github.com/neonbjb/tortoise-tts/discussions/277</ref>

=== Product Portfolio ===

==== Text-to-Speech Models ====

ElevenLabs offers several model variants optimized for different use cases:

* '''Multilingual v2''': High-quality model supporting 29+ languages, optimized for audiobooks and professional content
* '''Flash v2.5''': Ultra-low latency model (75ms) designed for real-time conversational applications
* '''Turbo v2.5''': Balanced quality and speed model for general-purpose applications
* '''Eleven v3 (alpha)''': Latest model featuring advanced emotion control via audio tags
* '''Eleven Scribe v1''': SoTA automatic speech recognition model
* '''Eleven Music v1''': Text-to-music model trained on licensed data<ref>https://elevenlabs.io/docs/models</ref><ref>https://elevenlabs.io/music</ref>

==== Voice Cloning ====

The platform provides two voice cloning approaches:

* '''Instant Voice Cloning''': Creates voice replicas from short audio samples (1-5 minutes)
* '''Professional Voice Cloning''': Higher-fidelity, fine-tuning-based cloning requiring longer training samples

==== Additional Features ====

* '''AI Dubbing''': Translates and dubs content while preserving original voice characteristics and emotions
* '''Voice Design''': Tool for creating entirely synthetic voices from text descriptions
* '''Speech Classifier''': Detection tool to identify AI-generated audio from ElevenLabs' technology
* '''Projects''': Long-form content creation tool for audiobooks and extended narration

== Business Model and Pricing ==

ElevenLabs operates on a freemium subscription model with usage-based pricing:

* '''Free Tier''': 10,000 characters per month with basic voices
* '''Starter''': $5/month with commercial licensing
* '''Creator''': $11/month with enhanced features
* '''Pro''': $99/month for professional use
* '''Enterprise''': Custom pricing with SLAs and dedicated support

The company has evolved its pricing structure multiple times, transitioning from simple character-based billing to more complex model-aware systems and back to unified credit systems as it scaled.<ref>https://flexprice.io/blog/elevenlabs-pricing-breakdown</ref>

== Performance and Benchmarks ==

Independent evaluations have provided mixed results regarding ElevenLabs' performance relative to competitors:

=== Competitive Analysis ===

According to third-party benchmarks:

* '''Voice Quality''': ElevenLabs demonstrates superior Mean Opinion Scores (MOS) compared to Google Cloud Text-to-Speech across fiction, non-fiction, and conversational content<ref>https://unrealspeech.com/compare/elevenlabs-vs-google-text-to-speech</ref>
* '''Latency''': Flash models achieve approximately 135ms Time to First Audio (TTFA), competitive with major cloud providers<ref>https://cartesia.ai/vs/elevenlabs-vs-microsoft-azure-text-to-speech</ref>
* '''Accuracy''': Word Error Rates vary but generally maintain competitive performance with established providers

However, these evaluations should be interpreted cautiously as they often come from companies with commercial interests in the TTS space, and standardized, independent benchmarking in the industry remains limited.

== Controversies and Ethical Concerns ==

ElevenLabs has faced significant criticism regarding the misuse of its technology:

=== Early Misuse Incidents ===

Shortly after the beta launch in January 2023, the platform was exploited by users on 4chan and other forums to create fake audio content. Notable incidents included:

* Creation of celebrity deepfakes, including voices of Emma Watson, Alexandria Ocasio-Cortez, and Ben Shapiro making statements they never made
* Generation of racist, sexist, and homophobic content using cloned voices<ref>https://www.vice.com/en/article/ai-voice-firm-4chan-celebrity-voices-emma-watson-joe-rogan-elevenlabs/</ref>

=== Political Deepfakes ===

In January 2024, ElevenLabs' technology was used to create a robocall impersonating President Joe Biden, urging New Hampshire voters not to participate in the Democratic primary. The incident prompted investigation by the New Hampshire Attorney General's office and led to the suspension of the responsible user account.<ref>https://www.bloomberg.com/news/articles/2024-01-26/ai-startup-elevenlabs-bans-account-blamed-for-biden-audio-deepfake</ref>

=== Legal Challenges ===

The company faces ongoing legal challenges, including:

* A lawsuit from voice actors Mark Boyett and Karissa Vacker, alleging unauthorized use of their voices to create the "Adam" and "Bella" default voices
* Claims of copyright infringement related to the use of audiobook recordings for training<ref>https://www.thevoicerealm.com/blog/a-look-into-the-elevenlabs-lawsuit/</ref>

=== Safety Measures ===

In response to misuse concerns, ElevenLabs has implemented several safeguards:

* Verification requirements for voice cloning features
* AI Speech Classifier for detecting ElevenLabs-generated content
* Partnership with Reality Defender for deepfake detection
* Mandatory credit card information for certain features<ref>https://www.bloomberg.com/news/articles/2024-07-18/elevenlabs-partners-with-reality-defender-to-combat-deepfake-audio</ref>

== Applications and Use Cases ==

ElevenLabs technology is utilized across various industries:

* '''Media and Entertainment''': Audiobook production, podcast creation, film dubbing
* '''Gaming''': Character voice generation for video games
* '''Education''': Educational content narration and language learning
* '''Enterprise''': Customer service automation, training materials
* '''Accessibility''': Tools for visually impaired users

The company reports that 41% of Fortune 500 companies use its platform, with notable customers including The Washington Post, TIME magazine, and HarperCollins Publishers.<ref>https://sacra.com/c/elevenlabs/</ref>

== External Links ==

* [https://elevenlabs.io/ Official ElevenLabs website]
* [https://elevenlabs.io/docs/ ElevenLabs Documentation]
* [https://elevenlabs.io/text-to-speech Text-to-Speech Demo]

VibeVoice

2025-09-23T02:53:54Z

Ttswikiadmin: Add note about fine-tuning

{{Infobox TTS model
| name = VibeVoice
| developer = [[Microsoft Research]]
| release_date = August 26, 2025
| latest_version = 7B
| architecture = [[Qwen]] 2.5 + Diffusion
| parameters = 1.5B / 7B
| training_data = Proprietary dataset
| languages = English, Chinese
| voices = 4 speakers maximum
| voice_cloning = Yes
| emotion_control = Limited
| streaming = Yes
| license = MIT
| open_source = Limited (code removed)
| code_repository = [https://github.com/vibevoice-community/VibeVoice Community fork]
| model_weights = [https://huggingface.co/vibevoice Community backup]
| website = [https://aka.ms/VibeVoice Microsoft]
}}

'''VibeVoice''' is an experimental [[text-to-speech]] (TTS) framework developed by [[Microsoft Research]] for generating long-form, multi-speaker conversational audio. It was released in August 2025 and is designed to synthesize long-form speech content such as podcasts and audiobooks with up to 4 speakers and with support for voice cloning.<ref>https://github.com/microsoft/VibeVoice</ref>

== Development and Release ==

VibeVoice was developed by a team at Microsoft Research led by Zhiliang Peng, Jianwei Yu, and others, with the technical report published on [[arXiv]] in August 2025.<ref>https://arxiv.org/abs/2508.19205</ref> The project was initially released as open-source software on GitHub and [[Hugging Face]], with model weights made publicly available under the MIT license.

However, the release was disrupted in September 2025 when Microsoft removed the official repository and model weights from public access. According to Microsoft's statement, this action was taken after discovering "instances where the tool was used in ways inconsistent with the stated intent" and concerns about responsible AI use.<ref>https://github.com/microsoft/VibeVoice</ref> The repository was later restored without implementation code, while community-maintained forks preserved the original materials. The 1.5B pretrained model remains available on Microsoft's Hugging Face page, but the 7B model was taken down.

Community forks have preserved backups of the code and model weights, including both the 1.5B and 7B models.<ref>https://github.com/vibevoice-community/VibeVoice</ref><ref>https://huggingface.co/aoi-ot/VibeVoice-Large</ref>

== Technical Architecture ==

VibeVoice uses a hybrid architecture combining large language models with diffusion-based audio generation. The system uses two specialized tokenizers operating at an ultra-low 7.5 Hz frame rate:

* '''Acoustic Tokenizer''': A [[variational autoencoder]] (VAE) based encoder-decoder that compresses audio signals while preserving fidelity
* '''Semantic Tokenizer''': A content-focused encoder trained using [[automatic speech recognition]] as a proxy task

The core model utilizes [[Qwen2.5]] as its base large language model (available in 1.5B and 7B parameter variants), integrated with a lightweight diffusion head for generating acoustic features. This design achieves what the researchers claim is an 80-fold improvement in data compression compared to the [[Encodec]] model while maintaining audio quality.<ref>https://arxiv.org/abs/2508.19205</ref>

== Capabilities and Limitations ==

VibeVoice can generate speech sequences up to 90 minutes in length with support for up to four distinct speakers. The model demonstrates several emergent capabilities not explicitly trained for, including:

* Cross-lingual speech synthesis
* Spontaneous singing (though often off-key)
* Contextual background music generation
* Voice cloning from short prompts

However, the system has notable limitations:

* Language support restricted to English and Chinese
* No explicit modeling of overlapping speech
* Occasional instability, particularly with Chinese text synthesis
* Uncontrolled generation of background sounds and music
* Limited commercial viability due to various technical constraints<ref>https://github.com/microsoft/VibeVoice</ref>

== Performance and Evaluation ==

In comparative evaluations against contemporary TTS systems including [[ElevenLabs]], [[Google]]'s Gemini 2.5 Pro TTS, and others, VibeVoice reportedly achieved superior scores in subjective metrics of realism, richness, and user preference. The model also demonstrated competitive [[word error rate]]s when evaluated using speech recognition systems.<ref>https://huggingface.co/microsoft/VibeVoice-1.5B</ref>

However, these evaluations were conducted on a limited test set of eight conversational transcripts totaling approximately one hour of audio, raising questions about the generalizability of the results to broader use cases.

== Controversies and Concerns ==

The temporary removal of VibeVoice from public access highlighted ongoing concerns about the potential misuse of high-quality synthetic speech technology. Microsoft explicitly warned about the potential for creating [[deepfake]] audio content for impersonation, fraud, or disinformation purposes.

The model's ability to generate convincing speech from minimal voice prompts, combined with its long-form generation capabilities, raised particular concerns among AI safety researchers about potential misuse for creating fake audio content at scale.

== Fine-Tuning ==
Following the release of the community backup, a member of the community has released fine-tuning scripts on GitHub. LoRA fine-tuning is fully supported; however, full-fine-tuning is not yet supported.

Members of the Discord community have reported success both extending VibeVoice to new languages and training on specific voices for better voice cloning; however, there are few LoRA adapters publicly available.

== Community Response ==

Following Microsoft's temporary withdrawal of the official release, the open-source community created several preservation efforts:

* '''[https://github.com/vibevoice-community/VibeVoice vibevoice-community/VibeVoice]''': A community-maintained fork preserving the original codebase and model weights
* '''[https://github.com/voicepowered-ai/VibeVoice-finetuning VibeVoice-finetuning]''': Unofficial tools for fine-tuning the models using [[Low-Rank Adaptation]] (LoRA) techniques

These community efforts have enabled continued research and development despite the official restrictions, though they operate independently of Microsoft's oversight.

== External Links ==

* [https://github.com/vibevoice-community/VibeVoice Community-maintained VibeVoice repository]
* [https://discord.com/invite/ZDEYTTRxWG VibeVoice Discord server (unofficial)]
* [https://arxiv.org/abs/2508.19205 Original technical report on arXiv]

[[Category:Artificial intelligence]]
[[Category:Speech synthesis]]
[[Category:Microsoft Research]]
[[Category:Open-source software]]

Mean Opinion Score

2025-09-22T02:52:24Z

Ttswikiadmin: fix list styling

'''Mean Opinion Score''' ('''MOS''') is a numerical measure used in telecommunications and multimedia engineering to represent the overall quality of a stimulus or system as perceived by human evaluators. It is calculated as the arithmetic mean of individual ratings given by test subjects on a predefined scale, typically ranging from 1 (lowest perceived quality) to 5 (highest perceived quality). MOS is widely used for evaluating voice, video, and audiovisual quality in applications ranging from traditional telephony to modern text-to-speech systems and streaming media.

The methodology originated in the telecommunications industry for assessing telephone call quality and was formally standardized by the International Telecommunication Union (ITU-T) in Recommendation P.800 in 1996. Since then, it has become the gold standard for subjective quality assessment across various domains where human perception of quality is critical.

== History and Development ==

=== Early Origins ===

The concept of Mean Opinion Score emerged in the telecommunications industry during the 1970s as telephone networks became more complex and digital transmission methods were introduced. Initially developed by the ITU, MOS provided a standardized way to assess voice transmission quality over telephone networks by aggregating human judgments of call quality.<ref>https://www.telecomtrainer.com/mos-mean-opinion-score/</ref>

The early methodology involved having listeners sit in controlled "quiet rooms" and score telephone call quality as they perceived it. This subjective testing approach had been in use in the telephony industry for decades before formal standardization, reflecting the industry's recognition that technical measurements alone could not capture the human experience of communication quality.<ref>https://en.wikipedia.org/wiki/Mean_opinion_score</ref>

=== Standardization ===

The methodology was formally standardized in ITU-T Recommendation P.800, "Methods for subjective determination of transmission quality," approved on August 30, 1996.<ref>https://www.itu.int/rec/T-REC-P.800-199608-I/en</ref> This recommendation established rigorous protocols for conducting subjective quality tests, including specific requirements for test environments and procedures.

The standardization specified that test subjects should be seated in quiet rooms with volumes between 30 and 120 cubic meters, reverberation times less than 500 milliseconds (preferably 200-300 ms), and room noise levels below 30 dBA with no dominant spectral peaks. These environmental controls ensured consistency and reliability in MOS evaluations across different testing facilities and organizations.

=== Evolution and Extensions ===

Following the success of P.800, the ITU-T developed additional recommendations to clarify and extend MOS methodology:

* '''ITU-T P.800.1''' (2003, updated 2016): Established terminology for different types of MOS scores, distinguishing between listening quality subjective (MOS-LQS), listening quality objective (MOS-LQO), and listening quality estimated (MOS-LQE) to avoid confusion about the source and nature of scores.<ref>https://studylib.net/doc/8277727/itu-t-rec.-p.800.1--03-2003--mean-opinion-score--mos--ter</ref>
* '''ITU-T P.800.2''': Prescribed how MOS values should be reported, emphasizing that MOS scores from separate experiments cannot be directly compared unless explicitly designed for comparison and statistically validated.
* '''ITU-T P.808''' (2021): Addressed crowdsourcing methods for conducting subjective evaluations, recognizing the need for scalable approaches to MOS testing in the digital age.<ref>https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-P.808-202106-I!!PDF-E&type=items</ref>

== Methodology ==

=== Rating Scales ===

The most commonly used rating scale is the '''Absolute Category Rating''' (ACR) scale, which maps subjective quality ratings to numerical values:

{| class="wikitable"
|-
! Score !! Quality Level !! Description
|-
| 5 || Excellent || Completely natural speech; imperceptible artifacts
|-
| 4 || Good || Mostly natural speech; just perceptible but not annoying
|-
| 3 || Fair || Equally natural and unnatural; perceptible and slightly annoying
|-
| 2 || Poor || Mostly unnatural speech; annoying but not objectionable
|-
| 1 || Bad || Completely unnatural speech; very annoying and objectionable
|}

Alternative scales may use different ranges (e.g., 1-100) or different qualitative descriptors, depending on the specific application and testing requirements.<ref>https://www.twilio.com/docs/glossary/what-is-mean-opinion-score-mos</ref>

=== Testing Procedures ===

Traditional MOS testing involves several key steps:

# '''Subject Selection''': Recruiting appropriate test participants, typically naive listeners without specialized training in audio quality assessment
# '''Environment Control''': Conducting tests in acoustically controlled environments meeting ITU-T specifications
# '''Stimulus Presentation''': Playing audio samples to subjects in randomized order to minimize bias
# '''Rating Collection''': Having subjects rate each stimulus on the chosen scale
# '''Statistical Analysis''': Calculating the arithmetic mean and confidence intervals for each stimulus

Modern extensions include comparative methods such as:

* '''Degradation Category Rating''' (DCR): Subjects compare processed audio to a reference
* '''Comparison Category Rating''' (CCR): Direct comparison between two stimuli

=== Objective Estimation ===

While traditional MOS relies on human evaluation, objective models have been developed to predict MOS scores automatically. Key standardized methods include:

* '''PESQ''' (ITU-T P.862): Perceptual Evaluation of Speech Quality, introduced in 200
* '''POLQA''' (ITU-T P.863): Perceptual Objective Listening Quality Assessment, approved in 201
* '''PSQM''' (ITU-T P.861): Perceptual Speech Quality Measure, the first standardized method from 1997

These algorithms analyze acoustic properties of audio signals to estimate human perceptual quality, enabling automated quality monitoring and real-time assessment.<ref>https://en.wikipedia.org/wiki/Perceptual_Evaluation_of_Speech_Quality</ref>

== Applications ==

=== Telecommunications ===

MOS remains fundamental in telecommunications for evaluating voice and video call quality. In Voice over IP (VoIP) systems, MOS scores help assess the impact of network impairments such as packet loss, jitter, and latency on user experience. The G.711 codec, commonly used in VoIP, has a maximum theoretical MOS of 4.4, serving as a benchmark for quality comparisons.<ref>https://obkio.com/blog/measuring-voip-quality-with-mos-score-mean-opinion-score/</ref>

Telecommunications companies use MOS for:

* Network planning and optimization
* Codec evaluation and selection
* Service level agreement monitoring
* Competitive benchmarking

=== Speech Synthesis and AI ===

In modern artificial intelligence applications, MOS has become critical for evaluating text-to-speech (TTS) systems. As synthetic speech quality has improved dramatically with neural approaches like WaveNet, Tacotron, and VITS, MOS remains the primary method for assessing how natural and human-like synthesized speech sounds to listeners.

However, recent research has highlighted limitations of MOS for evaluating state-of-the-art speech synthesis systems. Studies have shown that as synthetic speech approaches human quality, MOS becomes less sensitive to remaining differences, leading researchers to explore complementary evaluation methods.<ref>https://www.sciencedirect.com/science/article/abs/pii/S0885230823000967</ref>

=== Multimedia and Streaming ===

MOS is extensively used in multimedia applications for evaluating:

* Video streaming quality
* Audio codec performance
* Compression artifact assessment
* Real-time communication platforms

Streaming services use MOS to optimize their delivery pipelines, balancing bandwidth efficiency with perceptual quality to ensure user satisfaction across diverse network conditions.

== Modern Challenges and Limitations ==

=== Statistical and Methodological Issues ===

MOS faces several inherent limitations that researchers and practitioners must consider:

'''Ordinal Scale Problems''': MOS ratings are based on ordinal scales where the ranking of items is known but intervals between ratings are not necessarily equal. Mathematically, calculating an arithmetic mean from ordinal data is problematic, and median values would be more appropriate. However, the practice of using arithmetic means is widely accepted and standardized.<ref>https://en.wikipedia.org/wiki/Mean_opinion_score</ref>

'''Range-Equalization Bias''': Test subjects tend to use the full rating scale during an experiment, making scores relative to the range of quality present in the test rather than absolute measures of quality. This prevents direct comparison of MOS scores from different experiments.

'''Contextual Dependence''': MOS values are influenced by the testing context, participant demographics, and the presence of anchor stimuli (very high or low quality samples that influence perception of other stimuli).

=== Scalability and Cost ===

Traditional MOS testing is time-consuming and expensive, requiring recruitment of human evaluators and controlled testing environments. This has led to increased interest in:

* Crowdsourcing platforms for distributed evaluation
* Objective quality models that predict MOS
* Automated evaluation metrics that correlate with human perception

=== Limitations in Advanced Applications ===

As technology advances, particularly in AI-generated content, traditional MOS evaluation faces new challenges:

'''Ceiling Effects''': When synthetic speech approaches human quality, MOS becomes less discriminative, with most systems scoring in the 4.0-4.5 range where small differences may not be statistically significant.

'''Missing Dimensions''': MOS provides only an overall quality rating and may miss specific aspects like speaker similarity, emotional expression, or intelligibility of specific linguistic phenomena.

'''Cultural and Linguistic Bias''': MOS scores can vary based on evaluator demographics, language background, and cultural factors, potentially limiting generalizability across diverse user populations.

== Contemporary Developments ==

=== Crowdsourcing and Remote Testing ===

ITU-T P.808 (2021) established guidelines for conducting MOS evaluations using crowdsourcing platforms, recognizing the need for scalable testing methods. This approach enables larger-scale evaluations but introduces new challenges in quality control and participant screening.<ref>https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-P.808-202106-I!!PDF-E&type=items</ref>

=== Deep Learning Integration ===

Recent research explores using deep learning models for automatic MOS prediction, potentially enabling real-time quality assessment. Some approaches integrate MOS predictions into other tasks, such as fake audio detection, where predicted MOS scores help identify synthetic speech.<ref>https://arxiv.org/html/2401.13249v2</ref>

=== Multi-Modal Assessment ===

Modern applications increasingly require evaluation beyond audio quality alone. Research is extending MOS concepts to multi-modal scenarios, including audio-visual quality assessment and the evaluation of text-to-speech avatars that combine voice and visual synthesis.

== Related Standards and Metrics ==

Several ITU-T recommendations work in conjunction with MOS:

* '''ITU-T G.107''': The E-model for objective quality assessment that can be mapped to MOS scales
* '''ITU-T P.910''': Subjective video quality assessment methods
* '''ITU-T P.863''': POLQA objective quality measurement
* '''ITU-T P.862''': PESQ objective speech quality assessment

Other quality metrics used alongside MOS include technical measurements like signal-to-noise ratio, mean squared error, and mel-cepstral distortion, though these objective measures often correlate poorly with human perception.

== Criticism and Future Directions ==

The speech synthesis research community has increasingly questioned whether MOS alone is sufficient for evaluating modern high-quality systems. Critics argue that the field may have reached "the end of a cul-de-sac by only evaluating the overall quality with MOS" and advocate for developing new evaluation protocols better suited to analyzing advanced speech synthesis technologies.<ref>https://www.research.ed.ac.uk/en/publications/the-limits-of-the-mean-opinion-score-for-speech-synthesis-evaluat</ref>

Proposed alternatives and extensions include:

* Fine-grained evaluation of specific quality dimensions
* Task-specific intelligibility testing
* Comparative ranking methods that avoid absolute scaling issues
* Objective metrics that better correlate with human perception

Despite these limitations, MOS remains the most widely accepted method for subjective quality assessment and continues to evolve to meet the needs of advancing technology.

== External Links ==

* [https://www.itu.int/rec/T-REC-P.800/en ITU-T Recommendation P.800]
* [https://www.itu.int/rec/T-REC-P.800.1/en ITU-T Recommendation P.800.1]
* [https://www.itu.int/rec/T-REC-P.808/en ITU-T Recommendation P.808]

VITS

2025-09-22T02:51:59Z

Ttswikiadmin: relink

'''VITS''' ('''Variational Inference with adversarial learning for end-to-end Text-to-Speech''') and '''VITS2''' are neural text-to-speech synthesis models that generate speech directly from text input using end-to-end training. VITS was first introduced by researchers at Kakao Enterprise in June 2021, while VITS2 was developed by SK Telecom and published in July 2023 as an improvement over the original model.

== Overview ==

Traditional text-to-speech systems typically employ a two-stage pipeline: first converting text to intermediate representations like mel-spectrograms, then generating audio waveforms from these representations. VITS introduced a single-stage approach that generates natural-sounding audio directly from text using variational inference augmented with normalizing flows and adversarial training.

The models are notable for achieving quality comparable to human speech while maintaining parallel generation capabilities, making them significantly faster than autoregressive alternatives. Human evaluation on the LJ Speech dataset showed that VITS outperformed the best publicly available TTS systems at the time and achieved a [[Mean Opinion Score|mean opinion score (MOS)]] comparable to ground truth recordings.

== Technical Architecture ==

=== VITS (2021) ===

VITS employs a conditional variational autoencoder (VAE) framework combined with several advanced techniques:

'''Core Components:'''
* '''Posterior Encoder:''' Processes linear-scale spectrograms during training to learn latent representations
* '''Prior Encoder:''' Contains a text encoder and normalizing flows to model the prior distribution of latent variables
* '''Decoder:''' Based on HiFi-GAN V1 generator, converts latent variables to raw waveforms
* '''Discriminator:''' Multi-period discriminator from HiFi-GAN for adversarial training

'''Key Innovations:'''
* '''Monotonic Alignment Search (MAS):''' Automatically learns alignments between text and speech without external annotations by finding alignments that maximize the likelihood of target speech.
* '''Stochastic Duration Predictor:''' Uses normalizing flows to model the distribution of phoneme durations, enabling synthesis of speech with diverse rhythms from the same text input.
* '''Adversarial Training:''' Improves waveform quality through generator-discriminator competition

The model addresses the one-to-many relationship in speech synthesis, where a single text input can be spoken in multiple ways with different pitches, rhythms, and prosodic patterns.

=== VITS2 (2023) ===

VITS2 introduced several improvements over the original model to address issues including intermittent unnaturalness, computational efficiency, and strong dependence on phoneme conversion.

'''Major Improvements:'''
* '''Adversarial Duration Predictor:''' Replaced the flow-based stochastic duration predictor with one trained through adversarial learning, using a time step-wise conditional discriminator to improve efficiency and naturalness.
* '''Enhanced Normalizing Flows:''' Added transformer blocks to normalizing flows to capture long-term dependencies when transforming distributions, addressing limitations of convolution-only approaches.
* '''Improved Alignment Search:''' Modified Monotonic Alignment Search by adding Gaussian noise to calculated probabilities, giving the model additional opportunities to explore alternative alignments during early training.
* '''Speaker-Conditioned Text Encoder:''' For multi-speaker models, conditioned the speaker vector on the third transformer block of the text encoder to better capture speaker-specific pronunciation and intonation characteristics.

== Performance and Evaluation ==

=== VITS Results ===

On the LJ Speech dataset, VITS achieved a MOS of 4.43 (±0.06), compared to 4.46 (±0.06) for ground truth recordings. This outperformed Tacotron 2 + HiFi-GAN at 4.25 (±0.07) and Glow-TTS + HiFi-GAN at 4.32 (±0.07). The model also demonstrated significant improvements in synthesis speed, achieving 67.12× real-time factor compared to 27.48× for Glow-TTS + HiFi-GAN.

=== VITS2 Results ===

VITS2 showed further improvements with a MOS of 4.47 (±0.06) on LJ Speech, representing a 0.09 point increase over VITS. In comparative evaluations, VITS2 achieved a CMOS of 0.201 (±0.105) when compared directly to VITS. The model also improved synthesis speed to 97.25× real-time and reduced training time by approximately 22.7%.

=== Multi-Speaker Capabilities ===

Both models support multi-speaker synthesis. On the VCTK dataset containing 109 speakers, VITS achieved 4.38 (±0.06) MOS compared to 3.82 (±0.07) for Glow-TTS + HiFi-GAN. VITS2 further improved speaker similarity with a MOS of 3.99 (±0.08) compared to VITS's 3.79 (±0.09) on similarity evaluations.

== End-to-End Capabilities ==

A significant contribution of VITS2 was reducing dependence on phoneme conversion. Using character error rate (CER) evaluation with automatic speech recognition, VITS2 achieved 4.01% CER when using normalized text input compared to 3.92% with phoneme sequences, demonstrating the possibility of fully end-to-end training without explicit phoneme preprocessing.

== External Links ==

* [https://github.com/jaywalnut310/vits Official VITS repository]
* [https://github.com/daniilrobnikov/vits2 VITS2 implementation]
* [https://jaywalnut310.github.io/vits-demo/index.html VITS demo page]
* [https://vits-2.github.io/demo/ VITS2 demo page]
* [https://huggingface.co/docs/transformers/model_doc/vits VITS on Hugging Face]

Mean Opinion Score

2025-09-22T02:51:14Z

Ttswikiadmin: Add MOS

'''Mean Opinion Score''' ('''MOS''') is a numerical measure used in telecommunications and multimedia engineering to represent the overall quality of a stimulus or system as perceived by human evaluators. It is calculated as the arithmetic mean of individual ratings given by test subjects on a predefined scale, typically ranging from 1 (lowest perceived quality) to 5 (highest perceived quality). MOS is widely used for evaluating voice, video, and audiovisual quality in applications ranging from traditional telephony to modern text-to-speech systems and streaming media.

The methodology originated in the telecommunications industry for assessing telephone call quality and was formally standardized by the International Telecommunication Union (ITU-T) in Recommendation P.800 in 1996. Since then, it has become the gold standard for subjective quality assessment across various domains where human perception of quality is critical.

== History and Development ==

=== Early Origins ===

The concept of Mean Opinion Score emerged in the telecommunications industry during the 1970s as telephone networks became more complex and digital transmission methods were introduced. Initially developed by the ITU, MOS provided a standardized way to assess voice transmission quality over telephone networks by aggregating human judgments of call quality.<ref>https://www.telecomtrainer.com/mos-mean-opinion-score/</ref>

The early methodology involved having listeners sit in controlled "quiet rooms" and score telephone call quality as they perceived it. This subjective testing approach had been in use in the telephony industry for decades before formal standardization, reflecting the industry's recognition that technical measurements alone could not capture the human experience of communication quality.<ref>https://en.wikipedia.org/wiki/Mean_opinion_score</ref>

=== Standardization ===

The methodology was formally standardized in ITU-T Recommendation P.800, "Methods for subjective determination of transmission quality," approved on August 30, 1996.<ref>https://www.itu.int/rec/T-REC-P.800-199608-I/en</ref> This recommendation established rigorous protocols for conducting subjective quality tests, including specific requirements for test environments and procedures.

The standardization specified that test subjects should be seated in quiet rooms with volumes between 30 and 120 cubic meters, reverberation times less than 500 milliseconds (preferably 200-300 ms), and room noise levels below 30 dBA with no dominant spectral peaks. These environmental controls ensured consistency and reliability in MOS evaluations across different testing facilities and organizations.

=== Evolution and Extensions ===

Following the success of P.800, the ITU-T developed additional recommendations to clarify and extend MOS methodology:

- '''ITU-T P.800.1''' (2003, updated 2016): Established terminology for different types of MOS scores, distinguishing between listening quality subjective (MOS-LQS), listening quality objective (MOS-LQO), and listening quality estimated (MOS-LQE) to avoid confusion about the source and nature of scores.<ref>https://studylib.net/doc/8277727/itu-t-rec.-p.800.1--03-2003--mean-opinion-score--mos--ter</ref>

- '''ITU-T P.800.2''': Prescribed how MOS values should be reported, emphasizing that MOS scores from separate experiments cannot be directly compared unless explicitly designed for comparison and statistically validated.

- '''ITU-T P.808''' (2021): Addressed crowdsourcing methods for conducting subjective evaluations, recognizing the need for scalable approaches to MOS testing in the digital age.<ref>https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-P.808-202106-I!!PDF-E&type=items</ref>

== Methodology ==

=== Rating Scales ===

The most commonly used rating scale is the '''Absolute Category Rating''' (ACR) scale, which maps subjective quality ratings to numerical values:

{| class="wikitable"
|-
! Score !! Quality Level !! Description
|-
| 5 || Excellent || Completely natural speech; imperceptible artifacts
|-
| 4 || Good || Mostly natural speech; just perceptible but not annoying
|-
| 3 || Fair || Equally natural and unnatural; perceptible and slightly annoying
|-
| 2 || Poor || Mostly unnatural speech; annoying but not objectionable
|-
| 1 || Bad || Completely unnatural speech; very annoying and objectionable
|}

Alternative scales may use different ranges (e.g., 1-100) or different qualitative descriptors, depending on the specific application and testing requirements.<ref>https://www.twilio.com/docs/glossary/what-is-mean-opinion-score-mos</ref>

=== Testing Procedures ===

Traditional MOS testing involves several key steps:

# '''Subject Selection''': Recruiting appropriate test participants, typically naive listeners without specialized training in audio quality assessment
# '''Environment Control''': Conducting tests in acoustically controlled environments meeting ITU-T specifications
# '''Stimulus Presentation''': Playing audio samples to subjects in randomized order to minimize bias
# '''Rating Collection''': Having subjects rate each stimulus on the chosen scale
# '''Statistical Analysis''': Calculating the arithmetic mean and confidence intervals for each stimulus

Modern extensions include comparative methods such as:

* '''Degradation Category Rating''' (DCR): Subjects compare processed audio to a reference
* '''Comparison Category Rating''' (CCR): Direct comparison between two stimuli

=== Objective Estimation ===

While traditional MOS relies on human evaluation, objective models have been developed to predict MOS scores automatically. Key standardized methods include:

* '''PESQ''' (ITU-T P.862): Perceptual Evaluation of Speech Quality, introduced in 200
* '''POLQA''' (ITU-T P.863): Perceptual Objective Listening Quality Assessment, approved in 201
* '''PSQM''' (ITU-T P.861): Perceptual Speech Quality Measure, the first standardized method from 1997

These algorithms analyze acoustic properties of audio signals to estimate human perceptual quality, enabling automated quality monitoring and real-time assessment.<ref>https://en.wikipedia.org/wiki/Perceptual_Evaluation_of_Speech_Quality</ref>

== Applications ==

=== Telecommunications ===

MOS remains fundamental in telecommunications for evaluating voice and video call quality. In Voice over IP (VoIP) systems, MOS scores help assess the impact of network impairments such as packet loss, jitter, and latency on user experience. The G.711 codec, commonly used in VoIP, has a maximum theoretical MOS of 4.4, serving as a benchmark for quality comparisons.<ref>https://obkio.com/blog/measuring-voip-quality-with-mos-score-mean-opinion-score/</ref>

Telecommunications companies use MOS for:

* Network planning and optimization
* Codec evaluation and selection
* Service level agreement monitoring
* Competitive benchmarking

=== Speech Synthesis and AI ===

In modern artificial intelligence applications, MOS has become critical for evaluating text-to-speech (TTS) systems. As synthetic speech quality has improved dramatically with neural approaches like WaveNet, Tacotron, and VITS, MOS remains the primary method for assessing how natural and human-like synthesized speech sounds to listeners.

However, recent research has highlighted limitations of MOS for evaluating state-of-the-art speech synthesis systems. Studies have shown that as synthetic speech approaches human quality, MOS becomes less sensitive to remaining differences, leading researchers to explore complementary evaluation methods.<ref>https://www.sciencedirect.com/science/article/abs/pii/S0885230823000967</ref>

=== Multimedia and Streaming ===

MOS is extensively used in multimedia applications for evaluating:

* Video streaming quality
* Audio codec performance
* Compression artifact assessment
* Real-time communication platforms

Streaming services use MOS to optimize their delivery pipelines, balancing bandwidth efficiency with perceptual quality to ensure user satisfaction across diverse network conditions.

== Modern Challenges and Limitations ==

=== Statistical and Methodological Issues ===

MOS faces several inherent limitations that researchers and practitioners must consider:

'''Ordinal Scale Problems''': MOS ratings are based on ordinal scales where the ranking of items is known but intervals between ratings are not necessarily equal. Mathematically, calculating an arithmetic mean from ordinal data is problematic, and median values would be more appropriate. However, the practice of using arithmetic means is widely accepted and standardized.<ref>https://en.wikipedia.org/wiki/Mean_opinion_score</ref>

'''Range-Equalization Bias''': Test subjects tend to use the full rating scale during an experiment, making scores relative to the range of quality present in the test rather than absolute measures of quality. This prevents direct comparison of MOS scores from different experiments.

'''Contextual Dependence''': MOS values are influenced by the testing context, participant demographics, and the presence of anchor stimuli (very high or low quality samples that influence perception of other stimuli).

=== Scalability and Cost ===

Traditional MOS testing is time-consuming and expensive, requiring recruitment of human evaluators and controlled testing environments. This has led to increased interest in:

* Crowdsourcing platforms for distributed evaluation
* Objective quality models that predict MOS
* Automated evaluation metrics that correlate with human perception

=== Limitations in Advanced Applications ===

As technology advances, particularly in AI-generated content, traditional MOS evaluation faces new challenges:

'''Ceiling Effects''': When synthetic speech approaches human quality, MOS becomes less discriminative, with most systems scoring in the 4.0-4.5 range where small differences may not be statistically significant.

'''Missing Dimensions''': MOS provides only an overall quality rating and may miss specific aspects like speaker similarity, emotional expression, or intelligibility of specific linguistic phenomena.

'''Cultural and Linguistic Bias''': MOS scores can vary based on evaluator demographics, language background, and cultural factors, potentially limiting generalizability across diverse user populations.

== Contemporary Developments ==

=== Crowdsourcing and Remote Testing ===

ITU-T P.808 (2021) established guidelines for conducting MOS evaluations using crowdsourcing platforms, recognizing the need for scalable testing methods. This approach enables larger-scale evaluations but introduces new challenges in quality control and participant screening.<ref>https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-P.808-202106-I!!PDF-E&type=items</ref>

=== Deep Learning Integration ===

Recent research explores using deep learning models for automatic MOS prediction, potentially enabling real-time quality assessment. Some approaches integrate MOS predictions into other tasks, such as fake audio detection, where predicted MOS scores help identify synthetic speech.<ref>https://arxiv.org/html/2401.13249v2</ref>

=== Multi-Modal Assessment ===

Modern applications increasingly require evaluation beyond audio quality alone. Research is extending MOS concepts to multi-modal scenarios, including audio-visual quality assessment and the evaluation of text-to-speech avatars that combine voice and visual synthesis.

== Related Standards and Metrics ==

Several ITU-T recommendations work in conjunction with MOS:

* '''ITU-T G.107''': The E-model for objective quality assessment that can be mapped to MOS scales
* '''ITU-T P.910''': Subjective video quality assessment methods
* '''ITU-T P.863''': POLQA objective quality measurement
* '''ITU-T P.862''': PESQ objective speech quality assessment

Other quality metrics used alongside MOS include technical measurements like signal-to-noise ratio, mean squared error, and mel-cepstral distortion, though these objective measures often correlate poorly with human perception.

== Criticism and Future Directions ==

The speech synthesis research community has increasingly questioned whether MOS alone is sufficient for evaluating modern high-quality systems. Critics argue that the field may have reached "the end of a cul-de-sac by only evaluating the overall quality with MOS" and advocate for developing new evaluation protocols better suited to analyzing advanced speech synthesis technologies.<ref>https://www.research.ed.ac.uk/en/publications/the-limits-of-the-mean-opinion-score-for-speech-synthesis-evaluat</ref>

Proposed alternatives and extensions include:

* Fine-grained evaluation of specific quality dimensions
* Task-specific intelligibility testing
* Comparative ranking methods that avoid absolute scaling issues
* Objective metrics that better correlate with human perception

Despite these limitations, MOS remains the most widely accepted method for subjective quality assessment and continues to evolve to meet the needs of advancing technology.

== External Links ==

* [https://www.itu.int/rec/T-REC-P.800/en ITU-T Recommendation P.800]
* [https://www.itu.int/rec/T-REC-P.800.1/en ITU-T Recommendation P.800.1]
* [https://www.itu.int/rec/T-REC-P.808/en ITU-T Recommendation P.808]

VITS

2025-09-22T02:39:06Z

Ttswikiadmin: Created page with "'''VITS''' ('''Variational Inference with adversarial learning for end-to-end Text-to-Speech''') and '''VITS2''' are neural text-to-speech synthesis models that generate speech directly from text input using end-to-end training. VITS was first introduced by researchers at Kakao Enterprise in June 2021, while VITS2 was developed by SK Telecom and published in July 2023 as an improvement over the original model. == Overview == Traditional text-to-speech systems typically..."

'''VITS''' ('''Variational Inference with adversarial learning for end-to-end Text-to-Speech''') and '''VITS2''' are neural text-to-speech synthesis models that generate speech directly from text input using end-to-end training. VITS was first introduced by researchers at Kakao Enterprise in June 2021, while VITS2 was developed by SK Telecom and published in July 2023 as an improvement over the original model.

== Overview ==

Traditional text-to-speech systems typically employ a two-stage pipeline: first converting text to intermediate representations like mel-spectrograms, then generating audio waveforms from these representations. VITS introduced a single-stage approach that generates natural-sounding audio directly from text using variational inference augmented with normalizing flows and adversarial training.

The models are notable for achieving quality comparable to human speech while maintaining parallel generation capabilities, making them significantly faster than autoregressive alternatives. Human evaluation on the LJ Speech dataset showed that VITS outperformed the best publicly available TTS systems at the time and achieved a [[Mean opinion score|mean opinion score (MOS)]] comparable to ground truth recordings.

== Technical Architecture ==

=== VITS (2021) ===

VITS employs a conditional variational autoencoder (VAE) framework combined with several advanced techniques:

'''Core Components:'''
* '''Posterior Encoder:''' Processes linear-scale spectrograms during training to learn latent representations
* '''Prior Encoder:''' Contains a text encoder and normalizing flows to model the prior distribution of latent variables
* '''Decoder:''' Based on HiFi-GAN V1 generator, converts latent variables to raw waveforms
* '''Discriminator:''' Multi-period discriminator from HiFi-GAN for adversarial training

'''Key Innovations:'''
* '''Monotonic Alignment Search (MAS):''' Automatically learns alignments between text and speech without external annotations by finding alignments that maximize the likelihood of target speech.
* '''Stochastic Duration Predictor:''' Uses normalizing flows to model the distribution of phoneme durations, enabling synthesis of speech with diverse rhythms from the same text input.
* '''Adversarial Training:''' Improves waveform quality through generator-discriminator competition

The model addresses the one-to-many relationship in speech synthesis, where a single text input can be spoken in multiple ways with different pitches, rhythms, and prosodic patterns.

=== VITS2 (2023) ===

VITS2 introduced several improvements over the original model to address issues including intermittent unnaturalness, computational efficiency, and strong dependence on phoneme conversion.

'''Major Improvements:'''
* '''Adversarial Duration Predictor:''' Replaced the flow-based stochastic duration predictor with one trained through adversarial learning, using a time step-wise conditional discriminator to improve efficiency and naturalness.
* '''Enhanced Normalizing Flows:''' Added transformer blocks to normalizing flows to capture long-term dependencies when transforming distributions, addressing limitations of convolution-only approaches.
* '''Improved Alignment Search:''' Modified Monotonic Alignment Search by adding Gaussian noise to calculated probabilities, giving the model additional opportunities to explore alternative alignments during early training.
* '''Speaker-Conditioned Text Encoder:''' For multi-speaker models, conditioned the speaker vector on the third transformer block of the text encoder to better capture speaker-specific pronunciation and intonation characteristics.

== Performance and Evaluation ==

=== VITS Results ===

On the LJ Speech dataset, VITS achieved a MOS of 4.43 (±0.06), compared to 4.46 (±0.06) for ground truth recordings. This outperformed Tacotron 2 + HiFi-GAN at 4.25 (±0.07) and Glow-TTS + HiFi-GAN at 4.32 (±0.07). The model also demonstrated significant improvements in synthesis speed, achieving 67.12× real-time factor compared to 27.48× for Glow-TTS + HiFi-GAN.

=== VITS2 Results ===

VITS2 showed further improvements with a MOS of 4.47 (±0.06) on LJ Speech, representing a 0.09 point increase over VITS. In comparative evaluations, VITS2 achieved a CMOS of 0.201 (±0.105) when compared directly to VITS. The model also improved synthesis speed to 97.25× real-time and reduced training time by approximately 22.7%.

=== Multi-Speaker Capabilities ===

Both models support multi-speaker synthesis. On the VCTK dataset containing 109 speakers, VITS achieved 4.38 (±0.06) MOS compared to 3.82 (±0.07) for Glow-TTS + HiFi-GAN. VITS2 further improved speaker similarity with a MOS of 3.99 (±0.08) compared to VITS's 3.79 (±0.09) on similarity evaluations.

== End-to-End Capabilities ==

A significant contribution of VITS2 was reducing dependence on phoneme conversion. Using character error rate (CER) evaluation with automatic speech recognition, VITS2 achieved 4.01% CER when using normalized text input compared to 3.92% with phoneme sequences, demonstrating the possibility of fully end-to-end training without explicit phoneme preprocessing.

== External Links ==

* [https://github.com/jaywalnut310/vits Official VITS repository]
* [https://github.com/daniilrobnikov/vits2 VITS2 implementation]
* [https://jaywalnut310.github.io/vits-demo/index.html VITS demo page]
* [https://vits-2.github.io/demo/ VITS2 demo page]
* [https://huggingface.co/docs/transformers/model_doc/vits VITS on Hugging Face]

Main Page

2025-09-22T02:27:23Z

Ttswikiadmin: /* Open Source Models */

= Welcome to TTS Wiki =
'''Note:''' This Wiki is still a work-in-progress. Contributions are welcome!

'''TTS Wiki''' is a collaborative knowledge base dedicated to documenting and comparing latest '''Text-to-Speech (TTS)''' models and technologies. Our mission is to provide comprehensive, up-to-date information about the rapidly evolving landscape of speech synthesis.

== Getting Started ==

=== For Developers ===
* [[Installation Guides]] - Setup instructions for various TTS models
* [[Installation Guides|Finetuning Guides]] - Walkthroughs for fine-tuning TTS models
* [[Licensing Overview]] - Commercial usage rights and restrictions

== Model Categories ==

=== Open Source Models ===
{| class="wikitable sortable" style="width: 100%;"
! Model !! License !! Languages !! Voice Cloning !! Conversational
!Fine-Tuning!! Date Released
|-
|[[VoxCPM]]
|Apache-2.0
|English, Chinese
|✅
|❌
|❌
|2025
|-
|[[IndexTTS2]]
|Custom (restrictive)
|English
|✅
|❌
|❌
|2025
|-
| [[VibeVoice]] || MIT || English, Chinese || ✅ || ✅
|✅|| 2025
|-
|[[Chatterbox]]
|MIT
|English
|✅
|❌
|✅
|2025
|-
|[[MegaTTS 3]]
|MIT
|English, Chinese
|✅
|❌
|❌
|2025
|-
|[[Orpheus TTS]]
|Apache-2.0
|English
|❌
|❌
|✅
|2025
|-
|[[CSM-1B]]
|Apache-2.0
|English
|✅
|✅
|✅
|2025
|-
|[[Kokoro-82M]]
|Apache-2.0
|English
|❌
|❌
|❌
|2025
|-
|[[CosyVoice 2.0]]
|Apache-2.0
|Chinese, English, Japanese, Korean
|✅
|❌
|✅
|2024
|-
| [[F5-TTS]] || MIT, CC-BY-NC || English, Chinese || ✅ || ❌
|✅|| 2024
|-
| [[MaskGCT]] || MIT, CC-BY-NC || English, Chinese, Korean, Japanese, French, German || ✅ || ❌
|❌|| 2024
|-
| [[StyleTTS 2]] || MIT || English || ✅ || ❌
|✅|| 2024
|-
|[[XTTSv2]]
|CPML (restrictive)
|English, +16
|✅
|❌
|✅
|2023
|-
|[[Tortoise TTS]]
|Apache-2.0
|English
|✅
|❌
|❌
|2022
|}

=== Commercial Services ===
Coming soon

== Contributing ==

Contributions are welcome! Here's how you can help:

* '''Add new models''' - Document recently released TTS systems
* '''Update comparisons''' - Share performance benchmarks and quality tests
* '''Write tutorials and guides''' - Help others learn to use different TTS tools
* '''Upload samples''' - Provide audio examples for model comparisons (please do not upload copyrighted content!)
* '''Fix information''' - Correct outdated or inaccurate details

'''Disclaimer:''' This wiki is maintained by the community and information may not always be current. Always verify details with official sources before making production decisions. Voice cloning should only be used with proper consent and for ethical purposes.

[[Category:Main]]

IndexTTS2

2025-09-21T20:46:02Z

Ttswikiadmin: Add IndexTTS 2

{{Infobox TTS model|name=IndexTTS2|developer=Bilibili AI Platform Department|release_date=September 2025 (paper)|latest_version=2.0|architecture=Autoregressive Transformer|parameters=Undisclosed|training_data=55,000 hours multilingual|languages=Chinese, English, Japanese|voices=Zero-shot voice cloning|voice_cloning=Yes (emotion-timbre disentangled)|emotion_control=Yes (multimodal input)|streaming=Yes|latency=Not specified|license=Custom/restrictive (commercial license available)|open_source=Yes|code_repository=https://github.com/index-tts/index-tts|model_weights=https://huggingface.co/IndexTeam/IndexTTS-2|demo=https://index-tts.github.io/index-tts2.github.io/|website=https://indextts2.org}}'''IndexTTS2''' is an open-source text-to-speech model developed by Bilibili's AI Platform Department loosely based on [[Tortoise TTS]]. Released in September 2025, it addresses key limitations of traditional TTS models by introducing precise duration control and advanced emotional expression capabilities while maintaining the naturalness advantages of autoregressive generation.

== Development and Background ==
IndexTTS2 was developed by a team led by Siyi Zhou at Bilibili's Artificial Intelligence Platform Department in China. The project builds upon the earlier IndexTTS model, incorporating substantial improvements in duration control, emotional modeling, and speech stability. The research was published on arXiv in September 2025, with the paper titled "IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech."<ref>https://arxiv.org/abs/2506.21619</ref>

The development was motivated by specific limitations in existing autoregressive TTS models, particularly their inability to precisely control speech duration - a critical requirement for applications such as video dubbing that demand strict audio-visual synchronization. Additionally, the team sought to address the limited emotional expressiveness of existing systems, which are often constrained by scarce high-quality emotional training data.

== Technical Architecture ==
IndexTTS2 employs a three-module cascaded architecture:

=== Text-to-Semantic (T2S) Module ===
The T2S module serves as the core component, utilizing an autoregressive Transformer framework to generate semantic tokens from input text, timbre prompts, style prompts, and optional speech token counts. Key innovations include:

'''Duration Control Mechanism''': A novel duration encoding system where duration information p is computed from target semantic token length T using the formula p = W_num h(T), where W_num represents an embedding table and h(T) returns a one-hot vector

'''Emotion-Speaker Disentanglement''': Implementation of a Gradient Reversal Layer (GRL) to separate emotional features from speaker-specific characteristics, enabling independent control over timbre and emotion

'''Three-Stage Training Strategy''': A progressive training approach designed to overcome the scarcity of high-quality emotional data while maintaining model stability

=== Semantic-to-Mel (S2M) Module ===
The S2M module employs a non-autoregressive architecture based on flow matching to convert semantic tokens into mel-spectrograms. Notable features include:

'''GPT Latent Enhancement''': Integration of latent features from the T2S module's final transformer layer to improve speech clarity, particularly during emotionally expressive synthesis

'''Speaker Embedding Integration''': Concatenation of speaker embeddings with semantic features to ensure timbre consistency

=== Vocoder ===
The system utilizes [[BigVGANv2]] as its vocoder to convert mel-spectrograms into final audio waveforms, chosen for its superior audio quality and stability compared to previous vocoders.

== Key Features and Capabilities ==

=== Precise Duration Control ===
IndexTTS2 is claimed to be the first autoregressive zero-shot TTS model to achieve precise duration control. The model supports two generation modes:

'''Specified Duration Mode''': Users can explicitly specify the number of generated tokens to control speech duration with millisecond-level precision

'''Natural Duration Mode''': Free-form generation that faithfully reproduces prosodic features from input prompts without duration constraints

This capability addresses critical requirements for applications like video dubbing, where precise synchronization between audio and visual content is essential.

=== Fine-Grained Emotional Control ===
The system offers multiple methods for emotional control:

'''Reference Audio Emotion''': Extraction of emotional characteristics from style prompt audio

'''Natural Language Descriptions''': Text-based emotion control using a specialized Text-to-Emotion (T2E) module

'''Emotion Vector Input''': Direct specification of emotional states through numerical vectors

'''Cross-Speaker Emotion Transfer''': Ability to apply emotional characteristics from one speaker to the voice of another

=== Text-to-Emotion (T2E) Module ===
A specialized component that enables natural language-based emotion control through:

* Knowledge distillation from DeepSeek-R1 to Qwen3-1.7B
* Support for seven basic emotions: Anger, Happiness, Fear, Disgust, Sadness, Surprise, and Neutral
* Generation of emotion probability distributions that are combined with precomputed emotion embeddings

== Training and Dataset ==
IndexTTS2 was trained on a substantial multilingual corpus:

'''Total Training Data''': 55,000 hours comprising 30,000 hours of Chinese data and 25,000 hours of English data

'''Emotional Data''': 135 hours of specialized emotional speech from 361 speakers

'''Training Infrastructure''': 8 NVIDIA A100 80GB GPUs using AdamW optimizer with 2e-4 learning rate

'''Training Duration''': Three weeks total training time

'''Data Sources''': Primarily from the Emilia dataset, supplemented with audiobooks and commercial data

The three-stage training methodology includes:

# Foundation training on the full dataset with duration control capabilities
# Emotional control refinement using curated emotional data with GRL-based disentanglement
# Robustness improvement through fine-tuning on the complete dataset

== Performance Evaluation ==

=== Objective Metrics ===
Based on evaluation across multiple datasets (LibriSpeech-test-clean, SeedTTS test-zh/en, AIShell-1), IndexTTS2 demonstrates:

'''Superior Word Error Rates''': Outperforms baseline models including MaskGCT, F5-TTS, CosyVoice2, and SparkTTS across most test sets

'''Strong Speaker Similarity''': Achieves competitive speaker similarity scores while maintaining improved speech clarity

'''Emotional Fidelity''': Highest emotion similarity scores among evaluated models

=== Subjective Evaluation ===
Human evaluation using Mean Opinion Scores (MOS) across multiple dimensions shows:

'''Quality MOS''': Consistent superiority in perceived audio quality

'''Similarity MOS''': Strong performance in perceived speaker similarity

'''Prosody MOS''': Enhanced prosodic naturalness compared to baseline models

'''Emotion MOS''': Significant improvements in emotional expressiveness

=== Duration Control Accuracy ===
Precision testing reveals minimal token number error rates:

Original duration: <0.02% error rate

Scaled durations (0.875×-1.125×): <0.03% error rate

Larger scaling factors: Maximum 0.067% error rate

== Comparison with Existing Systems ==
IndexTTS2 distinguishes itself from contemporary TTS systems through:

'''Versus ElevenLabs''': Open-source nature and precise duration control capabilities

'''Versus Traditional TTS''': Enhanced emotional expressiveness and zero-shot voice cloning

'''Versus Other Open-Source Systems''': First autoregressive model with precise duration control

'''Versus Non-Autoregressive Models''': Maintains naturalness advantages while adding duration precision

== External Links ==
[https://github.com/index-tts/index-tts Official IndexTTS Repository]

[https://index-tts.github.io/index-tts2.github.io/ IndexTTS2 Demo Page]

[https://huggingface.co/IndexTeam IndexTTS2 Models on Hugging Face]

[https://arxiv.org/abs/2506.21619 IndexTTS2 Research Paper]

Main Page

2025-09-21T20:36:07Z

Ttswikiadmin: add some new models

= Welcome to TTS Wiki =
'''Note:''' This Wiki is still a work-in-progress. Contributions are welcome!

'''TTS Wiki''' is a collaborative knowledge base dedicated to documenting and comparing latest '''Text-to-Speech (TTS)''' models and technologies. Our mission is to provide comprehensive, up-to-date information about the rapidly evolving landscape of speech synthesis.

== Getting Started ==

=== For Developers ===
* [[Installation Guides]] - Setup instructions for various TTS models
* [[Installation Guides|Finetuning Guides]] - Walkthroughs for fine-tuning TTS models
* [[Licensing Overview]] - Commercial usage rights and restrictions

== Model Categories ==

=== Open Source Models ===
{| class="wikitable sortable" style="width: 100%;"
! Model !! License !! Languages !! Voice Cloning !! Conversational
!Fine-Tuning!! Date Released
|-
|[[IndexTTS2]]
|Custom (restrictive)
|English
|✅
|❌
|❌
|2025
|-
| [[VibeVoice]] || MIT || English, Chinese || ✅ || ✅
|✅|| 2025
|-
|[[Chatterbox]]
|MIT
|English
|✅
|❌
|✅
|2025
|-
|[[MegaTTS 3]]
|MIT
|English, Chinese
|✅
|❌
|❌
|2025
|-
|[[Orpheus TTS]]
|Apache-2.0
|English
|❌
|❌
|✅
|2025
|-
|[[CSM-1B]]
|Apache-2.0
|English
|✅
|✅
|✅
|2025
|-
|[[CosyVoice 2.0]]
|Apache-2.0
|Chinese, English, Japanese, Korean
|✅
|❌
|✅
|2024
|-
| [[F5-TTS]] || MIT, CC-BY-NC || English, Chinese || ✅ || ❌
|✅|| 2024
|-
| [[MaskGCT]] || MIT, CC-BY-NC || English, Chinese, Korean, Japanese, French, German || ✅ || ❌
|❌|| 2024
|-
| [[StyleTTS 2]] || MIT || English || ✅ || ❌
|✅|| 2024
|}

=== Commercial Services ===
Coming soon

== Contributing ==

Contributions are welcome! Here's how you can help:

* '''Add new models''' - Document recently released TTS systems
* '''Update comparisons''' - Share performance benchmarks and quality tests
* '''Write tutorials and guides''' - Help others learn to use different TTS tools
* '''Upload samples''' - Provide audio examples for model comparisons (please do not upload copyrighted content!)
* '''Fix information''' - Correct outdated or inaccurate details

'''Disclaimer:''' This wiki is maintained by the community and information may not always be current. Always verify details with official sources before making production decisions. Voice cloning should only be used with proper consent and for ethical purposes.

[[Category:Main]]

ElevenLabs

2025-09-20T20:48:57Z

Ttswikiadmin: Add ElevenLabs page

{{Infobox TTS model
| name = ElevenLabs
| developer = ElevenLabs Inc.
| release_date = January 2023 (beta)
| latest_version = Eleven v3 (alpha)
| languages = 32+ languages
| voices = 1000+ voices
| voice_cloning = Yes (professional & instant)
| emotion_control = Yes (via audio tags)
| streaming = Yes
| latency = ~135ms (Flash models)
| open_source = No
| website = https://elevenlabs.io
}}

'''ElevenLabs''' is a commercial artificial intelligence company specializing in text-to-speech synthesis and voice cloning technology. Founded in 2022 by Piotr Dąbkowski and Mateusz Staniszewski, the company has gained prominence for its AI-generated voices that can replicate human speech patterns, emotions, and intonation across multiple languages.

== History and Founding ==

ElevenLabs was co-founded in 2022 by Piotr Dąbkowski, a former Google machine learning engineer, and Mateusz Staniszewski, an ex-Palantir deployment strategist. Both founders, originally from Poland, reportedly drew inspiration from the poor quality of film dubbing they experienced while watching American movies in their home country.<ref>https://venturebeat.com/ai/now-hear-this-voice-cloning-ai-startup-elevenlabs-nabs-19m-from-a16z-and-other-heavy-hitters</ref>

The founders first met as teenagers at Copernicus High School in Warsaw before pursuing separate academic paths—Dąbkowski studying at Oxford and Cambridge, while Staniszewski studied mathematics in London. Their shared vision of making quality content accessible across all languages led to the creation of ElevenLabs as a research-first company.<ref>https://research.contrary.com/company/elevenlabs</ref>

The company launched its beta platform in January 2023, quickly gaining traction with over one million users within five months. This rapid adoption demonstrated market demand for high-quality AI voice synthesis technology.<ref>https://research.contrary.com/company/elevenlabs</ref>

== Funding and Valuation ==

ElevenLabs has experienced rapid growth in both user adoption and valuation:

* '''Pre-seed (January 2023)''': $2 million led by Credo Ventures and Concept Ventures
* '''Series A (June 2023)''': $19 million at $100 million valuation, co-led by Andreessen Horowitz, Nat Friedman, and Daniel Gross
* '''Series B (January 2024)''': $80 million at $1.1 billion valuation, achieving unicorn status
* '''Series C (January 2025)''': $180 million at $3.3 billion valuation, led by Andreessen Horowitz and ICONIQ Growth<ref>https://en.wikipedia.org/wiki/ElevenLabs</ref>

The company reportedly achieved $200 million in annual recurring revenue (ARR) by August 2025, demonstrating significant commercial traction.<ref>https://sacra.com/c/elevenlabs/</ref>

== Technology and Products ==

=== Core Technology ===

ElevenLabs's architecture is proprietary and remains undisclosed, with little information about it being publicly available. Some have speculated that early versions of ElevenLabs were based off of Tortoise TTS; however, these rumors remain unverified.<ref>https://github.com/neonbjb/tortoise-tts/discussions/277</ref>

=== Product Portfolio ===

==== Text-to-Speech Models ====

ElevenLabs offers several model variants optimized for different use cases:

* '''Multilingual v2''': High-quality model supporting 29+ languages, optimized for audiobooks and professional content
* '''Flash v2.5''': Ultra-low latency model (75ms) designed for real-time conversational applications
* '''Turbo v2.5''': Balanced quality and speed model for general-purpose applications
* '''Eleven v3 (alpha)''': Latest model featuring advanced emotion control via audio tags
* '''Eleven Scribe v1''': SoTA automatic speech recognition model
* '''Eleven Music v1''': Text-to-music model trained on licensed data<ref>https://elevenlabs.io/docs/models</ref><ref>https://elevenlabs.io/music</ref>

==== Voice Cloning ====

The platform provides two voice cloning approaches:

* '''Instant Voice Cloning''': Creates voice replicas from short audio samples (1-5 minutes)
* '''Professional Voice Cloning''': Higher-fidelity, fine-tuning-based cloning requiring longer training samples

==== Additional Features ====

* '''AI Dubbing''': Translates and dubs content while preserving original voice characteristics and emotions
* '''Voice Design''': Tool for creating entirely synthetic voices from text descriptions
* '''Speech Classifier''': Detection tool to identify AI-generated audio from ElevenLabs' technology
* '''Projects''': Long-form content creation tool for audiobooks and extended narration

== Business Model and Pricing ==

ElevenLabs operates on a freemium subscription model with usage-based pricing:

* '''Free Tier''': 10,000 characters per month with basic voices
* '''Starter''': $5/month with commercial licensing
* '''Creator''': $11/month with enhanced features
* '''Pro''': $99/month for professional use
* '''Enterprise''': Custom pricing with SLAs and dedicated support

The company has evolved its pricing structure multiple times, transitioning from simple character-based billing to more complex model-aware systems and back to unified credit systems as it scaled.<ref>https://flexprice.io/blog/elevenlabs-pricing-breakdown</ref>

== Performance and Benchmarks ==

Independent evaluations have provided mixed results regarding ElevenLabs' performance relative to competitors:

=== Competitive Analysis ===

According to third-party benchmarks:

* '''Voice Quality''': ElevenLabs demonstrates superior Mean Opinion Scores (MOS) compared to Google Cloud Text-to-Speech across fiction, non-fiction, and conversational content{https://unrealspeech.com/compare/elevenlabs-vs-google-text-to-speech}
* '''Latency''': Flash models achieve approximately 135ms Time to First Audio (TTFA), competitive with major cloud providers{https://cartesia.ai/vs/elevenlabs-vs-microsoft-azure-text-to-speech}
* '''Accuracy''': Word Error Rates vary but generally maintain competitive performance with established providers

However, these evaluations should be interpreted cautiously as they often come from companies with commercial interests in the TTS space, and standardized, independent benchmarking in the industry remains limited.

== Controversies and Ethical Concerns ==

ElevenLabs has faced significant criticism regarding the misuse of its technology:

=== Early Misuse Incidents ===

Shortly after the beta launch in January 2023, the platform was exploited by users on 4chan and other forums to create fake audio content. Notable incidents included:

* Creation of celebrity deepfakes, including voices of Emma Watson, Alexandria Ocasio-Cortez, and Ben Shapiro making statements they never made
* Generation of racist, sexist, and homophobic content using cloned voices<ref>https://www.vice.com/en/article/ai-voice-firm-4chan-celebrity-voices-emma-watson-joe-rogan-elevenlabs/</ref>

=== Political Deepfakes ===

In January 2024, ElevenLabs' technology was used to create a robocall impersonating President Joe Biden, urging New Hampshire voters not to participate in the Democratic primary. The incident prompted investigation by the New Hampshire Attorney General's office and led to the suspension of the responsible user account.<ref>https://www.bloomberg.com/news/articles/2024-01-26/ai-startup-elevenlabs-bans-account-blamed-for-biden-audio-deepfake</ref>

=== Legal Challenges ===

The company faces ongoing legal challenges, including:

* A lawsuit from voice actors Mark Boyett and Karissa Vacker, alleging unauthorized use of their voices to create the "Adam" and "Bella" default voices
* Claims of copyright infringement related to the use of audiobook recordings for training<ref>https://www.thevoicerealm.com/blog/a-look-into-the-elevenlabs-lawsuit/</ref>

=== Safety Measures ===

In response to misuse concerns, ElevenLabs has implemented several safeguards:

* Verification requirements for voice cloning features
* AI Speech Classifier for detecting ElevenLabs-generated content
* Partnership with Reality Defender for deepfake detection
* Mandatory credit card information for certain features<ref>https://www.bloomberg.com/news/articles/2024-07-18/elevenlabs-partners-with-reality-defender-to-combat-deepfake-audio</ref>

== Applications and Use Cases ==

ElevenLabs technology is utilized across various industries:

* '''Media and Entertainment''': Audiobook production, podcast creation, film dubbing
* '''Gaming''': Character voice generation for video games
* '''Education''': Educational content narration and language learning
* '''Enterprise''': Customer service automation, training materials
* '''Accessibility''': Tools for visually impaired users

The company reports that 41% of Fortune 500 companies use its platform, with notable customers including The Washington Post, TIME magazine, and HarperCollins Publishers.<ref>https://sacra.com/c/elevenlabs/</ref>

== External Links ==

* [https://elevenlabs.io/ Official ElevenLabs website]
* [https://elevenlabs.io/docs/ ElevenLabs Documentation]
* [https://elevenlabs.io/text-to-speech Text-to-Speech Demo]

MIT License

2025-09-20T20:10:13Z

Ttswikiadmin: Created page with "'''MIT License''' is a permissive free software license originally developed at the Massachusetts Institute of Technology (MIT). It is one of the most popular open-source licenses used in software development and is commonly used for licensing both code and model weights. == Overview == The MIT License is characterized by its simplicity and permissive nature. It allows users to do almost anything with the licensed software, including using, copying, modifying, merg..."

'''MIT License''' is a permissive free software license originally developed at the Massachusetts Institute of Technology (MIT). It is one of the most popular open-source licenses used in software development and is commonly used for licensing both code and [[model weights]].

== Overview ==

The MIT License is characterized by its simplicity and permissive nature. It allows users to do almost anything with the licensed software, including using, copying, modifying, merging, publishing, distributing, sublicensing, and selling copies, with minimal restrictions.

The license requires only that the original copyright notice and license text be included in all copies or substantial portions of the software. Unlike more restrictive licenses such as the [[GNU General Public License]] (GPL), the MIT License does not require derivative works to be released under the same license terms.

== Key Permissions ==

Under the MIT License, users are granted the following rights:

* '''Commercial use''': The software can be used for commercial purposes
* '''Modification''': The software can be modified and adapted
* '''Distribution''': The software can be distributed to others
* '''Private use''': The software can be used privately without disclosure
* '''Sublicensing''': The software can be incorporated into projects with different licenses

== Requirements ==

The MIT License has minimal requirements:

* '''License and copyright notice''': The original license text and copyright notice must be included with any distribution of the software
* '''No trademark use''': The license does not grant permission to use the licensor's trademarks or trade names

== Limitations ==

The MIT License provides no warranty and limits liability:

* '''No warranty''': The software is provided "as is" without any express or implied warranties
* '''Limited liability''': The license disclaims liability for damages arising from the use of the software
* '''No patent protection''': Unlike some other licenses, the MIT License does not explicitly grant patent rights

== Comparison with Other Licenses ==

{| class="wikitable"
! License !! Commercial Use !! Modification !! Distribution !! Copyleft !! Patent Grant
|-
| MIT || Yes || Yes || Yes || No || No
|-
| [[Apache 2.0]]|| Yes || Yes || Yes || No || Yes
|-
| [[GPL v3]]|| Yes || Yes || Yes || Yes || Yes
|-
| [[BSD-3-Clause]]|| Yes || Yes || Yes || No || No
|}

== License Text ==

The complete MIT License text is relatively short:

<pre>
MIT License

Copyright (c) [year] [fullname]

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
</pre>

== References ==

* [https://opensource.org/licenses/MIT Open Source Initiative - MIT License]
* [https://choosealicense.com/licenses/mit/ Choose a License - MIT]

[[Category:Licenses]]
[[Category:Open Source]]

Main Page

2025-09-20T16:28:18Z

Ttswikiadmin: Add year of release

= Welcome to TTS Wiki =
'''Note:''' This Wiki is still a work-in-progress. Contributions are welcome!

'''TTS Wiki''' is a collaborative knowledge base dedicated to documenting and comparing latest '''Text-to-Speech (TTS)''' models and technologies. Our mission is to provide comprehensive, up-to-date information about the rapidly evolving landscape of speech synthesis.

== Getting Started ==

=== For Developers ===
* [[Installation Guides]] - Setup instructions for various TTS models
* [[Installation Guides|Finetuning Guides]] - Walkthroughs for fine-tuning TTS models
* [[Licensing Overview]] - Commercial usage rights and restrictions

== Model Categories ==

=== Open Source Models ===
{| class="wikitable sortable" style="width: 100%;"
! Model !! License !! Languages !! Voice Cloning !! Conversational
!Fine-Tuning!! Date Released
|-
| [[VibeVoice]] || MIT || English, Chinese || ✅ || ✅
|✅|| 2025
|-
|[[Chatterbox]]
|MIT
|English
|✅
|❌
|✅
|2025
|-
|[[Orpheus TTS]]
|Apache-2.0
|English
|❌
|❌
|✅
|2025
|-
|[[CSM-1B]]
|Apache-2.0
|English
|✅
|✅
|✅
|2025
|-
|[[CosyVoice 2.0]]
|Apache-2.0
|Chinese, English, Japanese, Korean
|✅
|❌
|✅
|2024
|-
| [[F5-TTS]] || MIT, CC-BY-NC || English, Chinese || ✅ || ❌
|✅|| 2024
|-
| [[MaskGCT]] || MIT, CC-BY-NC || English, Chinese, Korean, Japanese, French, German || ✅ || ❌
|❌|| 2024
|-
| [[StyleTTS 2]] || MIT || English || ✅ || ❌
|✅|| 2024
|}

=== Commercial Services ===
Coming soon

== Contributing ==

Contributions are welcome! Here's how you can help:

* '''Add new models''' - Document recently released TTS systems
* '''Update comparisons''' - Share performance benchmarks and quality tests
* '''Write tutorials and guides''' - Help others learn to use different TTS tools
* '''Upload samples''' - Provide audio examples for model comparisons (please do not upload copyrighted content!)
* '''Fix information''' - Correct outdated or inaccurate details

'''Disclaimer:''' This wiki is maintained by the community and information may not always be current. Always verify details with official sources before making production decisions. Voice cloning should only be used with proper consent and for ethical purposes.

[[Category:Main]]

Chatterbox

2025-09-20T16:27:59Z

Ttswikiadmin: Add Chatterbox

{{Infobox TTS model
| name = Chatterbox
| developer = [[Resemble AI]]
| release_date = May 2025
| latest_version = Multilingual 2.0
| architecture = [[CosyVoice 2.0]]-based
| parameters = 500 million
| training_data = 500,000 hours cleaned data
| languages = 23 languages (multilingual version)
| voices = Zero-shot voice cloning
| voice_cloning = Yes (5-second reference)
| emotion_control = Yes (exaggeration parameter)
| streaming = Yes
| latency = Sub-200ms
| license = [[MIT License]]
| open_source = Yes
| code_repository = [https://github.com/resemble-ai/chatterbox GitHub]
| model_weights = [https://huggingface.co/ResembleAI/chatterbox Hugging Face]
| demo = [https://huggingface.co/spaces/ResembleAI/Chatterbox HF Spaces]
| website = [https://www.resemble.ai/chatterbox/ resemble.ai/chatterbox]
}}
'''Chatterbox''' is an open-source [[text-to-speech]] (TTS) model developed by [[Resemble AI]] and released in May 2025. Built on a modified [[Llama]] architecture with 500M parameters, it is marketed as the first open-source TTS model to include controllable emotion exaggeration and has gained attention for claiming to outperform established commercial systems in user preference evaluations. It is built on the [[CosyVoice|CosyVoice 2.0]] architecture.

== Development and Release ==

Chatterbox was developed by a three-person team at Resemble AI, a voice technology company founded by Zohaib Ahmed and Saqib Muhammad.<ref>https://www.digitalocean.com/community/tutorials/resemble-chatterbox-tts-text-to-speech</ref> The initial English-only version was released in May 2025 under the [[MIT License]], followed by a multilingual version supporting 23 languages in September 2025.<ref>https://www.resemble.ai/introducing-chatterbox-multilingual-open-source-tts-for-23-languages/</ref>

The project quickly gained popularity in the open-source community, accumulating over 1 million downloads on [[Hugging Face]] and more than 11,000 stars on [[GitHub]] within weeks of release.<ref name="multilingual">https://www.resemble.ai/introducing-chatterbox-multilingual-open-source-tts-for-23-languages/</ref>

== Technical Architecture ==

Chatterbox utilizes a 500-million parameter model based on a CosyVoice-style modified Llama architecture, significantly smaller than many contemporary TTS systems. The model was trained on approximately 500,000 hours of cleaned audio data and employs what the developers term "alignment-informed inference" for improved stability during generation.

Key technical features include:

* '''Zero-shot voice cloning''': Ability to clone voices using as little as 5 seconds of reference audio
* '''Emotion exaggeration control''': A novel parameter allowing users to adjust emotional intensity from monotone to dramatically expressive
* '''Fast inference''': Sub-200ms latency for real-time applications
* '''Multilingual support''': The updated version supports 23 languages including Arabic, Chinese, Hindi, and major European languages

== Performance Claims and Evaluation ==

Resemble AI conducted a comparative evaluation through [[Podonos]], a third-party evaluation service, testing Chatterbox against [[ElevenLabs]], a leading commercial TTS system. In blind A/B testing, 63.75% of evaluators reportedly preferred Chatterbox's output over ElevenLabs.<ref>https://www.podonos.com/blog/chatterbox</ref><ref>https://www.resemble.ai/chatterbox/</ref>

However, these results should be interpreted with caution, as the evaluation was limited in scope and conducted by a single third-party service. The testing methodology, sample size, and demographic composition of evaluators have not been independently verified. Additionally, the comparison was limited to a single competitor rather than a comprehensive benchmark against multiple state-of-the-art systems.

== Commercial and Research Impact ==

The release of Chatterbox has been significant for the open-source TTS community, representing one of the first production-grade systems to be freely available under a permissive license. This has enabled developers to integrate high-quality TTS capabilities into applications without licensing costs or vendor dependencies.

The system has found applications in various domains including:

* Audiobook generation and voice narration
* Game development for non-player character dialogue
* Educational content creation
* Accessibility tools for visually impaired users
* Research and development in speech synthesis

Resemble AI also offers a commercial "Pro" version with enhanced features, service-level agreements, and custom fine-tuning capabilities for enterprise customers requiring guaranteed performance and support. This version is available through their inference partners, such as FAL.

== External Links ==

* [https://github.com/resemble-ai/chatterbox Official Chatterbox repository]
* [https://huggingface.co/ResembleAI/chatterbox Model on Hugging Face]
* [https://huggingface.co/spaces/ResembleAI/Chatterbox Interactive demo]
* [https://resemble-ai.github.io/chatterbox_demopage/ Demo page with audio samples]

[[Category:Speech synthesis]]
[[Category:Open-source software]]
[[Category:Artificial intelligence]]
[[Category:Voice technology]]
[[Category:MIT License software]]

Main Page

2025-09-20T16:26:52Z

Ttswikiadmin: /* Open Source Models */

= Welcome to TTS Wiki =
'''Note:''' This Wiki is still a work-in-progress. Contributions are welcome!

'''TTS Wiki''' is a collaborative knowledge base dedicated to documenting and comparing latest '''Text-to-Speech (TTS)''' models and technologies. Our mission is to provide comprehensive, up-to-date information about the rapidly evolving landscape of speech synthesis.

== Getting Started ==

=== For Developers ===
* [[Installation Guides]] - Setup instructions for various TTS models
* [[Installation Guides|Finetuning Guides]] - Walkthroughs for fine-tuning TTS models
* [[Licensing Overview]] - Commercial usage rights and restrictions

== Model Categories ==

=== Open Source Models ===
{| class="wikitable sortable" style="width: 100%;"
! Model !! License !! Languages !! Voice Cloning !! Conversational
!Fine-Tuning!! Date Released
|-
| [[VibeVoice]] || MIT || English, Chinese || ✅ || ✅
|✅|| 2025
|-
|[[Chatterbox]]
|MIT
|English
|✅
|❌
|✅
|2025
|-
|[[Orpheus TTS]]
|Apache-2.0
|English
|❌
|❌
|✅
|2025
|-
|[[CSM-1B]]
|Apache-2.0
|English
|✅
|✅
|✅
|
|-
|[[CosyVoice 2.0]]
|Apache-2.0
|Chinese, English, Japanese, Korean
|✅
|❌
|✅
|2024
|-
| [[F5-TTS]] || MIT, CC-BY-NC || English, Chinese || ✅ || ❌
|✅|| 2024
|-
| [[MaskGCT]] || MIT, CC-BY-NC || English, Chinese, Korean, Japanese, French, German || ✅ || ❌
|❌|| 2024
|-
| [[StyleTTS 2]] || MIT || English || ✅ || ❌
|✅|| 2024
|}

=== Commercial Services ===
Coming soon

== Contributing ==

Contributions are welcome! Here's how you can help:

* '''Add new models''' - Document recently released TTS systems
* '''Update comparisons''' - Share performance benchmarks and quality tests
* '''Write tutorials and guides''' - Help others learn to use different TTS tools
* '''Upload samples''' - Provide audio examples for model comparisons (please do not upload copyrighted content!)
* '''Fix information''' - Correct outdated or inaccurate details

'''Disclaimer:''' This wiki is maintained by the community and information may not always be current. Always verify details with official sources before making production decisions. Voice cloning should only be used with proper consent and for ethical purposes.

[[Category:Main]]

Main Page

2025-09-20T16:22:49Z

Ttswikiadmin: cosyvoice

= Welcome to TTS Wiki =
'''Note:''' This Wiki is still a work-in-progress. Contributions are welcome!

'''TTS Wiki''' is a collaborative knowledge base dedicated to documenting and comparing latest '''Text-to-Speech (TTS)''' models and technologies. Our mission is to provide comprehensive, up-to-date information about the rapidly evolving landscape of speech synthesis.

== Getting Started ==

=== For Developers ===
* [[Installation Guides]] - Setup instructions for various TTS models
* [[Installation Guides|Finetuning Guides]] - Walkthroughs for fine-tuning TTS models
* [[Licensing Overview]] - Commercial usage rights and restrictions

== Model Categories ==

=== Open Source Models ===
{| class="wikitable sortable" style="width: 100%;"
! Model !! License !! Languages !! Voice Cloning !! Conversational
!Fine-Tuning!! Date Released
|-
| [[VibeVoice]] || MIT || English, Chinese || ❌ || ✅
|✅|| 2025
|-
|[[Chatterbox]]
|MIT
|English
|✅
|❌
|✅
|2025
|-
|[[Orpheus TTS]]
|Apache-2.0
|English
|❌
|❌
|✅
|2025
|-
|[[CosyVoice 2.0]]
|Apache-2.0
|Chinese, English, Japanese, Korean
|✅
|❌
|✅
|2024
|-
| [[F5-TTS]] || MIT, CC-BY-NC || English, Chinese || ✅ || ❌
|✅|| 2024
|-
| [[MaskGCT]] || MIT, CC-BY-NC || English, Chinese, Korean, Japanese, French, German || ✅ || ❌
|❌|| 2024
|-
| [[StyleTTS 2]] || MIT || English || ✅ || ❌
|✅|| 2024
|}

=== Commercial Services ===
Coming soon

== Contributing ==

Contributions are welcome! Here's how you can help:

* '''Add new models''' - Document recently released TTS systems
* '''Update comparisons''' - Share performance benchmarks and quality tests
* '''Write tutorials and guides''' - Help others learn to use different TTS tools
* '''Upload samples''' - Provide audio examples for model comparisons (please do not upload copyrighted content!)
* '''Fix information''' - Correct outdated or inaccurate details

'''Disclaimer:''' This wiki is maintained by the community and information may not always be current. Always verify details with official sources before making production decisions. Voice cloning should only be used with proper consent and for ethical purposes.

[[Category:Main]]

VibeVoice

2025-09-20T16:18:30Z

Ttswikiadmin: Add VibeVoice infobox

{{Infobox TTS model
| name = VibeVoice
| developer = [[Microsoft Research]]
| release_date = August 26, 2025
| latest_version = 7B
| architecture = [[Qwen]] 2.5 + Diffusion
| parameters = 1.5B / 7B
| training_data = Proprietary dataset
| languages = English, Chinese
| voices = 4 speakers maximum
| voice_cloning = Yes
| emotion_control = Limited
| streaming = Yes
| license = MIT
| open_source = Limited (code removed)
| code_repository = [https://github.com/vibevoice-community/VibeVoice Community fork]
| model_weights = [https://huggingface.co/vibevoice Community backup]
| website = [https://aka.ms/VibeVoice Microsoft]
}}

'''VibeVoice''' is an experimental [[text-to-speech]] (TTS) framework developed by [[Microsoft Research]] for generating long-form, multi-speaker conversational audio. It was released in August 2025 and is designed to synthesize long-form speech content such as podcasts and audiobooks with up to 4 speakers and with support for voice cloning.<ref>https://github.com/microsoft/VibeVoice</ref>

== Development and Release ==

VibeVoice was developed by a team at Microsoft Research led by Zhiliang Peng, Jianwei Yu, and others, with the technical report published on [[arXiv]] in August 2025.<ref>https://arxiv.org/abs/2508.19205</ref> The project was initially released as open-source software on GitHub and [[Hugging Face]], with model weights made publicly available under the MIT license.

However, the release was disrupted in September 2025 when Microsoft removed the official repository and model weights from public access. According to Microsoft's statement, this action was taken after discovering "instances where the tool was used in ways inconsistent with the stated intent" and concerns about responsible AI use.<ref>https://github.com/microsoft/VibeVoice</ref> The repository was later restored without implementation code, while community-maintained forks preserved the original materials. The 1.5B pretrained model remains available on Microsoft's Hugging Face page, but the 7B model was taken down.

Community forks have preserved backups of the code and model weights, including both the 1.5B and 7B models.<ref>https://github.com/vibevoice-community/VibeVoice</ref><ref>https://huggingface.co/aoi-ot/VibeVoice-Large</ref>

== Technical Architecture ==

VibeVoice uses a hybrid architecture combining large language models with diffusion-based audio generation. The system uses two specialized tokenizers operating at an ultra-low 7.5 Hz frame rate:

* '''Acoustic Tokenizer''': A [[variational autoencoder]] (VAE) based encoder-decoder that compresses audio signals while preserving fidelity
* '''Semantic Tokenizer''': A content-focused encoder trained using [[automatic speech recognition]] as a proxy task

The core model utilizes [[Qwen2.5]] as its base large language model (available in 1.5B and 7B parameter variants), integrated with a lightweight diffusion head for generating acoustic features. This design achieves what the researchers claim is an 80-fold improvement in data compression compared to the [[Encodec]] model while maintaining audio quality.<ref>https://arxiv.org/abs/2508.19205</ref>

== Capabilities and Limitations ==

VibeVoice can generate speech sequences up to 90 minutes in length with support for up to four distinct speakers. The model demonstrates several emergent capabilities not explicitly trained for, including:

* Cross-lingual speech synthesis
* Spontaneous singing (though often off-key)
* Contextual background music generation
* Voice cloning from short prompts

However, the system has notable limitations:

* Language support restricted to English and Chinese
* No explicit modeling of overlapping speech
* Occasional instability, particularly with Chinese text synthesis
* Uncontrolled generation of background sounds and music
* Limited commercial viability due to various technical constraints<ref>https://github.com/microsoft/VibeVoice</ref>

== Performance and Evaluation ==

In comparative evaluations against contemporary TTS systems including [[ElevenLabs]], [[Google]]'s Gemini 2.5 Pro TTS, and others, VibeVoice reportedly achieved superior scores in subjective metrics of realism, richness, and user preference. The model also demonstrated competitive [[word error rate]]s when evaluated using speech recognition systems.<ref>https://huggingface.co/microsoft/VibeVoice-1.5B</ref>

However, these evaluations were conducted on a limited test set of eight conversational transcripts totaling approximately one hour of audio, raising questions about the generalizability of the results to broader use cases.

== Controversies and Concerns ==

The temporary removal of VibeVoice from public access highlighted ongoing concerns about the potential misuse of high-quality synthetic speech technology. Microsoft explicitly warned about the potential for creating [[deepfake]] audio content for impersonation, fraud, or disinformation purposes.

The model's ability to generate convincing speech from minimal voice prompts, combined with its long-form generation capabilities, raised particular concerns among AI safety researchers about potential misuse for creating fake audio content at scale.

== Community Response ==

Following Microsoft's temporary withdrawal of the official release, the open-source community created several preservation efforts:

* '''[https://github.com/vibevoice-community/VibeVoice vibevoice-community/VibeVoice]''': A community-maintained fork preserving the original codebase and model weights
* '''[https://github.com/voicepowered-ai/VibeVoice-finetuning VibeVoice-finetuning]''': Unofficial tools for fine-tuning the models using [[Low-Rank Adaptation]] (LoRA) techniques

These community efforts have enabled continued research and development despite the official restrictions, though they operate independently of Microsoft's oversight.

== External Links ==

* [https://github.com/vibevoice-community/VibeVoice Community-maintained VibeVoice repository]
* [https://arxiv.org/abs/2508.19205 Original technical report on arXiv]

[[Category:Artificial intelligence]]
[[Category:Speech synthesis]]
[[Category:Microsoft Research]]
[[Category:Open-source software]]

Template:Nobold

2025-09-20T16:12:14Z