Ttswikiadmin at 02:33, 23 December 2025

2025-12-23T02:33:59Z

← Older revision		Revision as of 02:33, 23 December 2025
Line 1:		Line 1:
	'''~~NeuCodec~~''' is a neural audio codec ~~developed by [[Neuphonic]],~~ designed for ~~efficient speech tokenization and high-quality audio compression~~ at ~~relatively low bitrates~~.		'''X-Codec''' is a neural audio codec designed to enhance semantic understanding in audio language models (LLMs). It was introduced in the paper "Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model," published at AAAI 2025.

	=== ~~Technical Specifications~~ ===		=== Background ===
			Traditional audio codecs like EnCodec were originally designed for audio compression, which may lead to suboptimal performance in the context of audio LLM. Research found that methods like VALL-E, which condition acoustic token generation on text transcriptions, often suffer from content inaccuracies and elevated word error rates (WER) due to semantic misinterpretations of acoustic tokens, resulting in word skipping and errors.

	* '''Bitrate:''' 0.8 kbps		=== Architecture ===
	* '''Output sample rate:''' 24 kHz		X-Codec addresses these limitations through a dual-encoder design that incorporates semantic features from a pre-trained semantic encoder before the Residual Vector Quantization (RVQ) stage and introduces a semantic reconstruction loss after RVQ.
	* '''Frame rate:''' 50 Hz
	* '''Quantization:''' Finite Scalar Quantization (~~FSQ~~) ~~with~~ a ~~single codebook~~

	~~=== Architecture ===~~		The architecture consists of:
	~~NeuCodec is largely based on extending the work~~ of [[X-Codec\|X-Codec 2.0]]. It employs a dual-encoder approach, using both audio ([[BigCodec]]) and semantic (Wav2Vec2-BERT) encoders. The FSQ-based design produces a single quantized vector output, making it well-suited for downstream Speech Language Model (SpeechLM) training.

	~~=== Features ===~~		* '''Acoustic Encoder/Decoder''': Convolutional encoder and decoder with a Residual Vector Quantizer (RVQ)
			* '''Semantic Module''': A pre-trained self-supervised model such as HuBERT or WavLM
			* '''Projectors''': Linear layers that combine and process the acoustic and semantic features

	* Compresses and ~~reconstructs audio with near~~-~~inaudible reconstruction loss~~		The acoustic and semantic features are concatenated, transformed, and then quantized together. After quantization, separate post-processing layers reconstruct both semantic and acoustic representations.
	* Upsamples from 16 kHz to 24 kHz
	* Commercial use permitted
	* Pre-encoded datasets available (Emilia-YODAS compressed from 1.~~7 TB to 41 GB)~~

	=== Applications ===		=== Applications ===
	~~NeuCodec serves as~~ the ~~audio~~ codec ~~for [[NeuTTS Air]]~~, ~~Neuphonic~~'~~s on~~-~~device~~ text-to-speech ~~model with voice cloning capabilities~~. It'~~s intended~~ for ~~researchers~~ and ~~developers building~~ text-to-speech ~~systems who need efficient speech tokenization without developing their own~~ codec.		X-Codec demonstrated improvements across multiple audio generation tasks including text-to-speech synthesis, music continuation, and general audio classification tasks. By enhancing the semantic ability of the codec, X-Codec significantly reduces WER in speech synthesis tasks and extends these benefits to non-speech applications, including music and sound generation.

			== X-Codec 2.0 ==
			'''X-Codec 2.0''' (also written as XCodec2) is a successor to X-Codec, introduced alongside the LLaSA text-to-speech system in the paper "LLaSA: Scaling Train-time and Inference-time Compute for LLaMA-based Speech Synthesis."

			=== Key Differences from X-Codec ===
			X-Codec2 extends the original X-Codec by refining how semantic and acoustic information is integrated and tokenized. Major architectural changes include:

			* '''Unified Semantic-Acoustic Tokenization''': X-Codec2 fuses outputs from a semantic encoder (e.g., Wav2Vec2-BERT) and an acoustic encoder into a single embedding, capturing both high-level meaning (e.g., text content, emotion) and low-level audio details (e.g., timbre).
			* '''Single-Stage Vector Quantization''': Unlike the multi-layer residual VQ in most approaches (e.g., X-Codec, DAC, EnCodec), X-Codec2 uses a single-layer Feature-Space Quantization (FSQ) for stability and compatibility with causal language models.
			* '''Large Codebook''': 65,536 codebook size using Finite Scalar Quantization achieving 99% codebook usage, which is comparable to text tokenizers (LLaMA3 uses 128,256).

			=== Technical Specifications ===

			* '''Semantic Encoder''': Wav2Vec2-BERT, a semantic encoder pre-trained on 4.5M hours of unlabeled audio data covering more than 143 languages.
			* '''Training Data''': Codec trained on 150k hours of multilingual speech data, including Emilia (En/Zh/De/Fr/Ja/Ko) and MLS (En/Fr/De/Nl/Es/It/Pt/Pl).
			* '''Quantization''': Finite Scalar Quantization (FSQ), which does not require an explicit VQ objective term (e.g., codebook commitment loss), simplifying optimization during training.

			=== Derivatives ===
			X-Codec 2.0 has been extended in several ways:

			* '''NeuCodec''': Neuphonic's codec for on-device TTS, which is largely based on extending X-Codec 2.0
			* '''XCodec2-Streaming''': A streaming variant that adopts a causal decoder to focus solely on historical context, enabling streaming waveform reconstruction.

	=== Availability ===		=== Availability ===
	~~Available~~ on Hugging Face ~~and GitHub under~~ the <code>~~neuphonic/neucodec~~</code> ~~repository, installable via pip~~.		X-Codec is available on GitHub and integrated into Hugging Face's Transformers library. X-Codec 2.0 is available via the <code>xcodec2</code> Python package and on Hugging Face.

	[[Category:Neural audio codecs]]		[[Category:Neural audio codecs]]

Ttswikiadmin: Created page with "'''NeuCodec''' is a neural audio codec developed by Neuphonic, designed for efficient speech tokenization and high-quality audio compression at relatively low bitrates. === Technical Specifications === * '''Bitrate:''' 0.8 kbps * '''Output sample rate:''' 24 kHz * '''Frame rate:''' 50 Hz * '''Quantization:''' Finite Scalar Quantization (FSQ) with a single codebook === Architecture === NeuCodec is largely based on extending the work of X-Codec 2.0. It e..."

2025-12-23T02:30:16Z

Created page with "'''NeuCodec''' is a neural audio codec developed by Neuphonic, designed for efficient speech tokenization and high-quality audio compression at relatively low bitrates. === Technical Specifications === * '''Bitrate:''' 0.8 kbps * '''Output sample rate:''' 24 kHz * '''Frame rate:''' 50 Hz * '''Quantization:''' Finite Scalar Quantization (FSQ) with a single codebook === Architecture === NeuCodec is largely based on extending the work of X-Codec 2.0. It e..."

New page

'''NeuCodec''' is a neural audio codec developed by [[Neuphonic]], designed for efficient speech tokenization and high-quality audio compression at relatively low bitrates.

=== Technical Specifications ===

* '''Bitrate:''' 0.8 kbps
* '''Output sample rate:''' 24 kHz
* '''Frame rate:''' 50 Hz
* '''Quantization:''' Finite Scalar Quantization (FSQ) with a single codebook

=== Architecture ===
NeuCodec is largely based on extending the work of [[X-Codec|X-Codec 2.0]]. It employs a dual-encoder approach, using both audio ([[BigCodec]]) and semantic (Wav2Vec2-BERT) encoders. The FSQ-based design produces a single quantized vector output, making it well-suited for downstream Speech Language Model (SpeechLM) training.

=== Features ===

* Compresses and reconstructs audio with near-inaudible reconstruction loss
* Upsamples from 16 kHz to 24 kHz
* Commercial use permitted
* Pre-encoded datasets available (Emilia-YODAS compressed from 1.7 TB to 41 GB)

=== Applications ===
NeuCodec serves as the audio codec for [[NeuTTS Air]], Neuphonic's on-device text-to-speech model with voice cloning capabilities. It's intended for researchers and developers building text-to-speech systems who need efficient speech tokenization without developing their own codec.

=== Availability ===
Available on Hugging Face and GitHub under the <code>neuphonic/neucodec</code> repository, installable via pip.

[[Category:Neural audio codecs]]

X-Codec - Revision history

Ttswikiadmin at 02:33, 23 December 2025