Ttswikiadmin: Created page with "'''SNAC''' (Multi-Scale Neural Audio Codec) is a neural audio codec that introduces multi-scale temporal quantization for efficient audio compression. It was presented at the NeurIPS 2024 Workshop on AI-Driven Speech, Music, and Sound Generation by researchers from Papla Media and ETH Zurich. === Overview === Neural audio codecs have recently gained popularity because they can represent audio signals with high fidelity at very low bitrates, making it feasible to use..."

2025-12-23T03:44:01Z

Created page with "'''SNAC''' (Multi-Scale Neural Audio Codec) is a neural audio codec that introduces multi-scale temporal quantization for efficient audio compression. It was presented at the NeurIPS 2024 Workshop on AI-Driven Speech, Music, and Sound Generation by researchers from Papla Media and ETH Zurich. === Overview === Neural audio codecs have recently gained popularity because they can represent audio signals with high fidelity at very low bitrates, making it feasible to use..."

New page

'''SNAC''' (Multi-Scale Neural Audio Codec) is a neural audio codec that introduces multi-scale temporal quantization for efficient audio compression. It was presented at the NeurIPS 2024 Workshop on AI-Driven Speech, Music, and Sound Generation by researchers from [[Papla Media]] and ETH Zurich.

=== Overview ===
Neural audio codecs have recently gained popularity because they can represent audio signals with high fidelity at very low bitrates, making it feasible to use language modeling approaches for audio generation and understanding. While Residual Vector Quantization (RVQ) has become the standard technique for neural audio compression using a cascade of VQ codebooks, SNAC proposes a simple extension of RVQ where the quantizers can operate at different temporal resolutions.

=== Architecture ===
SNAC encodes audio into hierarchical tokens similarly to SoundStream, EnCodec, and [[DAC]]. However, SNAC introduces a simple change where coarse tokens are sampled less frequently, covering a broader time span.

The architecture includes several key innovations:

* '''Multi-Scale Quantization''': By applying a hierarchy of quantizers at variable frame rates, the codec adapts to the audio structure across multiple timescales.
* '''Noise Blocks''': Noise blocks that inject input-dependent Gaussian noise for enhanced expressiveness
* '''Depthwise Convolutions''': Depthwise convolutions for efficient computation and training stability
* '''Local Windowed Attention''': Local windowed attention layers at the lowest temporal resolution to capture contextual relationships

=== Model Variants ===
SNAC offers several pretrained models optimized for different use cases:
{| class="wikitable"
!Model
!Sample Rate
!Bitrate
!RVQ Levels
!Token Rates
!Parameters
!Use Case
|-
|snac_24khz
|24 kHz
|0.98 kbps
|3
|12, 23, and 47 Hz
|~20M
|Speech
|-
|snac_32khz
|32 kHz
|1.9 kbps
|4
|10, 21, 42, and 83 Hz
|~55M
|General audio
|-
|snac_44khz
|44 kHz
|2.6 kbps
|4
|14, 29, 57, and 115 Hz
|~55M
|Music/SFX
|}
Each codebook holds 4096 entries (12-bit). The general audio model consists of 16M parameters in the encoder and 38.3M in the decoder, totaling 54.5 M parameters.

=== Performance ===
For speech, SNAC consistently outperforms all other codecs. Notably, even at bitrates below 1 kbit/s, SNAC maintains audio quality that closely approaches the reference signal. In evaluations, SNAC outperformed competing codecs like Encodec and DAC at comparable bitrates, even matching the quality of systems operating at twice its bitrate.

=== Applications ===
SNAC has been adopted in several text-to-speech systems:

* [[Orpheus TTS|'''Orpheus TTS''']]: Orpheus uses SNAC, which creates tokens at four levels of hierarchy. The SNAC model is relatively lightweight and fast, making it suitable for real-time decoding.

With coarse tokens of ~10 Hz and a context window of 2048 you can effectively model a consistent structure of an audio track for ~3 minutes.

=== Comparison with Other Codecs ===
SNAC from Orpheus does 83 tokens per second, compared to 50 t/s for [[X-Codec|X-Codec 2.0]] and 25 t/s for [[CosyVoice]]'s codec. SNAC uses one codebook but tokens are created for each level of downsampling, in contrast to codecs like [[Mimi]] which use multiple separate codebooks.

[[Category:Neural audio codecs]]

SNAC - Revision history