ESDCodec: High-Fidelity Neural Speech Codec via Thoroughly Enhanced Semantic Quantizer and Decoder
Abstract
Despite recent advances in neural speech codecs, achieving high-fidelity speech reconstruction at low bitrates remains a formidable challenge. To address this limitation, we propose ESDCodec, a speech codec that integrates a thoroughly enhanced semantic quantizer and a conditioned decoder network. Specifically, we employ a randomly initialized and frozen codebook, followed by a lightweight projector, to encode semantic details entirely within a linear space while enhancing codebook utilization. To further improve perceptual quality, we design a condition network that injects prior subband knowledge into the upsampling decoder. Taking the de-quantized feature as input, this network predicts subband signals, thereby providing fine-grained guidance for waveform reconstruction. Extensive experiments show that ESDCodec achieves superior reconstruction performance at a low bitrate of 0.85kbps. For LLM-based speech generation task, ESDCodec also consistently outperforms existing codec models.
Installation
pip install esdcodec
News
- 2026-02-24: Release ESDCodec training and inference codes.
Model List
| Model | Frame Rate | Training Dataset | Discription |
|---|---|---|---|
| esdcodec_25hz_16384_1024 | 25Hz | Emilia(English and Chinese) | Adopt enhanced semantic quantizer and conditioned decoder network |
Inference
- First, download checkpoint and config to local:
huggingface-cli download facebook/w2v-bert-2.0 --local-dir w2v-bert-2.0
huggingface-cli download vspeech/ESDCodec esdcodec_25hz_16384_1024.safetensors w2vbert2_mean_var_stats_emilia.pt --local-dir esdcodec_ckpts
- To run example inference:
python infer.py
Training
- Clone and install
pip install "esdcodec[tts]"
git clone https://anonymous.4open.science/r/ESDCodec.git
cd ESDCodec
- Prepare the training_file in config, e.g., Emilia dataset list data.list
/path/to/your/xxx.tar
/path/to/your/yyy.tar
...
- To run example training on Emilia dataset
accelerate launch train.py --config-name=esdcodec_train \
trainer.batch_size=3 \
data.segment_speech.segment_length=96000
Acknowledgement
This repo is directly based on the following excellent projects:
