ESDCodec: High-Fidelity Neural Speech Codec via Thoroughly Enhanced Semantic Quantizer and Decoder

Abstract

Despite recent advances in neural speech codecs, achieving high-fidelity speech reconstruction at low bitrates remains a formidable challenge. To address this limitation, we propose ESDCodec, a speech codec that integrates a thoroughly enhanced semantic quantizer and a conditioned decoder network. Specifically, we employ a randomly initialized and frozen codebook, followed by a lightweight projector, to encode semantic details entirely within a linear space while enhancing codebook utilization. To further improve perceptual quality, we design a condition network that injects prior subband knowledge into the upsampling decoder. Taking the de-quantized feature as input, this network predicts subband signals, thereby providing fine-grained guidance for waveform reconstruction. Extensive experiments show that ESDCodec achieves superior reconstruction performance at a low bitrate of 0.85kbps. For LLM-based speech generation task, ESDCodec also consistently outperforms existing codec models.

ESDCodec

Installation

pip install esdcodec

News

  • 2026-02-24: Release ESDCodec training and inference codes.

Model List

Model Frame Rate Training Dataset Discription
esdcodec_25hz_16384_1024 25Hz Emilia(English and Chinese) Adopt enhanced semantic quantizer and conditioned decoder network

Inference

  1. First, download checkpoint and config to local:
huggingface-cli download facebook/w2v-bert-2.0 --local-dir w2v-bert-2.0
huggingface-cli download vspeech/ESDCodec esdcodec_25hz_16384_1024.safetensors w2vbert2_mean_var_stats_emilia.pt --local-dir esdcodec_ckpts
  1. To run example inference:
python infer.py

Training

  1. Clone and install
pip install "esdcodec[tts]"
git clone https://anonymous.4open.science/r/ESDCodec.git
cd ESDCodec
  1. Prepare the training_file in config, e.g., Emilia dataset list data.list
/path/to/your/xxx.tar
/path/to/your/yyy.tar
...
  1. To run example training on Emilia dataset
accelerate launch train.py --config-name=esdcodec_train \
trainer.batch_size=3 \
data.segment_speech.segment_length=96000

Acknowledgement

This repo is directly based on the following excellent projects:

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support