SCNet: Enhancing GAN-based Speech Generation with Subband Condition Network and Magnitude-aware Phase Loss

Recent speech generation has been predominantly driven by GAN-based networks aimed at high-quality waveform synthesis from mel-spectrograms. However, these methods often operate as black-box models, leading to the loss of inherent spectral information. In this work, we propose SCNet, a GAN-based vocoder augmented with a Subband Condition Network to address this issue. Specifically, SCNet leverages a subband signal predicted by a lightweight condition network as prior knowledge. This subband signal is then transformed via STFT to obtain Fourier coefficients, which are integrated into the backbone for the enhanced reconstruction. Additionally, to mitigate the phase wrapping, we introduce a magnitude-aware phase loss that computes instantaneous phase errors weighted by the corresponding magnitude, emphasizing regions with higher energy. Experimental results demonstrate that SCNet achieves superior performance in both objective and subjective evaluations for high-quality speech generation.

Pre-requisites

Python >= 3.10
Clone this repository:

git clone https://github.com/vspeech/SCNet.git
cd SCNet

Install python requirements:

pip install -r requirements.txt

Pre-Trained Models

You can download the pre-trained LibriTTS model here and copy to cp_scnet directory.

Or download from huggingface:

huggingface-cli download vspeech/SCNet g_02005000 config.json --local-dir cp_scnet

Inference

Please refer to the inference.py for details.

python inference.py 
--input_wavs_dir /path/to/your/input_wav \
--checkpoint_file /path/to/your/cp_scnet/model \
--output_dir /path/to/your/output_wav

References

Downloads last month: -