SCOREQ-PyTorch
About
This is an unofficial fairseq-free implementation of the SCOREQ Speech Quality Assessment system proposed in SCOREQ: Speech Quality Assessment with Contrastive Regression.
The original implementation provides a fairseq-based PyTorch model and an ONNX variant. In practice, the fairseq dependency can be difficult to install with recent Python, PyTorch, and dependency versions. The ONNX variant avoids fairseq, but it can be less convenient for PyTorch-based research workflows and may be difficult to run with GPU acceleration on ARM/aarch64 systems.
Recent study from ICASSP 2026 highlights the high correlation of SCOREQ with subjective listening scores for neural codecs. Therefore, modern neural audio codec and TTS research benefits from an easy-to-install SCOREQ implementation.
We provide a fairseq-free implementation written directly in PyTorch that matches the original system using converted weights and reimplemented modules.
We also provide a TorchScript variant that can be loaded with only PyTorch, without installing this package.
The PyTorch and TorchScript versions are validated against the original implementation and produce matching scores.
In contrast to the original implementation, we support batched audio assessment. However, we recommend running SCOREQ with batch size 1 to avoid metric shifts caused by padding. Batching can be used for faster evaluation when small padding-related score differences are acceptable.
Model Types
As in the original system, we support 4 types of SCOREQ, i.e., 2 audio domains and 2 modes.
Data domain (what kind of audio is evaluated):
natural: used for audio that was created from a genuine human speech (Audio Codecs, VoIP, Telephony, Speech Enhancement, Audio Restoration).synthetic: used for audio that was synthesized by a machine (Text-to-Speech (TTS), Voice Conversion (VC), Generative Speech Models).
Mode (whether there is a reference audio to compare with):
nr: no-reference mode. Assesses the quality of audio, the higher the better, without relying on any reference.ref: reference mode. Calculate the distance between provided and reference audio embeddings, the lower the better.
We refer the user to the original repository and paper for more details on model types.
Usage
You can install the repo as a package:
pip install scoreq-pytorch
Or from source:
git clone https://github.com/Blinorot/scoreq-pytorch.git
cd scoreq-pytorch
pip install -e .
The code requires:
| Package | Version |
|---|---|
| Python | >=3.9 |
| PyTorch | >=2.2.0 |
| HuggingFace Hub | >=0.20 |
The TorchScript checkpoint was scripted with PyTorch 2.5.1. We have tested that it works on PyTorch 2.2.0, however, PyTorch >=2.5.1 is recommended for the
TorchScript variant.
Then, you can run the model as follows:
import torchaudio
from scoreq_pytorch import SCOREQScoreTorch
device = "cpu" # set to "cuda" to use on GPU
data_domain = "natural" # or "synthetic"
mode = "nr" # or "ref"
scoreq = SCOREQScoreTorch(
data_domain=data_domain,
mode=mode,
device=device
) # already in eval mode
# load an audio file, e.g. using torchaudio
test_audio_path = ... # path to an audio file
test_wav, sr = torchaudio.load(test_audio_path)
# convert to MONO 16 kHz
TARGET_SR = 16000
if test_wav.shape[0] != 1:
test_wav = test_wav[0:1]
if sr != TARGET_SR:
test_wav = torchaudio.functional.resample(test_wav, orig_freq=sr, new_freq=TARGET_SR)
# put on device
test_wav = test_wav.to(device)
# for "ref" mode, you need a reference audio
# same loading and pre-processing procedure
if mode == "ref":
ref_wav = ...
else:
ref_wav = None
# calculate the score
# accepts T, 1xT, Bx1xT
scoreq_score = scoreq.score(test_wav, ref_wav) # tensor of shape (batch_size,)
You can replace SCOREQScoreTorch with SCOREQScoreScripted to use the TorchScript variant instead. On first use, the package downloads converted SCOREQ weights from Hugging Face Hub and caches them locally using the Hugging Face cache.
For TorchScript, you can avoid downloading the package and use the model directly:
import torch
import torchaudio
import wget
data_domain = "natural" # or "synthetic"
mode = "nr" # or "ref"
# download scripted checkpoint, e.g. using wget
checkpoint_url = f"https://huggingface.co/Blinorot/SCOREQ-PyTorch/resolve/main/scoreq_{data_domain}_{mode}_scripted.pt"
checkpoint_path = ... # path to saved checkpoint
wget.download(checkpoint_url, checkpoint_path)
# load directly with torch.jit
device = "cpu" # set to "cuda" to use on GPU
scoreq = torch.jit.load(checkpoint_path, map_location=device)
scoreq.eval()
# load an audio file, e.g. using torchaudio
test_audio_path = ... # path to an audio file
test_wav, sr = torchaudio.load(test_audio_path)
# convert to MONO 16 kHz
TARGET_SR = 16000
if test_wav.shape[0] != 1:
test_wav = test_wav[0:1]
if sr != TARGET_SR:
test_wav = torchaudio.functional.resample(test_wav, orig_freq=sr, new_freq=TARGET_SR)
# put on device
test_wav = test_wav.to(device)
# for "ref" mode, you need a reference audio
# same loading and pre-processing procedure
if mode == "ref":
ref_wav = ...
else:
ref_wav = None
# calculate the score
# accepts T, 1xT, Bx1xT
with torch.no_grad():
scoreq_score = scoreq(test_wav, ref_wav) # tensor of shape (batch_size,)
Notes
The model expects audio sampled at 16 kHz.
Accepted tensor shapes:
| Shape | Meaning |
|---|---|
(T,) |
single mono test_waveform |
(1, T) |
single mono test_waveform with channel dimension |
(B, 1, T) |
batch of mono test_waveforms |
The input should be a floating point PyTorch tensor. Stereo audio should be converted to mono before scoring. scoreq.score(test_wav) returns a tensor of shape (batch_size,), where each value is a predicted quality score.
For reference ref mode, a reference audio ref_wav must be provided: scoreq.score(test_wav, ref_wav).
Note that score() and forward() return the same values. The only difference is that score() is decorated with torch.no_grad() for convenient inference. Since the raw TorchScript module exposes forward(), it is called directly as scoreq(test_wav, ref_wav) rather than through the package wrapper's scoreq.score(test_wav, ref_wav).
Batch size 1 is recommended to avoid padding-related score shifts.
API classes:
| Class | Description |
|---|---|
SCOREQScoreTorch |
PyTorch implementation using converted weights. |
SCOREQScoreScripted |
Wrapper around the TorchScript checkpoint. |
Citation
If you use this package, please cite the original SCOREQ paper:
@article{ragano2024scoreq,
title={SCOREQ: Speech quality assessment with contrastive regression},
author={Ragano, Alessandro and Skoglund, Jan and Hines, Andrew},
journal={Advances in Neural Information Processing Systems},
volume={37},
pages={105702--105729},
year={2024}
}