Fish Audio S2 Technical Report
Paper β’ 2603.08823 β’ Published β’ 38
YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Comprehensive multi-phase quantization of Fish Audio S2 Pro (4.56B params) with voice cloning samples.
| Component | Layers | Dim | Heads | Params | Size |
|---|---|---|---|---|---|
| Slow AR (LLM backbone) | 36 | 2560 | 32 (GQA, 8 local) | ~4.0B | ~8.5 GB |
| Fast AR (acoustic decoder) | 4 | 2560 | 32 (GQA, 8 local) | ~0.4B | ~0.8 GB |
| DAC Codec (RVQ) | β | β | β | β | 1.7 GB |
| Total | β | β | β | 4.56B | ~10.8 GB |
| ID | Method | Target | Expected Size | Compression | Status |
|---|---|---|---|---|---|
| 1a | FP8 (per-row symmetric) | Slow AR | ~6.8 GB | 1.60x | β Proven (drbaph/s2-pro-fp8) |
| 1b | INT4 (group=128) | Slow AR | ~4.8 GB | 2.24x | β Proven (baicai1145/s2-pro-w4a16) |
| ID | Method | Target | Expected Size | Compression | Status |
|---|---|---|---|---|---|
| 2a | INT4 (group=128) | All | ~4.9 GB | 2.19x | π¬ Experimental |
| 2b | INT8 (per-row) | Slow AR | ~6.8 GB | 1.60x | β Safe |
| 2c | INT3 (group=128) | Slow AR | ~4.3 GB | 2.52x | β οΈ Risky |
| ID | Method | Target | Expected Size | Compression | Status |
|---|---|---|---|---|---|
| 3a | INT2 (group=64) | Slow AR | ~3.8 GB | 2.88x | β Likely degraded |
| 3b | INT2 (group=64) | All | ~3.8 GB | 2.88x | β Likely degraded |
# Clone fish-speech and this experiment repo
git clone https://github.com/fishaudio/fish-speech.git
cd fish-speech-experiments
# Install dependencies
pip install torch einops loguru ormsgpack hydra-core omegaconf safetensors torchaudio soundfile
# Run all phases
python scripts/quantize.py --phase all --output ./output
# Or run individual phases
python scripts/quantize.py --phase 1a # FP8 only
python scripts/quantize.py --phase 1b # INT4 only
python scripts/quantize.py --phase 2c # INT3 only
# Requires HF write token
huggingface-cli login
python scripts/upload_to_hub.py --output ./output
Each phase generates two audio samples:
{phase}_tts.wav β Text-to-speech without reference{phase}_clone.wav β Voice cloning from celebrity referenceThe reference audio is generated from the base model using a Morgan Freeman-style deep narration:
*"Good morning. I want to tell you something about the universe. Every atom in your body came from a star that exploded. We are all made of star stuff."`
| Model | Method | Size | Link |
|---|---|---|---|
| fishaudio/s2-pro | BF16 (original) | 10.8 GB | Link |
| drbaph/s2-pro-fp8 | FP8 | 6.2 GB | Link |
| baicai1145/s2-pro-w4a16 | GPTQ INT4 | ~5.5 GB | Link |
| rodrigomt/s2-pro-gguf | GGUF (q2-q8) | 2.4-9.2 GB | Link |
| Quant | Size | Notes |
|---|---|---|
| f16 | 9.2 GB | Lossless |
| q8_0 | 5.2 GB | Near-lossless |
| q6_k | 4.2 GB | Minimal loss |
| q5_k_m | 3.8 GB | Slight loss |
| q4_k_m | 3.3 GB | Good tradeoff |
| q3_k | 2.8 GB | Noticeable loss |
| q2_k | 2.4 GB | Significant loss |
nn.Linear weights in Slow ARW_bf16 = W_fp8.to(bfloat16) * scalefish-speech-experiments/
βββ scripts/
β βββ quantize.py # Main quantization + sample generation script
β βββ run_all_phases.py # Alternative all-in-one script (for HF Jobs)
β βββ upload_to_hub.py # Upload results to HuggingFace Hub
βββ output/ # Generated quantized models + samples
β βββ samples/ # Audio samples from each phase
β βββ phase1a/ # FP8 quantized model
β βββ phase1b/ # INT4 quantized model
β βββ phase2a/ # INT4 all layers
β βββ phase2b/ # INT8 quantized model
β βββ phase2c/ # INT3 quantized model
β βββ phase3a/ # INT2 quantized model
β βββ phase3b/ # INT2 all layers
β βββ all_results.json # Combined results
βββ size_analysis.json # Theoretical size analysis
βββ README.md # This file
@misc{liao2026fishaudios2technical,
title={Fish Audio S2 Technical Report},
author={Shijia Liao and Yuxuan Wang and others},
year={2026},
eprint={2603.08823},
archivePrefix={arXiv},
primaryClass={cs.SD},
}
Quantized models inherit the Fish Audio Research License. Research and non-commercial use only.