🐟 Fish Speech S2 Pro — Quantization Experiments

Comprehensive multi-phase quantization of Fish Audio S2 Pro (4.56B params) with voice cloning samples.

Model Architecture

Component	Layers	Dim	Heads	Params	Size
Slow AR (LLM backbone)	36	2560	32 (GQA, 8 local)	~4.0B	~8.5 GB
Fast AR (acoustic decoder)	4	2560	32 (GQA, 8 local)	~0.4B	~0.8 GB
DAC Codec (RVQ)	—	—	—	—	1.7 GB
Total	—	—	—	4.56B	~10.8 GB

Quantization Experiments

Phase 1: Proven Approaches (Zero/Near-Zero Quality Loss)

ID	Method	Target	Expected Size	Compression	Status
1a	FP8 (per-row symmetric)	Slow AR	~6.8 GB	1.60x	✅ Proven (drbaph/s2-pro-fp8)
1b	INT4 (group=128)	Slow AR	~4.8 GB	2.24x	✅ Proven (baicai1145/s2-pro-w4a16)

Phase 2: Aggressive Approaches (Potential Quality Tradeoffs)

ID	Method	Target	Expected Size	Compression	Status
2a	INT4 (group=128)	All	~4.9 GB	2.19x	🔬 Experimental
2b	INT8 (per-row)	Slow AR	~6.8 GB	1.60x	✅ Safe
2c	INT3 (group=128)	Slow AR	~4.3 GB	2.52x	⚠️ Risky

Phase 3: Extreme Approaches (Quality Degradation Expected)

ID	Method	Target	Expected Size	Compression	Status
3a	INT2 (group=64)	Slow AR	~3.8 GB	2.88x	❌ Likely degraded
3b	INT2 (group=64)	All	~3.8 GB	2.88x	❌ Likely degraded

Quick Start

Prerequisites

CUDA GPU with ≥24GB VRAM (A100 40/80GB recommended)
Python 3.10+

Run All Phases

# Clone fish-speech and this experiment repo
git clone https://github.com/fishaudio/fish-speech.git
cd fish-speech-experiments

# Install dependencies
pip install torch einops loguru ormsgpack hydra-core omegaconf safetensors torchaudio soundfile

# Run all phases
python scripts/quantize.py --phase all --output ./output

# Or run individual phases
python scripts/quantize.py --phase 1a    # FP8 only
python scripts/quantize.py --phase 1b    # INT4 only
python scripts/quantize.py --phase 2c    # INT3 only

Upload to Hub

# Requires HF write token
huggingface-cli login
python scripts/upload_to_hub.py --output ./output

Voice Cloning

Each phase generates two audio samples:

{phase}_tts.wav — Text-to-speech without reference
{phase}_clone.wav — Voice cloning from celebrity reference

The reference audio is generated from the base model using a Morgan Freeman-style deep narration:

*"Good morning. I want to tell you something about the universe. Every atom in your body came from a star that exploded. We are all made of star stuff."`

Existing Quantized Models on HuggingFace

Model	Method	Size	Link
fishaudio/s2-pro	BF16 (original)	10.8 GB	Link
drbaph/s2-pro-fp8	FP8	6.2 GB	Link
baicai1145/s2-pro-w4a16	GPTQ INT4	~5.5 GB	Link
rodrigomt/s2-pro-gguf	GGUF (q2-q8)	2.4-9.2 GB	Link

GGUF Sizes (from rodrigomt/s2-pro-gguf)

Quant	Size	Notes
f16	9.2 GB	Lossless
q8_0	5.2 GB	Near-lossless
q6_k	4.2 GB	Minimal loss
q5_k_m	3.8 GB	Slight loss
q4_k_m	3.3 GB	Good tradeoff
q3_k	2.8 GB	Noticeable loss
q2_k	2.4 GB	Significant loss

Quantization Details

FP8 (Phase 1a)

Method: Per-row symmetric FP8 (float8_e4m3fn)
What's quantized: All nn.Linear weights in Slow AR
What's kept in bf16: Embeddings, layer norms, Fast AR, codec
Scale: Per-row float32 (captures per-channel variation)
Dequant: W_bf16 = W_fp8.to(bfloat16) * scale
Quality: Zero perceptible loss

INT4 (Phase 1b)

Method: Group-wise symmetric INT4 (group_size=128)
Range: [-7, 7] per weight
Scale: Per-group float32
Target: Slow AR only (Fast AR + codec in bf16)
Quality: Near-zero loss with group_size=128

INT3 (Phase 2c)

Method: Group-wise symmetric INT3 (group_size=128)
Range: [-3, 3] per weight
Expected: Some quality loss, especially on prosody

INT2 (Phase 3)

Method: Group-wise symmetric INT2 (group_size=64)
Range: [-1, 0, 1] per weight (ternary!)
Expected: Significant quality degradation

Files

fish-speech-experiments/
├── scripts/
│   ├── quantize.py           # Main quantization + sample generation script
│   ├── run_all_phases.py     # Alternative all-in-one script (for HF Jobs)
│   └── upload_to_hub.py      # Upload results to HuggingFace Hub
├── output/                   # Generated quantized models + samples
│   ├── samples/              # Audio samples from each phase
│   ├── phase1a/              # FP8 quantized model
│   ├── phase1b/              # INT4 quantized model
│   ├── phase2a/              # INT4 all layers
│   ├── phase2b/              # INT8 quantized model
│   ├── phase2c/              # INT3 quantized model
│   ├── phase3a/              # INT2 quantized model
│   ├── phase3b/              # INT2 all layers
│   └── all_results.json      # Combined results
├── size_analysis.json        # Theoretical size analysis
└── README.md                 # This file

Citation

@misc{liao2026fishaudios2technical,
      title={Fish Audio S2 Technical Report}, 
      author={Shijia Liao and Yuxuan Wang and others},
      year={2026},
      eprint={2603.08823},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
}

License

Quantized models inherit the Fish Audio Research License. Research and non-commercial use only.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Spaces using Swagcrew/fish-speech-s2-quantized 2

Paper for Swagcrew/fish-speech-s2-quantized

Fish Audio S2 Technical Report

Paper • 2603.08823 • Published Mar 9 • 38