YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

🐟 Fish Speech S2 Pro β€” Quantization Experiments

Comprehensive multi-phase quantization of Fish Audio S2 Pro (4.56B params) with voice cloning samples.

Model Architecture

Component Layers Dim Heads Params Size
Slow AR (LLM backbone) 36 2560 32 (GQA, 8 local) ~4.0B ~8.5 GB
Fast AR (acoustic decoder) 4 2560 32 (GQA, 8 local) ~0.4B ~0.8 GB
DAC Codec (RVQ) β€” β€” β€” β€” 1.7 GB
Total β€” β€” β€” 4.56B ~10.8 GB

Quantization Experiments

Phase 1: Proven Approaches (Zero/Near-Zero Quality Loss)

ID Method Target Expected Size Compression Status
1a FP8 (per-row symmetric) Slow AR ~6.8 GB 1.60x βœ… Proven (drbaph/s2-pro-fp8)
1b INT4 (group=128) Slow AR ~4.8 GB 2.24x βœ… Proven (baicai1145/s2-pro-w4a16)

Phase 2: Aggressive Approaches (Potential Quality Tradeoffs)

ID Method Target Expected Size Compression Status
2a INT4 (group=128) All ~4.9 GB 2.19x πŸ”¬ Experimental
2b INT8 (per-row) Slow AR ~6.8 GB 1.60x βœ… Safe
2c INT3 (group=128) Slow AR ~4.3 GB 2.52x ⚠️ Risky

Phase 3: Extreme Approaches (Quality Degradation Expected)

ID Method Target Expected Size Compression Status
3a INT2 (group=64) Slow AR ~3.8 GB 2.88x ❌ Likely degraded
3b INT2 (group=64) All ~3.8 GB 2.88x ❌ Likely degraded

Quick Start

Prerequisites

  • CUDA GPU with β‰₯24GB VRAM (A100 40/80GB recommended)
  • Python 3.10+

Run All Phases

# Clone fish-speech and this experiment repo
git clone https://github.com/fishaudio/fish-speech.git
cd fish-speech-experiments

# Install dependencies
pip install torch einops loguru ormsgpack hydra-core omegaconf safetensors torchaudio soundfile

# Run all phases
python scripts/quantize.py --phase all --output ./output

# Or run individual phases
python scripts/quantize.py --phase 1a    # FP8 only
python scripts/quantize.py --phase 1b    # INT4 only
python scripts/quantize.py --phase 2c    # INT3 only

Upload to Hub

# Requires HF write token
huggingface-cli login
python scripts/upload_to_hub.py --output ./output

Voice Cloning

Each phase generates two audio samples:

  1. {phase}_tts.wav β€” Text-to-speech without reference
  2. {phase}_clone.wav β€” Voice cloning from celebrity reference

The reference audio is generated from the base model using a Morgan Freeman-style deep narration:

*"Good morning. I want to tell you something about the universe. Every atom in your body came from a star that exploded. We are all made of star stuff."`

Existing Quantized Models on HuggingFace

Model Method Size Link
fishaudio/s2-pro BF16 (original) 10.8 GB Link
drbaph/s2-pro-fp8 FP8 6.2 GB Link
baicai1145/s2-pro-w4a16 GPTQ INT4 ~5.5 GB Link
rodrigomt/s2-pro-gguf GGUF (q2-q8) 2.4-9.2 GB Link

GGUF Sizes (from rodrigomt/s2-pro-gguf)

Quant Size Notes
f16 9.2 GB Lossless
q8_0 5.2 GB Near-lossless
q6_k 4.2 GB Minimal loss
q5_k_m 3.8 GB Slight loss
q4_k_m 3.3 GB Good tradeoff
q3_k 2.8 GB Noticeable loss
q2_k 2.4 GB Significant loss

Quantization Details

FP8 (Phase 1a)

  • Method: Per-row symmetric FP8 (float8_e4m3fn)
  • What's quantized: All nn.Linear weights in Slow AR
  • What's kept in bf16: Embeddings, layer norms, Fast AR, codec
  • Scale: Per-row float32 (captures per-channel variation)
  • Dequant: W_bf16 = W_fp8.to(bfloat16) * scale
  • Quality: Zero perceptible loss

INT4 (Phase 1b)

  • Method: Group-wise symmetric INT4 (group_size=128)
  • Range: [-7, 7] per weight
  • Scale: Per-group float32
  • Target: Slow AR only (Fast AR + codec in bf16)
  • Quality: Near-zero loss with group_size=128

INT3 (Phase 2c)

  • Method: Group-wise symmetric INT3 (group_size=128)
  • Range: [-3, 3] per weight
  • Expected: Some quality loss, especially on prosody

INT2 (Phase 3)

  • Method: Group-wise symmetric INT2 (group_size=64)
  • Range: [-1, 0, 1] per weight (ternary!)
  • Expected: Significant quality degradation

Files

fish-speech-experiments/
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ quantize.py           # Main quantization + sample generation script
β”‚   β”œβ”€β”€ run_all_phases.py     # Alternative all-in-one script (for HF Jobs)
β”‚   └── upload_to_hub.py      # Upload results to HuggingFace Hub
β”œβ”€β”€ output/                   # Generated quantized models + samples
β”‚   β”œβ”€β”€ samples/              # Audio samples from each phase
β”‚   β”œβ”€β”€ phase1a/              # FP8 quantized model
β”‚   β”œβ”€β”€ phase1b/              # INT4 quantized model
β”‚   β”œβ”€β”€ phase2a/              # INT4 all layers
β”‚   β”œβ”€β”€ phase2b/              # INT8 quantized model
β”‚   β”œβ”€β”€ phase2c/              # INT3 quantized model
β”‚   β”œβ”€β”€ phase3a/              # INT2 quantized model
β”‚   β”œβ”€β”€ phase3b/              # INT2 all layers
β”‚   └── all_results.json      # Combined results
β”œβ”€β”€ size_analysis.json        # Theoretical size analysis
└── README.md                 # This file

Citation

@misc{liao2026fishaudios2technical,
      title={Fish Audio S2 Technical Report}, 
      author={Shijia Liao and Yuxuan Wang and others},
      year={2026},
      eprint={2603.08823},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
}

License

Quantized models inherit the Fish Audio Research License. Research and non-commercial use only.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Spaces using Swagcrew/fish-speech-s2-quantized 2

Paper for Swagcrew/fish-speech-s2-quantized