REAP the Experts: Why Pruning Prevails for One-Shot MoE compression
Paper • 2510.13999 • Published • 19
Support this work: donate.sybilsolutions.ai
REAP surfaces: GLM | MiniMax | Qwen | Gemma | Paper | Code | PR17 | Cerebras Collection
Expert-pruned GLM-5 (744B -> ~372B params, 256 -> 128 routed experts) in bf16 GGUF format for llama.cpp inference. This is the full-precision intermediate used to produce quantized GGUFs.
| Property | Value |
|---|---|
| Base model | zai-org/GLM-5 (744B, 256 routed experts) |
| Pruning | REAP saliency pruning, 50% expert removal (256 -> 128 experts) |
| Format | bf16 GGUF (full precision, no quantization loss) |
| Size | ~711 GB |
| Architecture | GlmMoeDsaForCausalLM (MLA + Mixture of Experts + DSA indexer) |
| Context | 202,752 tokens |
| Active params | ~20B per token (8 of 128 experts selected) |
This bf16 GGUF serves as the source for all quantized variants. Use it to:
llama-quantize (Q2_K, Q3_K_M, Q4_K_M, Q5_K_M, Q6_K, Q8_0, etc.)| Variant | BPW | Size | Repo |
|---|---|---|---|
| Q3_K_M | 3.82 | ~170 GB | 0xSero/GLM-5-REAP-50pct-Q3_K_M-GGUF |
# Download
huggingface-cli download 0xSero/GLM-5-REAP-50pct-BF16-GGUF --local-dir GLM-5-REAP-50pct-BF16-GGUF
# Quantize to any format
llama-quantize GLM-5-REAP-50pct-BF16-GGUF/GLM-5-REAP-50pct-BF16.gguf output-Q4_K_M.gguf Q4_K_M
# With importance matrix for better quality
llama-quantize --imatrix imatrix.dat GLM-5-REAP-50pct-BF16-GGUF/GLM-5-REAP-50pct-BF16.gguf output-IQ3_M.gguf IQ3_M
convert_hf_to_gguf.py (auto-dequants FP8, splits fused gate_up_proj, handles 3D expert tensors)The original FP8 safetensors model has a known KV cache NaN bug in HuggingFace Transformers' GlmMoeDsaAttention implementation. llama.cpp bypasses this entirely with its own inference engine, producing correct output with working KV cache.
16-bit
Base model
zai-org/GLM-5