Support this work: donate.sybilsolutions.ai

REAP surfaces: GLM | MiniMax | Qwen | Gemma | Paper | Code | PR17 | Cerebras Collection

GLM-5 REAP-50% BF16 GGUF

Expert-pruned GLM-5 (744B -> ~372B params, 256 -> 128 routed experts) in bf16 GGUF format for llama.cpp inference. This is the full-precision intermediate used to produce quantized GGUFs.

Model Details

Property	Value
Base model	zai-org/GLM-5 (744B, 256 routed experts)
Pruning	REAP saliency pruning, 50% expert removal (256 -> 128 experts)
Format	bf16 GGUF (full precision, no quantization loss)
Size	~711 GB
Architecture	GlmMoeDsaForCausalLM (MLA + Mixture of Experts + DSA indexer)
Context	202,752 tokens
Active params	~20B per token (8 of 128 experts selected)

Purpose

This bf16 GGUF serves as the source for all quantized variants. Use it to:

Produce custom GGUF quantizations with llama-quantize (Q2_K, Q3_K_M, Q4_K_M, Q5_K_M, Q6_K, Q8_0, etc.)
Generate importance-matrix (imatrix) calibrated quantizations for higher quality at low bit rates
Run full-precision inference if you have sufficient VRAM/RAM (~711 GB)

Quantized Variants

Variant	BPW	Size	Repo
Q3_K_M	3.82	~170 GB	0xSero/GLM-5-REAP-50pct-Q3_K_M-GGUF

Usage

# Download
huggingface-cli download 0xSero/GLM-5-REAP-50pct-BF16-GGUF --local-dir GLM-5-REAP-50pct-BF16-GGUF

# Quantize to any format
llama-quantize GLM-5-REAP-50pct-BF16-GGUF/GLM-5-REAP-50pct-BF16.gguf output-Q4_K_M.gguf Q4_K_M

# With importance matrix for better quality
llama-quantize --imatrix imatrix.dat GLM-5-REAP-50pct-BF16-GGUF/GLM-5-REAP-50pct-BF16.gguf output-IQ3_M.gguf IQ3_M

Pipeline

REAP pruning of GLM-5 FP8 (256 -> 128 experts) using layerwise saliency scores
FP8 scale repair (block-quantized scale tensors corrected from 256-expert to 128-expert layout)
FP8 -> bf16 GGUF conversion via patched convert_hf_to_gguf.py (auto-dequants FP8, splits fused gate_up_proj, handles 3D expert tensors)

Why GGUF?

The original FP8 safetensors model has a known KV cache NaN bug in HuggingFace Transformers' GlmMoeDsaAttention implementation. llama.cpp bypasses this entirely with its own inference engine, producing correct output with working KV cache.

Source Models

FP8 pruned model: 0xSero/GLM-5-REAP-50pct-FP8
Original base model: zai-org/GLM-5

Downloads last month: 474

GGUF

Model size

381B params

Architecture

glm-dsa

Hardware compatibility

16-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for 0xSero/GLM-5-REAP-50pct-BF16-GGUF

Base model

zai-org/GLM-5

Quantized

(24)

this model

Datasets used to train 0xSero/GLM-5-REAP-50pct-BF16-GGUF

Space using 0xSero/GLM-5-REAP-50pct-BF16-GGUF 1

Paper for 0xSero/GLM-5-REAP-50pct-BF16-GGUF

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Paper • 2510.13999 • Published Oct 15, 2025 • 19