Support this work: donate.sybilsolutions.ai

REAP surfaces: GLM | MiniMax | Qwen | Gemma | Paper | Code | PR17 | Cerebras Collection

GLM-5 REAP-50% BF16 GGUF

Expert-pruned GLM-5 (744B -> ~372B params, 256 -> 128 routed experts) in bf16 GGUF format for llama.cpp inference. This is the full-precision intermediate used to produce quantized GGUFs.

Model Details

Property Value
Base model zai-org/GLM-5 (744B, 256 routed experts)
Pruning REAP saliency pruning, 50% expert removal (256 -> 128 experts)
Format bf16 GGUF (full precision, no quantization loss)
Size ~711 GB
Architecture GlmMoeDsaForCausalLM (MLA + Mixture of Experts + DSA indexer)
Context 202,752 tokens
Active params ~20B per token (8 of 128 experts selected)

Purpose

This bf16 GGUF serves as the source for all quantized variants. Use it to:

  • Produce custom GGUF quantizations with llama-quantize (Q2_K, Q3_K_M, Q4_K_M, Q5_K_M, Q6_K, Q8_0, etc.)
  • Generate importance-matrix (imatrix) calibrated quantizations for higher quality at low bit rates
  • Run full-precision inference if you have sufficient VRAM/RAM (~711 GB)

Quantized Variants

Variant BPW Size Repo
Q3_K_M 3.82 ~170 GB 0xSero/GLM-5-REAP-50pct-Q3_K_M-GGUF

Usage

# Download
huggingface-cli download 0xSero/GLM-5-REAP-50pct-BF16-GGUF --local-dir GLM-5-REAP-50pct-BF16-GGUF

# Quantize to any format
llama-quantize GLM-5-REAP-50pct-BF16-GGUF/GLM-5-REAP-50pct-BF16.gguf output-Q4_K_M.gguf Q4_K_M

# With importance matrix for better quality
llama-quantize --imatrix imatrix.dat GLM-5-REAP-50pct-BF16-GGUF/GLM-5-REAP-50pct-BF16.gguf output-IQ3_M.gguf IQ3_M

Pipeline

  1. REAP pruning of GLM-5 FP8 (256 -> 128 experts) using layerwise saliency scores
  2. FP8 scale repair (block-quantized scale tensors corrected from 256-expert to 128-expert layout)
  3. FP8 -> bf16 GGUF conversion via patched convert_hf_to_gguf.py (auto-dequants FP8, splits fused gate_up_proj, handles 3D expert tensors)

Why GGUF?

The original FP8 safetensors model has a known KV cache NaN bug in HuggingFace Transformers' GlmMoeDsaAttention implementation. llama.cpp bypasses this entirely with its own inference engine, producing correct output with working KV cache.

Source Models

Downloads last month
474
GGUF
Model size
381B params
Architecture
glm-dsa
Hardware compatibility
Log In to add your hardware

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for 0xSero/GLM-5-REAP-50pct-BF16-GGUF

Base model

zai-org/GLM-5
Quantized
(24)
this model

Datasets used to train 0xSero/GLM-5-REAP-50pct-BF16-GGUF

Space using 0xSero/GLM-5-REAP-50pct-BF16-GGUF 1

Paper for 0xSero/GLM-5-REAP-50pct-BF16-GGUF