Support this work: donate.sybilsolutions.ai

REAP surfaces: GLM | MiniMax | Qwen | Gemma | Paper | Code | PR17 | Cerebras Collection

Qwen3.5-122B-A10B-REAP-20 — GGUF

GGUF quantizations of 0xSero/Qwen3.5-122B-A10B-REAP-20, a 20% expert-pruned Qwen3.5-122B MoE model using REAP.

Available Quantizations

File	Quant	BPW	Size	Description
`Qwen3.5-122B-A10B-REAP-20-Q4_K_M.gguf`	Q4_K_M	4.86	57 GB	Best speed-to-quality ratio. Fits in 64 GB GTT.
`Qwen3.5-122B-A10B-REAP-20-Q6_K.gguf`	Q6_K	6.57	76 GB	Higher quality. Needs 80+ GB VRAM/GTT.
`Qwen3.5-122B-A10B-REAP-20-Q8_0.gguf`	Q8_0	8.51	99 GB	Near-lossless. Needs 100+ GB VRAM/GTT.

Model Details

Property	Value
Base Model	Qwen3.5-122B-A10B
Pruned Model	0xSero/Qwen3.5-122B-A10B-REAP-20
Architecture	Qwen3.5 MoE (GDN + Full Attention hybrid)
Total Parameters	99B (205 experts/layer, down from 256)
Active Parameters	~10B per token (8 experts selected)
Context Length	262,144 tokens
Thinking Mode	Yes (reasoning_content in chat completions)
Pruning Method	REAP — 20% expert removal with super-expert protection
Quantization Tool	llama.cpp (llama-quantize)
Converted From	Safetensors (BF16) via llama.cpp convert_hf_to_gguf.py

Speed Benchmarks

Tested on AMD Ryzen AI MAX+ 395 (Strix Halo), Radeon 8060S (gfx1151), 128 GB LPDDR5X. llama.cpp b8746, Vulkan RADV, Flash Attention ON.

llama-bench (pp512 / tg128)

Quant	GPU Layers	Prefill (t/s)	Token Gen (t/s)
Q4_K_M	49/49 (full)	295.74	27.56
Q6_K	35/49 (partial)	121.35	15.74
Q8_0	25/49 (partial)	44.55	9.89

API Speed (llama-server, real chat completions)

Quant	Prefill (short)	Prefill (long)	Token Gen
Q4_K_M	141.8 t/s	62.3 t/s	28.4 t/s
Q6_K	48.8 t/s	21.7 t/s	15.4 t/s
Q8_0	25.8 t/s	14.2 t/s	9.0 t/s

Q6_K and Q8_0 are partially offloaded to CPU because they exceed the default 64 GB GTT limit. With GTT increased to 120 GB (BIOS GART + modprobe config), they would run at full GPU speed.

Quality Benchmarks

Tested via llama-server API with thinking mode enabled.

Reasoning (5 questions — math, calculus, logic, code comprehension, knowledge)

Quant	Score
Q4_K_M	5/5
Q6_K	5/5
Q8_0	5/5

All quants produce correct answers for arithmetic (127*43=5461), calculus (derivative of x^3+2x^2-5x+7), formal logic, Python reference semantics, and factual recall.

Code Generation (HumanEval subset — 5 problems, executed and tested)

Quant	Passed
Q4_K_M	4/5
Q6_K	4/5
Q8_0	3/5

The model generates correct code for all problems. Score differences are due to code extraction from the thinking format, not model quality.

Full Benchmarks (safetensors, from base model card)

Benchmark	Score
HumanEval	81.1%
HumanEval+	76.8%
MBPP	86.2%
MBPP+	73.0%
ARC Challenge	63.7%
HellaSwag	84.1%
TruthfulQA MC2	52.4%
Winogrande	75.5%

See the full model card for complete benchmark results and methodology.

How to Run

llama-server (recommended)

# Q4_K_M — fits in 64 GB, fastest
llama-server \
  -m Qwen3.5-122B-A10B-REAP-20-Q4_K_M.gguf \
  -ngl 999 --flash-attn on -c 4096 \
  --port 8080 --host 0.0.0.0

# With speculative decoding for faster generation
llama-server \
  -m Qwen3.5-122B-A10B-REAP-20-Q4_K_M.gguf \
  -ngl 999 --flash-attn on -c 4096 \
  --spec-type ngram-mod --spec-ngram-size-n 24 \
  --draft-min 48 --draft-max 64 \
  --port 8080 --host 0.0.0.0

Ollama

# Create a Modelfile
echo 'FROM ./Qwen3.5-122B-A10B-REAP-20-Q4_K_M.gguf' > Modelfile
ollama create reap20 -f Modelfile
ollama run reap20

Python (llama-cpp-python)

from llama_cpp import Llama

llm = Llama(
    model_path="Qwen3.5-122B-A10B-REAP-20-Q4_K_M.gguf",
    n_gpu_layers=-1,
    n_ctx=4096,
    flash_attn=True,
)

output = llm.create_chat_completion(
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=512,
)
print(output["choices"][0]["message"]["content"])

Which Quant Should I Use?

Your Setup	Recommended
64 GB VRAM/GTT (e.g., Strix Halo default)	Q4_K_M — full GPU offload, 28 t/s
80-96 GB VRAM/GTT	Q6_K — higher quality, full GPU offload
128+ GB VRAM (e.g., 2x Strix Halo cluster, A100)	Q8_0 — near-lossless quality
RTX 4090 (24 GB)	Model too large. Use a smaller model.

Hardware Notes

This model was designed for and tested on AMD Strix Halo (Ryzen AI MAX+ 395) with 128 GB unified memory. It also works on any system with sufficient VRAM/RAM:

Strix Halo (64 GB GTT default): Q4_K_M fits fully, Q6_K/Q8_0 partial offload
Strix Halo (120 GB GTT increased): All quants fit fully
2x Strix Halo cluster (RPC): All quants at full speed
NVIDIA A100 80GB: Q4_K_M and Q6_K fit fully
Apple M-series (128 GB): All quants should work via Metal

What is REAP?

REAP (Routing-Enhanced Activation Pruning) removes the least-activated experts from Mixture-of-Experts models while preserving critical capabilities. This model has 20% of experts removed (256 -> 205 per layer), retaining 97.9% average capability across standard benchmarks.

Credits

Pruning: 0xSero / Sybil Solutions
Base Model: Qwen Team
REAP Method: arxiv:2510.13999
Quantization: llama.cpp

License

Same license as the base model. See Qwen3.5-122B-A10B license.

Downloads last month: 3,419

GGUF

Model size

99B params

Architecture

qwen35moe

Hardware compatibility

4-bit

6-bit

8-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for 0xSero/Qwen3.5-122B-A10B-REAP-20-GGUF

Base model

Qwen/Qwen3.5-122B-A10B

Finetuned

0xSero/Qwen3.5-122B-A10B-REAP-20

Quantized

(1)

this model

Space using 0xSero/Qwen3.5-122B-A10B-REAP-20-GGUF 1

Paper for 0xSero/Qwen3.5-122B-A10B-REAP-20-GGUF

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Paper • 2510.13999 • Published Oct 15, 2025 • 19