Support this work: donate.sybilsolutions.ai
REAP surfaces: GLM | MiniMax | Qwen | Gemma | Paper | Code | PR17 | Cerebras Collection
Qwen3.5-122B-A10B-REAP-20 β GGUF
GGUF quantizations of 0xSero/Qwen3.5-122B-A10B-REAP-20, a 20% expert-pruned Qwen3.5-122B MoE model using REAP.
Available Quantizations
| File | Quant | BPW | Size | Description |
|---|---|---|---|---|
Qwen3.5-122B-A10B-REAP-20-Q4_K_M.gguf |
Q4_K_M | 4.86 | 57 GB | Best speed-to-quality ratio. Fits in 64 GB GTT. |
Qwen3.5-122B-A10B-REAP-20-Q6_K.gguf |
Q6_K | 6.57 | 76 GB | Higher quality. Needs 80+ GB VRAM/GTT. |
Qwen3.5-122B-A10B-REAP-20-Q8_0.gguf |
Q8_0 | 8.51 | 99 GB | Near-lossless. Needs 100+ GB VRAM/GTT. |
Model Details
| Property | Value |
|---|---|
| Base Model | Qwen3.5-122B-A10B |
| Pruned Model | 0xSero/Qwen3.5-122B-A10B-REAP-20 |
| Architecture | Qwen3.5 MoE (GDN + Full Attention hybrid) |
| Total Parameters | 99B (205 experts/layer, down from 256) |
| Active Parameters | ~10B per token (8 experts selected) |
| Context Length | 262,144 tokens |
| Thinking Mode | Yes (reasoning_content in chat completions) |
| Pruning Method | REAP β 20% expert removal with super-expert protection |
| Quantization Tool | llama.cpp (llama-quantize) |
| Converted From | Safetensors (BF16) via llama.cpp convert_hf_to_gguf.py |
Speed Benchmarks
Tested on AMD Ryzen AI MAX+ 395 (Strix Halo), Radeon 8060S (gfx1151), 128 GB LPDDR5X. llama.cpp b8746, Vulkan RADV, Flash Attention ON.
llama-bench (pp512 / tg128)
| Quant | GPU Layers | Prefill (t/s) | Token Gen (t/s) |
|---|---|---|---|
| Q4_K_M | 49/49 (full) | 295.74 | 27.56 |
| Q6_K | 35/49 (partial) | 121.35 | 15.74 |
| Q8_0 | 25/49 (partial) | 44.55 | 9.89 |
API Speed (llama-server, real chat completions)
| Quant | Prefill (short) | Prefill (long) | Token Gen |
|---|---|---|---|
| Q4_K_M | 141.8 t/s | 62.3 t/s | 28.4 t/s |
| Q6_K | 48.8 t/s | 21.7 t/s | 15.4 t/s |
| Q8_0 | 25.8 t/s | 14.2 t/s | 9.0 t/s |
Q6_K and Q8_0 are partially offloaded to CPU because they exceed the default 64 GB GTT limit. With GTT increased to 120 GB (BIOS GART + modprobe config), they would run at full GPU speed.
Quality Benchmarks
Tested via llama-server API with thinking mode enabled.
Reasoning (5 questions β math, calculus, logic, code comprehension, knowledge)
| Quant | Score |
|---|---|
| Q4_K_M | 5/5 |
| Q6_K | 5/5 |
| Q8_0 | 5/5 |
All quants produce correct answers for arithmetic (127*43=5461), calculus (derivative of x^3+2x^2-5x+7), formal logic, Python reference semantics, and factual recall.
Code Generation (HumanEval subset β 5 problems, executed and tested)
| Quant | Passed |
|---|---|
| Q4_K_M | 4/5 |
| Q6_K | 4/5 |
| Q8_0 | 3/5 |
The model generates correct code for all problems. Score differences are due to code extraction from the thinking format, not model quality.
Full Benchmarks (safetensors, from base model card)
| Benchmark | Score |
|---|---|
| HumanEval | 81.1% |
| HumanEval+ | 76.8% |
| MBPP | 86.2% |
| MBPP+ | 73.0% |
| ARC Challenge | 63.7% |
| HellaSwag | 84.1% |
| TruthfulQA MC2 | 52.4% |
| Winogrande | 75.5% |
See the full model card for complete benchmark results and methodology.
How to Run
llama-server (recommended)
# Q4_K_M β fits in 64 GB, fastest
llama-server \
-m Qwen3.5-122B-A10B-REAP-20-Q4_K_M.gguf \
-ngl 999 --flash-attn on -c 4096 \
--port 8080 --host 0.0.0.0
# With speculative decoding for faster generation
llama-server \
-m Qwen3.5-122B-A10B-REAP-20-Q4_K_M.gguf \
-ngl 999 --flash-attn on -c 4096 \
--spec-type ngram-mod --spec-ngram-size-n 24 \
--draft-min 48 --draft-max 64 \
--port 8080 --host 0.0.0.0
Ollama
# Create a Modelfile
echo 'FROM ./Qwen3.5-122B-A10B-REAP-20-Q4_K_M.gguf' > Modelfile
ollama create reap20 -f Modelfile
ollama run reap20
Python (llama-cpp-python)
from llama_cpp import Llama
llm = Llama(
model_path="Qwen3.5-122B-A10B-REAP-20-Q4_K_M.gguf",
n_gpu_layers=-1,
n_ctx=4096,
flash_attn=True,
)
output = llm.create_chat_completion(
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=512,
)
print(output["choices"][0]["message"]["content"])
Which Quant Should I Use?
| Your Setup | Recommended |
|---|---|
| 64 GB VRAM/GTT (e.g., Strix Halo default) | Q4_K_M β full GPU offload, 28 t/s |
| 80-96 GB VRAM/GTT | Q6_K β higher quality, full GPU offload |
| 128+ GB VRAM (e.g., 2x Strix Halo cluster, A100) | Q8_0 β near-lossless quality |
| RTX 4090 (24 GB) | Model too large. Use a smaller model. |
Hardware Notes
This model was designed for and tested on AMD Strix Halo (Ryzen AI MAX+ 395) with 128 GB unified memory. It also works on any system with sufficient VRAM/RAM:
- Strix Halo (64 GB GTT default): Q4_K_M fits fully, Q6_K/Q8_0 partial offload
- Strix Halo (120 GB GTT increased): All quants fit fully
- 2x Strix Halo cluster (RPC): All quants at full speed
- NVIDIA A100 80GB: Q4_K_M and Q6_K fit fully
- Apple M-series (128 GB): All quants should work via Metal
What is REAP?
REAP (Routing-Enhanced Activation Pruning) removes the least-activated experts from Mixture-of-Experts models while preserving critical capabilities. This model has 20% of experts removed (256 -> 205 per layer), retaining 97.9% average capability across standard benchmarks.
Credits
- Pruning: 0xSero / Sybil Solutions
- Base Model: Qwen Team
- REAP Method: arxiv:2510.13999
- Quantization: llama.cpp
License
Same license as the base model. See Qwen3.5-122B-A10B license.
- Downloads last month
- 3,419
4-bit
6-bit
8-bit
Model tree for 0xSero/Qwen3.5-122B-A10B-REAP-20-GGUF
Base model
Qwen/Qwen3.5-122B-A10B