Support this work: donate.sybilsolutions.ai

REAP surfaces: GLM | MiniMax | Qwen | Gemma | Paper | Code | PR17 | Cerebras Collection

Qwen3.5-122B-A10B-REAP-20 β€” GGUF

GGUF quantizations of 0xSero/Qwen3.5-122B-A10B-REAP-20, a 20% expert-pruned Qwen3.5-122B MoE model using REAP.

Available Quantizations

File Quant BPW Size Description
Qwen3.5-122B-A10B-REAP-20-Q4_K_M.gguf Q4_K_M 4.86 57 GB Best speed-to-quality ratio. Fits in 64 GB GTT.
Qwen3.5-122B-A10B-REAP-20-Q6_K.gguf Q6_K 6.57 76 GB Higher quality. Needs 80+ GB VRAM/GTT.
Qwen3.5-122B-A10B-REAP-20-Q8_0.gguf Q8_0 8.51 99 GB Near-lossless. Needs 100+ GB VRAM/GTT.

Model Details

Property Value
Base Model Qwen3.5-122B-A10B
Pruned Model 0xSero/Qwen3.5-122B-A10B-REAP-20
Architecture Qwen3.5 MoE (GDN + Full Attention hybrid)
Total Parameters 99B (205 experts/layer, down from 256)
Active Parameters ~10B per token (8 experts selected)
Context Length 262,144 tokens
Thinking Mode Yes (reasoning_content in chat completions)
Pruning Method REAP β€” 20% expert removal with super-expert protection
Quantization Tool llama.cpp (llama-quantize)
Converted From Safetensors (BF16) via llama.cpp convert_hf_to_gguf.py

Speed Benchmarks

Tested on AMD Ryzen AI MAX+ 395 (Strix Halo), Radeon 8060S (gfx1151), 128 GB LPDDR5X. llama.cpp b8746, Vulkan RADV, Flash Attention ON.

llama-bench (pp512 / tg128)

Quant GPU Layers Prefill (t/s) Token Gen (t/s)
Q4_K_M 49/49 (full) 295.74 27.56
Q6_K 35/49 (partial) 121.35 15.74
Q8_0 25/49 (partial) 44.55 9.89

API Speed (llama-server, real chat completions)

Quant Prefill (short) Prefill (long) Token Gen
Q4_K_M 141.8 t/s 62.3 t/s 28.4 t/s
Q6_K 48.8 t/s 21.7 t/s 15.4 t/s
Q8_0 25.8 t/s 14.2 t/s 9.0 t/s

Q6_K and Q8_0 are partially offloaded to CPU because they exceed the default 64 GB GTT limit. With GTT increased to 120 GB (BIOS GART + modprobe config), they would run at full GPU speed.

Quality Benchmarks

Tested via llama-server API with thinking mode enabled.

Reasoning (5 questions β€” math, calculus, logic, code comprehension, knowledge)

Quant Score
Q4_K_M 5/5
Q6_K 5/5
Q8_0 5/5

All quants produce correct answers for arithmetic (127*43=5461), calculus (derivative of x^3+2x^2-5x+7), formal logic, Python reference semantics, and factual recall.

Code Generation (HumanEval subset β€” 5 problems, executed and tested)

Quant Passed
Q4_K_M 4/5
Q6_K 4/5
Q8_0 3/5

The model generates correct code for all problems. Score differences are due to code extraction from the thinking format, not model quality.

Full Benchmarks (safetensors, from base model card)

Benchmark Score
HumanEval 81.1%
HumanEval+ 76.8%
MBPP 86.2%
MBPP+ 73.0%
ARC Challenge 63.7%
HellaSwag 84.1%
TruthfulQA MC2 52.4%
Winogrande 75.5%

See the full model card for complete benchmark results and methodology.

How to Run

llama-server (recommended)

# Q4_K_M β€” fits in 64 GB, fastest
llama-server \
  -m Qwen3.5-122B-A10B-REAP-20-Q4_K_M.gguf \
  -ngl 999 --flash-attn on -c 4096 \
  --port 8080 --host 0.0.0.0

# With speculative decoding for faster generation
llama-server \
  -m Qwen3.5-122B-A10B-REAP-20-Q4_K_M.gguf \
  -ngl 999 --flash-attn on -c 4096 \
  --spec-type ngram-mod --spec-ngram-size-n 24 \
  --draft-min 48 --draft-max 64 \
  --port 8080 --host 0.0.0.0

Ollama

# Create a Modelfile
echo 'FROM ./Qwen3.5-122B-A10B-REAP-20-Q4_K_M.gguf' > Modelfile
ollama create reap20 -f Modelfile
ollama run reap20

Python (llama-cpp-python)

from llama_cpp import Llama

llm = Llama(
    model_path="Qwen3.5-122B-A10B-REAP-20-Q4_K_M.gguf",
    n_gpu_layers=-1,
    n_ctx=4096,
    flash_attn=True,
)

output = llm.create_chat_completion(
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=512,
)
print(output["choices"][0]["message"]["content"])

Which Quant Should I Use?

Your Setup Recommended
64 GB VRAM/GTT (e.g., Strix Halo default) Q4_K_M β€” full GPU offload, 28 t/s
80-96 GB VRAM/GTT Q6_K β€” higher quality, full GPU offload
128+ GB VRAM (e.g., 2x Strix Halo cluster, A100) Q8_0 β€” near-lossless quality
RTX 4090 (24 GB) Model too large. Use a smaller model.

Hardware Notes

This model was designed for and tested on AMD Strix Halo (Ryzen AI MAX+ 395) with 128 GB unified memory. It also works on any system with sufficient VRAM/RAM:

  • Strix Halo (64 GB GTT default): Q4_K_M fits fully, Q6_K/Q8_0 partial offload
  • Strix Halo (120 GB GTT increased): All quants fit fully
  • 2x Strix Halo cluster (RPC): All quants at full speed
  • NVIDIA A100 80GB: Q4_K_M and Q6_K fit fully
  • Apple M-series (128 GB): All quants should work via Metal

What is REAP?

REAP (Routing-Enhanced Activation Pruning) removes the least-activated experts from Mixture-of-Experts models while preserving critical capabilities. This model has 20% of experts removed (256 -> 205 per layer), retaining 97.9% average capability across standard benchmarks.

Credits

License

Same license as the base model. See Qwen3.5-122B-A10B license.

Downloads last month
3,419
GGUF
Model size
99B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

4-bit

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for 0xSero/Qwen3.5-122B-A10B-REAP-20-GGUF

Quantized
(1)
this model

Space using 0xSero/Qwen3.5-122B-A10B-REAP-20-GGUF 1

Paper for 0xSero/Qwen3.5-122B-A10B-REAP-20-GGUF