fraQtl Diagnostic
Fingerprint compression + estimate KV savings.
KV cache compression, inference optimization, model compression
KV-cache compression. Weight quantization. Runtime memory optimization. And the diagnostics to measure if it actually helps your stack β before you commit.
Run larger models, longer contexts, more concurrency, at the same quality.
| Layer | What | Where |
|---|---|---|
| KV cache | live-VRAM reduction at long context via llama.cpp integration |
runtime |
| Weights | higher-fidelity GGUF quants (fraQtl calibration) | artifact |
| Runtime memory | arena allocators, lifetime control, paged layouts | runtime |
| Diagnostic | drop-in measurement: projected savings, readiness, spectral fingerprint | free, open (Apache 2.0) |
Most "compression" projects ship one of the above. fraQtl ships all four, with reproducible receipts at every layer.
Higher-fidelity GGUF quants. Same file size as a standard Q4_K_M; measurably closer to the full-precision teacher across code, math, chat, tool calling, and long-form text.
Measured on Qwen 3.6 35B-A3B (symmetric top-20 KLD vs the Q8 teacher, 400-record held-out slices):
| Lane | KLD vs Q8 β | Top-1 vs Q8 β |
|---|---|---|
| Code + math | 0.0203 | 97.2% |
| General (chat + tools + long-form text) | 0.0485 | 93.2% |
At the same file size as a standard Q4_K_M, ~30% lower KLD across both slices. Reproducibility drift across three independent runs: 0.00000.
β Qwen 3.6 35B-A3B (Q4_K_M) β Q4_K_M Β· 21.4 GB Β· drop-in for llama.cpp / Ollama / LM Studio / koboldcpp / Jan
llama.cpp runtimeMeasured live-VRAM reduction on Mistral-7B-Instruct-v0.3 Q4_K_M vs a true fp16-KV baseline:
| Context | Live VRAM Ξ vs fp16 | % of fp16 KV saved | Quality |
|---|---|---|---|
| 8K | β302 MiB | small | PPL parity |
| 64K | β2,610 MiB | ~32% | PPL parity |
| 128K | β9,422 MiB | ~42% | PPL drift β€ 0.004 |
The benefit grows with context β exactly where fp16 KV becomes the bottleneck. Reproducibility drift across independent runs: β€ 0.004 PPL.
pip install fraqtl-diagnostic β three tools in one package:
β Run in your browser Β· PyPI Β· GitHub
llama.cpp kernel path β no patched runtime, no custom flags0.00000 across independent runsThe thesis: compression should be calibration-aware and workload-aware.
Patent pending.