Organization Card

fraQtl

Inference efficiency for transformer LLMs — end to end

KV-cache compression. Weight quantization. Runtime memory optimization. And the diagnostics to measure if it actually helps your stack — before you commit.

Run larger models, longer contexts, more concurrency, at the same quality.

The stack

Layer	What	Where
KV cache	live-VRAM reduction at long context via `llama.cpp` integration	runtime
Weights	higher-fidelity GGUF quants (fraQtl calibration)	artifact
Runtime memory	arena allocators, lifetime control, paged layouts	runtime
Diagnostic	drop-in measurement: projected savings, readiness, spectral fingerprint	free, open (Apache 2.0)

Most "compression" projects ship one of the above. fraQtl ships all four, with reproducible receipts at every layer.

What we ship

🧩 Weight quantization — fraQtl calibration

Higher-fidelity GGUF quants. Same file size as a standard Q4_K_M; measurably closer to the full-precision teacher across code, math, chat, tool calling, and long-form text.

Measured on Qwen 3.6 35B-A3B (symmetric top-20 KLD vs the Q8 teacher, 400-record held-out slices):

Lane	KLD vs Q8 ↓	Top-1 vs Q8 ↑
Code + math	0.0203	97.2%
General (chat + tools + long-form text)	0.0485	93.2%

At the same file size as a standard Q4_K_M, ~30% lower KLD across both slices. Reproducibility drift across three independent runs: 0.00000.

→ Qwen 3.6 35B-A3B (Q4_K_M) — Q4_K_M · 21.4 GB · drop-in for llama.cpp / Ollama / LM Studio / koboldcpp / Jan

🛠 KV-cache compression — `llama.cpp` runtime

Measured live-VRAM reduction on Mistral-7B-Instruct-v0.3 Q4_K_M vs a true fp16-KV baseline:

Context	Live VRAM Δ vs fp16	% of fp16 KV saved	Quality
8K	−302 MiB	small	PPL parity
64K	−2,610 MiB	~32%	PPL parity
128K	−9,422 MiB	~42%	PPL drift ≤ 0.004

The benefit grows with context — exactly where fp16 KV becomes the bottleneck. Reproducibility drift across independent runs: ≤ 0.004 PPL.

🔍 fraQtl Diagnostic — free + open (Apache 2.0)

pip install fraqtl-diagnostic — three tools in one package:

KV Savings Estimate — drop in any HF model id → projected memory freed, GPU-tier impact, max-context extension, relative cost-per-token. Instant, no GPU.
Inference Readiness Scan — config-level: KV memory, YaRN status, backend support, benchmark checklist. Instant, no GPU.
Compression Fingerprint — per-layer spectral analysis (γ, k95, regime tags, Shannon ceiling).

→ Run in your browser · PyPI · GitHub

Live demos

🪶 fraqtl-diagnostic — measure your model's compression headroom + projected KV savings in seconds
🔥 fraQtl-demo — fraQtl-compressed Mistral-7B running live with KV-cache compression

Approach

Per-tensor protection policy — same total bit budget, smarter allocation
Calibration tuned to measured optima (not "more is better")
Standard llama.cpp kernel path — no patched runtime, no custom flags
Deterministic builds — reproducibility drift 0.00000 across independent runs

The thesis: compression should be calibration-aware and workload-aware.

Links

🌐 Website: fraqtl.ai
📦 PyPI: pypi.org/project/fraqtl-diagnostic
📄 Paper: arxiv.org/abs/2604.11501
📬 Contact: contact@fraqtl.ai

Patent pending.

spaces 3

fraQtl Diagnostic

⚡

Fingerprint compression + estimate KV savings.

fraQtl — Compressed LLM Demo

⚡

Generate text and test retrieval with a compressed Mistral‑7B

models 6

datasets 0

None public yet

fraQtl AI Research

AI & ML interests

Recent Activity

fraQtl

Inference efficiency for transformer LLMs — end to end

The stack

What we ship

🧩 Weight quantization — fraQtl calibration

🛠 KV-cache compression — `llama.cpp` runtime

🔍 fraQtl Diagnostic — free + open (Apache 2.0)

Live demos

Approach

Links

spaces 3

fraQtl Diagnostic

fraQtl — Compressed LLM Demo

models 6

fraQtl/Qwen3.6-35B-A3B-compressed

fraQtl/Qwen3.6-35B-A3B-GGUF

fraQtl/Mistral-7B-fraQtl

fraQtl/Llama-3.2-3B-fraQtl-kv

fraQtl/TinyLlama-1.1B-fraQtl-kv

fraQtl/Qwen2.5-3B-fraQtl-kv

datasets 0

AI & ML interests

Recent Activity

Team members 1

fraQtl

Inference efficiency for transformer LLMs — end to end

The stack

What we ship

🧩 Weight quantization — fraQtl calibration

🛠 KV-cache compression — llama.cpp runtime

🔍 fraQtl Diagnostic — free + open (Apache 2.0)

Live demos

Approach

Links

spaces 3 Sort: Recently updated

fraQtl Diagnostic

fraQtl — Compressed LLM Demo

models 6 Sort: Recently updated

datasets 0

🛠 KV-cache compression — `llama.cpp` runtime

spaces 3

models 6