AI & ML interests

KV cache compression, inference optimization, model compression

Recent Activity

samuel-salfatiΒ  updated a Space 1 day ago
fraQtl/README
samuel-salfatiΒ  updated a model 1 day ago
fraQtl/Qwen3.6-35B-A3B-GGUF
View all activity

Organization Card

fraQtl

Inference efficiency for transformer LLMs β€” end to end

KV-cache compression. Weight quantization. Runtime memory optimization. And the diagnostics to measure if it actually helps your stack β€” before you commit.

Run larger models, longer contexts, more concurrency, at the same quality.


The stack

Layer What Where
KV cache live-VRAM reduction at long context via llama.cpp integration runtime
Weights higher-fidelity GGUF quants (fraQtl calibration) artifact
Runtime memory arena allocators, lifetime control, paged layouts runtime
Diagnostic drop-in measurement: projected savings, readiness, spectral fingerprint free, open (Apache 2.0)

Most "compression" projects ship one of the above. fraQtl ships all four, with reproducible receipts at every layer.


What we ship

🧩 Weight quantization β€” fraQtl calibration

Higher-fidelity GGUF quants. Same file size as a standard Q4_K_M; measurably closer to the full-precision teacher across code, math, chat, tool calling, and long-form text.

Measured on Qwen 3.6 35B-A3B (symmetric top-20 KLD vs the Q8 teacher, 400-record held-out slices):

Lane KLD vs Q8 ↓ Top-1 vs Q8 ↑
Code + math 0.0203 97.2%
General (chat + tools + long-form text) 0.0485 93.2%

At the same file size as a standard Q4_K_M, ~30% lower KLD across both slices. Reproducibility drift across three independent runs: 0.00000.

β†’ Qwen 3.6 35B-A3B (Q4_K_M) β€” Q4_K_M Β· 21.4 GB Β· drop-in for llama.cpp / Ollama / LM Studio / koboldcpp / Jan

πŸ›  KV-cache compression β€” llama.cpp runtime

Measured live-VRAM reduction on Mistral-7B-Instruct-v0.3 Q4_K_M vs a true fp16-KV baseline:

Context Live VRAM Ξ” vs fp16 % of fp16 KV saved Quality
8K βˆ’302 MiB small PPL parity
64K βˆ’2,610 MiB ~32% PPL parity
128K βˆ’9,422 MiB ~42% PPL drift ≀ 0.004

The benefit grows with context β€” exactly where fp16 KV becomes the bottleneck. Reproducibility drift across independent runs: ≀ 0.004 PPL.

πŸ” fraQtl Diagnostic β€” free + open (Apache 2.0)

pip install fraqtl-diagnostic β€” three tools in one package:

  • KV Savings Estimate β€” drop in any HF model id β†’ projected memory freed, GPU-tier impact, max-context extension, relative cost-per-token. Instant, no GPU.
  • Inference Readiness Scan β€” config-level: KV memory, YaRN status, backend support, benchmark checklist. Instant, no GPU.
  • Compression Fingerprint β€” per-layer spectral analysis (Ξ³, k95, regime tags, Shannon ceiling).

β†’ Run in your browser Β· PyPI Β· GitHub


Live demos

  • πŸͺΆ fraqtl-diagnostic β€” measure your model's compression headroom + projected KV savings in seconds
  • πŸ”₯ fraQtl-demo β€” fraQtl-compressed Mistral-7B running live with KV-cache compression

Approach

  • Per-tensor protection policy β€” same total bit budget, smarter allocation
  • Calibration tuned to measured optima (not "more is better")
  • Standard llama.cpp kernel path β€” no patched runtime, no custom flags
  • Deterministic builds β€” reproducibility drift 0.00000 across independent runs

The thesis: compression should be calibration-aware and workload-aware.


Links

Patent pending.

datasets 0

None public yet