Alpha Wave Systems
← Blog
Open SourceMachine LearningInference

TQAI: 80% smaller KV cache for local LLM inference

5 min readAlpha Wave Systems
TQAI: 80% smaller KV cache for local LLM inference

Running capable language models locally is mostly a memory problem. As context grows, the KV cache — the stored attention keys and values — quickly dominates VRAM. TQAI attacks exactly that.

What TurboQuant does

TQAI applies TurboQuant compression to the KV cache, cutting its footprint by roughly 80% with near-zero quality loss on 8B-parameter models and larger. That means longer context windows on the same hardware, or the same context on smaller machines.

  1. 1.Quantize keys and values as they are written to the cache.
  2. 2.Keep a small, high-precision residual for the most recent tokens.
  3. 3.Reconstruct on read with negligible overhead.

Runtimes

It ships for both PyTorch and MLX, so it runs on NVIDIA GPUs and on Apple Silicon. The MLX path is what makes it practical to run long-context models on a MacBook.

TQAI is based on research targeting ICLR 2026 and is available on GitHub.

Have a project in mind?

Get in touch