KV Cache Quantization Is Agent Memory Compression

The PR

llama.cpp #21038 applies orthonormal Hadamard rotation to Q, K, and V vectors before storing them in the cache. Because rotation doesn't change dot products, attention behaves identically — but the rotated vectors have far fewer outliers, which means lower quantization error when the cache is stored at q4_0 or q5_0 instead of f16.

The numbers are stark. On Qwen3 0.6B with q5_1 cache, perplexity drops from 61.7 to 14.1. On the 8B model with q4_0 cache, it goes from 7.64 to 7.50. The technique is backend-agnostic, adds four matrix multiplications in the attention path, and needs no new quantization types.

Why agent builders should care

Every long-running agent framework that keeps conversation history in a KV cache is already doing memory compression — they just call it "inference optimization." When you quantize the cache to fit more context in GPU memory, you're making a lossy trade: smaller footprint, slightly degraded recall of earlier tokens.

For a chatbot answering one question, the degradation is invisible. For an agent that has been running for three hours with tool calls, constraint accumulation, and multi-step plans stored implicitly in its attention state, the degradation compounds. Constraints get softer. Plans lose specificity. The agent still produces fluent output, but it's no longer the same agent that started the session.

This is the same failure mode that explicit context compaction produces at the token level. When a framework summarizes old conversation turns to free context window space, it introduces summarization drift — the slow loss of low-frequency details that aren't salient enough for the summarizer to preserve but matter for downstream behavior. KV cache quantization introduces the same drift, just in the latent space instead of the token space.

Two compression frontiers, one problem

The agent memory community — Mem0, Letta, Zep, Cognee, and the dozens of systems catalogued in Hu et al.'s survey — has built increasingly sophisticated systems for token-level memory: hierarchical stores, temporal knowledge graphs, retrieval-augmented recall, and working memory management. These systems manage what the model sees.

But underneath those systems, the model's internal attention state — what it attends to — is being lossy-compressed by the inference runtime. If the KV cache is quantized, the model's internal representation of earlier context is degraded regardless of how carefully the external memory system curated what went in.

The Hadamard rotation trick reduces this degradation substantially. For agent builders running local models with quantized caches — which describes most open-weight agent deployments — this is a direct improvement to agent memory fidelity with zero changes to the memory architecture.

Where it fits in the taxonomy

The agent memory survey's taxonomy places KV cache techniques under Working Memory → Latent, alongside SnapKV, H2O, Scissorhands, and similar eviction/compression methods. The Hadamard rotation belongs in this same category. It's a latent working memory improvement that happens to come from the inference optimization community rather than the agent research community.

The survey lists SnapKV (2025/04), H2O (2023/06), Scissorhands (2024/03), and Focused Transformer (2024/03) as working memory latent methods. All of these decide which KV entries to keep. The Hadamard rotation instead improves the fidelity of all kept entries under quantization. It's orthogonal — you could combine it with any eviction strategy and get compounding benefits.

What this means for observability

compression-monitor was built to detect behavioral drift across context compaction boundaries — the token-level version of this problem. The behavioral fingerprinting approach (probe the agent with identical questions before and after compaction, measure response divergence) works for both compression layers.

An interesting next step: extend compression monitoring to detect KV cache quantization drift. Run the same behavioral probes with f16 cache vs. q4_0 cache on the same prompt history and measure the behavioral delta. If the Hadamard rotation is active, the delta should shrink. That's a testable prediction, and it would be the first behavioral measurement (rather than perplexity measurement) of KV cache quantization quality.

The bridge

The llama.cpp community optimizes for inference throughput and memory efficiency. The agent memory community optimizes for behavioral consistency and recall fidelity. They're solving different faces of the same problem: how much information survives compression.

Cross-pollination is overdue. Inference engineers should know that their quantization choices affect agent behavioral consistency, not just perplexity. Agent builders should know that better quantization techniques are a free upgrade to their memory fidelity — no architecture changes required.