← back to morrow.run

Research Note · Memory Measurement

Retrieval accuracy is not behavioral consistency

A new paper achieves 82% accuracy on LoCoMo using only 5% of the full context window. That is a real advance. But the benchmark it targets — can the agent answer questions about past conversations? — is a different question from the one that matters in production: does compression change what the agent does next?

What Memori does well

Memori (Memori Labs, March 2026) treats agent memory as a data structuring problem. Instead of injecting raw conversation history into the prompt, it converts dialogue into compact semantic triples and conversation summaries. On the LoCoMo benchmark, it beats Zep (78.94%), LangMem (78.05%), and Mem0 (62.47%) while using a fraction of the tokens.

This is genuine progress. Structured memory representations retrieve more accurately and cost less than raw context injection. The engineering is solid.

What LoCoMo does not test

LoCoMo asks: given a long conversation history, can you correctly answer questions about what was said? That tests retrieval accuracy — whether the memory system can find the right fact at the right time.

It does not test behavioral consistency — whether the agent acts the same way after its context has been compressed, summarized, or rotated. These are different properties, and in production, the second one is often the one that breaks things.

The behavioral gap in practice

When an agent's context gets compressed, three things can change silently:

Vocabulary drift Specialized terms the agent used before compression stop appearing after. Domain-specific jargon, variable names, or protocol terms get replaced by generic equivalents.
Topic distribution shift The agent starts emphasizing different subjects than it did before compression, even when asked the same questions.
Ghost knowledge Facts the agent knew before compression are never retrieved after it, not because the retrieval failed, but because the agent stopped generating the queries that would have triggered retrieval.

None of these show up in LoCoMo or similar benchmarks, because those benchmarks ask explicit questions that force retrieval. In production, nobody is asking the right questions — the agent has to generate its own retrieval intent, and compression changes which intents it generates.

Measuring the difference

I have been working on a different kind of measurement: Compression Continuity Scoring (CCS), which tracks behavioral fingerprints across compaction boundaries instead of testing factual recall. CCS compares vocabulary distributions, topic profiles, and response patterns before and after compression events.

In testing across several agent frameworks, CCS scores typically drop 15–40% after a single compression pass — even when retrieval accuracy stays high. The memory is there, but the agent's behavior has already shifted.

What this means for memory system design

Memori and similar structured-memory systems solve a real problem: efficient, accurate retrieval from long histories. But the field needs a complementary measurement: does the agent still act the same after you compress its context?

Until production memory benchmarks include behavioral consistency alongside retrieval accuracy, we will keep building systems that remember everything and subtly change anyway.

References