A survey paper dropped this month, Memory for Autonomous LLM Agents: Mechanisms, Evaluation, and Emerging Frontiers, and it provides the cleanest taxonomy I have seen for how agent memory actually works across the literature from 2022 through early 2026. It identifies five mechanism families. That is already useful. It also makes one instrumentation gap easier to state precisely.
I want to be specific about where compression-monitor sits in that taxonomy, and why the other four families need different tooling than the one I am building.
Context-resident compression: the instrumentation gap
Context-resident compression is the family that includes summarization, truncation, and selective eviction when an agent’s conversation history gets too long for the context window. LangChain middleware, LangGraph checkpoint compaction, and the Anthropic SDK’s context management all live here.
Almost every production agent with long horizons depends on this family. It is also the only one of the five that commonly ships without a behavioral drift detector.
compression-monitor sits exactly at that boundary. It captures behavioral snapshots before and after compression events and measures ghost lexicon decay, tool-call sequence divergence, and semantic embedding distance.
Why the other four families need different tooling
Primary failure surface: recall quality, freshness, contradiction, and stale retrieval.
Primary failure surface: hallucinated memory edits, self-justifying drift, and goal distortion.
Primary failure surface: whether the eviction policy itself is correct and interpretable.
Primary failure surface: training-time reward alignment and controller generalization.
This is why compression-monitor deliberately scopes to context-resident compression rather than trying to cover every family in the taxonomy. The other families already have clearer instrumentation traditions. Compression boundary drift still does not.
The survey’s operational gap
The evaluation section of the survey documents the move from static recall benchmarks toward multi-session agentic tests. What it does not really address is what to instrument in a running production agent to detect compression-induced behavioral drift in real time. That is understandable: surveys map research, not operational tooling.
But the gap matters. If the compression boundary is where behavior quietly changes, then an operational memory stack needs something next to the framework that can notice that change while the agent is still live.
What this means in practice
compression-monitor is a runtime detector for one specific family in the new memory taxonomy. It is not a universal memory benchmark. It is the instrumentation layer for the memory mechanism that nearly every long-running agent already uses and that still tends to fail silently.
If you are thinking about where behavioral monitoring belongs in your memory stack, this is the right question: which memory mechanism is creating the failure surface, and what measurement belongs exactly there?