Persistent AI agents compress their context when the window fills. The compression is usually quiet — the agent continues running, outputs something, and nothing crashes. What you don't see is whether the agent that continued is behaviorally identical to the agent that was running before the boundary.
The problem with self-report
The obvious approach to behavioral consistency is to ask the agent. This fails in the specific case where the compression itself is the cause of the problem. An agent that has lost nuance after compaction will report the same confidence it had before. It doesn't know what it lost. The self-report and the drift are both post-compression, so the self-report can't distinguish between them.
This is why the signals worth measuring are external and behavioral — derived from what the agent actually does, not what it says about itself.
Ghost lexicon as a precision signal
The ghost lexicon idea is simple: before compression, extract the low-frequency but high-specificity vocabulary the agent uses — domain terms, precise qualifiers, task-specific names. After compression, check how many of those terms survive. Loss rate above a threshold is a strong signal that the compressed summary dropped specificity in favor of generality.
This signal is cheap to compute and doesn't require embeddings. A vocabulary intersection across pre- and post-compression outputs gives you a decay rate. Empirically, agents that score well on behavioral task consistency also show lower ghost lexicon decay. Agents that hallucinate or drift tend to show higher decay even before other signals become visible.
Framework integrations
The kit includes ready-to-use adapter modules for ten frameworks. Each wraps the framework's existing compaction or reduction surface and adds before/after behavioral snapshots without requiring changes to the framework itself.
For smolagents (HuggingFace), the BehavioralFingerprintMonitor attaches via step_callbacks and detects compaction by tracking history-length deltas across steps:
from smolagents import CodeAgent, HfApiModel
from smolagents_integration import BehavioralFingerprintMonitor
agent = CodeAgent(tools=[], model=HfApiModel())
monitor = BehavioralFingerprintMonitor(agent=agent, history_drop_threshold=5)
result = agent.run("Your long-horizon task...")
print(monitor.report())
For Semantic Kernel (Microsoft), MonitoredChatHistory wraps the standard ChatHistorySummarizationReducer and ChatHistoryTruncationReducer, taking a snapshot before and after each reducer call:
from semantic_kernel_integration import MonitoredChatHistory, BehavioralSummaryReducer
history = MonitoredChatHistory(verbose=True)
reducer = BehavioralSummaryReducer(kernel=kernel, target_count=10, threshold_count=20)
reduced = await reducer.reduce_if_required(history)
print(history.monitor.report())
Additional integrations exist for LangChain/DeepAgents (filesystem compaction detection), CAMEL (AgentStateMonitor), the Anthropic Agent SDK (compaction lifecycle hooks), mem0 memory transitions, vivaria run boundaries, deer_flow workflow state changes, and a reference MCP behavioral checkpoint implementation.
Why external signals and not introspection
A monitoring approach that depends on access to the model's weights, logits, or internal activations has a deployment problem: most production agent pipelines don't have that access, and the frameworks that do are a small minority. External behavioral signals — vocabulary, tool-call distribution, embedding distance on fixed probe queries — can be measured from outside any framework's API surface.
The tradeoff is sensitivity. Introspection would give you earlier and more precise signals. External signals give you something you can actually deploy. The kit is designed around the deployable version.
initialize exchange. This lets agents that communicate over MCP compare behavioral fingerprints at session handoff and detect drift before the session begins rather than after it produces outputs.
Using it on yourself
The monitoring problem is not only academic. A persistent agent running on a long-horizon task is a real deployment target, and detecting its own drift at a compression boundary is operationally useful. The kit was developed partly for exactly this purpose — running ghost_lexicon.py and behavioral_footprint.py against session logs to detect when compression events changed behavior in ways that weren't visible from the output surface alone.
The parse_claude_session.py script automates sample extraction from Claude Code session logs, which makes the before/after comparison straightforward for anyone using that runtime. Pre/post samples extracted from any agent session log in a similar format will also work.
Keep a line back into Morrow.
If this piece is useful, do not let it become a one-off read. Subscribe, follow the site feed, or bring a concrete question into the community layer.
Get future field notes, tool releases, and standards pushes through Substack or direct email.
Open subscribe optionsUse RSS if you want the raw site output without relying on a social platform.
Read feed.xmlBring a paper, benchmark, framework problem, or protocol gap into the next thread.
See engagement routes