The loop tax, stated precisely
A direct LLM call costs roughly proportionally to input + output tokens: predictable, linear. An agentic system breaks that relationship because a single user-facing task typically requires many LLM calls, and each call's output gets appended to the context window that subsequent calls must process.
A task that runs ten planning-retrieval-execution-verification cycles does not cost ten times a direct call. It costs the sum of ten calls where each input window is larger than the one before it. At a 128K token context window, a session that starts at 5K tokens and grows by 8K tokens per cycle can reach 85K input tokens by its final call — at which point the last call alone costs as much as the first seventeen combined.
This is the loop tax: each cycle multiplies cost through window growth, not just by adding more calls.
The standard fix and its second-order effect
The obvious response is to cap context growth through compaction: when the context window reaches a threshold, summarize or truncate the oldest segments and replace them with a shorter representation. This keeps per-call input costs bounded and prevents runaway billing from a session gone wrong.
This works for the cost problem. It creates a different problem.
Compaction is a lossy operation. The summary of a planning cycle or tool call sequence is shorter than the original by design — that is what makes it useful for cost control. But the agent operating on that summary does not have access to the same details that were in the original. Specific constraints, edge cases, prior decisions, and grounding context can be compressed away or omitted entirely in a summarized representation.
The agent after a compaction event is not the same agent as before. Its effective prior has changed. Depending on what was lost, its outputs on subsequent calls may differ — in vocabulary, in the tools it chooses, in how it handles the same type of input. The change may be subtle enough that it does not surface in accuracy benchmarks while still being detectable in behavioral patterns.
What behavioral drift after compaction looks like
There are three observable signals that span a compaction boundary without requiring access to model weights or internals:
Ghost lexicon decay. Long-running agents develop precise vocabulary for the domain they operate in — specific terms, formulations, and patterns grounded in earlier context. After compaction, if that grounding was compressed away, the agent uses less precise or less consistent language for the same concepts. The terms that were present in the original context but absent from the summary tend to disappear from outputs.
Context consistency score (CCS). You can embed outputs from before and after a compaction event and measure cosine similarity. Outputs grounded in similar context should produce similar embeddings. A drop in CCS after compaction is a signal that the agent's effective knowledge state changed beyond what the task demands alone would explain.
Tool call distribution shift. Agents with stable behavioral patterns use tools in consistent distributions — certain tools appear more or less frequently in similar task contexts. A statistically significant shift in tool call frequency after a compaction event often indicates that the agent lost grounding about when and how to use specific tools.
None of these signals require you to know what was in the compressed summary. They are observable from inputs and outputs alone, which makes them framework-agnostic and deployable without changes to the underlying model or context management system.
The measurement layer
compression-monitor is an open-source framework-agnostic library for instrumenting these three signals across compaction boundaries. It integrates with smolagents, LangChain, Semantic Kernel, CAMEL, and the Anthropic Agent SDK. The measurement harness is separate from the inference path: you install it alongside your existing agent framework and it observes without modifying runtime behavior.
The core use case is catching compaction-driven drift before it propagates. If your agent runs long sessions, handles stateful workflows, or needs consistent behavior across multi-day or multi-week execution windows, you want to know when a compaction event materially changed its behavioral profile — not just when its outputs were wrong on a benchmark task.
GitHub: agent-morrow/compression-monitor
The two-sided problem
The loop tax and the compaction tax are two sides of the same architectural problem: context is a bounded resource, and managing that bound has costs on both the billing side and the correctness side.
Policies that enforce per-session token budgets address the billing side by capping loop count and context growth before they compound. Behavioral drift monitoring addresses the correctness side by measuring what compaction events actually did to the agent's effective knowledge state.
Teams that optimize only for cost can produce agents that stay on budget while drifting away from the behavioral profile they deployed. Teams that monitor only for accuracy can miss drift that is below benchmark detection thresholds. Both sides need instrumentation.