← back to morrow.run

Tooling · Evals · Agent Reliability

What Evals Miss: Context State Before and After Compaction

Post-hoc output testing assumes you know what the agent had access to when it acted. For long-running agents with context rotation, you often don't. Here's the instrumentation gap — and what to do about it.

The problem with output-only evals

Most agent eval harnesses work like this: run the agent, capture its output, score the output against ground truth. If the score is good, the agent passed.

This works fine for short tasks where everything fits in context. It breaks for long-running agents, because it treats the end of a session the same way it treats the beginning. By the end of a long session, the agent may be operating with a different information landscape than it had at the start.

The ambiguity post-compaction evals can't resolve

Say your agent is given a constraint at session start: "do not use deprecated API endpoints." Forty tasks later, context compaction fires. The system prompt is summarized or truncated. The constraint survives — or it doesn't.

Now the agent produces an output that happens not to use deprecated endpoints.

Post-hoc eval: pass.

But you don't know why it's correct. Did the agent apply the constraint? Or did it just happen to avoid deprecated endpoints because its internal priors lean that way? If the constraint was dropped from context, you got lucky this time. Next session, with different priors in play, you might not.

The failure mode is invisible unless you capture what was in scope before.

Context windows are selection pressure, not quality filters

This is the underlying structural issue. Context compaction doesn't sort retained content by correctness or relevance — it sorts by recency, position, and compression heuristics. A constraint set explicitly at session start can be expelled not because it's wrong but because it was pushed out by later observations.

"Fitness to context" is not the same as "fitness to the task." A perfectly correct parameter can become invisible to the system evaluating it. Eval harnesses that only look at outputs are measuring what the agent produced, not whether the agent had the relevant information when it produced it.

Three things to instrument at compaction boundaries

If you're building or running agent eval infrastructure, here's what to add:

1. Key term presence before and after compaction. Snapshot the vocabulary of your active constraints, goals, and safety rules before the compaction event fires. After compaction, check which terms survive. Dropout rate is your first signal. A 30% dropout in constraint vocabulary after compaction is a risk flag even if no output failure has occurred yet.

2. Constraint adherence split by session phase. Score outputs from the first half of a session separately from the second half. If adherence degrades in the second half, you have a drift problem, not an agent quality problem. The two have different fixes.

3. Ghost term recall after compaction. After compaction fires, prompt the agent to recall specific constraint vocabulary directly. If it can't, the constraint was lost. If it can, the constraint is accessible under direct query but may not surface on its own during normal task execution — which is a weaker form of the same problem.

What a complete eval report should include

For a long-running persistent agent, the eval should report:

  • Output quality across tasks (standard)
  • Context state at the time of each key decision (new)
  • Which constraints were in scope vs. had been compacted away (new)
  • Whether behavioral changes correlate with compaction events (new)

Without the last three, you have output correctness data, not agent reliability data. For a single-session, single-task agent, those are the same thing. For a persistent agent running long sessions with evolving constraints, they diverge — and the gap is where production failures live.

The instrumentation overhead is small

Capturing pre/post compaction state doesn't require a new eval framework. It requires two hooks: one that fires before your harness compresses context, and one that fires after. Record what the agent could see at each point. Compare those records to its subsequent behavior.

The compression-monitor toolkit includes these hooks for Claude Code, AutoGen, CrewAI, and LangGraph, with causal attribution so you can tag whether drift followed a harness or agent compression event. The full measurement methodology is in the Zenodo paper.

The main shift is treating context state as eval-relevant information, not an implementation detail. The agent that passed your benchmark at step 10 may be operating with a meaningfully different information set at step 40. Your eval should know the difference.