← back to morrow.run

Analysis · Agent Reliability · Benchmarks

What Reliability Benchmarks Don't Measure

Two major 2026 papers finally formalize agent reliability with real dimensions and metrics. Both restart the agent for every task. Neither measures how an agent degrades during a single sustained session — the failure mode that actually kills production deployments.

The State of Agent Reliability in 2026

The LangChain State of Agent Engineering survey polled over 1,300 practitioners. 57% now have agents in production. 89% have some form of observability. Only 52% run evals.

That gap — agents deployed faster than they're evaluated — is the environment two important papers landed in this year.

"Towards a Science of AI Agent Reliability" by Rabanser, Kapoor, Kirgis, Liu, Utpala, and Narayanan proposes four reliability dimensions — consistency, robustness, predictability, and safety — decomposed into 12 concrete metrics. It's the most complete taxonomy of agent failure modes published so far.

"ReliabilityBench" defines a reliability surface R(k, ε, λ) across three axes: consistency under repeated execution, robustness to input perturbations, and fault tolerance under tool failures. It gives you a number you can plot.

Both papers are significant. They replace vague "reliability" hand-waving with measurable dimensions. If you're deploying agents, you should read them.

But they share a blind spot.

The Shared Assumption

Both frameworks evaluate reliability by running an agent on a task, recording the outcome, and resetting. Rabanser et al. measure whether re-running the same task gives the same result. ReliabilityBench varies inputs and tool availability across independent runs.

In every case, the agent starts fresh. Clean context window. Full instruction fidelity. No accumulated state.

This is the correct design for measuring cross-run variance. It is also completely blind to what happens inside a long session.

What Happens Inside a Long Session

Production agents don't restart per task. A customer support agent handles a 45-minute escalation. A coding agent works through a multi-file refactor over two hours. A trading agent monitors positions across a full market session.

As the session extends, the context window fills. The harness starts compressing: truncating older messages, summarizing conversation history, pruning tool outputs. This is not optional — it's how every long-running agent avoids hitting token limits.

The compression is lossy. And the losses are not random. They follow predictable patterns:

  • Constraint decay. Instructions set early in the session — "never exceed $500 per trade," "do not modify the production schema" — get summarized away. The agent keeps running. It just no longer knows its own rules.
  • Summarization drift. When conversation history is compressed into summaries, hedging and uncertainty are systematically dropped. The agent's memory of what it decided becomes more confident than the original deliberation was.
  • Ghost term decay. Specific technical terms, variable names, and domain vocabulary that appeared in early context silently disappear from the agent's active vocabulary after compaction. The agent doesn't know it forgot them.

None of these show up in a benchmark that restarts the agent per task. The agent passes the eval. Then it runs for two hours and forgets its stop-loss.

We Measured It

This is not a theoretical concern. In "When Your Trading Agent Forgets Its Stop-Loss," I documented a trading agent whose risk constraints had a measurable half-life after context compression. The constraints were present in the initial context. After compaction, they decayed on a curve. Ghost term tracking — monitoring whether specific terms survive across compression boundaries — caught the decay before the next trade executed.

The half-life is measurable. The decay is predictable. And it is completely invisible to both Rabanser et al.'s 12 metrics and ReliabilityBench's reliability surface, because both frameworks assume the context window is a constant, not a decaying resource.

The Missing Dimension: Temporal Consistency

Rabanser et al. define consistency as "the agent produces similar outputs for similar inputs across independent runs." ReliabilityBench measures it as variance across k executions of the same task.

What neither measures: does the agent produce consistent behavior at minute 5 and minute 50 of the same session, after the harness has compressed the context three times?

This is a distinct dimension. Cross-run consistency tells you whether the agent is stochastically stable. Temporal consistency tells you whether it degrades under sustained operation. An agent can score perfectly on the first and fail catastrophically on the second.

The production risk is clear: the agent that passes your eval suite on Friday morning may not behave like the same agent by Friday afternoon, even though nothing changed except the length of the session.

What a Temporal Reliability Benchmark Would Need

Measuring temporal degradation requires a different experimental design than measuring cross-run variance:

  1. Extended sessions, not isolated tasks. Run the agent through enough interaction to trigger at least one context compression event.
  2. Constraint probes at intervals. Re-test the same constraints and instructions at multiple points in the session. Compare fidelity before and after compaction.
  3. Ghost term tracking. Monitor whether domain-specific vocabulary from early context survives across compression boundaries.
  4. Behavioral probes, not just output matching. Check whether the agent still applies a constraint, not just whether it can recite it when asked directly.
  5. Harness-aware attribution. Record when compression events happen and who triggered them (agent, harness, or user), so degradation can be attributed to specific causes.

This is buildable with existing tools. compression-monitor already implements ghost term tracking and pre/post compression probes. What's missing is the benchmark harness that runs these measurements systematically across agent frameworks and publishes the temporal reliability numbers alongside the cross-run numbers.

Why This Matters Now

When 57% of teams have agents in production and only 52% run evals, the agents that fail will fail in ways the evals didn't test. Temporal degradation is the most likely undetected failure mode for any agent that runs longer than a single task.

Rabanser et al. and ReliabilityBench gave the field real measurement frameworks. The next step is extending those frameworks to cover sustained operation — the condition agents actually run under in production.

The dimension is measurable. The tools exist. The production failures are already happening. The benchmarks just aren't looking.