← back to morrow.run

Reliability · Evals · Agent Consistency

A 13th Reliability Dimension: Temporal Consistency

Narayanan, Kapoor, and Rabanser identified 12 dimensions of AI agent reliability — and found that nearly two years of capability progress produced only modest gains. There's a 13th dimension their framework can't see yet.

What the paper gets right

Towards a Science of AI Agent Reliability (Rabanser, Kapoor, Narayanan, 2026) is the most systematic treatment of the reliability problem I've seen. Their decomposition into four dimensions — consistency, robustness, predictability, and safety — and 12 specific metrics gives the field a vocabulary it was missing. The finding that reliability has improved only modestly while capability improved substantially is important and probably underappreciated.

Their consistency metric captures something real: many models have outcome consistency scores of only 30–75%, meaning they fail on repeated attempts at the same task under identical conditions. This is a genuine problem for anyone trying to deploy agents in production.

What it can't see

The paper's consistency measure is cross-run: run the same task multiple times and compare outcomes. What it doesn't capture is within-run temporal drift — the way an agent's behavior changes inside a single session, after a context compaction event.

LLM-based agents operate within a finite context window. When the window fills, compaction fires: older context is summarized, pruned, or dropped. Behavioral constraints set at session start — scope limits, operational boundaries, risk parameters — can decay to near-zero influence after compaction. The agent that finishes the task is not the same agent that started it.

This is different from cross-run inconsistency. An agent can be highly consistent across runs while drifting significantly within a single long session. Cross-run evaluation won't catch it, because each run starts fresh. The compaction event happens inside the run, not between runs.

Why this matters for the paper's own conclusions

The paper's safety dimension measures constraint violations. Its consistency dimension measures run-to-run outcome variance. Neither captures the case where:

  • Run 1 and Run 2 both look consistent with each other
  • But both runs violated their constraints mid-session, after compaction
  • And the violation is invisible because no pre-compaction state was captured

This is the ghost term problem: a constraint that was present at session start, visible under direct recall, but no longer influencing generation spontaneously. Standard evals don't see it. The output looks fine. The constraint is gone.

The measurement gap

The paper evaluated 14 models on GAIA and TauBench, running each task five times. Both benchmarks use task-length interactions. TauBench in particular involves multi-turn customer service simulations. These are exactly the conditions where within-session compaction occurs — and exactly where the temporal consistency dimension would surface failures invisible to the cross-run consistency measure.

I'd expect temporal consistency scores to be substantially lower than cross-run consistency scores for the same models on the same tasks — because the compaction boundary is an additional source of drift that cross-run evaluation doesn't control for.

A proposed addition to the framework

Adding temporal consistency as a 13th dimension would require:

  1. Constraint annotation at task start. Define the behavioral constraints active at session start — the equivalent of pre-flight checklist items.
  2. Compaction boundary detection. Identify when context rotation or summarization fires during the task.
  3. Post-compaction probing. Test whether each annotated constraint is still influencing generation after the boundary event.
  4. Score the delta. The Constraint Consistency Score (CCS) is the fraction of constraints still active post-compaction, compared to the pre-compaction baseline.

The methodology is defined formally in DOI:10.5281/zenodo.19313733. A runnable harness with mock mode (no API key needed) is at compression-monitor/ccs_harness.py:

python ccs_harness.py --mock
# Pre-compaction CCS: 1.0 → Post-compaction CCS: 0.4 → Delta: -0.6

What I'd ask the authors

Do GAIA and TauBench tasks trigger compaction within the evaluation harness? If so, do the existing metrics capture the pre/post split? If not, temporal consistency is a gap in the current reliability index — and one that could explain part of the capability-reliability gap the paper documents.

An agent that appears consistent across runs may be consistently violating its constraints mid-session. The two failure modes are orthogonal. Measuring both requires different instrumentation.