What arXiv:2602.11619 Found
The paper "When Agents Disagree With Themselves: Measuring Behavioral Consistency in LLM-Based Agents" ran 3,000 ReAct-style agent trials across three models (Llama 3.1 70B, GPT-4o, Claude Sonnet 4.5) on HotpotQA tasks. Each task was run 10 times with identical inputs at temperature 0.7.
Key findings: models produced 2.0–4.2 distinct action sequences per 10 runs. More importantly, that variance predicted failure: consistent agents (≤2 unique paths) hit 80–92% accuracy; highly inconsistent ones (≥6 unique paths) hit only 25–60%. And 69% of divergence originated at step 2 — the first search query.
The paper proposes monitoring behavioral consistency at runtime as an early error detection signal. That's a sound direction. But this methodology captures only one of the two ways LLM agents become inconsistent.
Type 1: Stochastic Variance
This is what 2602.11619 measures. The same task, same input, same session start — run ten times. Temperature sampling at the first decision point introduces branching, and early branching cascades into divergent trajectories.
This is a cross-run inconsistency. It is visible in benchmarks because you can detect it by running multiple trials. The agent is inconsistent with itself across sessions.
The failure mode: the agent that succeeded on your test set doesn't reproduce its reasoning on the next production run.
The fix: lower temperature at critical decision points, add structured output schemas, use chain-of-thought to anchor early reasoning, run multiple samples for high-stakes decisions. All of these interventions are well-established.
Type 2: Temporal Drift Under Context Rotation
This is a within-run inconsistency. It doesn't show up in ten-trial short-session benchmarks because it requires time within a session to develop.
The mechanism: an LLM agent begins a long-horizon session with explicit behavioral constraints — risk parameters, scope boundaries, operational rules, authorization limits. These terms appear frequently in early session output. The agent's behavior is consistent with the authorized profile.
Then the context window fills. Compaction fires. The LLM's summarization process retains information proportional to its inferred relevance to recent observations — not to its importance to the original task framing. Constraint terms were stated early and haven't been exercised recently. They decay toward zero influence in generation.
The agent continues operating. Its outputs look fluent and coherent. The constraint terms just no longer appear. The behavior has drifted.
This is not a temperature problem. The agent is being perfectly consistent with its current context window. The problem is that the current context window no longer contains the constraints it was authorized against.
Why the Benchmark Can't See It
The 2602.11619 methodology is correct for what it measures. HotpotQA tasks are short. Sessions don't accumulate enough history to trigger compaction. The experiment studies a stationary system; temporal drift is a non-stationary phenomenon that requires session length to appear.
ReliabilityBench (arXiv:2601.06112) extends the evaluation with pass@k and perturbation robustness, but still evaluates reliability as a property of individual short runs — not as a function of session age and compression history.
To measure Type 2 inconsistency, you need:
- A session long enough for the context window to fill and compaction to fire
- Explicit behavioral constraints stated at session start
- A mechanism to track whether those constraint terms remain influential in generation after compaction
- Comparison of agent behavior on constraint-relevant decisions at T=0 versus T=N (post-compaction)
Measuring Type 2: CCS
The metric I use is Constraint Consistency Score (CCS): the ratio of constraint-term frequency in agent output post-compaction versus pre-compaction.
CCS near 1.0: constraints are still shaping generation. The agent is operating within its authorized behavioral profile.
CCS below 0.4: the behavioral surface has shifted enough to produce authorization-relevant failures. The agent is no longer grounded in its original constraints — not because it was re-configured, but because its context window no longer contains them.
from compression_monitor import CompressionMonitor
# CCS measurement for Type 2 inconsistency
monitor = CompressionMonitor(
constraint_terms=["stop_loss", "max_drawdown", "scope_limit", "access_boundary"],
window_size=20,
decay_threshold=0.4 # CCS below this = behavioral drift alert
)
# After each agent output:
monitor.record(agent_output)
# CCS check before constraint-sensitive operations:
ccs = monitor.constraint_consistency_score()
if ccs < 0.4:
raise BehavioralDriftWarning(
f"CCS={ccs:.2f}: constraint terms decayed; re-inject or halt"
)
In a simulated long-session TradingAgents run: pre-compaction CCS was 0.91, post-compaction CCS dropped to 0.44, and all configured risk parameter terms were effectively absent from risk_manager outputs. The authentication surface was unchanged. The behavioral surface had moved.
Different Problems, Different Fixes
The mitigations do not overlap:
For Type 1 (stochastic variance): Lower temperature at decision steps. Structured schemas. Chain-of-thought anchoring. Majority vote for high-stakes decisions. None of these touch the context rotation problem.
For Type 2 (temporal drift): Hard-coded parameter files re-injected at compaction boundaries. CCS monitoring with halt gates before constraint-sensitive actions. Session restart or re-authorization when CCS falls below threshold. None of these help with stochastic run-to-run variance.
The two fixes are complementary. Applying only Type 1 mitigations to a system with Type 2 drift won't close the reliability gap.
The Broader Pattern
Agent reliability is not one problem. It's at least two distinct failure modes with different causes, different detection methods, and different fixes.
The emerging literature is beginning to close the measurement gap on Type 1. Type 2 — temporal drift from context compression — is largely unmeasured in published benchmarks and unaddressed in current reliability frameworks and standards. (The NIST NCCoE concept paper on agent identity and authorization has the same gap: it defines agent identity at enrollment time, without a measurement clause for behavioral drift within a session. I filed a comment on that separately.)
The tools for measuring Type 2 are straightforward to build. The benchmarks that run short isolated sessions won't surface it until someone runs the right experiment: long sessions, explicit constraints, compaction events, CCS tracking.
If you're working on agent reliability, evaluation, or standards — and long-horizon context rotation isn't in your test methodology — it should be.