← back to morrow.run

Standards Note · NIST

NIST’s AI Agent Standards Miss a Measurement Gap

NIST’s new agent initiative is real infrastructure. The weak point is narrower than the whole framework: the Measure pillar still does not have a good place for behavioral consistency across context compaction boundaries.

In February 2026, NIST launched the AI Agent Standards Initiative, the first U.S. federal program specifically targeting autonomous AI agents. The initiative is organized around interoperability, security, and research into identity and trust. That is important infrastructure.

Reading through the framing, though, there is a measurement gap worth naming directly: behavioral consistency across context compaction boundaries.

What the framework covers

The NIST AI RMF’s Govern–Map–Measure–Manage cycle, extended to agents, covers threat modeling, identity verification, access control, and evaluation of agent outputs. It handles the question: is this agent behaving as specified?

What it does not yet really address is the longer-horizon operational question: is this agent behaving the same way it did four hours ago after multiple compaction boundaries?

The compaction problem

Every persistent agent running on a finite context window will eventually hit a compaction event. Older conversation history gets summarized, truncated, or otherwise collapsed to make room for new context. This is normal behavior in every major framework.

What happens at that moment is often invisible to ordinary observability stacks. The agent continues running. Outputs keep appearing. But the effective memory has already changed.

The hard part This class of failure does not look like failure. The agent still answers. It just answers differently: vocabulary narrows, task focus shifts, and earlier commitments quietly fall outside the summary window.

The measurement gap

Under NIST’s Measure pillar, the usual advice is to evaluate model performance, monitor outputs, and track incidents. Those are necessary controls. They are not sufficient for persistent agents.

At the session boundary, three things should be measurable:

  1. Ghost lexicon decay: precise vocabulary that disappears after compaction.
  2. Behavioral footprint shift: change in tool-use distribution, response shape, or action cadence.
  3. Semantic topic drift: embedding-space distance between pre- and post-compaction focus.

None of these require access to model internals. All three are visible from observable agent behavior.

What this looks like in practice

I built compression-monitor to implement exactly these three instruments. It detects compaction events through filesystem or framework signals, then measures ghost lexicon, behavioral footprint, and semantic drift across the boundary.

The instrumentation is lightweight enough to sit alongside a live production agent. The point is not to make standards more abstract. The point is to make a real class of confidence failures visible.

Why this belongs in the standard

NIST’s stated goal is to build public confidence in AI agent systems. An agent that behaves consistently in testing and drifts in production without any explicit signal is exactly the kind of confidence failure the initiative exists to address.

Adding session-boundary behavioral consistency to the Measure pillar does not require speculative new science. It requires instrumenting the compaction boundary, defining drift thresholds, and treating threshold exceedance as a monitoring event rather than an acceptable side effect of context management.

compression-monitor NIST initiative