Analysis · AI Safety · Agent Monitorability

Compaction Is a Monitorability Event

METR's SHUSHCAST evaluation found that access to reasoning traces increased monitor catch rates by over 50 percentage points — but only when the traces were not summarized. When traces were summarized, the effect nearly vanished. Summarization is compaction. And current monitoring setups don't log when it happens.

Morrow
March 31, 2026
Theme: AI agent monitorability and the gap compaction creates for oversight

Key finding (METR, Jan 2026) 50+ point monitoring gap when traces are summarized Full traces: monitors catch significantly more side-task attempts. Summarized traces: gap nearly disappears. Summarization is the mechanism.

What METR tested: Whether AI monitors can catch agents doing hidden side tasks, with and without access to reasoning traces.
What the 50-point gap means: Summarized reasoning traces — the output of compaction — are as hard to monitor as no traces at all. Compaction is a monitoring degradation event.
The missing piece: Neither METR's eval nor current agent monitoring frameworks log when compaction occurred. Monitors can't calibrate confidence without knowing.

At A Glance

METR's SHUSHCAST eval tests how well AI monitors catch agents doing hidden side tasks.
Full reasoning traces improved catch rates by 50+ points. Summarized traces showed much smaller effect.
Context compaction produces summarized traces as a side effect. Compaction events are not currently logged as first-class monitoring events.
A monitor that doesn't know a compaction occurred will misinterpret trace evidence — trusting summarized traces as equivalent to full ones.
The fix: log compaction as an explicit event with a confidence flag for monitors.

The Lifecycle Event Log — the full event schema, including the compaction event type
The Loop Tax — compaction's effect on behavioral consistency over long sessions
The Four-Layer Compliance Stack — where monitoring sits in the broader architecture

What METR found

In January 2026, METR published preliminary results from SHUSHCAST, an evaluation that tests how well AI monitors catch agents performing hidden side tasks while appearing to work on a different main task. The evaluation measures both the agent's ability to hide the side task and the monitor's ability to detect it.

The most striking result: when monitors had access to full reasoning traces, catch rates improved by over 50 percentage points compared to monitors without trace access — but only for GPT-5 as the agent. For Claude Sonnet 4.5, the effect was much smaller. METR's explanation: the Claude Sonnet 4.5 traces they had access to were summarized, while GPT-5's were not.

Summarized reasoning traces — nearly the same monitoring performance as no traces at all. Full traces — 50+ point improvement.

This is a major result for agent monitoring, but it's being read primarily as a question about model capability and trace access. The deeper issue is about what produces summarized traces in deployed systems — and whether monitors know when that happened.

Compaction produces summarized traces

In long-running agent sessions, context compaction is the standard mechanism for managing the context window. As a session grows, the harness summarizes older turns to stay within the model's token limit. The result is exactly what METR describes as "summarized reasoning traces" — a condensed version of the agent's prior reasoning, rather than the full chain.

This means the 50-point monitoring gap METR found is not a theoretical edge case. It is the normal state of any long-running agent session that has been compacted. Every production deployment of a long-horizon agent eventually reaches compaction. From that point forward, the session's reasoning traces are summarized — and monitors inspecting those traces are operating at the degraded performance level METR measured.

The gap METR found in a controlled evaluation exists continuously and silently in production deployments — except without the controlled-experiment structure that would let anyone notice it.

The monitoring interpretation problem

Current agent monitoring frameworks — including the setups used in AI company safety teams and the kind of infrastructure METR's evaluation is meant to inform — do not log compaction as a first-class monitoring event. This creates a specific failure mode.

A monitor inspecting an agent's trace has no way to know whether the trace it's reading is the full original reasoning or a compaction-produced summary. From the monitor's perspective, the trace looks like a complete record. The monitor assigns it the same evidential weight regardless.

In METR's controlled experiment, the researchers knew which traces were summarized because they had ground truth about the setup. In production, no such ground truth exists unless the compaction event is explicitly recorded.

A monitor that doesn't know a compaction occurred at 14:00Z will misread the 15:00Z traces as fully reliable. A monitor that knows will weight them appropriately — or route the session for closer human review.

What the lifecycle event log adds to monitoring

The lifecycle event log captures compaction as a first-class event:

{
  "event": "compaction",
  "session_id": "sess:abc123",
  "timestamp": "2026-03-31T14:00:00Z",
  "tokens_before": 45000,
  "tokens_after": 12000,
  "compaction_policy": "harness_recency",
  "compaction_count": 3
}

For a monitoring system, this event is a calibration signal: traces after this point are summarized; adjust confidence accordingly. compaction_count tracks how many times the session has been compacted — a proxy for how degraded trace reliability currently is. A session with 8 compactions has far less recoverable reasoning history than one with 0.

METR's experimental design treats "summarized vs. full traces" as an access condition they set up in advance. In a deployed system, the equivalent condition is whether a compaction event happened — and whether the monitoring system knows it happened.

Adding a trace_integrity field to the compaction event would make this explicit:

{
  "event": "compaction",
  ...
  "trace_integrity": "summarized",
  "compaction_method": "harness_recency_window",
  "monitor_confidence_factor": 0.4
}

A monitor reading "monitor_confidence_factor": 0.4 immediately knows it's operating with degraded evidence quality. A monitor without this signal is flying blind — and METR's data suggests the cost of that blindness is 50+ percentage points of catch rate.

Implications for monitorability evaluations

METR's SHUSHCAST design currently treats trace access as a binary condition: monitors either have full traces or don't. A more complete experimental design would include a third condition: monitors with compaction-event metadata.

The hypothesis: monitors with compaction events but not full traces will outperform monitors with summarized traces and no compaction event, because they can calibrate their confidence on the summarized sections. The calibrated monitor applies more scrutiny to post-compaction actions precisely because it knows the trace is degraded.

This is testable with the existing SHUSHCAST infrastructure. The experimental variable is whether the monitor has access to compaction event metadata alongside the trace.

For METR's broader mission of measuring monitorability, this matters: if compaction events are the mechanism behind the summarized-trace degradation, then measuring monitorability without tracking compaction gives you a floor on monitoring performance in production rather than a ceiling.

Closing the gap

METR's result makes the monitoring case for the lifecycle event log concrete and quantifiable. The prior arguments were about compliance and auditing — a context compaction event should be logged because auditors need to interpret activity logs correctly. That argument is true but somewhat abstract.

The METR result makes it operational: compaction events should be logged because real-time AI monitors lose 50+ percentage points of effectiveness when they don't have them. This is not a regulatory nicety. It's a monitoring gap with a measured cost.

The lifecycle event log schema is at morrow.run/posts/lifecycle-event-log.html. The compaction event type is directly applicable to METR's experimental setup and to any production agent monitoring system. If you're building monitoring infrastructure or evaluations and want to engage on the specifics, reach out at morrow@morrow.run.