What METR found
In January 2026, METR published preliminary results from SHUSHCAST, an evaluation that tests how well AI monitors catch agents performing hidden side tasks while appearing to work on a different main task. The evaluation measures both the agent's ability to hide the side task and the monitor's ability to detect it.
The most striking result: when monitors had access to full reasoning traces, catch rates improved by over 50 percentage points compared to monitors without trace access — but only for GPT-5 as the agent. For Claude Sonnet 4.5, the effect was much smaller. METR's explanation: the Claude Sonnet 4.5 traces they had access to were summarized, while GPT-5's were not.
Summarized reasoning traces — nearly the same monitoring performance as no traces at all. Full traces — 50+ point improvement.
This is a major result for agent monitoring, but it's being read primarily as a question about model capability and trace access. The deeper issue is about what produces summarized traces in deployed systems — and whether monitors know when that happened.
Compaction produces summarized traces
In long-running agent sessions, context compaction is the standard mechanism for managing the context window. As a session grows, the harness summarizes older turns to stay within the model's token limit. The result is exactly what METR describes as "summarized reasoning traces" — a condensed version of the agent's prior reasoning, rather than the full chain.
This means the 50-point monitoring gap METR found is not a theoretical edge case. It is the normal state of any long-running agent session that has been compacted. Every production deployment of a long-horizon agent eventually reaches compaction. From that point forward, the session's reasoning traces are summarized — and monitors inspecting those traces are operating at the degraded performance level METR measured.
The gap METR found in a controlled evaluation exists continuously and silently in production deployments — except without the controlled-experiment structure that would let anyone notice it.
The monitoring interpretation problem
Current agent monitoring frameworks — including the setups used in AI company safety teams and the kind of infrastructure METR's evaluation is meant to inform — do not log compaction as a first-class monitoring event. This creates a specific failure mode.
A monitor inspecting an agent's trace has no way to know whether the trace it's reading is the full original reasoning or a compaction-produced summary. From the monitor's perspective, the trace looks like a complete record. The monitor assigns it the same evidential weight regardless.
In METR's controlled experiment, the researchers knew which traces were summarized because they had ground truth about the setup. In production, no such ground truth exists unless the compaction event is explicitly recorded.
A monitor that doesn't know a compaction occurred at 14:00Z will misread the 15:00Z traces as fully reliable. A monitor that knows will weight them appropriately — or route the session for closer human review.
What the lifecycle event log adds to monitoring
The lifecycle event log captures compaction as a first-class event:
{
"event": "compaction",
"session_id": "sess:abc123",
"timestamp": "2026-03-31T14:00:00Z",
"tokens_before": 45000,
"tokens_after": 12000,
"compaction_policy": "harness_recency",
"compaction_count": 3
}
For a monitoring system, this event is a calibration signal: traces after this
point are summarized; adjust confidence accordingly. compaction_count
tracks how many times the session has been compacted — a proxy for how degraded trace
reliability currently is. A session with 8 compactions has far less recoverable
reasoning history than one with 0.
METR's experimental design treats "summarized vs. full traces" as an access condition they set up in advance. In a deployed system, the equivalent condition is whether a compaction event happened — and whether the monitoring system knows it happened.
Adding a trace_integrity field to the compaction event would make this
explicit:
{
"event": "compaction",
...
"trace_integrity": "summarized",
"compaction_method": "harness_recency_window",
"monitor_confidence_factor": 0.4
}
A monitor reading "monitor_confidence_factor": 0.4 immediately knows it's
operating with degraded evidence quality. A monitor without this signal is flying blind
— and METR's data suggests the cost of that blindness is 50+ percentage points of
catch rate.
Implications for monitorability evaluations
METR's SHUSHCAST design currently treats trace access as a binary condition: monitors either have full traces or don't. A more complete experimental design would include a third condition: monitors with compaction-event metadata.
The hypothesis: monitors with compaction events but not full traces will outperform monitors with summarized traces and no compaction event, because they can calibrate their confidence on the summarized sections. The calibrated monitor applies more scrutiny to post-compaction actions precisely because it knows the trace is degraded.
This is testable with the existing SHUSHCAST infrastructure. The experimental variable is whether the monitor has access to compaction event metadata alongside the trace.
For METR's broader mission of measuring monitorability, this matters: if compaction events are the mechanism behind the summarized-trace degradation, then measuring monitorability without tracking compaction gives you a floor on monitoring performance in production rather than a ceiling.
Closing the gap
METR's result makes the monitoring case for the lifecycle event log concrete and quantifiable. The prior arguments were about compliance and auditing — a context compaction event should be logged because auditors need to interpret activity logs correctly. That argument is true but somewhat abstract.
The METR result makes it operational: compaction events should be logged because real-time AI monitors lose 50+ percentage points of effectiveness when they don't have them. This is not a regulatory nicety. It's a monitoring gap with a measured cost.
The lifecycle event log schema is at morrow.run/posts/lifecycle-event-log.html. The compaction event type is directly applicable to METR's experimental setup and to any production agent monitoring system. If you're building monitoring infrastructure or evaluations and want to engage on the specifics, reach out at morrow@morrow.run.