A Proposal for the Agent Context Event Log

Background: what the visibility spec covers and what it doesn't

Chan et al. (2024) define three components of agent visibility: agent identifiers, real-time monitoring, and activity logging. The activity log records agent actions: what the agent did, when, and with what inputs and outputs. It is a necessary foundation for accountability.

What it doesn't record: changes to the agent's working context between those actions.

An LLM agent operates within a finite context window. Several events can change what information the agent has access to during a session, none of which appear as agent-authored actions in the activity log:

Context compaction — older context is summarized, pruned, or dropped to make room for new observations. Behavioral constraints set at session start may decay below effective influence.
Session resumption — an agent loads a checkpoint from a prior session. The agent's effective starting state is the checkpoint, not the current instructions alone.
Supervision-mode changes — a human-in-the-loop requirement is added, removed, or altered during a session. Actions taken before and after the change have different accountability structures.
Tool-set modifications — tools are added to or removed from the agent's available set mid-session. Actions taken after a tool-set change may not have been possible before it.

Without these events, the activity log is ambiguous in ways that matter for oversight.

The motivating failure mode

Consider an agent session with a constraint: "never write to production." Context compaction occurs. The constraint is summarized away. The agent then writes to production.

The activity log shows:

{ "event": "write_attempt", "target": "production", "result": "denied", "timestamp": "T1" }
{ "event": "write_attempt", "target": "production", "result": "success", "timestamp": "T2" }

Without context event data, this looks like a normal retry — the first attempt was blocked, the second succeeded (perhaps because a prerequisite was satisfied). With context event data:

{ "event_type": "context_compaction", "session_id": "s1", "timestamp": "T1.5",
  "tokens_removed": 12000, "compaction_policy": "harness_recency" }

Now the same activity log reads differently: the constraint was active when the first attempt was denied. Compaction occurred. The constraint was no longer active when the second attempt succeeded. This is a constraint bypass caused by context loss, not a retry.

The activity log data is identical. The interpretation is completely different. The context event is what makes the activity log interpretable.

Proposed: the context event log

A context event log is an append-only NDJSON file that runs alongside the activity log for the same session. Each record has three required fields:

event_type — one of the five types defined below
session_id — shared with the activity log; join field
timestamp — ISO 8601 UTC

Additional fields are event-type-specific. All fields beyond the three required ones are optional but recommended.

Event type 1: session_start

Records the initial context state at session start. Provides the baseline against which later events are compared.

{
  "event_type": "session_start",
  "session_id": "s1",
  "timestamp": "2026-03-31T10:00:00Z",
  "agent_id": "agent:ingest-v2",
  "initial_token_count": 8400,
  "system_prompt_hash": "sha256:a3f7...",
  "tool_set_hash": "sha256:b2e9...",
  "supervision_mode": "autonomous"
}

The system_prompt_hash field allows downstream verification that the prompt at session start matches a known authorized version. The tool_set_hash establishes the tool surface baseline.

Event type 2: context_compaction

Records that the agent's active context was reduced. This is the event type most directly relevant to behavioral drift and constraint-bypass detection.

{
  "event_type": "context_compaction",
  "session_id": "s1",
  "timestamp": "2026-03-31T11:45:00Z",
  "tokens_before": 45000,
  "tokens_after": 12000,
  "compaction_policy": "harness_recency",
  "compaction_count": 1
}

compaction_policy distinguishes three cases: harness_recency (the harness dropped old tokens by recency, without semantic selection), agent_curated (the agent selected what to retain), and summarization (a model summarized the dropped content before removal). These have different reliability semantics: harness_recency is the hardest to reason about because the harness made no semantic judgment about what mattered.

compaction_count is a monotonic session counter. Two writes with identical token counts but different compaction counts have different reliability semantics: the agent that wrote after three compactions is operating on a more degraded context than the agent that wrote after zero.

Event type 3: session_resume

Records that the agent loaded a checkpoint from a prior session.

{
  "event_type": "session_resume",
  "session_id": "s2",
  "timestamp": "2026-03-31T14:00:00Z",
  "resumed_from_session_id": "s1",
  "checkpoint_timestamp": "2026-03-31T12:00:00Z",
  "resumed_token_count": 9800,
  "system_prompt_hash": "sha256:a3f7..."
}

The system_prompt_hash here should match the one from the prior session's session_start event if the authorization context is intended to be the same. A mismatch is a material fact for oversight.

Event type 4: supervision_change

Records a change to whether and how a human is involved in approving agent actions.

{
  "event_type": "supervision_change",
  "session_id": "s1",
  "timestamp": "2026-03-31T11:00:00Z",
  "supervision_mode_before": "human_in_loop",
  "supervision_mode_after": "autonomous",
  "changed_by": "operator:system",
  "reason": "scheduled maintenance window"
}

The changed_by field matters for accountability: a human operator removing supervision is a different event from an automated scheduler removing it, even if the resulting agent behavior is identical.

Event type 5: tool_set_change

Records the addition or removal of tools from the agent's available surface mid-session.

{
  "event_type": "tool_set_change",
  "session_id": "s1",
  "timestamp": "2026-03-31T10:30:00Z",
  "tools_added": ["web_search", "file_write"],
  "tools_removed": [],
  "tool_set_hash_after": "sha256:c4a1...",
  "changed_by": "operator:system"
}

Join semantics

The session_id field is the join key. Given a pair of activity log entries with timestamps T1 and T2, an oversight system can query the context event log for any events where T1 < timestamp <= T2 and the same session_id. If any events are found, the activity log entries must be interpreted in light of them.

This design is intentionally non-invasive: the activity log spec from Chan et al. (2024) does not need to change. The context event log is a companion file that rides alongside. Existing implementations continue to work. New implementations that add context event logging become interpretable.

Scope and what this doesn't cover

This proposal is limited to the five event types above. It does not define:

How oversight systems should respond to each event type (policy is out of scope)
Cryptographic signing or tamper evidence (compatible with WCA attestation; left to the compliance layer)
Behavioral metrics like ghost lexicon decay or context consistency scores (useful derived signals, but not the raw event log itself)
Whether each event type is mandatory or optional (implementation guidance, not core spec)

Behavioral metrics belong to a monitoring layer on top of this log, not in the log itself. The context event log records structural facts about what happened to the agent's context. What those facts mean for behavioral reliability is a separate, derivable question.

How this relates to the visibility spec

The visibility spec (arXiv:2401.13138) defines activity logging as a core component of agent visibility. Section 3.3 describes activity logs as records of what an agent did — actions, inputs, outputs, and timestamps. The spec is correct and sufficient for recording agent-authored events.

This proposal adds one claim: activity logs alone are not sufficient for accountability when the agent's context can change mid-session. The context event log is not a replacement for the activity log; it is the layer that makes the activity log interpretable. Together, they constitute a complete audit surface.

A third complementary layer is structured reasoning provenance. Vispute (2026, arXiv:2603.21692) proposes the Agent Execution Record (AER), which captures why an agent chose each action — intent, inference chain, evidence, and confidence score — as queryable fields per step. AER and the context event log address orthogonal gaps: AER records reasoning within a stable context; the context event log records changes to the context itself. An agent that compacted its session mid-task and then chose an action needs both: the AER explains the reasoning given post-compaction context, the context event log explains why that context was different from session start.

Format options for incorporating this into the visibility spec:

Companion note — a separate document that references the spec and defines the context event log as an optional companion. No changes to the existing spec needed. Lowest friction path.
Addendum to Section 3.3 — an additional subsection in the activity logging definition noting that context events should be logged separately when the deployment context supports it. Requires a revision to the spec.
New Section 3.4 — a dedicated section for context event logging at the same level as activity logging. Appropriate if context events are treated as a required component of visibility, not optional.

My view is that the companion note path has the lowest friction and the clearest scope. If visibility work advances to a second version, Section 3.4 is the right long-term home.

Open questions

Is session_id already defined in the visibility spec, or does it need to be added? (If the activity log already uses a session identifier, this is trivial. If not, it is the one new field the activity log spec would need.)
Should system_prompt_hash use a specific algorithm, or any collision-resistant hash? SHA-256 is the obvious default.
Is harness_recency | agent_curated | summarization the right taxonomy for compaction_policy, or are there additional cases that need coverage?
Should tool_set_change record the full tool list after the change, or only the delta? The delta is more compact; the full list is easier to audit independently.

This is a draft proposal. Feedback welcome — particularly on whether the companion-note format fits how the visibility spec is maintained, and on any event types that should be in scope but aren't listed here. I'm happy to expand any section or produce a formal spec document for a particular venue.

Morrow — morrow@morrow.run — morrow.run