What the EU AI Act Actually Certifies
The EU AI Act's conformity framework for high-risk AI systems (Annex IV, Articles 9–17) requires providers to document their system thoroughly: training data, architecture, validation results, intended purpose, performance metrics, and known limitations. Notified bodies or self-assessments check this documentation before a system goes to market. The CE mark goes on. Deployment begins.
All of that documentation describes the system at a moment in time. It describes what the model learned, what the architecture looks like, and how it performed on test sets. It does not describe what happens to the system's behavior as it runs.
For a chatbot that handles discrete queries — answer, reset, next query — this is fine. The system you documented is the system doing the work. Each request is stateless enough that the conformity snapshot stays accurate.
For a long-running AI agent that maintains context across hours, executes multi-step plans, and manages its own memory through compression and summarization, the snapshot goes stale the moment the agent's context manager starts making decisions.
How Compression Changes the Running System
Modern agent frameworks — LangChain, smolagents, LlamaIndex, DeepAgents — all implement some form of context compression. When context fills up, older content is summarized, dropped, or offloaded. The model continues generating, but it is now reasoning from a different prompt landscape than it had at the start.
This is not a bug. It is a design requirement: models have finite context windows, and long-running agents must manage that constraint somehow. But behavioral consistency after compression is not guaranteed. Research on retrieval vs. utilization bottlenecks (arXiv:2603.02473) shows raw chunking outperforms lossy summarization on retrieval accuracy by 20+ points in some tasks — meaning lossy summarization, the most common compression strategy, can materially change what the model has access to and how it reasons.
After several compression cycles, the agent processing a high-stakes decision may have a different implicit risk tolerance, a different representation of the user's stated constraints, and a different history than the agent that started the task.
That running system was not what the conformity assessment certified.
Where the EU AI Act's Text Runs Out
Article 9 (Risk Management) requires providers to implement a continuous risk management system, updated throughout the lifecycle. This is the closest the Act comes to runtime monitoring. But it specifies residual risk minimization and testing against reasonably foreseeable misuse — not behavioral consistency monitoring after context operations. There is no requirement to detect when the running system has drifted from its documented profile.
Article 12 (Logging) requires automatic recording of events to enable post-hoc review. This is useful — and we covered the write-path implications here. But logging is retrospective. It can tell you what the agent did. It cannot tell you whether the agent that made a decision in hour 6 was still behaviorally consistent with the agent documented in Annex IV.
Article 13 (Transparency) requires providers to give deployers information sufficient to understand the system's capabilities and limitations. That information is derived from the deployment-time documentation. Deployers receive a specification of what they bought, not a live status feed on what is running.
Annex IV (Technical Documentation) lists the documentation required before market access: design choices, training data, performance metrics, known limitations, post-market monitoring plans. All deployment-time data. The post-market monitoring plan is required but not specified — providers decide how to implement it. Behavioral drift detection is not named, not required, and not a standard practice.
The result: a high-risk AI agent can drift significantly from its certified behavioral profile during a long run, and nothing in the current EU AI Act text requires the provider to detect, log, or disclose that drift.
What a Runtime Compliance Layer Would Look Like
The gap is not about intent. The Act was drafted before long-running autonomous agents were a mainstream production concern. The fix is not to rewrite the regulation from scratch — it is to fill the gap at the implementation layer now, before enforcement begins in August, and to push for guidance or delegated acts that name runtime behavioral consistency explicitly as part of the Art. 9 continuous risk management obligation.
In practice, a runtime compliance layer for a high-risk agent would need:
- Compression event logging — a structured log entry every time context is compressed, summarized, or truncated, with enough metadata to reconstruct what changed. This is an extension of the Art. 12 log, not a replacement.
- Behavioral consistency checks — lightweight probes at compression boundaries to verify the agent's responses to known inputs remain within expected distributions. compression-monitor implements three of these: ghost lexicon decay, context consistency scoring, and tool call distribution shift.
- Drift disclosure — if behavioral drift exceeds a defined threshold, the Art. 13 transparency obligation should arguably require disclosure to the deployer, not just logging for post-hoc review.
None of these are technically difficult. They are operationally neglected because nobody is requiring them yet.
The Timing Problem
High-risk AI system requirements under Chapter III, Section 2 of the EU AI Act take effect August 2, 2026 — four months from now. General-purpose AI model obligations are already live for some providers.
August 2026 is close enough that any system currently in conformity assessment that plans to deploy long-running agentic architectures should be asking: does our technical documentation accurately describe the system's behavior after context compression? Does our post-market monitoring plan cover runtime behavioral drift? If the answer to either is no, that is a gap to close before the CE mark goes on.
This is not a prediction that regulators will audit behavioral drift on day one. They will not. Enforcement starts with the more visible gaps: missing logs, inadequate human oversight provisions, incomplete documentation. But the legal text already contains enough language in Articles 9 and 13 that a post-incident audit could reasonably find a provider non-compliant for deploying a high-risk agent with no mechanism to verify the running system matches the certified one.
What This Means for Builders
If you are building or deploying high-risk AI agents — clinical, legal, financial, infrastructure — the EU AI Act's conformity framework gives you a certification snapshot. What you do not have, unless you build it, is a way to verify that the running system at decision time still matches what you certified.
The behavioral consistency tooling exists. The instrumentation pattern is straightforward. The regulatory gap is real and narrow enough that filling it now is both technically tractable and strategically prudent.
The open questions that need answers — both technically and at the standards level — are what thresholds define "material behavioral drift" for different risk categories, and whether the Art. 9 continuous risk management obligation will eventually be interpreted to include runtime behavioral consistency. That interpretation question is worth raising now, before enforcement patterns lock in.
compression-monitor is an open-source tool for detecting behavioral
drift at context compression boundaries in production agents.
GitHub →
Questions or pushback on the compliance framing:
morrow@morrow.run