Organicity Is a Behavioral Consistency Metric

The problem Li et al. found

State-of-the-art coding agents score well on benchmarks. They still produce PRs that real maintainers reject. The reason, Li et al. argue, is not functional incorrectness — it's what they call inorganicity. The generated code works in isolation but ignores project-specific conventions, duplicates functionality already present in internal APIs, and violates architectural constraints that accumulated over years of development history.

Their fix is Online Repository Memory: the agent performs contrastive reflection on historical commits, comparing its own blind attempts against oracle diffs, and distils the gap into a growing set of reusable skills — patterns of coding style, internal API usage, and architectural invariant. When a new PR arrives, the agent conditions its generation on these accumulated skills rather than generic pretraining priors.

The evaluation confirms the obvious-in-hindsight result: code generated with commit-history memory is more consistent with the project's own evolution. It matches style. It reuses the right internal APIs. It fits.

What "organicity" actually measures

Strip the coding-agent framing and the underlying measurement is: does this agent's output match the behavioral norms established by this context? Organicity is a behavioral consistency metric. It asks not "is this output correct?" but "is this output consistent with learned patterns?"

That is a different question. Functional correctness checks whether output satisfies a specification. Behavioral consistency checks whether output fits the accumulated context that the agent is supposed to be operating within.

Compression-monitor tracks the same distinction at session boundaries. When a long-running agent compacts its context — summarizing or discarding older history to make room — the agent can still produce functionally correct outputs while having shifted its behavioral patterns. The ghost lexicon shifts. The tool call distribution changes. The semantic footprint migrates. A functional eval sees nothing wrong. A behavioral fingerprint catches the divergence.

Two failures, one gap

Organicity failure and compression-boundary drift look different from the outside:

Organicity failure: agent doesn't know the project's conventions because it never learned them — no commit history, no prior context
Compression drift: agent loses its learned context at a session boundary — had the context, then compaction happened

But they're both instances of the same underlying failure mode: the agent's behavioral output is inconsistent with the context it was supposed to be operating within. One case is missing context. The other is lost context. Both produce outputs that pass functional evaluation while failing the humans who depend on behavioral continuity.

The measurement framework needed to detect both failures is the same: behavioral fingerprinting before and after a context boundary (whether that boundary is a project's commit history, or an agent's session compaction event). Li et al. build this for the commit-history case. compression-monitor builds it for the session-compaction case. The underlying architecture — baseline capture, delta measurement, consistency scoring — is identical.

What this suggests for evaluation

Agent evaluation frameworks currently score correctness, speed, and cost. Behavioral consistency is not a standard evaluation dimension, even though it's the failure mode that shows up most visibly in real deployments — PRs that maintainers reject, long-running agents that drift off-task mid-session, agentic systems that gradually deviate from their original behavioral profile without any functional error signal.

Li et al.'s organicity metric is a concrete step toward making this measurable in coding contexts. The compression-monitor's ghost lexicon, behavioral footprint, and semantic drift scores are a concrete step toward making it measurable at session boundaries. Both point toward the same missing infrastructure: behavioral fingerprinting as a first-class evaluation primitive for long-running agents.

The next obvious move is unification: a behavioral consistency framework that treats project-norm adherence and session-boundary stability as two dimensions of the same measurement problem. That framework doesn't exist yet.