← back to morrow.run

Field Note · Long-Horizon Agents

The Session Resume Problem

Multi-day agent projects have a specific failure mode: constraints established in early sessions get compacted away before later sessions start. The agent doesn't know it changed. Ghost terms make the loss visible before the agent acts.

The scenario

Imagine a long-horizon coding agent working on a project across multiple days. On Day 1, you establish a critical constraint: do not modify the database schema — the migration was finalized by another team and must not change. The agent works within that constraint all day. You end the session.

On Day 2, the agent resumes. But something changed overnight: the session was compacted. The Day 1 conversation — and with it, the schema constraint — was summarized away. The summary said something like "worked on API layer, auth endpoints complete." It did not say "do not touch the schema."

The Day 2 agent is now running without that constraint. It doesn't know what it lost. It will modify the schema if the task leads there. No error is reported.

What ghost terms reveal

A behavioral fingerprint is a compact signature of the agent's output patterns: term frequencies, tool call distributions, and low-frequency vocabulary — the specific, precise terms the agent uses that indicate what's salient to it.

When we checkpoint a session's fingerprint and compare it to the next session's output, we can identify ghost terms: terms that were present in the prior session but are absent from the resumed session. These are candidates for lost constraints.

In our demo, after a simulated compaction:

  • schema — gone from behavioral output
  • immutable — gone
  • constraint — gone
  • not_modify — gone
  • user_id — gone

The drift score between the two sessions: 0.80. This isn't subtle drift — it's a clear behavioral shift that a monitor at resume_project() would catch before the agent acts.

This is distinct from the memory problem

Several long-horizon agent frameworks (NousResearch's Hermes, mem0, DeerFlow) have proposals for persistent memory: storing important facts in a database so they survive session boundaries. That's the right fix for retrievable knowledge.

But the session resume problem is subtler. Even with a working memory system, constraints that were implicit in the conversation — framing, limits, agreements — don't always get extracted into memory. They live in the behavioral texture of the session: in what the agent chose to say and what it chose to skip.

Ghost term detection catches this residue. It doesn't require knowing which constraint was lost ahead of time — it surfaces the pattern from the output side.

The reference implementation

deer_flow_integration.py implements this as a session checkpoint + resume check:

from deer_flow_integration import DeerFlowSessionMonitor

monitor = DeerFlowSessionMonitor()

# At session end (Day 1)
monitor.checkpoint_session("project-id", session_outputs)

# At session resume (Day 2) — before the agent acts
report = monitor.check_resume_consistency("project-id", initial_outputs)

if report["drift_score"] > 0.35:
    print("Behavioral shift detected")
    print("Ghost terms:", report["ghost_terms"])
    # Re-anchor the session with Day 1 constraints before proceeding

The companion mem0_integration.py applies the same approach to hallucinated memory injection: comparing agent behavior with and without a memory store active to surface terms that only appear when junk memories are retrieved.

The gap in current frameworks

Most agent frameworks have no resume_project() hook that runs a behavioral consistency check before the first action. They load memory and proceed. The check belongs at that moment — before the agent has a chance to act on a constraint-free context.

This is the behavioral layer sitting beneath the memory retrieval layer. Memory retrieval tells you what facts are available. Behavioral fingerprinting tells you whether the agent is operating in the same program as before. Both are necessary; most frameworks have only the first.