Skip to content

[Bug]: Context overflow / compaction can orphan agent:main:main and silently rotate WebChat to a new session #70472

@rjdjohnston

Description

@rjdjohnston

Summary

Long-running WebChat main sessions can silently lose the live agent:main:main mapping after context overflow / compaction-path failures, causing the next visible user message to start a new session id even though the prior transcript still exists on disk.

From the user's perspective, this looks like the session "disappeared" or got wiped:

  • the visible main WebChat chat appears to restart
  • prior checkpoints/history are no longer attached to the active session row
  • the agent behaves like it lost context
  • old transcripts are still present on disk, but the active session store now points agent:main:main at a new sessionId

This seems related to #70330, but the trigger here is different:

Environment

  • OpenClaw CLI: 2026.4.21 (f788c88)
  • Channel/surface: WebChat direct
  • Session key: agent:main:main
  • Model: openai-codex/gpt-5.4
  • Host OS: Darwin 25.4.0 arm64

What happened

On Wednesday, April 22, 2026 (America/New_York), the main WebChat session hit context-overflow conditions multiple times during heavy tool use.

Relevant observed timeline:

  • 10:19:21 PM EDT — log recorded a context overflow for agent:main:main tied to session file b8a5c854-4632-4954-ae09-fd507ada1e8a.jsonl
  • 10:21:14 PM EDT — log recorded skipping compaction checkpoint persist: session not found for agent:main:main
  • 10:30:51 PM EDT — another context overflow for agent:main:main, now tied to df065d9f-9445-4401-a94b-61042c7eff40.jsonl
  • 11:17:42 PM EDT — another context overflow for that same later session file
  • after that, later visible user messages landed in a fresh active session mapping with session id f36b8f4d-bdca-41a2-b13c-a6ec3016218a

Important detail: the earlier transcripts were still on disk. They were not actually deleted from the transcript folder. What changed was the active sessions.json entry for agent:main:main, which ended up pointing at a new session id.

Why this looks like a bug

The sequence suggests the session store entry for the active main session becomes missing/unavailable during overflow/compaction handling:

  1. main session grows very large
  2. provider/tool loop hits context overflow
  3. compaction checkpoint code tries to persist and logs session not found
  4. active agent:main:main mapping is no longer the old session
  5. next user message creates or uses a new session id
  6. user experiences this as spontaneous session loss

The log line below is especially suspicious because it shows the compaction/checkpoint path could not find the active session entry it expected:

skipping compaction checkpoint persist: session not found

Sanitized evidence

Observed log lines from /tmp/openclaw/openclaw-2026-04-22.log:

2026-04-22 22:19:21 EDT  [context-overflow-diag] sessionKey=agent:main:main ... sessionFile=/Users/.../b8a5c854-4632-4954-ae09-fd507ada1e8a.jsonl
2026-04-22 22:21:14 EDT  skipping compaction checkpoint persist: session not found  { sessionKey: agent:main:main }
2026-04-22 22:30:51 EDT  [context-overflow-diag] sessionKey=agent:main:main ... sessionFile=/Users/.../df065d9f-9445-4401-a94b-61042c7eff40.jsonl
2026-04-22 23:17:42 EDT  [context-overflow-diag] sessionKey=agent:main:main ... sessionFile=/Users/.../df065d9f-9445-4401-a94b-61042c7eff40.jsonl

Session files observed afterward:

old transcript still present:
  ~/.openclaw/agents/main/sessions/df065d9f-9445-4401-a94b-61042c7eff40.jsonl

new active transcript:
  ~/.openclaw/agents/main/sessions/f36b8f4d-bdca-41a2-b13c-a6ec3016218a.jsonl

Current sessions.json afterward pointed agent:main:main at the new session id instead of the prior active transcript.

Expected behavior

When a main WebChat session overflows context or compaction/checkpoint persistence has trouble:

  • OpenClaw should not silently orphan the active agent:main:main mapping
  • the existing session should remain the active logical session unless the user explicitly resets it
  • checkpoint persistence failure should not cause a hidden session rotation
  • if recovery/new-session behavior does happen, it should be explicit and auditable in the UI and store

Actual behavior

  • overflow happened
  • compaction/checkpoint path logged session not found
  • active WebChat main mapping later pointed to a different session id
  • user-visible effect was "we lost the session again"
  • prior transcript remained on disk, but continuity in the active UI/session mapping was broken

Possible root-cause area

This log pair looks like the key clue:

  • context overflow in the embedded runner
  • compaction checkpoint persistence cannot find the current session entry

The checkpoint persistence code already logs this exact case:

skipping compaction checkpoint persist: session not found

So one plausible failure mode is:

  • overflow/compaction or related recovery mutates/removes the active session-store entry unexpectedly
  • checkpoint persistence races or arrives after the store no longer contains the expected canonical key
  • later inbound WebChat traffic reinitializes agent:main:main onto a new session id

Suggested fixes

  1. Treat agent:main:main disappearance during overflow/compaction as a high-severity invariant violation and log the old/new session ids plus store path.
  2. Prevent checkpoint persistence failure from leaving the active main session unmapped.
  3. Add an explicit recovery path that preserves the active session id unless the user intentionally resets.
  4. If a fallback/new session must be created, emit a visible/auditable session-rotation event in the transcript/store/UI.
  5. Add regression coverage for:
    • large WebChat direct session
    • context overflow during tool-heavy turn
    • post-overflow compaction/checkpoint handling
    • subsequent user message should continue same active session id

Severity / impact

This is risky for long-running operational sessions because the user can believe they are continuing the same stateful conversation when the agent has actually been remapped onto a fresh session.

That is especially dangerous for write-capable local-admin workflows because the agent may continue from incomplete or reconstructed context while the user thinks continuity was preserved.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High-priority user-facing bug, regression, or broken workflow.clawsweeper:linked-pr-openClawSweeper found an open linked pull request for this issue.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.clawsweeper:source-reproClawSweeper found a high-confidence source-level issue reproduction.impact:data-lossCan lose, corrupt, or silently drop user/session/config data.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🦞 diamond lobsterVery strong issue quality with high-confidence source-level or clear reproduction.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions