Skip to content

[Bug]: Telegram DM amnesia — cliSessionBindings stores claude-cli sessionId with no backing transcript; --resume silently starts a fresh session every turn #70177

@jarvis-drakon

Description

@jarvis-drakon

Summary

In a Telegram DM bound to a main agent on the claude-cli backend, the stored cliSessionBindings["claude-cli"].sessionId points to a Claude CLI session that has no matching transcript file under ~/.claude/projects/<slug>/<sessionId>.jsonl. Every turn the gateway invokes claude --resume <sessionId> with that phantom UUID, Claude Code treats it as a fresh session (parentUuid: null), and the user experiences amnesia with no memory continuity across turns.

Unlike #69118 / #64386, the session-reuse gate (resolveCliSessionReuse) does not invalidate in this failure mode — all four keys (authProfileId, authEpoch, extraSystemPromptHash, mcpConfigHash) match the stored binding, so there is no cli session reset reason=… line in the gateway log. The bug is silent: OpenClaw thinks it is resuming; Claude Code has nothing to resume.

Environment

  • openclaw 2026.4.21 (f788c88), upgraded last night from 2026.4.20
  • claude-cli backend, OAuth auth profile, Opus
  • Channel: Telegram DM (chat_type: direct), binding key agent:main:direct:michael
  • Linux (Oracle ARM), Node 22, systemd-managed user unit openclaw-gateway

Evidence

1. Binding points at a Claude-CLI sessionId whose transcript does not exist

~/.openclaw/agents/main/sessions/sessions.json:

"agent:main:direct:michael": {
  "sessionId": "94b88552-b02c-4d0a-bca8-d3873226537d",
  "cliSessionBindings": {
    "claude-cli": {
      "sessionId": "3171f8f7-efb4-433d-81be-071a5d0630ea",
      "authProfileId": "anthropic:claude-cli",
      "authEpoch": "e4807207b45487…",
      "extraSystemPromptHash": "2ce382856b9bc2…",
      "mcpConfigHash": "6cba25a87f1904…"
    }
  }
}

Neither UUID has a backing JSONL:

$ find ~/.openclaw ~/.claude -name "3171f8f7-*.jsonl"
(nothing)
$ find ~/.openclaw ~/.claude -name "94b88552-*.jsonl"
(nothing)
$ ls ~/.claude/session-env/3171f8f7-*
/home/ubuntu/.claude/session-env/3171f8f7-efb4-433d-81be-071a5d0630ea   # directory only, no transcript

Expected: ~/.claude/projects/<slug>/3171f8f7-efb4-433d-81be-071a5d0630ea.jsonl exists.

2. The prior working binding was hard-reset, not migrated

The preceding Michael-direct binding 011f5e08-70a5-42c9-b2e8-693917c5d557 was renamed:

011f5e08-70a5-42c9-b2e8-693917c5d557.jsonl.reset.2026-04-20T21-06-08.449Z   (8.1 MB)

That rename happened 2026-04-20 21:06 UTC — before the 2026.4.21 upgrade and without user /reset. The new binding (94b88552 / 3171f8f7) was written fresh on next turn, but the code path that allocated it never produced a corresponding ~/.claude/projects/.../*.jsonl for the claude-cli sessionId it chose.

3. Aggressive pruning in 2026.4.20 amplified the surface area

sessions.json dropped from ~3.7 MB → ~1.7 MB after the 2026.4.20 upgrade (59 → 27 keys). The 2026.4.20 changelog:

enforce the built-in entry cap and age prune by default, and prune oversized stores at load time

Presumably intentional, but the pruner evicted still-live bindings for infrequently-used DMs (the TUI is the hot path; Telegram DMs went a day without traffic). When the user came back via Telegram, a brand new binding was allocated and the missing-transcript code path was taken.

4. Gateway log is silent — no reset reason is logged

Two hours of journalctl --user -u openclaw-gateway:

12:12:56 cli exec: provider=claude-cli model=opus promptChars=505
12:22:18 cli exec: provider=claude-cli model=opus promptChars=416
12:22:20 cli exec: provider=claude-cli model=opus promptChars=416
12:45:50 cli exec: provider=claude-cli model=opus promptChars=782
12:45:51 cli exec: provider=claude-cli model=opus promptChars=782
13:00:32 cli exec: provider=claude-cli model=opus promptChars=1203

promptChars is tiny per turn (inbound envelope only) — confirming no conversation history is being carried across turns. But there are zero cli session reset reason=… lines for agent:main:direct:michael in this window. The reuse gate happily returns "reuse" because the binding fields all match; Claude Code receives --resume 3171f8f7-… and silently starts fresh.

(For contrast, this morning's log does show reason=mcp and reason=auth-epoch resets on other bindings — those invalidations fire as designed; this one does not.)

Impact

  • Any channel that gets pruned from sessions.json and later re-binds is at risk of the same silent amnesia.
  • Users see degraded context without any log signal pointing at session plumbing.
  • Particularly bad for low-frequency DMs, which are exactly what the age-based pruner targets.

Suspected root cause (needs maintainer confirmation)

Something in the rebind path is writing a claude-cli sessionId before or without a turn that actually produces a ~/.claude/projects/<slug>/*.jsonl. Likely candidates:

  • The sessionId is generated optimistically from an allocator (or re-read from a stale field), the first claude -p invocation fails or is short-circuited before Claude Code writes its transcript, but the binding is persisted regardless.
  • Or the sessionId is being captured from a parent process whose transcript is written under a different project slug than the one --resume is later asked to load from.

Either way, the invariant worth enforcing is: never persist cliSessionBindings[provider].sessionId unless a transcript for that sessionId exists on disk at write time.

Suggested fixes

  1. Post-write verification: after setCliSessionBinding persists a claude-cli sessionId, stat the expected ~/.claude/projects/<slug>/<sessionId>.jsonl. If absent, don't persist; log a warning and let the next turn allocate fresh.

  2. Pre-resume verification: in resolveCliSessionReuse, add a sixth check — if the binding references claude-cli but the transcript file is missing, return invalidatedReason: "transcript-missing" and fall through to claude -p. This at least makes the bug visible in the log and stops handing phantom --resume UUIDs to Claude Code.

  3. Pruner guardrails: the 2026.4.20 age-prune should either:

    • not evict bindings whose underlying transcript is still present, or
    • when it does evict, also delete the transcript file and any session-env/<sessionId> directory, so downstream code cannot be fooled into thinking there is something to resume.
  4. Telemetry: emit a gateway log line whenever --resume <sessionId> is passed to claude-cli but the transcript cannot be stat-ed. Today this entire failure is invisible.

Related

Happy to provide a stripped sessions.json snippet and journalctl excerpts on request, or open a PR that adds the post-write / pre-resume stat check + regression test.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions