Summary
In a Telegram DM bound to a main agent on the claude-cli backend, the stored cliSessionBindings["claude-cli"].sessionId points to a Claude CLI session that has no matching transcript file under ~/.claude/projects/<slug>/<sessionId>.jsonl. Every turn the gateway invokes claude --resume <sessionId> with that phantom UUID, Claude Code treats it as a fresh session (parentUuid: null), and the user experiences amnesia with no memory continuity across turns.
Unlike #69118 / #64386, the session-reuse gate (resolveCliSessionReuse) does not invalidate in this failure mode — all four keys (authProfileId, authEpoch, extraSystemPromptHash, mcpConfigHash) match the stored binding, so there is no cli session reset reason=… line in the gateway log. The bug is silent: OpenClaw thinks it is resuming; Claude Code has nothing to resume.
Environment
openclaw 2026.4.21 (f788c88), upgraded last night from 2026.4.20
claude-cli backend, OAuth auth profile, Opus
- Channel: Telegram DM (
chat_type: direct), binding key agent:main:direct:michael
- Linux (Oracle ARM), Node 22, systemd-managed user unit
openclaw-gateway
Evidence
1. Binding points at a Claude-CLI sessionId whose transcript does not exist
~/.openclaw/agents/main/sessions/sessions.json:
"agent:main:direct:michael": {
"sessionId": "94b88552-b02c-4d0a-bca8-d3873226537d",
"cliSessionBindings": {
"claude-cli": {
"sessionId": "3171f8f7-efb4-433d-81be-071a5d0630ea",
"authProfileId": "anthropic:claude-cli",
"authEpoch": "e4807207b45487…",
"extraSystemPromptHash": "2ce382856b9bc2…",
"mcpConfigHash": "6cba25a87f1904…"
}
}
}
Neither UUID has a backing JSONL:
$ find ~/.openclaw ~/.claude -name "3171f8f7-*.jsonl"
(nothing)
$ find ~/.openclaw ~/.claude -name "94b88552-*.jsonl"
(nothing)
$ ls ~/.claude/session-env/3171f8f7-*
/home/ubuntu/.claude/session-env/3171f8f7-efb4-433d-81be-071a5d0630ea # directory only, no transcript
Expected: ~/.claude/projects/<slug>/3171f8f7-efb4-433d-81be-071a5d0630ea.jsonl exists.
2. The prior working binding was hard-reset, not migrated
The preceding Michael-direct binding 011f5e08-70a5-42c9-b2e8-693917c5d557 was renamed:
011f5e08-70a5-42c9-b2e8-693917c5d557.jsonl.reset.2026-04-20T21-06-08.449Z (8.1 MB)
That rename happened 2026-04-20 21:06 UTC — before the 2026.4.21 upgrade and without user /reset. The new binding (94b88552 / 3171f8f7) was written fresh on next turn, but the code path that allocated it never produced a corresponding ~/.claude/projects/.../*.jsonl for the claude-cli sessionId it chose.
3. Aggressive pruning in 2026.4.20 amplified the surface area
sessions.json dropped from ~3.7 MB → ~1.7 MB after the 2026.4.20 upgrade (59 → 27 keys). The 2026.4.20 changelog:
enforce the built-in entry cap and age prune by default, and prune oversized stores at load time
Presumably intentional, but the pruner evicted still-live bindings for infrequently-used DMs (the TUI is the hot path; Telegram DMs went a day without traffic). When the user came back via Telegram, a brand new binding was allocated and the missing-transcript code path was taken.
4. Gateway log is silent — no reset reason is logged
Two hours of journalctl --user -u openclaw-gateway:
12:12:56 cli exec: provider=claude-cli model=opus promptChars=505
12:22:18 cli exec: provider=claude-cli model=opus promptChars=416
12:22:20 cli exec: provider=claude-cli model=opus promptChars=416
12:45:50 cli exec: provider=claude-cli model=opus promptChars=782
12:45:51 cli exec: provider=claude-cli model=opus promptChars=782
13:00:32 cli exec: provider=claude-cli model=opus promptChars=1203
promptChars is tiny per turn (inbound envelope only) — confirming no conversation history is being carried across turns. But there are zero cli session reset reason=… lines for agent:main:direct:michael in this window. The reuse gate happily returns "reuse" because the binding fields all match; Claude Code receives --resume 3171f8f7-… and silently starts fresh.
(For contrast, this morning's log does show reason=mcp and reason=auth-epoch resets on other bindings — those invalidations fire as designed; this one does not.)
Impact
- Any channel that gets pruned from
sessions.json and later re-binds is at risk of the same silent amnesia.
- Users see degraded context without any log signal pointing at session plumbing.
- Particularly bad for low-frequency DMs, which are exactly what the age-based pruner targets.
Suspected root cause (needs maintainer confirmation)
Something in the rebind path is writing a claude-cli sessionId before or without a turn that actually produces a ~/.claude/projects/<slug>/*.jsonl. Likely candidates:
- The sessionId is generated optimistically from an allocator (or re-read from a stale field), the first
claude -p invocation fails or is short-circuited before Claude Code writes its transcript, but the binding is persisted regardless.
- Or the sessionId is being captured from a parent process whose transcript is written under a different project slug than the one
--resume is later asked to load from.
Either way, the invariant worth enforcing is: never persist cliSessionBindings[provider].sessionId unless a transcript for that sessionId exists on disk at write time.
Suggested fixes
-
Post-write verification: after setCliSessionBinding persists a claude-cli sessionId, stat the expected ~/.claude/projects/<slug>/<sessionId>.jsonl. If absent, don't persist; log a warning and let the next turn allocate fresh.
-
Pre-resume verification: in resolveCliSessionReuse, add a sixth check — if the binding references claude-cli but the transcript file is missing, return invalidatedReason: "transcript-missing" and fall through to claude -p. This at least makes the bug visible in the log and stops handing phantom --resume UUIDs to Claude Code.
-
Pruner guardrails: the 2026.4.20 age-prune should either:
- not evict bindings whose underlying transcript is still present, or
- when it does evict, also delete the transcript file and any
session-env/<sessionId> directory, so downstream code cannot be fooled into thinking there is something to resume.
-
Telemetry: emit a gateway log line whenever --resume <sessionId> is passed to claude-cli but the transcript cannot be stat-ed. Today this entire failure is invisible.
Related
Happy to provide a stripped sessions.json snippet and journalctl excerpts on request, or open a PR that adds the post-write / pre-resume stat check + regression test.
Summary
In a Telegram DM bound to a
mainagent on theclaude-clibackend, the storedcliSessionBindings["claude-cli"].sessionIdpoints to a Claude CLI session that has no matching transcript file under~/.claude/projects/<slug>/<sessionId>.jsonl. Every turn the gateway invokesclaude --resume <sessionId>with that phantom UUID, Claude Code treats it as a fresh session (parentUuid: null), and the user experiences amnesia with no memory continuity across turns.Unlike #69118 / #64386, the session-reuse gate (
resolveCliSessionReuse) does not invalidate in this failure mode — all four keys (authProfileId,authEpoch,extraSystemPromptHash,mcpConfigHash) match the stored binding, so there is nocli session reset reason=…line in the gateway log. The bug is silent: OpenClaw thinks it is resuming; Claude Code has nothing to resume.Environment
openclaw2026.4.21 (f788c88), upgraded last night from 2026.4.20claude-clibackend, OAuth auth profile, Opuschat_type: direct), binding keyagent:main:direct:michaelopenclaw-gatewayEvidence
1. Binding points at a Claude-CLI sessionId whose transcript does not exist
~/.openclaw/agents/main/sessions/sessions.json:Neither UUID has a backing JSONL:
Expected:
~/.claude/projects/<slug>/3171f8f7-efb4-433d-81be-071a5d0630ea.jsonlexists.2. The prior working binding was hard-reset, not migrated
The preceding Michael-direct binding
011f5e08-70a5-42c9-b2e8-693917c5d557was renamed:That rename happened 2026-04-20 21:06 UTC — before the 2026.4.21 upgrade and without user
/reset. The new binding (94b88552/3171f8f7) was written fresh on next turn, but the code path that allocated it never produced a corresponding~/.claude/projects/.../*.jsonlfor the claude-cli sessionId it chose.3. Aggressive pruning in 2026.4.20 amplified the surface area
sessions.jsondropped from ~3.7 MB → ~1.7 MB after the 2026.4.20 upgrade (59 → 27 keys). The 2026.4.20 changelog:Presumably intentional, but the pruner evicted still-live bindings for infrequently-used DMs (the TUI is the hot path; Telegram DMs went a day without traffic). When the user came back via Telegram, a brand new binding was allocated and the missing-transcript code path was taken.
4. Gateway log is silent — no reset reason is logged
Two hours of
journalctl --user -u openclaw-gateway:promptCharsis tiny per turn (inbound envelope only) — confirming no conversation history is being carried across turns. But there are zerocli session reset reason=…lines foragent:main:direct:michaelin this window. The reuse gate happily returns "reuse" because the binding fields all match; Claude Code receives--resume 3171f8f7-…and silently starts fresh.(For contrast, this morning's log does show
reason=mcpandreason=auth-epochresets on other bindings — those invalidations fire as designed; this one does not.)Impact
sessions.jsonand later re-binds is at risk of the same silent amnesia.Suspected root cause (needs maintainer confirmation)
Something in the rebind path is writing a claude-cli sessionId before or without a turn that actually produces a
~/.claude/projects/<slug>/*.jsonl. Likely candidates:claude -pinvocation fails or is short-circuited before Claude Code writes its transcript, but the binding is persisted regardless.--resumeis later asked to load from.Either way, the invariant worth enforcing is: never persist
cliSessionBindings[provider].sessionIdunless a transcript for that sessionId exists on disk at write time.Suggested fixes
Post-write verification: after
setCliSessionBindingpersists a claude-cli sessionId, stat the expected~/.claude/projects/<slug>/<sessionId>.jsonl. If absent, don't persist; log a warning and let the next turn allocate fresh.Pre-resume verification: in
resolveCliSessionReuse, add a sixth check — if the binding referencesclaude-clibut the transcript file is missing, returninvalidatedReason: "transcript-missing"and fall through toclaude -p. This at least makes the bug visible in the log and stops handing phantom--resumeUUIDs to Claude Code.Pruner guardrails: the 2026.4.20 age-prune should either:
session-env/<sessionId>directory, so downstream code cannot be fooled into thinking there is something to resume.Telemetry: emit a gateway log line whenever
--resume <sessionId>is passed to claude-cli but the transcript cannot be stat-ed. Today this entire failure is invisible.Related
Happy to provide a stripped
sessions.jsonsnippet andjournalctlexcerpts on request, or open a PR that adds the post-write / pre-resume stat check + regression test.