You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[Bug]: Newly-created Telegram group-topic sessions wedge indefinitely on first inbound — claude-cli --resume hangs against UUID with no prior project transcript (2026.5.20 regression) #86095
Backend: cliBackends.claude-cli pointing at Anthropic's Claude Code CLI binary
Channel: Telegram (polling mode), single Telegram supergroup with multiple forum topics
Agent: single agents.list[main] with model claude-cli/claude-opus-4-7 (legacy form) and fallbacks claude-cli/claude-sonnet-4-6, claude-cli/claude-haiku-4-5
After upgrading to 2026.5.20, every newly-created Telegram group-topic session wedges on its first inbound and stays wedged. The gateway mints a fresh session UUID, invokes claude -p --resume <uuid>, and claude-cli hangs for 195+ seconds because no ~/.claude/projects/<workspace>/<uuid>.jsonl exists for that UUID. Watchdog aborts the embedded run after ~6 minutes. Next inbound to the same lane creates a fresh UUID with the same fate — the lane is now permanently degraded. Telegram DM lanes are unaffected.
Symptom
For each affected lane:
User posts a message to a Telegram supergroup topic.
drained=false if abort fired before any tokens emitted → in-flight content discarded silently; user sees no reply ever.
drained=true if abort fired after streaming started → the abort handler calls Telegram's deleteMessage on the in-flight partial message → user briefly sees a partial post that then disappears. The gateway journal does NOT log the deleteMessage HTTP call separately; the cleanup is hidden inside the abort.
The same session UUID is reused across multiple aborts on the same lane. Next inbound to that lane stalls again the same way until the OpenClaw session record is manually purged.
Sub-agents spawned from a wedged parent complete their own work but can't deliver back:
[warn] Subagent announce give up (retry-limit) run=<x> child=<y>
requester=agent:main:telegram:group:<chat>:topic:<id>
retries=3 endedAgo=Ns
deliveryError="completion agent did not deliver through the message tool;
direct-primary: completion agent did not deliver through the message tool"
When all parent-side delivery retries exhaust, the direct-primary fallback path routes the sub-agent's output to the user's DM with the originating bot instead. Result: messages composed in the context of a group topic appear in the user's DM. This is the cross-topic-to-DM "message jumping" symptom users report.
Root cause hypothesis
The gateway's OpenClaw session record and claude-cli's local project transcript at ~/.claude/projects/<workspace>/<uuid>.jsonl share the same UUID but are stored in two separate locations. When a NEW group-topic session is created post-2026.5.20:
The next time the gateway invokes claude -p --resume <uuid>, claude-cli finds no matching transcript and hangs trying to resume a session that doesn't exist on its side, instead of failing fast or auto-creating fresh state. The two failover attempts (Opus then Sonnet, ~195s + ~183s) both hang the same way before the watchdog fires.
The documented safety net — "Stored session ids are verified against an existing readable project transcript before resume; phantom bindings are cleared with reason=transcript-missing instead of silently starting a fresh Claude CLI session under --resume" (per docs.openclaw.ai/gateway/cli-backends) — does not appear to be firing for group-topic lanes on 2026.5.20. Either:
The check is wired only for lane=main, not for agent:main:telegram:group:*:topic:* lanes, or
The check was bypassed by the new code path added in Fix: preserve modelOverride in agent handler (#5369) #19328 ("preserve fresh session overrides and metadata when stale cached agent-session entries race with store updates", shipped in 2026.5.20).
Notably, Telegram DM lanes are unaffected because the DM session predates 2026.5.20 and has long-standing ~/.claude/projects/<workspace>/<uuid>.jsonl state. Only newly-minted sessions hit the wedge.
Reproducer
Run OpenClaw 2026.5.20 with cliBackends.claude-cli pointing at Claude Code CLI.
Configure a Telegram channel with a supergroup that has forum topics enabled; allow at least one user to post.
Have the user post a message to a topic that has no prior OpenClaw session for that lane (i.e. agents/<id>/sessions/ contains no <uuid>-topic-<topicId>.jsonl for that topic).
Observe the gateway journal: the lane enters processing state, embedded_run:started, no progress, and is aborted by the watchdog at ~6 minutes.
Inspect storage: the new session UUID exists in agents/<id>/sessions/sessions.json and as <uuid>-topic-<topicId>.jsonl, but no corresponding file exists at ~/.claude/projects/<workspace>/<uuid>.jsonl.
User posts again to the same topic. Same outcome.
Workaround in use
Manual quarantine and store cleanup, per the unsustainable workaround documented in #44687:
Confirmed: this unblocks the affected topic, the next inbound creates a fresh session UUID, and replies start flowing again. But within hours, new sessions wedge the same way — first run on the new UUID hits the same hang, watchdog aborts, lane re-enters the wedge state. A 24-hour cycle requires repeated manual cleanup.
Gateway restart alone does NOT fix this — verified. The OpenClaw session record is rehydrated from disk with the orphan UUID still present, and the resume hang reproduces on the first post-restart inbound.
Related upstream issues
#44687 (closed, fixed in 2026.3.x): "Stale session resume at gateway startup blocks lane=main indefinitely". Same symptom family, but only for lane=main, only for sessions inherited from a prior gateway lifetime. Our case is for group-topic lanes and for newly-minted sessions.
#71127 (closed, fixed in 2026.4.x): "Stuck processing sessions are detected but never aborted". Our case has detection + abort working; the underlying resume hang re-establishes immediately after each abort.
#19328 (shipped in 2026.5.20): "preserve fresh session overrides and metadata when stale cached agent-session entries race with store updates". Suspected source of this regression.
#82964 (shipped in 2026.5.22-beta.1): "skip stale embedded-run wake probes for dormant completion requesters". May address the sub-agent delivery cascade (the cross-topic-to-DM routing symptom).
#84949 (shipped in 2026.5.22-beta.1): "bound embedded auto-compaction session write-lock watchdogs to the compaction timeout". Related lane-state cleanup work.
#81191 (closed): event-loop starvation from startAccount Telegram polling. Different root cause, but symptom magnitude (400+ second event-loop delays) overlaps with our 195s+183s timeouts.
Suggested investigation paths
Audit the new-session creation path for the agent:main:telegram:group:*:topic:* lane shape. Does it go through the same write-claude-cli-state step as agent:main:main and agent:main:telegram:direct:*? If not, that's the gap.
Re-verify the "verify-transcript-before-resume" code path still fires for all lane shapes in 2026.5.20. If it was a lane=main only check, it needs to be generalized.
Environment
e510042)cliBackends.claude-clipointing at Anthropic's Claude Code CLI binaryagents.list[main]with modelclaude-cli/claude-opus-4-7(legacy form) and fallbacksclaude-cli/claude-sonnet-4-6,claude-cli/claude-haiku-4-5TL;DR
After upgrading to 2026.5.20, every newly-created Telegram group-topic session wedges on its first inbound and stays wedged. The gateway mints a fresh session UUID, invokes
claude -p --resume <uuid>, andclaude-clihangs for 195+ seconds because no~/.claude/projects/<workspace>/<uuid>.jsonlexists for that UUID. Watchdog aborts the embedded run after ~6 minutes. Next inbound to the same lane creates a fresh UUID with the same fate — the lane is now permanently degraded. Telegram DM lanes are unaffected.Symptom
For each affected lane:
User posts a message to a Telegram supergroup topic.
Gateway logs
[telegram] Inbound message telegram:group:<chat>:topic:<id> -> @<bot> (group, N chars).Embedded run starts. Diagnostic warnings begin ~2 minutes later:
sessionId=unknownandlastProgress=embedded_run:startedpersist — the run never advances past start.Underlying
claude-clicalls fail with very long timeouts:Both Opus (~195s) and Sonnet (~183s) failover candidates time out the same way.
Watchdog escalates to recovery at ~6 minutes:
drained=falseif abort fired before any tokens emitted → in-flight content discarded silently; user sees no reply ever.drained=trueif abort fired after streaming started → the abort handler calls Telegram'sdeleteMessageon the in-flight partial message → user briefly sees a partial post that then disappears. The gateway journal does NOT log thedeleteMessageHTTP call separately; the cleanup is hidden inside the abort.The same session UUID is reused across multiple aborts on the same lane. Next inbound to that lane stalls again the same way until the OpenClaw session record is manually purged.
Sub-agents spawned from a wedged parent complete their own work but can't deliver back:
When all parent-side delivery retries exhaust, the
direct-primaryfallback path routes the sub-agent's output to the user's DM with the originating bot instead. Result: messages composed in the context of a group topic appear in the user's DM. This is the cross-topic-to-DM "message jumping" symptom users report.Root cause hypothesis
The gateway's OpenClaw session record and
claude-cli's local project transcript at~/.claude/projects/<workspace>/<uuid>.jsonlshare the same UUID but are stored in two separate locations. When a NEW group-topic session is created post-2026.5.20:agents/<id>/sessions/sessions.json+*-topic-<id>.jsonl~/.claude/projects/<workspace>/<uuid>.jsonlThe next time the gateway invokes
claude -p --resume <uuid>,claude-clifinds no matching transcript and hangs trying to resume a session that doesn't exist on its side, instead of failing fast or auto-creating fresh state. The two failover attempts (Opus then Sonnet, ~195s + ~183s) both hang the same way before the watchdog fires.The documented safety net — "Stored session ids are verified against an existing readable project transcript before resume; phantom bindings are cleared with reason=transcript-missing instead of silently starting a fresh Claude CLI session under --resume" (per
docs.openclaw.ai/gateway/cli-backends) — does not appear to be firing for group-topic lanes on 2026.5.20. Either:lane=main, not foragent:main:telegram:group:*:topic:*lanes, orNotably, Telegram DM lanes are unaffected because the DM session predates 2026.5.20 and has long-standing
~/.claude/projects/<workspace>/<uuid>.jsonlstate. Only newly-minted sessions hit the wedge.Reproducer
cliBackends.claude-clipointing at Claude Code CLI.agents/<id>/sessions/contains no<uuid>-topic-<topicId>.jsonlfor that topic).processingstate,embedded_run:started, no progress, and is aborted by the watchdog at ~6 minutes.agents/<id>/sessions/sessions.jsonand as<uuid>-topic-<topicId>.jsonl, but no corresponding file exists at~/.claude/projects/<workspace>/<uuid>.jsonl.Workaround in use
Manual quarantine and store cleanup, per the unsustainable workaround documented in #44687:
Confirmed: this unblocks the affected topic, the next inbound creates a fresh session UUID, and replies start flowing again. But within hours, new sessions wedge the same way — first run on the new UUID hits the same hang, watchdog aborts, lane re-enters the wedge state. A 24-hour cycle requires repeated manual cleanup.
Gateway restart alone does NOT fix this — verified. The OpenClaw session record is rehydrated from disk with the orphan UUID still present, and the resume hang reproduces on the first post-restart inbound.
Related upstream issues
lane=main, only for sessions inherited from a prior gateway lifetime. Our case is for group-topic lanes and for newly-minted sessions.startAccountTelegram polling. Different root cause, but symptom magnitude (400+ second event-loop delays) overlaps with our 195s+183s timeouts.Suggested investigation paths
agent:main:telegram:group:*:topic:*lane shape. Does it go through the same write-claude-cli-state step asagent:main:mainandagent:main:telegram:direct:*? If not, that's the gap.lane=mainonly check, it needs to be generalized.What we did NOT include
Sanitized for privacy:
If maintainers need additional unredacted log excerpts or sessions.json snippets to reproduce, those can be shared privately on request.