Skip to content

[Bug]: Newly-created Telegram group-topic sessions wedge indefinitely on first inbound — claude-cli --resume hangs against UUID with no prior project transcript (2026.5.20 regression) #86095

@BryceMurray

Description

@BryceMurray

Environment

  • OpenClaw: 2026.5.20 (commit e510042)
  • Node.js: bundled with the published npm package
  • Backend: cliBackends.claude-cli pointing at Anthropic's Claude Code CLI binary
  • Channel: Telegram (polling mode), single Telegram supergroup with multiple forum topics
  • Agent: single agents.list[main] with model claude-cli/claude-opus-4-7 (legacy form) and fallbacks claude-cli/claude-sonnet-4-6, claude-cli/claude-haiku-4-5
  • Other plugins enabled: anthropic, browser, canvas, device-pair, file-transfer, memory-core, phone-control, slack, talk-voice, telegram (10 total)
  • Host: single-VPS deployment, no container

TL;DR

After upgrading to 2026.5.20, every newly-created Telegram group-topic session wedges on its first inbound and stays wedged. The gateway mints a fresh session UUID, invokes claude -p --resume <uuid>, and claude-cli hangs for 195+ seconds because no ~/.claude/projects/<workspace>/<uuid>.jsonl exists for that UUID. Watchdog aborts the embedded run after ~6 minutes. Next inbound to the same lane creates a fresh UUID with the same fate — the lane is now permanently degraded. Telegram DM lanes are unaffected.

Symptom

For each affected lane:

  1. User posts a message to a Telegram supergroup topic.

  2. Gateway logs [telegram] Inbound message telegram:group:<chat>:topic:<id> -> @<bot> (group, N chars).

  3. Embedded run starts. Diagnostic warnings begin ~2 minutes later:

    [diagnostic] stalled session: sessionId=unknown
      sessionKey=agent:main:telegram:group:<chat>:topic:<id>
      state=processing age=Ns queueDepth=1
      reason=active_work_without_progress
      classification=stalled_agent_run
      activeWorkKind=embedded_run
      lastProgress=embedded_run:started
      lastProgressAge=Ns recovery=none
    

    sessionId=unknown and lastProgress=embedded_run:started persist — the run never advances past start.

  4. Underlying claude-cli calls fail with very long timeouts:

    [agent/cli-backend] claude live session turn failed:
      provider=claude-cli model=claude-opus-4-7
      durationMs=195366 error=FailoverError
    [model-fallback/decision] model fallback decision:
      decision=candidate_failed
      requested=claude-cli/claude-opus-4-7
      candidate=claude-cli/claude-opus-4-7
      reason=unknown next=claude-cli/claude-sonnet-4-6
      detail=Claude CLI failed.
    [agent/cli-backend] cli exec:
      provider=claude-cli model=sonnet promptChars=N
      trigger=user useResume=true session=present
      resumeSession=<short> reuse=reusable historyPrompt=present
    [agent/cli-backend] claude live session turn failed:
      provider=claude-cli model=claude-sonnet-4-6
      durationMs=183725 error=AbortError
    

    Both Opus (~195s) and Sonnet (~183s) failover candidates time out the same way.

  5. Watchdog escalates to recovery at ~6 minutes:

    [diagnostic] stuck session recovery:
      sessionId=<uuid> sessionKey=agent:main:telegram:group:<chat>:topic:<id>
      age=N action=abort_embedded_run aborted=true drained=true|false released=0
    [diagnostic] stuck session recovery outcome:
      status=aborted action=abort_embedded_run
      sessionId=<uuid> ... activeWorkKind=embedded_run
      lane=session:agent:main:telegram:group:<chat>:topic:<id>
      aborted=true drained=true|false forceCleared=false released=0
    
    • drained=false if abort fired before any tokens emitted → in-flight content discarded silently; user sees no reply ever.
    • drained=true if abort fired after streaming started → the abort handler calls Telegram's deleteMessage on the in-flight partial message → user briefly sees a partial post that then disappears. The gateway journal does NOT log the deleteMessage HTTP call separately; the cleanup is hidden inside the abort.
  6. The same session UUID is reused across multiple aborts on the same lane. Next inbound to that lane stalls again the same way until the OpenClaw session record is manually purged.

  7. Sub-agents spawned from a wedged parent complete their own work but can't deliver back:

    [warn] Subagent announce give up (retry-limit) run=<x> child=<y>
      requester=agent:main:telegram:group:<chat>:topic:<id>
      retries=3 endedAgo=Ns
      deliveryError="completion agent did not deliver through the message tool;
                     direct-primary: completion agent did not deliver through the message tool"
    

    When all parent-side delivery retries exhaust, the direct-primary fallback path routes the sub-agent's output to the user's DM with the originating bot instead. Result: messages composed in the context of a group topic appear in the user's DM. This is the cross-topic-to-DM "message jumping" symptom users report.

Root cause hypothesis

The gateway's OpenClaw session record and claude-cli's local project transcript at ~/.claude/projects/<workspace>/<uuid>.jsonl share the same UUID but are stored in two separate locations. When a NEW group-topic session is created post-2026.5.20:

Store State for a newly-minted lane
OpenClaw agents/<id>/sessions/sessions.json + *-topic-<id>.jsonl Created, contains the user's first turn
~/.claude/projects/<workspace>/<uuid>.jsonl Never written

The next time the gateway invokes claude -p --resume <uuid>, claude-cli finds no matching transcript and hangs trying to resume a session that doesn't exist on its side, instead of failing fast or auto-creating fresh state. The two failover attempts (Opus then Sonnet, ~195s + ~183s) both hang the same way before the watchdog fires.

The documented safety net — "Stored session ids are verified against an existing readable project transcript before resume; phantom bindings are cleared with reason=transcript-missing instead of silently starting a fresh Claude CLI session under --resume" (per docs.openclaw.ai/gateway/cli-backends) — does not appear to be firing for group-topic lanes on 2026.5.20. Either:

  1. The check is wired only for lane=main, not for agent:main:telegram:group:*:topic:* lanes, or
  2. The check was bypassed by the new code path added in Fix: preserve modelOverride in agent handler (#5369) #19328 ("preserve fresh session overrides and metadata when stale cached agent-session entries race with store updates", shipped in 2026.5.20).

Notably, Telegram DM lanes are unaffected because the DM session predates 2026.5.20 and has long-standing ~/.claude/projects/<workspace>/<uuid>.jsonl state. Only newly-minted sessions hit the wedge.

Reproducer

  1. Run OpenClaw 2026.5.20 with cliBackends.claude-cli pointing at Claude Code CLI.
  2. Configure a Telegram channel with a supergroup that has forum topics enabled; allow at least one user to post.
  3. Have the user post a message to a topic that has no prior OpenClaw session for that lane (i.e. agents/<id>/sessions/ contains no <uuid>-topic-<topicId>.jsonl for that topic).
  4. Observe the gateway journal: the lane enters processing state, embedded_run:started, no progress, and is aborted by the watchdog at ~6 minutes.
  5. Inspect storage: the new session UUID exists in agents/<id>/sessions/sessions.json and as <uuid>-topic-<topicId>.jsonl, but no corresponding file exists at ~/.claude/projects/<workspace>/<uuid>.jsonl.
  6. User posts again to the same topic. Same outcome.

Workaround in use

Manual quarantine and store cleanup, per the unsustainable workaround documented in #44687:

mv agents/<id>/sessions/<uuid>-topic-<topicId>.* /quarantine/
openclaw sessions cleanup --fix-missing --enforce --active-key "agent:<id>:telegram:direct:<user>"

Confirmed: this unblocks the affected topic, the next inbound creates a fresh session UUID, and replies start flowing again. But within hours, new sessions wedge the same way — first run on the new UUID hits the same hang, watchdog aborts, lane re-enters the wedge state. A 24-hour cycle requires repeated manual cleanup.

Gateway restart alone does NOT fix this — verified. The OpenClaw session record is rehydrated from disk with the orphan UUID still present, and the resume hang reproduces on the first post-restart inbound.

Related upstream issues

  • #44687 (closed, fixed in 2026.3.x): "Stale session resume at gateway startup blocks lane=main indefinitely". Same symptom family, but only for lane=main, only for sessions inherited from a prior gateway lifetime. Our case is for group-topic lanes and for newly-minted sessions.
  • #71127 (closed, fixed in 2026.4.x): "Stuck processing sessions are detected but never aborted". Our case has detection + abort working; the underlying resume hang re-establishes immediately after each abort.
  • #19328 (shipped in 2026.5.20): "preserve fresh session overrides and metadata when stale cached agent-session entries race with store updates". Suspected source of this regression.
  • #82964 (shipped in 2026.5.22-beta.1): "skip stale embedded-run wake probes for dormant completion requesters". May address the sub-agent delivery cascade (the cross-topic-to-DM routing symptom).
  • #84949 (shipped in 2026.5.22-beta.1): "bound embedded auto-compaction session write-lock watchdogs to the compaction timeout". Related lane-state cleanup work.
  • #81191 (closed): event-loop starvation from startAccount Telegram polling. Different root cause, but symptom magnitude (400+ second event-loop delays) overlaps with our 195s+183s timeouts.

Suggested investigation paths

  1. Audit the new-session creation path for the agent:main:telegram:group:*:topic:* lane shape. Does it go through the same write-claude-cli-state step as agent:main:main and agent:main:telegram:direct:*? If not, that's the gap.
  2. Re-verify the "verify-transcript-before-resume" code path still fires for all lane shapes in 2026.5.20. If it was a lane=main only check, it needs to be generalized.
  3. Check whether Fix: preserve modelOverride in agent handler (#5369) #19328's fix introduced an early return / state-reuse path that bypasses the safety net for new sessions.
  4. Confirm 2026.5.22-beta.1 fixes this — we have not upgraded.

What we did NOT include

Sanitized for privacy:

  • Specific user IDs, Telegram chat IDs, topic names, bot username
  • Workspace contents, third-party PII referenced in any wedged session
  • Per-skill names that reveal the deployment's business use

If maintainers need additional unredacted log excerpts or sessions.json snippets to reproduce, those can be shared privately on request.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High-priority user-facing bug, regression, or broken workflow.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions