[Bug]: Newly-created Telegram group-topic sessions wedge indefinitely on first inbound — claude-cli --resume hangs against UUID with no prior project transcript (2026.5.20 regression)

## Environment

- **OpenClaw**: 2026.5.20 (commit `e510042`)
- **Node.js**: bundled with the published npm package
- **Backend**: `cliBackends.claude-cli` pointing at Anthropic's Claude Code CLI binary
- **Channel**: Telegram (polling mode), single Telegram supergroup with multiple forum topics
- **Agent**: single `agents.list[main]` with model `claude-cli/claude-opus-4-7` (legacy form) and fallbacks `claude-cli/claude-sonnet-4-6`, `claude-cli/claude-haiku-4-5`
- **Other plugins enabled**: anthropic, browser, canvas, device-pair, file-transfer, memory-core, phone-control, slack, talk-voice, telegram (10 total)
- **Host**: single-VPS deployment, no container

## TL;DR

After upgrading to 2026.5.20, every newly-created Telegram group-topic session wedges on its first inbound and stays wedged. The gateway mints a fresh session UUID, invokes `claude -p --resume <uuid>`, and `claude-cli` hangs for 195+ seconds because no `~/.claude/projects/<workspace>/<uuid>.jsonl` exists for that UUID. Watchdog aborts the embedded run after ~6 minutes. Next inbound to the same lane creates a fresh UUID with the same fate — the lane is now permanently degraded. Telegram DM lanes are unaffected.

## Symptom

For each affected lane:

1. User posts a message to a Telegram supergroup topic.
2. Gateway logs `[telegram] Inbound message telegram:group:<chat>:topic:<id> -> @<bot> (group, N chars)`.
3. Embedded run starts. Diagnostic warnings begin ~2 minutes later:

   ```
   [diagnostic] stalled session: sessionId=unknown
     sessionKey=agent:main:telegram:group:<chat>:topic:<id>
     state=processing age=Ns queueDepth=1
     reason=active_work_without_progress
     classification=stalled_agent_run
     activeWorkKind=embedded_run
     lastProgress=embedded_run:started
     lastProgressAge=Ns recovery=none
   ```

   `sessionId=unknown` and `lastProgress=embedded_run:started` persist — the run never advances past start.

4. Underlying `claude-cli` calls fail with very long timeouts:

   ```
   [agent/cli-backend] claude live session turn failed:
     provider=claude-cli model=claude-opus-4-7
     durationMs=195366 error=FailoverError
   [model-fallback/decision] model fallback decision:
     decision=candidate_failed
     requested=claude-cli/claude-opus-4-7
     candidate=claude-cli/claude-opus-4-7
     reason=unknown next=claude-cli/claude-sonnet-4-6
     detail=Claude CLI failed.
   [agent/cli-backend] cli exec:
     provider=claude-cli model=sonnet promptChars=N
     trigger=user useResume=true session=present
     resumeSession=<short> reuse=reusable historyPrompt=present
   [agent/cli-backend] claude live session turn failed:
     provider=claude-cli model=claude-sonnet-4-6
     durationMs=183725 error=AbortError
   ```

   Both Opus (~195s) and Sonnet (~183s) failover candidates time out the same way.

5. Watchdog escalates to recovery at ~6 minutes:

   ```
   [diagnostic] stuck session recovery:
     sessionId=<uuid> sessionKey=agent:main:telegram:group:<chat>:topic:<id>
     age=N action=abort_embedded_run aborted=true drained=true|false released=0
   [diagnostic] stuck session recovery outcome:
     status=aborted action=abort_embedded_run
     sessionId=<uuid> ... activeWorkKind=embedded_run
     lane=session:agent:main:telegram:group:<chat>:topic:<id>
     aborted=true drained=true|false forceCleared=false released=0
   ```

   - `drained=false` if abort fired before any tokens emitted → in-flight content discarded silently; user sees no reply ever.
   - `drained=true` if abort fired after streaming started → the abort handler calls Telegram's `deleteMessage` on the in-flight partial message → user briefly sees a partial post that then disappears. The gateway journal does NOT log the `deleteMessage` HTTP call separately; the cleanup is hidden inside the abort.

6. The same session UUID is reused across multiple aborts on the same lane. Next inbound to that lane stalls again the same way until the OpenClaw session record is manually purged.

7. Sub-agents spawned from a wedged parent complete their own work but can't deliver back:

   ```
   [warn] Subagent announce give up (retry-limit) run=<x> child=<y>
     requester=agent:main:telegram:group:<chat>:topic:<id>
     retries=3 endedAgo=Ns
     deliveryError="completion agent did not deliver through the message tool;
                    direct-primary: completion agent did not deliver through the message tool"
   ```

   When all parent-side delivery retries exhaust, the `direct-primary` fallback path routes the sub-agent's output to the user's DM with the originating bot instead. Result: messages composed in the context of a group topic appear in the user's DM. This is the cross-topic-to-DM "message jumping" symptom users report.

## Root cause hypothesis

The gateway's OpenClaw session record and `claude-cli`'s local project transcript at `~/.claude/projects/<workspace>/<uuid>.jsonl` share the same UUID but are stored in two separate locations. When a NEW group-topic session is created post-2026.5.20:

| Store | State for a newly-minted lane |
|---|---|
| OpenClaw `agents/<id>/sessions/sessions.json` + `*-topic-<id>.jsonl` | Created, contains the user's first turn |
| `~/.claude/projects/<workspace>/<uuid>.jsonl` | **Never written** |

The next time the gateway invokes `claude -p --resume <uuid>`, `claude-cli` finds no matching transcript and hangs trying to resume a session that doesn't exist on its side, instead of failing fast or auto-creating fresh state. The two failover attempts (Opus then Sonnet, ~195s + ~183s) both hang the same way before the watchdog fires.

The documented safety net — *"Stored session ids are verified against an existing readable project transcript before resume; phantom bindings are cleared with reason=transcript-missing instead of silently starting a fresh Claude CLI session under --resume"* (per `docs.openclaw.ai/gateway/cli-backends`) — does **not** appear to be firing for group-topic lanes on 2026.5.20. Either:

1. The check is wired only for `lane=main`, not for `agent:main:telegram:group:*:topic:*` lanes, or
2. The check was bypassed by the new code path added in #19328 ("preserve fresh session overrides and metadata when stale cached agent-session entries race with store updates", shipped in 2026.5.20).

Notably, **Telegram DM lanes are unaffected** because the DM session predates 2026.5.20 and has long-standing `~/.claude/projects/<workspace>/<uuid>.jsonl` state. Only newly-minted sessions hit the wedge.

## Reproducer

1. Run OpenClaw 2026.5.20 with `cliBackends.claude-cli` pointing at Claude Code CLI.
2. Configure a Telegram channel with a supergroup that has forum topics enabled; allow at least one user to post.
3. Have the user post a message to a topic that has no prior OpenClaw session for that lane (i.e. `agents/<id>/sessions/` contains no `<uuid>-topic-<topicId>.jsonl` for that topic).
4. Observe the gateway journal: the lane enters `processing` state, `embedded_run:started`, no progress, and is aborted by the watchdog at ~6 minutes.
5. Inspect storage: the new session UUID exists in `agents/<id>/sessions/sessions.json` and as `<uuid>-topic-<topicId>.jsonl`, but no corresponding file exists at `~/.claude/projects/<workspace>/<uuid>.jsonl`.
6. User posts again to the same topic. Same outcome.

## Workaround in use

Manual quarantine and store cleanup, per the unsustainable workaround documented in #44687:

```
mv agents/<id>/sessions/<uuid>-topic-<topicId>.* /quarantine/
openclaw sessions cleanup --fix-missing --enforce --active-key "agent:<id>:telegram:direct:<user>"
```

Confirmed: this unblocks the affected topic, the next inbound creates a fresh session UUID, and replies start flowing again. **But within hours, new sessions wedge the same way** — first run on the new UUID hits the same hang, watchdog aborts, lane re-enters the wedge state. A 24-hour cycle requires repeated manual cleanup.

Gateway restart alone does NOT fix this — verified. The OpenClaw session record is rehydrated from disk with the orphan UUID still present, and the resume hang reproduces on the first post-restart inbound.

## Related upstream issues

- [#44687](https://github.com/openclaw/openclaw/issues/44687) (closed, fixed in 2026.3.x): "Stale session resume at gateway startup blocks lane=main indefinitely". Same symptom family, but only for `lane=main`, only for sessions inherited from a prior gateway lifetime. Our case is for group-topic lanes and for *newly-minted* sessions.
- [#71127](https://github.com/openclaw/openclaw/issues/71127) (closed, fixed in 2026.4.x): "Stuck processing sessions are detected but never aborted". Our case has detection + abort working; the underlying resume hang re-establishes immediately after each abort.
- [#19328](https://github.com/openclaw/openclaw/issues/19328) (shipped in 2026.5.20): "preserve fresh session overrides and metadata when stale cached agent-session entries race with store updates". **Suspected source of this regression**.
- [#82964](https://github.com/openclaw/openclaw/issues/82964) (shipped in 2026.5.22-beta.1): "skip stale embedded-run wake probes for dormant completion requesters". May address the sub-agent delivery cascade (the cross-topic-to-DM routing symptom).
- [#84949](https://github.com/openclaw/openclaw/issues/84949) (shipped in 2026.5.22-beta.1): "bound embedded auto-compaction session write-lock watchdogs to the compaction timeout". Related lane-state cleanup work.
- [#81191](https://github.com/openclaw/openclaw/issues/81191) (closed): event-loop starvation from `startAccount` Telegram polling. Different root cause, but symptom magnitude (400+ second event-loop delays) overlaps with our 195s+183s timeouts.

## Suggested investigation paths

1. **Audit the new-session creation path** for the `agent:main:telegram:group:*:topic:*` lane shape. Does it go through the same write-claude-cli-state step as `agent:main:main` and `agent:main:telegram:direct:*`? If not, that's the gap.
2. **Re-verify the "verify-transcript-before-resume" code path** still fires for all lane shapes in 2026.5.20. If it was a `lane=main` only check, it needs to be generalized.
3. **Check whether #19328's fix** introduced an early return / state-reuse path that bypasses the safety net for new sessions.
4. **Confirm 2026.5.22-beta.1 fixes this** — we have not upgraded.

## What we did NOT include

Sanitized for privacy:

- Specific user IDs, Telegram chat IDs, topic names, bot username
- Workspace contents, third-party PII referenced in any wedged session
- Per-skill names that reveal the deployment's business use

If maintainers need additional unredacted log excerpts or sessions.json snippets to reproduce, those can be shared privately on request.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: Newly-created Telegram group-topic sessions wedge indefinitely on first inbound — claude-cli --resume hangs against UUID with no prior project transcript (2026.5.20 regression) #86095

Environment

TL;DR

Symptom

Root cause hypothesis

Reproducer

Workaround in use

Related upstream issues

Suggested investigation paths

What we did NOT include

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Store	State for a newly-minted lane
OpenClaw `agents/<id>/sessions/sessions.json` + `*-topic-<id>.jsonl`	Created, contains the user's first turn
`~/.claude/projects/<workspace>/<uuid>.jsonl`	Never written

Uh oh!

[Bug]: Newly-created Telegram group-topic sessions wedge indefinitely on first inbound — claude-cli --resume hangs against UUID with no prior project transcript (2026.5.20 regression) #86095

Description

Environment

TL;DR

Symptom

Root cause hypothesis

Reproducer

Workaround in use

Related upstream issues

Suggested investigation paths

What we did NOT include

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions