Webchat: substantive tool turns intermittently hang in post-tool-result generation → ~365s abort_embedded_run while the host event loop stays healthy, and the turn is silently lost — v2026.5.22 (a374c3a)

## Summary
On v2026.5.22 (`a374c3a`), a substantive webchat turn that uses tools occasionally
freezes **after** its `tool_result`s have already returned successfully. The embedded
Claude-CLI run then produces **zero progress for ~365s**; the gateway's no-progress
watchdog (`stuckSessionAbortMs`) fires `abort_embedded_run` → `AbortError`, and — because
the transcript is persisted only on success (#86592) — the turn vanishes with no trace.
To the user, webchat appears to "only answer trivial questions" (those are the fast turns
that complete and persist).

The freeze is **intermittent** (I could not reproduce it on demand), but I captured
strong direct evidence about *where* it occurs.

## 🔍 Direct evidence: the host gateway is healthy throughout the hang
During one captured 366s hang the gateway's own event loop stayed responsive the whole time:
- `eventLoopMax` spikes were logged elsewhere in the day, but **none in the hang window**.
- The `diagnostic` `stuckSessionWarn` timer fired **exactly on schedule, every 30s**, for
  the full duration: `age=126s … 156 … 186 … 216 … 246 … 276 … 306 … 336 … 366s → abort`.
- Abort at `2026-05-26T20:03:01.799+09:00`:
  ```
  [agent/cli-backend] claude live session turn failed: provider=claude-cli model=claude-opus-4-6 durationMs=365055 error=AbortError
  [diagnostic] stuck session recovery: sessionKey=agent:<redacted>:main age=366s action=abort_embedded_run
  [diagnostic] stuck session recovery outcome: status=aborted action=abort_embedded_run
  ```
➡️ The host event loop was healthy and *timing the hang correctly*. The freeze is **inside
the embedded CLI run's post-`tool_result` generation step** — not host-side event-loop
starvation, and not in any tool (the `tool_result`s returned `is_error:false` before the stall).

## 🎯 Most telling data point: identical prompt, opposite outcome
Same URL, same prompt string, same model (`claude-opus-4-6`), same build, same auth path —
the only difference is the session:

| Session | Result |
|---|---|
| long-lived `…:main` webchat (at hang) | **365s zero-progress → abort** |
| fresh isolated session (`--session-key`, repro) | **166s of steady progress → real reply ✅** |

A large-input control (fetch full Wikipedia "Artificial intelligence" + summarize, fresh
session) also completed fine (`durationMs=55661`). So the hang is **not** coupled to
tool-result size or page content — the only uncontrolled axis is **session identity**
(long-lived `main` vs a fresh session).

## Environment
- OpenClaw **v2026.5.22** (build `a374c3a`), macOS, launchd gateway (loopback :18789)
- Turns run through the local Claude CLI runtime (`runner:"cli"`, `winnerProvider:"claude-cli"`, `fallbackUsed:false`)
- Model `claude-opus-4-6`

## Observed sequence (the hang)
1. Webchat turn in the persistent `…:main` session: "read/summarize <a long-form web article>".
2. `ToolSearch` → success; `WebFetch` → success (`is_error:false`). Tools are done.
3. Post-`tool_result` generation emits **no progress events** (no reply/tool/status/block) for ~365s.
4. `stuckSessionWarn` fires every 30s; at `age=366s`, `stuckSessionAbortMs` → `abort_embedded_run` / `AbortError`.
5. Turn is discarded and **not persisted** (#86592) → the user never sees it happened.

## Expected vs actual
- **Expected:** generation makes progress (as it does in a fresh session, 166s), or — if it
  genuinely stalls — the user sees an error and the attempt is recorded.
- **Actual:** silent ~365s wedge then abort, no transcript persisted; webchat looks like it "only does pong".

## What this rules out
- **NOT #86239** — that is `MissingAgentHarnessError` on inbound dispatch under event-loop
  starvation (~17–28s, self-healing). Here the host loop is healthy and the wait is 365s of true zero progress.
- **NOT load / event-loop starvation** — no `eventLoopMax` spike in the hang window; warn timer on schedule.
- **NOT auth/token** — `fallbackUsed:false`, no credential error; the same auth path succeeds in a fresh session.
- **NOT a misconfigured threshold** — `stuckSessionAbortMs` is a no-progress timer; hitting the
  full ~365s means genuinely zero progress, so raising it would only lengthen the wedge.

## Relationship to #86592
#86592 (persist-only-on-success) is what makes this *invisible*: `persistTextTurnTranscript`
writes the user+assistant turn only after success, so the aborted turn leaves no trace and the
user concludes "webchat only answers trivially." The two compound: (1) the generation stall
loses the turn; (2) #86592 hides that it ever ran.

## Candidate code locations (pointers, not a diagnosis)
- Symptom/abort path: `src/logging/diagnostic-stuck-session-recovery.runtime.ts` (`abortAndDrainEmbeddedPiRun`),
  `src/logging/diagnostic.ts` (`resolveStuckSessionAbortMs`).
- Likely root: post-`tool_result` generation in the cli-runner — `src/agents/cli-runner/prepare.ts`,
  `…/session-history.ts` (missing-transcript reset + raw-history reseed on every turn).
- Compounding visibility: `src/agents/command/attempt-execution.ts` (`persistTextTurnTranscript`, #86592).

## Open questions
- Reproduced N=1 (hang) vs N=3 (success); could not reproduce on demand. The session-identity
  correlation suggests accumulated state in the long-lived `main` session (history reseed?) may be
  implicated — can a long `main` session's reseeded history put the cli-runner into a state where
  post-`tool_result` generation deadlocks?
- Is the embedded run waiting on the model stream, or on an internal queue/lock that never signals progress?


Session	Result
long-lived `…:main` webchat (at hang)	365s zero-progress → abort
fresh isolated session (`--session-key`, repro)	166s of steady progress → real reply ✅

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Webchat: substantive tool turns intermittently hang in post-tool-result generation → ~365s abort_embedded_run while the host event loop stays healthy, and the turn is silently lost — v2026.5.22 (a374c3a) #86895

Summary

🔍 Direct evidence: the host gateway is healthy throughout the hang

🎯 Most telling data point: identical prompt, opposite outcome

Environment

Observed sequence (the hang)

Expected vs actual

What this rules out

Relationship to #86592

Candidate code locations (pointers, not a diagnosis)

Open questions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Webchat: substantive tool turns intermittently hang in post-tool-result generation → ~365s abort_embedded_run while the host event loop stays healthy, and the turn is silently lost — v2026.5.22 (a374c3a) #86895

Description

Summary

🔍 Direct evidence: the host gateway is healthy throughout the hang

🎯 Most telling data point: identical prompt, opposite outcome

Environment

Observed sequence (the hang)

Expected vs actual

What this rules out

Relationship to #86592

Candidate code locations (pointers, not a diagnosis)

Open questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions