Skip to content

Webchat: substantive tool turns intermittently hang in post-tool-result generation → ~365s abort_embedded_run while the host event loop stays healthy, and the turn is silently lost — v2026.5.22 (a374c3a) #86895

@ryota-murakami

Description

@ryota-murakami

Summary

On v2026.5.22 (a374c3a), a substantive webchat turn that uses tools occasionally
freezes after its tool_results have already returned successfully. The embedded
Claude-CLI run then produces zero progress for ~365s; the gateway's no-progress
watchdog (stuckSessionAbortMs) fires abort_embedded_runAbortError, and — because
the transcript is persisted only on success (#86592) — the turn vanishes with no trace.
To the user, webchat appears to "only answer trivial questions" (those are the fast turns
that complete and persist).

The freeze is intermittent (I could not reproduce it on demand), but I captured
strong direct evidence about where it occurs.

🔍 Direct evidence: the host gateway is healthy throughout the hang

During one captured 366s hang the gateway's own event loop stayed responsive the whole time:

  • eventLoopMax spikes were logged elsewhere in the day, but none in the hang window.
  • The diagnostic stuckSessionWarn timer fired exactly on schedule, every 30s, for
    the full duration: age=126s … 156 … 186 … 216 … 246 … 276 … 306 … 336 … 366s → abort.
  • Abort at 2026-05-26T20:03:01.799+09:00:
    [agent/cli-backend] claude live session turn failed: provider=claude-cli model=claude-opus-4-6 durationMs=365055 error=AbortError
    [diagnostic] stuck session recovery: sessionKey=agent:<redacted>:main age=366s action=abort_embedded_run
    [diagnostic] stuck session recovery outcome: status=aborted action=abort_embedded_run
    

➡️ The host event loop was healthy and timing the hang correctly. The freeze is inside
the embedded CLI run's post-tool_result generation step
— not host-side event-loop
starvation, and not in any tool (the tool_results returned is_error:false before the stall).

🎯 Most telling data point: identical prompt, opposite outcome

Same URL, same prompt string, same model (claude-opus-4-6), same build, same auth path —
the only difference is the session:

Session Result
long-lived …:main webchat (at hang) 365s zero-progress → abort
fresh isolated session (--session-key, repro) 166s of steady progress → real reply ✅

A large-input control (fetch full Wikipedia "Artificial intelligence" + summarize, fresh
session) also completed fine (durationMs=55661). So the hang is not coupled to
tool-result size or page content — the only uncontrolled axis is session identity
(long-lived main vs a fresh session).

Environment

  • OpenClaw v2026.5.22 (build a374c3a), macOS, launchd gateway (loopback :18789)
  • Turns run through the local Claude CLI runtime (runner:"cli", winnerProvider:"claude-cli", fallbackUsed:false)
  • Model claude-opus-4-6

Observed sequence (the hang)

  1. Webchat turn in the persistent …:main session: "read/summarize ".
  2. ToolSearch → success; WebFetch → success (is_error:false). Tools are done.
  3. Post-tool_result generation emits no progress events (no reply/tool/status/block) for ~365s.
  4. stuckSessionWarn fires every 30s; at age=366s, stuckSessionAbortMsabort_embedded_run / AbortError.
  5. Turn is discarded and not persisted (Inbound user messages are not persisted to session JSONL when the agent attempt throws #86592) → the user never sees it happened.

Expected vs actual

  • Expected: generation makes progress (as it does in a fresh session, 166s), or — if it
    genuinely stalls — the user sees an error and the attempt is recorded.
  • Actual: silent ~365s wedge then abort, no transcript persisted; webchat looks like it "only does pong".

What this rules out

Relationship to #86592

#86592 (persist-only-on-success) is what makes this invisible: persistTextTurnTranscript
writes the user+assistant turn only after success, so the aborted turn leaves no trace and the
user concludes "webchat only answers trivially." The two compound: (1) the generation stall
loses the turn; (2) #86592 hides that it ever ran.

Candidate code locations (pointers, not a diagnosis)

  • Symptom/abort path: src/logging/diagnostic-stuck-session-recovery.runtime.ts (abortAndDrainEmbeddedPiRun),
    src/logging/diagnostic.ts (resolveStuckSessionAbortMs).
  • Likely root: post-tool_result generation in the cli-runner — src/agents/cli-runner/prepare.ts,
    …/session-history.ts (missing-transcript reset + raw-history reseed on every turn).
  • Compounding visibility: src/agents/command/attempt-execution.ts (persistTextTurnTranscript, Inbound user messages are not persisted to session JSONL when the agent attempt throws #86592).

Open questions

  • Reproduced N=1 (hang) vs N=3 (success); could not reproduce on demand. The session-identity
    correlation suggests accumulated state in the long-lived main session (history reseed?) may be
    implicated — can a long main session's reseeded history put the cli-runner into a state where
    post-tool_result generation deadlocks?
  • Is the embedded run waiting on the model stream, or on an internal queue/lock that never signals progress?

Metadata

Metadata

Assignees

Labels

P1High-priority user-facing bug, regression, or broken workflow.clawsweeper:needs-live-reproClawSweeper needs live local, crabbox, or manual validation to confirm this issue.clawsweeper:needs-maintainer-reviewClawSweeper marked this issue as needing maintainer review before automation.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.impact:data-lossCan lose, corrupt, or silently drop user/session/config data.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🐚 platinum hermitGood issue quality with a plausible reproduction path needing some confirmation.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions