-
-
Notifications
You must be signed in to change notification settings - Fork 79.1k
Webchat: substantive tool turns intermittently hang in post-tool-result generation → ~365s abort_embedded_run while the host event loop stays healthy, and the turn is silently lost — v2026.5.22 (a374c3a) #86895
Copy link
Copy link
Closed
Labels
P1High-priority user-facing bug, regression, or broken workflow.High-priority user-facing bug, regression, or broken workflow.clawsweeper:needs-live-reproClawSweeper needs live local, crabbox, or manual validation to confirm this issue.ClawSweeper needs live local, crabbox, or manual validation to confirm this issue.clawsweeper:needs-maintainer-reviewClawSweeper marked this issue as needing maintainer review before automation.ClawSweeper marked this issue as needing maintainer review before automation.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.ClawSweeper does not recommend queueing a new automated fix PR for this issue.impact:data-lossCan lose, corrupt, or silently drop user/session/config data.Can lose, corrupt, or silently drop user/session/config data.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.Channel message delivery can be lost, duplicated, or misrouted.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.Session, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🐚 platinum hermitGood issue quality with a plausible reproduction path needing some confirmation.Good issue quality with a plausible reproduction path needing some confirmation.
Metadata
Metadata
Assignees
Labels
P1High-priority user-facing bug, regression, or broken workflow.High-priority user-facing bug, regression, or broken workflow.clawsweeper:needs-live-reproClawSweeper needs live local, crabbox, or manual validation to confirm this issue.ClawSweeper needs live local, crabbox, or manual validation to confirm this issue.clawsweeper:needs-maintainer-reviewClawSweeper marked this issue as needing maintainer review before automation.ClawSweeper marked this issue as needing maintainer review before automation.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.ClawSweeper does not recommend queueing a new automated fix PR for this issue.impact:data-lossCan lose, corrupt, or silently drop user/session/config data.Can lose, corrupt, or silently drop user/session/config data.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.Channel message delivery can be lost, duplicated, or misrouted.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.Session, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🐚 platinum hermitGood issue quality with a plausible reproduction path needing some confirmation.Good issue quality with a plausible reproduction path needing some confirmation.
Type
Fields
Give feedbackNo fields configured for issues without a type.
Summary
On v2026.5.22 (
a374c3a), a substantive webchat turn that uses tools occasionallyfreezes after its
tool_results have already returned successfully. The embeddedClaude-CLI run then produces zero progress for ~365s; the gateway's no-progress
watchdog (
stuckSessionAbortMs) firesabort_embedded_run→AbortError, and — becausethe transcript is persisted only on success (#86592) — the turn vanishes with no trace.
To the user, webchat appears to "only answer trivial questions" (those are the fast turns
that complete and persist).
The freeze is intermittent (I could not reproduce it on demand), but I captured
strong direct evidence about where it occurs.
🔍 Direct evidence: the host gateway is healthy throughout the hang
During one captured 366s hang the gateway's own event loop stayed responsive the whole time:
eventLoopMaxspikes were logged elsewhere in the day, but none in the hang window.diagnosticstuckSessionWarntimer fired exactly on schedule, every 30s, forthe full duration:
age=126s … 156 … 186 … 216 … 246 … 276 … 306 … 336 … 366s → abort.2026-05-26T20:03:01.799+09:00:➡️ The host event loop was healthy and timing the hang correctly. The freeze is inside
the embedded CLI run's post-
tool_resultgeneration step — not host-side event-loopstarvation, and not in any tool (the
tool_results returnedis_error:falsebefore the stall).🎯 Most telling data point: identical prompt, opposite outcome
Same URL, same prompt string, same model (
claude-opus-4-6), same build, same auth path —the only difference is the session:
…:mainwebchat (at hang)--session-key, repro)A large-input control (fetch full Wikipedia "Artificial intelligence" + summarize, fresh
session) also completed fine (
durationMs=55661). So the hang is not coupled totool-result size or page content — the only uncontrolled axis is session identity
(long-lived
mainvs a fresh session).Environment
a374c3a), macOS, launchd gateway (loopback :18789)runner:"cli",winnerProvider:"claude-cli",fallbackUsed:false)claude-opus-4-6Observed sequence (the hang)
…:mainsession: "read/summarize ".ToolSearch→ success;WebFetch→ success (is_error:false). Tools are done.tool_resultgeneration emits no progress events (no reply/tool/status/block) for ~365s.stuckSessionWarnfires every 30s; atage=366s,stuckSessionAbortMs→abort_embedded_run/AbortError.Expected vs actual
genuinely stalls — the user sees an error and the attempt is recorded.
What this rules out
MissingAgentHarnessErroron inbound dispatch under event-loopstarvation (~17–28s, self-healing). Here the host loop is healthy and the wait is 365s of true zero progress.
eventLoopMaxspike in the hang window; warn timer on schedule.fallbackUsed:false, no credential error; the same auth path succeeds in a fresh session.stuckSessionAbortMsis a no-progress timer; hitting thefull ~365s means genuinely zero progress, so raising it would only lengthen the wedge.
Relationship to #86592
#86592 (persist-only-on-success) is what makes this invisible:
persistTextTurnTranscriptwrites the user+assistant turn only after success, so the aborted turn leaves no trace and the
user concludes "webchat only answers trivially." The two compound: (1) the generation stall
loses the turn; (2) #86592 hides that it ever ran.
Candidate code locations (pointers, not a diagnosis)
src/logging/diagnostic-stuck-session-recovery.runtime.ts(abortAndDrainEmbeddedPiRun),src/logging/diagnostic.ts(resolveStuckSessionAbortMs).tool_resultgeneration in the cli-runner —src/agents/cli-runner/prepare.ts,…/session-history.ts(missing-transcript reset + raw-history reseed on every turn).src/agents/command/attempt-execution.ts(persistTextTurnTranscript, Inbound user messages are not persisted to session JSONL when the agent attempt throws #86592).Open questions
correlation suggests accumulated state in the long-lived
mainsession (history reseed?) may beimplicated — can a long
mainsession's reseeded history put the cli-runner into a state wherepost-
tool_resultgeneration deadlocks?