Fix stuck-session recovery aborts for active reply runs#88925
Conversation
|
Codex review: needs real behavior proof before merge. Reviewed June 11, 2026, 12:57 AM ET / 04:57 UTC. Summary PR surface: Source +129, Tests +153. Total +282 across 10 files. Reproducibility: unclear. The review failed before ClawSweeper could establish a reproduction path. Review metrics: none identified. Merge readiness Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch. Risk before merge
Maintainer options:
Next step before merge
Review detailsBest possible solution: Retry the Codex review after fixing the execution failure. Do we have a high-confidence way to reproduce the issue? Unclear. The review failed before ClawSweeper could establish a reproduction path. Is this the best way to solve the issue? Unclear. Retry the review first so ClawSweeper can evaluate the actual issue and fix direction. AGENTS.md: unclear because the file could not be read completely. Codex review notes: model internal, reasoning high; reviewed against 418d7e1e83c5. Label changesLabel changes:
Label justifications:
Evidence reviewedPR surface: Source +129, Tests +153. Total +282 across 10 files. View PR surface stats
What I checked:
Likely related people:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. How this review workflow works
|
Summary
stuck_recoverythrough reply-run session-id fallback and embedded-run cancel plumbing so recovery no longer degrades into a fake user abortTesting
node scripts/run-vitest.mjs run --config test/vitest/vitest.auto-reply-reply.config.ts src/auto-reply/reply/reply-run-registry.test.tsnode scripts/run-vitest.mjs run --config test/vitest/vitest.logging.config.ts src/logging/diagnostic.test.ts src/logging/diagnostic-stuck-session-recovery.integration.test.tsnode scripts/run-vitest.mjs run --config test/vitest/vitest.agents-embedded-agent.config.ts src/agents/embedded-agent-runner/run/attempt.model-diagnostic-events.test.tsReal behavior proof
stuck_recoverythrough the reply-run session-id fallback and does not abort a live embedded run that still has fresh tracked progress.159943a6c0.pnpm exec tsx -against the real repo modules in this branch with a runtime harness that (1) creates a reply operation and aborts it throughabortReplyRunBySessionId(..., { reason: "stuck_recovery" }), then (2) registers an active embedded run, records fresh progress withmarkDiagnosticRunProgressForTest(...), and callsrecoverStuckDiagnosticSession(...)withallowActiveAbort: true.[reply-registry]
{"aborted":true,"result":{"kind":"aborted","code":"aborted_by_system","reason":"stuck_recovery"},"abortSignalReason":"Reply operation aborted by system reason=stuck_recovery"}
[stuck-recovery]
{"snapshotBefore":{"activeWorkKind":"embedded_run","hasActiveEmbeddedRun":true,"lastProgressAgeMs":1,"lastProgressReason":"model chunk observed from runtime harness"},"outcome":{"status":"skipped","action":"observe_only","reason":"active_embedded_run","sessionId":"session-live-progress","sessionKey":"agent:main:main","activeSessionId":"session-live-progress","activeWorkKind":"embedded_run"},"abortCount":0,"cancelReasons":[],"snapshotAfter":{"activeWorkKind":"embedded_run","hasActiveEmbeddedRun":true,"lastProgressAgeMs":84,"lastProgressReason":"model chunk observed from runtime harness"}}
[diagnostic] stuck session recovery skipped: sessionId=session-live-progress sessionKey=agent:main:main age=180s queueDepth=1 activeSessionId=session-live-progress
[diagnostic] stuck session recovery outcome: status=skipped action=observe_only sessionId=session-live-progress sessionKey=agent:main:main activeSessionId=session-live-progress activeWorkKind=embedded_run reason=active_embedded_run
reason: "stuck_recovery"instead of degrading into a user-abort shape. The live-progress recovery path returnsstatus: "skipped", leaves the embedded run active, and does not call abort or cancel on the active handle.claude-clichannel repro in this environment, and I did not capture UI/video proof from a live chat surface.Notes
diagnostics.stuckSessionWarnMs/diagnostics.stuckSessionAbortMsand the defaultwarnMs x 3/ minimum 5 minute behavior are already documented in the current docs/config help, so this PR focuses on the runtime regressions from#88870.Closes #88870