fix(agents): finalize killed subagent task rows via maintenance to avoid kill-vs-complete race#90505
Conversation
e6fbce8 to
2792830
Compare
|
Codex review: needs maintainer review before merge. Reviewed June 5, 2026, 3:36 AM ET / 07:36 UTC. Summary PR surface: Source +75, Tests +248. Total +323 across 6 files. Reproducibility: yes. from source inspection: current main persists killed subagent terminal state without task-row finalization, task maintenance lacks a subagent-run Review metrics: 1 noteworthy metric.
Merge readiness Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch. Rank-up moves:
Risk before merge
Maintainer options:
Next step before merge
Security Review detailsBest possible solution: Land the maintenance-owned repair if maintainers accept the bounded session-state tradeoff and the focused task/subagent CI remains green. Do we have a high-confidence way to reproduce the issue? Yes from source inspection: current main persists killed subagent terminal state without task-row finalization, task maintenance lacks a subagent-run Is this the best way to solve the issue? Yes, with maintainer acceptance of the session-state tradeoff: maintenance-owned finalization avoids writing AGENTS.md: found and applied where relevant. Codex review notes: model gpt-5.5, reasoning high; reviewed against 1a3ce7c2a8da. Label changesLabel justifications:
Evidence reviewedPR surface: Source +75, Tests +248. Total +323 across 6 files. View PR surface stats
What I checked:
Likely related people:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. How this review workflow works
|
d259227 to
c5e658b
Compare
…oid kill-vs-complete race Kill path writes setDetachedTaskDeliveryStatusByRunId(not_applicable) immediately; task-row status finalization is deferred to maintenance via a new optional getSubagentRunEndedAt hook on TaskRegistryMaintenanceRuntime. When the runtime is authoritative and endedAt is set, shouldMarkLost returns true before the 5-minute grace window (convergence ≤60 s vs ≤6 min). shouldApplyRunScopedStatusUpdate now allows lost→succeeded so a late lifecycle COMPLETE can override the maintenance-inferred lost outcome. Regression tests cover: stuck-running→lost, fresh-kill timing (before grace expires), live-run retained, non-authoritative skipped, and kill→maintenance-lost→late-COMPLETE→succeeded. Fixes openclaw#90444
c5e658b to
f63c221
Compare
Summary
markSubagentRunTerminatedkills a subagent run, the mirrored task rows stay stuck inrunningindefinitely. The linked report describes both the parentruntime='subagent'row and the childruntime='cli'row for the same run. Thecancelpath returns early because the run is already gone; maintenance had no subagent-run lookup so neither row could be cleared without a manual DB fix. Issue Bug: killed subagent run can leave task_runs stuck in running, with tasks cancel and maintenance unable to clear it #90444.endedAt,cleanupHandled,suppressAnnounceReason) but never calledfailTaskRunByRunIdto close the mirrored task rows. WritingfailTaskRunByRunIdimmediately in the kill path would fix the stuck rows but opens a kill-vs-complete race: if a lifecycleCOMPLETEevent arrives concurrently,completeTaskRunByRunIdcannot transitionfailed → succeededbecauseshouldApplyRunScopedStatusUpdateblocks that direction, leaving the row permanently wrong.setDetachedTaskDeliveryStatusByRunId(not_applicable)immediately — killed runs never reach the announce flow. Task-row status finalization is deferred to maintenance, which reads terminal run state from a merged SQLite+memory snapshot via a new optionalgetSubagentRunEndedAthook onTaskRegistryMaintenanceRuntime. WhenendedAtis set for a run,shouldMarkLostreturnstrueearly — before the 5-minuteTASK_RECONCILE_GRACE_MSwindow — for bothruntime='subagent'and same-runruntime='cli'peer rows, so maintenance marks themloston the next sweep (≤ 60 s). The snapshot is SQLite-backed so CLI maintenance (openclaw tasks maintenance --apply) also resolves stuck rows. AshouldApplyRunScopedStatusUpdatechange allowslost → succeeded, and thelost → succeededupgrade also clears the staleerrorand resetscleanupAfterto the succeeded retention window.src/agents/subagent-registry-run-manager.ts— kill loop: removefailTaskRunByRunId; keepsetDetachedTaskDeliveryStatusByRunId(not_applicable).src/tasks/task-registry.maintenance.ts— add optionalgetSubagentRunEndedAthook; populate a per-sweep SQLite+memory snapshot viagetSubagentRunsSnapshotForRead; terminal-run early-exit inshouldMarkLostandhasBackingSessionnow covers bothruntime='subagent'and same-runruntime='cli'peer rows.src/tasks/task-registry.ts— allowlost → succeededinshouldApplyRunScopedStatusUpdate; clear lost-derivederrorand resetcleanupAfteron the upgrade.src/agents/subagent-registry.test.ts— add mock forsetDetachedTaskDeliveryStatusByRunId; update killed-run test.src/tasks/task-registry.maintenance.issue-60299.test.ts— addgetSubagentRunEndedAtto harness; add five regression tests for#90444.src/tasks/task-registry.test.ts— add regression for kill → maintenance lost → late COMPLETE →succeededwith field-level assertions.safeFinalizeSubagentTaskRun,completeTaskRunByRunId) is unchanged.subagent-registry-lifecycle.tsis unchanged.Reproduction
registerSubagentRun) which creates mirrored running task rows.markSubagentRunTerminated— the kill path persists terminal state to SQLite and callspersist().cancelon the task — returns early because the subagent run is already gone.hasBackingSessionreturnedtrueand both rows stayedrunning.runningwith no automatic recovery path.deliveryStatus = not_applicableimmediately; the next maintenance sweep loads a SQLite+memory snapshot, findsendedAtset, and marks both the subagent row and any CLI peer rowlost. A late lifecycle COMPLETE then upgradeslost → succeededand clears the lost-derived error and cleanup window.Real behavior proof
Behavior addressed (#90444): killed subagent task rows (both parent
runtime='subagent'and childruntime='cli') no longer stay stuck inrunning— maintenance detects the terminal run record via a SQLite-backed snapshot and finalizes the rows on the next sweep, without the kill-vs-complete race that would occur iffailTaskRunByRunIdwere called directly in the kill path.Real environment tested (Linux, Node 22 — Vitest against the production kill path, maintenance runtime, and
hasBackingSessionlogic):markSubagentRunTerminateddriving the realsubagent-registry-run-manager;runTaskRegistryMaintenancedriving the realhasBackingSessionwith the newgetSubagentRunEndedAthook;markTaskLostById+markTaskTerminalByRunIdexercising thelost → succeededtransition.Exact steps or command run after this patch:
node scripts/run-vitest.mjs src/tasks/task-registry.maintenance.issue-60299.test.ts src/tasks/task-registry.test.ts src/agents/subagent-registry.test.ts src/agents/subagent-registry-lifecycle.test.tsEvidence after fix (Vitest output):
The six new
#90444cases cover: terminal run → subagent row markedlost; live run (noendedAt) → row retained; non-authoritative runtime → terminal-run check still works via SQLite snapshot; freshly killed task (within grace window) → markedlostwithout waiting for grace expiry; same-run CLI peer row → markedlostalongside the subagent row; maintenance-lost + late COMPLETE →succeededwith cleared error and correct cleanup window.Observed result after fix: kill path writes
deliveryStatus = not_applicableimmediately; task-rowstatusstaysrunninguntil the next maintenance sweep (≤ 60 s) detectsendedAtin the SQLite+memory snapshot and writesstatus = lostfor both the subagent row and any CLI peer row; a late lifecycle COMPLETE then upgradeslost → succeeded.What was not tested: live gateway kill with a concurrent lifecycle
COMPLETEevent arriving in the same sweep window.Repro confirmation: the new
#90444maintenance tests fail on the pre-patch code and pass with the fix; the CLI peer row test fails before thetask.runtime === "cli"extension and passes after; thelost → succeededfield test fails before thepatch.error = undefined / patch.cleanupAfter = undefinednormalization and passes after.Risk / Mitigation
TASK_SWEEP_INTERVAL_MS; the previous approach had no deadline at all (rows stuck permanently).markSubagentRunTerminatedcallsparams.persist()before returning, writing the entry (withendedAt) to SQLite; the per-sweepgetSubagentRunsSnapshotForReadcall reads that persisted data, sogetSubagentRunEndedAtreturns the value in the first post-restart sweep.lost → succeededopens a window for unintended task resurrection. Mitigation:updateTaskStateByRunIdis scoped byrunId; only acompleteTaskRunByRunIdcall for the samerunIdcan trigger this transition, and that call originates exclusively from the authoritative lifecycle COMPLETE handler.sourceIdwith a non-killed terminal subagent run. Mitigation:getSubagentRunEndedAtreturnsendedAtfor any terminal run (killed or completed). If the parent subagent completed normally, its own lifecycle event would have already transitioned the CLI peer row to a terminal state before maintenance runs; maintenance only sees the CLI row asrunningwhen the normal finalization path failed (the exact failure the issue reports).Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
Fixes #90444