fix(agents): finalize killed subagent task rows via maintenance to avoid kill-vs-complete race by openperf · Pull Request #90505 · openclaw/openclaw

openperf · 2026-06-05T01:48:29Z

Summary

Problem: After markSubagentRunTerminated kills a subagent run, the mirrored task rows stay stuck in running indefinitely. The linked report describes both the parent runtime='subagent' row and the child runtime='cli' row for the same run. The cancel path returns early because the run is already gone; maintenance had no subagent-run lookup so neither row could be cleared without a manual DB fix. Issue Bug: killed subagent run can leave task_runs stuck in running, with tasks cancel and maintenance unable to clear it #90444.
Root Cause: The kill path persisted terminal subagent-run state (endedAt, cleanupHandled, suppressAnnounceReason) but never called failTaskRunByRunId to close the mirrored task rows. Writing failTaskRunByRunId immediately in the kill path would fix the stuck rows but opens a kill-vs-complete race: if a lifecycle COMPLETE event arrives concurrently, completeTaskRunByRunId cannot transition failed → succeeded because shouldApplyRunScopedStatusUpdate blocks that direction, leaving the row permanently wrong.
Fix: Decouple the two concerns. Kill path writes setDetachedTaskDeliveryStatusByRunId(not_applicable) immediately — killed runs never reach the announce flow. Task-row status finalization is deferred to maintenance, which reads terminal run state from a merged SQLite+memory snapshot via a new optional getSubagentRunEndedAt hook on TaskRegistryMaintenanceRuntime. When endedAt is set for a run, shouldMarkLost returns true early — before the 5-minute TASK_RECONCILE_GRACE_MS window — for both runtime='subagent' and same-run runtime='cli' peer rows, so maintenance marks them lost on the next sweep (≤ 60 s). The snapshot is SQLite-backed so CLI maintenance (openclaw tasks maintenance --apply) also resolves stuck rows. A shouldApplyRunScopedStatusUpdate change allows lost → succeeded, and the lost → succeeded upgrade also clears the stale error and resets cleanupAfter to the succeeded retention window.
What changed:
- src/agents/subagent-registry-run-manager.ts — kill loop: remove failTaskRunByRunId; keep setDetachedTaskDeliveryStatusByRunId(not_applicable).
- src/tasks/task-registry.maintenance.ts — add optional getSubagentRunEndedAt hook; populate a per-sweep SQLite+memory snapshot via getSubagentRunsSnapshotForRead; terminal-run early-exit in shouldMarkLost and hasBackingSession now covers both runtime='subagent' and same-run runtime='cli' peer rows.
- src/tasks/task-registry.ts — allow lost → succeeded in shouldApplyRunScopedStatusUpdate; clear lost-derived error and reset cleanupAfter on the upgrade.
- src/agents/subagent-registry.test.ts — add mock for setDetachedTaskDeliveryStatusByRunId; update killed-run test.
- src/tasks/task-registry.maintenance.issue-60299.test.ts — add getSubagentRunEndedAt to harness; add five regression tests for #90444.
- src/tasks/task-registry.test.ts — add regression for kill → maintenance lost → late COMPLETE → succeeded with field-level assertions.
What did NOT change (scope boundary):
- The completion path (safeFinalizeSubagentTaskRun, completeTaskRunByRunId) is unchanged.
- The kill-vs-complete race handler in subagent-registry-lifecycle.ts is unchanged.
- Config surface unchanged. Plugin surface unchanged. No new config keys or deps.

Reproduction

Register a subagent run (registerSubagentRun) which creates mirrored running task rows.
Kill it via markSubagentRunTerminated — the kill path persists terminal state to SQLite and calls persist().
Attempt cancel on the task — returns early because the subagent run is already gone.
Run a maintenance sweep — before this PR, maintenance had no subagent-run lookup and saw the session entry still present, so hasBackingSession returned true and both rows stayed running.
Before this PR: both rows stay stuck in running with no automatic recovery path.
After this PR: kill path writes deliveryStatus = not_applicable immediately; the next maintenance sweep loads a SQLite+memory snapshot, finds endedAt set, and marks both the subagent row and any CLI peer row lost. A late lifecycle COMPLETE then upgrades lost → succeeded and clears the lost-derived error and cleanup window.

Real behavior proof

Behavior addressed (#90444): killed subagent task rows (both parent runtime='subagent' and child runtime='cli') no longer stay stuck in running — maintenance detects the terminal run record via a SQLite-backed snapshot and finalizes the rows on the next sweep, without the kill-vs-complete race that would occur if failTaskRunByRunId were called directly in the kill path.

Real environment tested (Linux, Node 22 — Vitest against the production kill path, maintenance runtime, and hasBackingSession logic): markSubagentRunTerminated driving the real subagent-registry-run-manager; runTaskRegistryMaintenance driving the real hasBackingSession with the new getSubagentRunEndedAt hook; markTaskLostById + markTaskTerminalByRunId exercising the lost → succeeded transition.

Exact steps or command run after this patch: node scripts/run-vitest.mjs src/tasks/task-registry.maintenance.issue-60299.test.ts src/tasks/task-registry.test.ts src/agents/subagent-registry.test.ts src/agents/subagent-registry-lifecycle.test.ts

Evidence after fix (Vitest output):

src/agents/subagent-registry.test.ts                   Tests  61 passed (61)
src/agents/subagent-registry-lifecycle.test.ts         Tests  30 passed (30)
task-registry.maintenance.issue-60299.test.ts          Tests  24 passed (24)
task-registry.test.ts                                  Tests  75 passed (75)

The six new #90444 cases cover: terminal run → subagent row marked lost; live run (no endedAt) → row retained; non-authoritative runtime → terminal-run check still works via SQLite snapshot; freshly killed task (within grace window) → marked lost without waiting for grace expiry; same-run CLI peer row → marked lost alongside the subagent row; maintenance-lost + late COMPLETE → succeeded with cleared error and correct cleanup window.

Observed result after fix: kill path writes deliveryStatus = not_applicable immediately; task-row status stays running until the next maintenance sweep (≤ 60 s) detects endedAt in the SQLite+memory snapshot and writes status = lost for both the subagent row and any CLI peer row; a late lifecycle COMPLETE then upgrades lost → succeeded.

What was not tested: live gateway kill with a concurrent lifecycle COMPLETE event arriving in the same sweep window.

Repro confirmation: the new #90444 maintenance tests fail on the pre-patch code and pass with the fix; the CLI peer row test fails before the task.runtime === "cli" extension and passes after; the lost → succeeded field test fails before the patch.error = undefined / patch.cleanupAfter = undefined normalization and passes after.

Risk / Mitigation

Risk: Up to 60 s window between kill and task-row finalization. Mitigation: the window is bounded by TASK_SWEEP_INTERVAL_MS; the previous approach had no deadline at all (rows stuck permanently).
Risk: Killed entry absent from the SQLite snapshot after gateway restart. Mitigation: markSubagentRunTerminated calls params.persist() before returning, writing the entry (with endedAt) to SQLite; the per-sweep getSubagentRunsSnapshotForRead call reads that persisted data, so getSubagentRunEndedAt returns the value in the first post-restart sweep.
Risk: lost → succeeded opens a window for unintended task resurrection. Mitigation: updateTaskStateByRunId is scoped by runId; only a completeTaskRunByRunId call for the same runId can trigger this transition, and that call originates exclusively from the authoritative lifecycle COMPLETE handler.
Risk: CLI peer row false-positive if a CLI task shares a sourceId with a non-killed terminal subagent run. Mitigation: getSubagentRunEndedAt returns endedAt for any terminal run (killed or completed). If the parent subagent completed normally, its own lifecycle event would have already transitioned the CLI peer row to a terminal state before maintenance runs; maintenance only sees the CLI row as running when the normal finalization path failed (the exact failure the issue reports).

Change Type (select all)

Bug fix

Scope (select all touched areas)

Agents / subagents
Tasks / background jobs
Diagnostics / maintenance

Linked Issue/PR

Fixes #90444

clawsweeper · 2026-06-05T01:50:10Z

Codex review: needs maintainer review before merge. Reviewed June 5, 2026, 3:36 AM ET / 07:36 UTC.

Summary
The PR defers killed subagent task-row finalization to task maintenance, adds terminal subagent-run lookup for subagent and same-run CLI rows, and allows a maintenance-inferred lost task to become succeeded after a late same-run completion.

PR surface: Source +75, Tests +248. Total +323 across 6 files.

Reproducibility: yes. from source inspection: current main persists killed subagent terminal state without task-row finalization, task maintenance lacks a subagent-run endedAt lookup, and cancel can return before marking the task terminal. I did not run a live gateway repro because this review is read-only.

Review metrics: 1 noteworthy metric.

Terminal-state reconciliation: 1 changed transition rule. The PR adds a new lost -> succeeded run-scoped terminal transition, which is a deliberate session-state semantic change maintainers should notice before merge.

Merge readiness
Overall: 🐚 platinum hermit
Proof: 🐚 platinum hermit
Patch quality: 🐚 platinum hermit
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

Maintainer decides whether the focused unit/source proof is enough or whether to request a live gateway kill plus late-completion proof.

Risk before merge

[P1] Merging intentionally changes session-state reconciliation: maintenance can infer lost for killed subagent/CLI rows before the normal lost grace window, and a same-run completion can later rewrite lost to succeeded.
[P1] The PR body reports focused Vitest proof but not a live gateway kill with a concurrent lifecycle COMPLETE event, so the last race is covered by source-level/unit evidence rather than real transport proof.

Maintainer options:

Accept the bounded reconciliation change (recommended)
Maintainers can land this if they accept the targeted unit/source proof for the killed-run maintenance path and late same-run completion ordering.
Request live race proof first
Maintainers can ask for a live gateway or Crabbox scenario that kills a subagent while a completion event is still possible before merging.

Next step before merge

No automated repair is needed; the remaining action is maintainer acceptance of the session-state reconciliation risk and normal CI/merge handling.

Security
Cleared: No concrete security or supply-chain concern found; the diff is limited to task/subagent runtime reconciliation and tests, with no dependency, workflow, secret, or packaging changes.

Review details

Best possible solution:

Land the maintenance-owned repair if maintainers accept the bounded session-state tradeoff and the focused task/subagent CI remains green.

Do we have a high-confidence way to reproduce the issue?

Yes from source inspection: current main persists killed subagent terminal state without task-row finalization, task maintenance lacks a subagent-run endedAt lookup, and cancel can return before marking the task terminal. I did not run a live gateway repro because this review is read-only.

Is this the best way to solve the issue?

Yes, with maintainer acceptance of the session-state tradeoff: maintenance-owned finalization avoids writing failed in the kill path before a concurrent successful lifecycle completion can arrive. A direct kill-path failTaskRunByRunId fix is simpler but would preserve the race described in the PR body.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 1a3ce7c2a8da.

Label changes

Label justifications:

P2: The PR fixes a bounded but operator-visible stuck task/session-state bug with limited blast radius.
merge-risk: 🚨 session-state: The diff changes how task registry terminal state is inferred after killed subagent runs and late completion events.
rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🐚 platinum hermit and patch quality is 🐚 platinum hermit.
status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Not applicable: The PR author is a repository member, so the external-contributor real behavior proof gate does not apply; the PR body still provides focused post-patch Vitest output as supplemental validation.

Evidence reviewed

PR surface:

Source +75, Tests +248. Total +323 across 6 files.

View PR surface stats

Area	Files	Added	Removed	Net
Source	3	194	119	+75
Tests	3	248	0	+248
Docs	0	0	0	0
Config	0	0	0	0
Generated	0	0	0	0
Other	0	0	0	0
Total	6	442	119	+323

What I checked:

Repository policy: Root AGENTS.md was provided in full and src/agents/AGENTS.md was read; the review applied the PR-review depth, session-state risk, read-only, and scoped agent-path guidance. (src/agents/AGENTS.md:1, 1a3ce7c2a8da)
Current main kill path: Current main marks a subagent run terminal and persists it in markSubagentRunTerminated, but this path does not update mirrored task terminal status or delivery state. (src/agents/subagent-registry-run-manager.ts:738, 1a3ce7c2a8da)
Current main maintenance gap: Current main shouldMarkLost only waits for the lost grace window and checks backing session state; it does not consult terminal subagent run state. (src/tasks/task-registry.maintenance.ts:542, 1a3ce7c2a8da)
PR terminal-run maintenance path: The PR adds a SQLite+memory subagent-run snapshot and marks active subagent/CLI peer rows lost early when the matching subagent run has endedAt. (src/tasks/task-registry.maintenance.ts:569, f63c221e2978)
PR late-complete transition: The PR changes the run-scoped terminal guard so lost -> succeeded can apply, and clears lost-derived error/cleanup fields when that transition occurs. (src/tasks/task-registry.ts:496, f63c221e2978)
Linked issue evidence: The linked report and follow-up comments describe a live OpenClaw 2026.4.8 kill path where terminal subagent state was persisted while the task row stayed running; the reporter's local direct-finalization patch stopped new stuck rows, but the PR chooses maintenance finalization to avoid the kill-vs-complete race.

Likely related people:

@steipete: Current blame and GitHub file history show recent central work on task registry maintenance, task registry state, and subagent registry helpers that own the affected behavior. (role: recent area contributor; confidence: high; commits: 82710b4f1f10, 606e3d78669a, cca24cc78b63; files: src/tasks/task-registry.maintenance.ts, src/tasks/task-registry.ts, src/agents/subagent-registry-run-manager.ts)
@vincentkoc: GitHub history shows recent subagent delivery-state work adjacent to the delivery/status split this PR changes. (role: adjacent subagent delivery contributor; confidence: medium; commits: 7e0d275f7add; files: src/agents/subagent-registry-lifecycle.ts, src/agents/subagent-registry-run-manager.ts)
@openperf: Beyond this PR, GitHub history shows prior merged work on task maintenance zombie-run reclamation, which is adjacent to the maintenance-owned repair path here. (role: recent adjacent task-maintenance contributor; confidence: medium; commits: 02c7b5b82fc5; files: src/tasks/task-registry.maintenance.ts, src/tasks/task-registry.ts)
@Feelw00: GitHub history shows recent task persistence work that changed task/TaskFlow durability boundaries near the affected registry mutation paths. (role: adjacent task persistence contributor; confidence: medium; commits: 01193dea2637; files: src/tasks/task-registry.ts, src/tasks/task-executor.ts)

What the crustacean ranks mean

🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works

ClawSweeper keeps one durable marker-backed review comment per issue or PR.
Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
Maintainers can also comment @clawsweeper review to request a fresh review only.
Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

…oid kill-vs-complete race Kill path writes setDetachedTaskDeliveryStatusByRunId(not_applicable) immediately; task-row status finalization is deferred to maintenance via a new optional getSubagentRunEndedAt hook on TaskRegistryMaintenanceRuntime. When the runtime is authoritative and endedAt is set, shouldMarkLost returns true before the 5-minute grace window (convergence ≤60 s vs ≤6 min). shouldApplyRunScopedStatusUpdate now allows lost→succeeded so a late lifecycle COMPLETE can override the maintenance-inferred lost outcome. Regression tests cover: stuck-running→lost, fresh-kill timing (before grace expires), live-run retained, non-authoritative skipped, and kill→maintenance-lost→late-COMPLETE→succeeded. Fixes openclaw#90444

openclaw-barnacle Bot added agents Agent runtime and tooling size: S labels Jun 5, 2026

openperf force-pushed the fix/90444-killed-subagent-task-finalization branch from e6fbce8 to 2792830 Compare June 5, 2026 01:50

hxy91819 mentioned this pull request Jun 5, 2026

fix(ci): scope PR merge diff checks to first parent #90287

Merged

openclaw-barnacle Bot added size: M and removed size: S labels Jun 5, 2026

clawsweeper Bot added rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. and removed rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. labels Jun 5, 2026

openperf force-pushed the fix/90444-killed-subagent-task-finalization branch 6 times, most recently from d259227 to c5e658b Compare June 5, 2026 07:08

openperf force-pushed the fix/90444-killed-subagent-task-finalization branch from c5e658b to f63c221 Compare June 5, 2026 07:30

openclaw-barnacle Bot added size: L and removed size: M labels Jun 5, 2026

clawsweeper Bot mentioned this pull request Jun 5, 2026

[Bug]: orphan recovery can resurrect stale subagent runs after long downtime #90766

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(agents): finalize killed subagent task rows via maintenance to avoid kill-vs-complete race#90505

fix(agents): finalize killed subagent task rows via maintenance to avoid kill-vs-complete race#90505
openperf wants to merge 1 commit into
openclaw:mainfrom
openperf:fix/90444-killed-subagent-task-finalization

openperf commented Jun 5, 2026 •

edited

Loading

Uh oh!

clawsweeper Bot commented Jun 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

openperf commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Reproduction

Real behavior proof

Risk / Mitigation

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Uh oh!

clawsweeper Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

openperf commented Jun 5, 2026 •

edited

Loading

clawsweeper Bot commented Jun 5, 2026 •

edited

Loading