Skip to content

fix(agents): finalize killed subagent task rows via maintenance to avoid kill-vs-complete race#90505

Open
openperf wants to merge 1 commit into
openclaw:mainfrom
openperf:fix/90444-killed-subagent-task-finalization
Open

fix(agents): finalize killed subagent task rows via maintenance to avoid kill-vs-complete race#90505
openperf wants to merge 1 commit into
openclaw:mainfrom
openperf:fix/90444-killed-subagent-task-finalization

Conversation

@openperf

@openperf openperf commented Jun 5, 2026

Copy link
Copy Markdown
Member

Summary

  • Problem: After markSubagentRunTerminated kills a subagent run, the mirrored task rows stay stuck in running indefinitely. The linked report describes both the parent runtime='subagent' row and the child runtime='cli' row for the same run. The cancel path returns early because the run is already gone; maintenance had no subagent-run lookup so neither row could be cleared without a manual DB fix. Issue Bug: killed subagent run can leave task_runs stuck in running, with tasks cancel and maintenance unable to clear it #90444.
  • Root Cause: The kill path persisted terminal subagent-run state (endedAt, cleanupHandled, suppressAnnounceReason) but never called failTaskRunByRunId to close the mirrored task rows. Writing failTaskRunByRunId immediately in the kill path would fix the stuck rows but opens a kill-vs-complete race: if a lifecycle COMPLETE event arrives concurrently, completeTaskRunByRunId cannot transition failed → succeeded because shouldApplyRunScopedStatusUpdate blocks that direction, leaving the row permanently wrong.
  • Fix: Decouple the two concerns. Kill path writes setDetachedTaskDeliveryStatusByRunId(not_applicable) immediately — killed runs never reach the announce flow. Task-row status finalization is deferred to maintenance, which reads terminal run state from a merged SQLite+memory snapshot via a new optional getSubagentRunEndedAt hook on TaskRegistryMaintenanceRuntime. When endedAt is set for a run, shouldMarkLost returns true early — before the 5-minute TASK_RECONCILE_GRACE_MS window — for both runtime='subagent' and same-run runtime='cli' peer rows, so maintenance marks them lost on the next sweep (≤ 60 s). The snapshot is SQLite-backed so CLI maintenance (openclaw tasks maintenance --apply) also resolves stuck rows. A shouldApplyRunScopedStatusUpdate change allows lost → succeeded, and the lost → succeeded upgrade also clears the stale error and resets cleanupAfter to the succeeded retention window.
  • What changed:
    • src/agents/subagent-registry-run-manager.ts — kill loop: remove failTaskRunByRunId; keep setDetachedTaskDeliveryStatusByRunId(not_applicable).
    • src/tasks/task-registry.maintenance.ts — add optional getSubagentRunEndedAt hook; populate a per-sweep SQLite+memory snapshot via getSubagentRunsSnapshotForRead; terminal-run early-exit in shouldMarkLost and hasBackingSession now covers both runtime='subagent' and same-run runtime='cli' peer rows.
    • src/tasks/task-registry.ts — allow lost → succeeded in shouldApplyRunScopedStatusUpdate; clear lost-derived error and reset cleanupAfter on the upgrade.
    • src/agents/subagent-registry.test.ts — add mock for setDetachedTaskDeliveryStatusByRunId; update killed-run test.
    • src/tasks/task-registry.maintenance.issue-60299.test.ts — add getSubagentRunEndedAt to harness; add five regression tests for #90444.
    • src/tasks/task-registry.test.ts — add regression for kill → maintenance lost → late COMPLETE → succeeded with field-level assertions.
  • What did NOT change (scope boundary):
    • The completion path (safeFinalizeSubagentTaskRun, completeTaskRunByRunId) is unchanged.
    • The kill-vs-complete race handler in subagent-registry-lifecycle.ts is unchanged.
    • Config surface unchanged. Plugin surface unchanged. No new config keys or deps.

Reproduction

  1. Register a subagent run (registerSubagentRun) which creates mirrored running task rows.
  2. Kill it via markSubagentRunTerminated — the kill path persists terminal state to SQLite and calls persist().
  3. Attempt cancel on the task — returns early because the subagent run is already gone.
  4. Run a maintenance sweep — before this PR, maintenance had no subagent-run lookup and saw the session entry still present, so hasBackingSession returned true and both rows stayed running.
  5. Before this PR: both rows stay stuck in running with no automatic recovery path.
  6. After this PR: kill path writes deliveryStatus = not_applicable immediately; the next maintenance sweep loads a SQLite+memory snapshot, finds endedAt set, and marks both the subagent row and any CLI peer row lost. A late lifecycle COMPLETE then upgrades lost → succeeded and clears the lost-derived error and cleanup window.

Real behavior proof

Behavior addressed (#90444): killed subagent task rows (both parent runtime='subagent' and child runtime='cli') no longer stay stuck in running — maintenance detects the terminal run record via a SQLite-backed snapshot and finalizes the rows on the next sweep, without the kill-vs-complete race that would occur if failTaskRunByRunId were called directly in the kill path.

Real environment tested (Linux, Node 22 — Vitest against the production kill path, maintenance runtime, and hasBackingSession logic): markSubagentRunTerminated driving the real subagent-registry-run-manager; runTaskRegistryMaintenance driving the real hasBackingSession with the new getSubagentRunEndedAt hook; markTaskLostById + markTaskTerminalByRunId exercising the lost → succeeded transition.

Exact steps or command run after this patch: node scripts/run-vitest.mjs src/tasks/task-registry.maintenance.issue-60299.test.ts src/tasks/task-registry.test.ts src/agents/subagent-registry.test.ts src/agents/subagent-registry-lifecycle.test.ts

Evidence after fix (Vitest output):

src/agents/subagent-registry.test.ts                   Tests  61 passed (61)
src/agents/subagent-registry-lifecycle.test.ts         Tests  30 passed (30)
task-registry.maintenance.issue-60299.test.ts          Tests  24 passed (24)
task-registry.test.ts                                  Tests  75 passed (75)

The six new #90444 cases cover: terminal run → subagent row marked lost; live run (no endedAt) → row retained; non-authoritative runtime → terminal-run check still works via SQLite snapshot; freshly killed task (within grace window) → marked lost without waiting for grace expiry; same-run CLI peer row → marked lost alongside the subagent row; maintenance-lost + late COMPLETE → succeeded with cleared error and correct cleanup window.

Observed result after fix: kill path writes deliveryStatus = not_applicable immediately; task-row status stays running until the next maintenance sweep (≤ 60 s) detects endedAt in the SQLite+memory snapshot and writes status = lost for both the subagent row and any CLI peer row; a late lifecycle COMPLETE then upgrades lost → succeeded.

What was not tested: live gateway kill with a concurrent lifecycle COMPLETE event arriving in the same sweep window.

Repro confirmation: the new #90444 maintenance tests fail on the pre-patch code and pass with the fix; the CLI peer row test fails before the task.runtime === "cli" extension and passes after; the lost → succeeded field test fails before the patch.error = undefined / patch.cleanupAfter = undefined normalization and passes after.

Risk / Mitigation

  • Risk: Up to 60 s window between kill and task-row finalization. Mitigation: the window is bounded by TASK_SWEEP_INTERVAL_MS; the previous approach had no deadline at all (rows stuck permanently).
  • Risk: Killed entry absent from the SQLite snapshot after gateway restart. Mitigation: markSubagentRunTerminated calls params.persist() before returning, writing the entry (with endedAt) to SQLite; the per-sweep getSubagentRunsSnapshotForRead call reads that persisted data, so getSubagentRunEndedAt returns the value in the first post-restart sweep.
  • Risk: lost → succeeded opens a window for unintended task resurrection. Mitigation: updateTaskStateByRunId is scoped by runId; only a completeTaskRunByRunId call for the same runId can trigger this transition, and that call originates exclusively from the authoritative lifecycle COMPLETE handler.
  • Risk: CLI peer row false-positive if a CLI task shares a sourceId with a non-killed terminal subagent run. Mitigation: getSubagentRunEndedAt returns endedAt for any terminal run (killed or completed). If the parent subagent completed normally, its own lifecycle event would have already transitioned the CLI peer row to a terminal state before maintenance runs; maintenance only sees the CLI row as running when the normal finalization path failed (the exact failure the issue reports).

Change Type (select all)

  • Bug fix

Scope (select all touched areas)

  • Agents / subagents
  • Tasks / background jobs
  • Diagnostics / maintenance

Linked Issue/PR

Fixes #90444

@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: S labels Jun 5, 2026
@openperf openperf force-pushed the fix/90444-killed-subagent-task-finalization branch from e6fbce8 to 2792830 Compare June 5, 2026 01:50
@clawsweeper

clawsweeper Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs maintainer review before merge. Reviewed June 5, 2026, 3:36 AM ET / 07:36 UTC.

Summary
The PR defers killed subagent task-row finalization to task maintenance, adds terminal subagent-run lookup for subagent and same-run CLI rows, and allows a maintenance-inferred lost task to become succeeded after a late same-run completion.

PR surface: Source +75, Tests +248. Total +323 across 6 files.

Reproducibility: yes. from source inspection: current main persists killed subagent terminal state without task-row finalization, task maintenance lacks a subagent-run endedAt lookup, and cancel can return before marking the task terminal. I did not run a live gateway repro because this review is read-only.

Review metrics: 1 noteworthy metric.

  • Terminal-state reconciliation: 1 changed transition rule. The PR adds a new lost -> succeeded run-scoped terminal transition, which is a deliberate session-state semantic change maintainers should notice before merge.

Merge readiness
Overall: 🐚 platinum hermit
Proof: 🐚 platinum hermit
Patch quality: 🐚 platinum hermit
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • Maintainer decides whether the focused unit/source proof is enough or whether to request a live gateway kill plus late-completion proof.

Risk before merge

  • [P1] Merging intentionally changes session-state reconciliation: maintenance can infer lost for killed subagent/CLI rows before the normal lost grace window, and a same-run completion can later rewrite lost to succeeded.
  • [P1] The PR body reports focused Vitest proof but not a live gateway kill with a concurrent lifecycle COMPLETE event, so the last race is covered by source-level/unit evidence rather than real transport proof.

Maintainer options:

  1. Accept the bounded reconciliation change (recommended)
    Maintainers can land this if they accept the targeted unit/source proof for the killed-run maintenance path and late same-run completion ordering.
  2. Request live race proof first
    Maintainers can ask for a live gateway or Crabbox scenario that kills a subagent while a completion event is still possible before merging.

Next step before merge

  • No automated repair is needed; the remaining action is maintainer acceptance of the session-state reconciliation risk and normal CI/merge handling.

Security
Cleared: No concrete security or supply-chain concern found; the diff is limited to task/subagent runtime reconciliation and tests, with no dependency, workflow, secret, or packaging changes.

Review details

Best possible solution:

Land the maintenance-owned repair if maintainers accept the bounded session-state tradeoff and the focused task/subagent CI remains green.

Do we have a high-confidence way to reproduce the issue?

Yes from source inspection: current main persists killed subagent terminal state without task-row finalization, task maintenance lacks a subagent-run endedAt lookup, and cancel can return before marking the task terminal. I did not run a live gateway repro because this review is read-only.

Is this the best way to solve the issue?

Yes, with maintainer acceptance of the session-state tradeoff: maintenance-owned finalization avoids writing failed in the kill path before a concurrent successful lifecycle completion can arrive. A direct kill-path failTaskRunByRunId fix is simpler but would preserve the race described in the PR body.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 1a3ce7c2a8da.

Label changes

Label justifications:

  • P2: The PR fixes a bounded but operator-visible stuck task/session-state bug with limited blast radius.
  • merge-risk: 🚨 session-state: The diff changes how task registry terminal state is inferred after killed subagent runs and late completion events.
  • rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🐚 platinum hermit and patch quality is 🐚 platinum hermit.
  • status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Not applicable: The PR author is a repository member, so the external-contributor real behavior proof gate does not apply; the PR body still provides focused post-patch Vitest output as supplemental validation.
Evidence reviewed

PR surface:

Source +75, Tests +248. Total +323 across 6 files.

View PR surface stats
Area Files Added Removed Net
Source 3 194 119 +75
Tests 3 248 0 +248
Docs 0 0 0 0
Config 0 0 0 0
Generated 0 0 0 0
Other 0 0 0 0
Total 6 442 119 +323

What I checked:

  • Repository policy: Root AGENTS.md was provided in full and src/agents/AGENTS.md was read; the review applied the PR-review depth, session-state risk, read-only, and scoped agent-path guidance. (src/agents/AGENTS.md:1, 1a3ce7c2a8da)
  • Current main kill path: Current main marks a subagent run terminal and persists it in markSubagentRunTerminated, but this path does not update mirrored task terminal status or delivery state. (src/agents/subagent-registry-run-manager.ts:738, 1a3ce7c2a8da)
  • Current main maintenance gap: Current main shouldMarkLost only waits for the lost grace window and checks backing session state; it does not consult terminal subagent run state. (src/tasks/task-registry.maintenance.ts:542, 1a3ce7c2a8da)
  • PR terminal-run maintenance path: The PR adds a SQLite+memory subagent-run snapshot and marks active subagent/CLI peer rows lost early when the matching subagent run has endedAt. (src/tasks/task-registry.maintenance.ts:569, f63c221e2978)
  • PR late-complete transition: The PR changes the run-scoped terminal guard so lost -> succeeded can apply, and clears lost-derived error/cleanup fields when that transition occurs. (src/tasks/task-registry.ts:496, f63c221e2978)
  • Linked issue evidence: The linked report and follow-up comments describe a live OpenClaw 2026.4.8 kill path where terminal subagent state was persisted while the task row stayed running; the reporter's local direct-finalization patch stopped new stuck rows, but the PR chooses maintenance finalization to avoid the kill-vs-complete race.

Likely related people:

  • @steipete: Current blame and GitHub file history show recent central work on task registry maintenance, task registry state, and subagent registry helpers that own the affected behavior. (role: recent area contributor; confidence: high; commits: 82710b4f1f10, 606e3d78669a, cca24cc78b63; files: src/tasks/task-registry.maintenance.ts, src/tasks/task-registry.ts, src/agents/subagent-registry-run-manager.ts)
  • @vincentkoc: GitHub history shows recent subagent delivery-state work adjacent to the delivery/status split this PR changes. (role: adjacent subagent delivery contributor; confidence: medium; commits: 7e0d275f7add; files: src/agents/subagent-registry-lifecycle.ts, src/agents/subagent-registry-run-manager.ts)
  • @openperf: Beyond this PR, GitHub history shows prior merged work on task maintenance zombie-run reclamation, which is adjacent to the maintenance-owned repair path here. (role: recent adjacent task-maintenance contributor; confidence: medium; commits: 02c7b5b82fc5; files: src/tasks/task-registry.maintenance.ts, src/tasks/task-registry.ts)
  • @Feelw00: GitHub history shows recent task persistence work that changed task/TaskFlow durability boundaries near the affected registry mutation paths. (role: adjacent task persistence contributor; confidence: medium; commits: 01193dea2637; files: src/tasks/task-registry.ts, src/tasks/task-executor.ts)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. P2 Normal backlog priority with limited blast radius. merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. labels Jun 5, 2026
@clawsweeper clawsweeper Bot added rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. and removed rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. labels Jun 5, 2026
@openperf openperf force-pushed the fix/90444-killed-subagent-task-finalization branch 6 times, most recently from d259227 to c5e658b Compare June 5, 2026 07:08
@clawsweeper clawsweeper Bot added rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. and removed rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. labels Jun 5, 2026
…oid kill-vs-complete race

Kill path writes setDetachedTaskDeliveryStatusByRunId(not_applicable) immediately; task-row
status finalization is deferred to maintenance via a new optional getSubagentRunEndedAt hook
on TaskRegistryMaintenanceRuntime. When the runtime is authoritative and endedAt is set,
shouldMarkLost returns true before the 5-minute grace window (convergence ≤60 s vs ≤6 min).
shouldApplyRunScopedStatusUpdate now allows lost→succeeded so a late lifecycle COMPLETE can
override the maintenance-inferred lost outcome. Regression tests cover: stuck-running→lost,
fresh-kill timing (before grace expires), live-run retained, non-authoritative skipped, and
kill→maintenance-lost→late-COMPLETE→succeeded.

Fixes openclaw#90444
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. P2 Normal backlog priority with limited blast radius. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. size: L status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: killed subagent run can leave task_runs stuck in running, with tasks cancel and maintenance unable to clear it

1 participant