Summary
A subagent run can terminate and be recorded as terminal in subagents/runs.json, while the task registry still leaves the related task_runs rows in status='running'.
In the incident I investigated, this affected both:
- the top-level
runtime='subagent' task row
- the child
runtime='cli' task row for the same run
The stale rows were visible in openclaw tasks, but there was no live worker left.
What made this notable
The stale rows were not just old running tasks:
- the backing subagent run already had terminal state in
subagents/runs.json
- the task rows still showed
status='running'
- both rows had
error set (killed / terminated shape)
- at least one affected incident had broken timestamps where
ended_at < started_at
Because of that, normal cleanup paths did not resolve it.
Actual behavior
Observed behavior in a live self-hosted runtime:
- A subagent run ended and no active worker/session remained.
subagents/runs.json contained terminal data such as:
endedAt
outcome.status: "error"
endedReason: "subagent-killed"
- optional
suppressAnnounceReason
task_runs still contained related rows with:
status='running'
- stale
delivery_status
- terminal-ish
error values
- in one case,
ended_at < started_at
openclaw tasks cancel <runId> returned a variant of Subagent was not running.
openclaw tasks maintenance --apply did not reconcile the rows.
openclaw tasks flow cancel <flowId> also could not finish the flow while the linked task still looked active.
Expected behavior
When a subagent run reaches terminal state, the task registry should also transition all related task rows out of running.
At minimum:
- parent
runtime='subagent' task row should become terminal
- child
runtime='cli' task row should become terminal
- timestamps should remain monotonic
- any linked task flow should become cancellable/finalizable by supported commands
Impact
This creates a misleading and sticky failure mode:
openclaw tasks shows a running task that is not real
- operators cannot clear it via supported commands
maintenance leaves it behind
- task flow state can remain stuck because the child still appears active
- manual SQLite repair may be required even though authoritative terminal data already exists elsewhere
Why this seems distinct from age-based stale-running cleanup
This does not look like only the already-known “multi-day stale running task” case.
In this incident the bad rows were relatively fresh, and the stronger signal was that terminal run data already existed, but registry finalization was incomplete or internally inconsistent.
Specifically, the interesting combination was:
- terminal data present in
subagents/runs.json
task_runs.status='running'
- broken timestamps possible
- supported cancel/maintenance paths still unable to clean it up
Suggested fix shape
A likely robust fix would be one or more of:
-
Finalization consistency on subagent termination paths
- Ensure every killed/restarted/terminated subagent path updates all related
task_runs rows to terminal state.
- Do not leave
status='running' once endedAt / terminal outcome is known.
-
Registry self-heal using authoritative subagent run state
- If
subagents/runs.json already shows a run as terminal, tasks maintenance should reconcile matching task_runs rows even if they are not yet “old enough” for stale-running heuristics.
-
Monotonic timestamp guardrails
- Prevent
ended_at < started_at from being written.
- If encountered during audit/maintenance, treat it as repairable corruption instead of preserving
running.
-
Flow-level cancellation resilience
tasks flow cancel should be able to finish a flow when the linked task is only nominally active but its backing run is already terminal.
Useful repro shape
I do not yet have a tiny public repro script, but the incident shape was approximately:
- Start a subagent run that creates a top-level subagent task and a child CLI task.
- Cause the run to terminate through a kill/restart/failure path during or near finalization.
- End up with:
- terminal subagent run state recorded
- no live child worker/session
- task rows still in
running
- optionally broken
ended_at ordering
- Try cleanup:
openclaw tasks cancel <runId>
openclaw tasks maintenance --apply
openclaw tasks flow cancel <flowId>
- Observe that none of them clear the stale
running rows.
Workaround used
The only reliable repair I found was:
- back up both the task registry and flow registry SQLite files
- manually update stale
task_runs rows to terminal state using the authoritative terminal run data
- re-run supported
tasks flow cancel / maintenance afterward
That fixes the state locally, but it is obviously not a good operator workflow.
Environment
- OpenClaw version:
2026.4.8 (9ece252)
- Model in use:
openai-codex/gpt-5.4
- Runtime type: self-hosted Linux gateway
- Channel/session type during incident: direct chat session spawning subagents
Related upstream context
This feels adjacent to, but not the same as, existing stale-task reports around maintenance/cancel behavior.
Potentially related:
- stale-running maintenance gaps
- tasks cancel not clearing stale rows
- missing operator-level subagent kill/reap tools
If helpful, I can also provide a sanitized before/after snapshot of the affected registry fields and the exact sequence of supported commands that failed before manual repair.
Summary
A subagent run can terminate and be recorded as terminal in
subagents/runs.json, while the task registry still leaves the relatedtask_runsrows instatus='running'.In the incident I investigated, this affected both:
runtime='subagent'task rowruntime='cli'task row for the same runThe stale rows were visible in
openclaw tasks, but there was no live worker left.What made this notable
The stale rows were not just old
runningtasks:subagents/runs.jsonstatus='running'errorset (killed/terminatedshape)ended_at < started_atBecause of that, normal cleanup paths did not resolve it.
Actual behavior
Observed behavior in a live self-hosted runtime:
subagents/runs.jsoncontained terminal data such as:endedAtoutcome.status: "error"endedReason: "subagent-killed"suppressAnnounceReasontask_runsstill contained related rows with:status='running'delivery_statuserrorvaluesended_at < started_atopenclaw tasks cancel <runId>returned a variant ofSubagent was not running.openclaw tasks maintenance --applydid not reconcile the rows.openclaw tasks flow cancel <flowId>also could not finish the flow while the linked task still looked active.Expected behavior
When a subagent run reaches terminal state, the task registry should also transition all related task rows out of
running.At minimum:
runtime='subagent'task row should become terminalruntime='cli'task row should become terminalImpact
This creates a misleading and sticky failure mode:
openclaw tasksshows a running task that is not realmaintenanceleaves it behindWhy this seems distinct from age-based stale-running cleanup
This does not look like only the already-known “multi-day stale running task” case.
In this incident the bad rows were relatively fresh, and the stronger signal was that terminal run data already existed, but registry finalization was incomplete or internally inconsistent.
Specifically, the interesting combination was:
subagents/runs.jsontask_runs.status='running'Suggested fix shape
A likely robust fix would be one or more of:
Finalization consistency on subagent termination paths
task_runsrows to terminal state.status='running'onceendedAt/ terminal outcome is known.Registry self-heal using authoritative subagent run state
subagents/runs.jsonalready shows a run as terminal,tasks maintenanceshould reconcile matchingtask_runsrows even if they are not yet “old enough” for stale-running heuristics.Monotonic timestamp guardrails
ended_at < started_atfrom being written.running.Flow-level cancellation resilience
tasks flow cancelshould be able to finish a flow when the linked task is only nominally active but its backing run is already terminal.Useful repro shape
I do not yet have a tiny public repro script, but the incident shape was approximately:
runningended_atorderingopenclaw tasks cancel <runId>openclaw tasks maintenance --applyopenclaw tasks flow cancel <flowId>runningrows.Workaround used
The only reliable repair I found was:
task_runsrows to terminal state using the authoritative terminal run datatasks flow cancel/ maintenance afterwardThat fixes the state locally, but it is obviously not a good operator workflow.
Environment
2026.4.8 (9ece252)openai-codex/gpt-5.4Related upstream context
This feels adjacent to, but not the same as, existing stale-task reports around maintenance/cancel behavior.
Potentially related:
If helpful, I can also provide a sanitized before/after snapshot of the affected registry fields and the exact sequence of supported commands that failed before manual repair.