Bug: killed subagent run can leave task_runs stuck in running, with tasks cancel and maintenance unable to clear it

## Summary

A subagent run can terminate and be recorded as terminal in `subagents/runs.json`, while the task registry still leaves the related `task_runs` rows in `status='running'`.

In the incident I investigated, this affected both:
- the top-level `runtime='subagent'` task row
- the child `runtime='cli'` task row for the same run

The stale rows were visible in `openclaw tasks`, but there was no live worker left.

## What made this notable

The stale rows were not just old `running` tasks:
- the backing subagent run already had terminal state in `subagents/runs.json`
- the task rows still showed `status='running'`
- both rows had `error` set (`killed` / `terminated` shape)
- at least one affected incident had broken timestamps where `ended_at < started_at`

Because of that, normal cleanup paths did not resolve it.

## Actual behavior

Observed behavior in a live self-hosted runtime:

1. A subagent run ended and no active worker/session remained.
2. `subagents/runs.json` contained terminal data such as:
   - `endedAt`
   - `outcome.status: "error"`
   - `endedReason: "subagent-killed"`
   - optional `suppressAnnounceReason`
3. `task_runs` still contained related rows with:
   - `status='running'`
   - stale `delivery_status`
   - terminal-ish `error` values
   - in one case, `ended_at < started_at`
4. `openclaw tasks cancel <runId>` returned a variant of `Subagent was not running.`
5. `openclaw tasks maintenance --apply` did not reconcile the rows.
6. `openclaw tasks flow cancel <flowId>` also could not finish the flow while the linked task still looked active.

## Expected behavior

When a subagent run reaches terminal state, the task registry should also transition all related task rows out of `running`.

At minimum:
- parent `runtime='subagent'` task row should become terminal
- child `runtime='cli'` task row should become terminal
- timestamps should remain monotonic
- any linked task flow should become cancellable/finalizable by supported commands

## Impact

This creates a misleading and sticky failure mode:
- `openclaw tasks` shows a running task that is not real
- operators cannot clear it via supported commands
- `maintenance` leaves it behind
- task flow state can remain stuck because the child still appears active
- manual SQLite repair may be required even though authoritative terminal data already exists elsewhere

## Why this seems distinct from age-based stale-running cleanup

This does not look like only the already-known “multi-day stale running task” case.

In this incident the bad rows were relatively fresh, and the stronger signal was that terminal run data already existed, but registry finalization was incomplete or internally inconsistent.

Specifically, the interesting combination was:
- terminal data present in `subagents/runs.json`
- `task_runs.status='running'`
- broken timestamps possible
- supported cancel/maintenance paths still unable to clean it up

## Suggested fix shape

A likely robust fix would be one or more of:

1. **Finalization consistency on subagent termination paths**
   - Ensure every killed/restarted/terminated subagent path updates all related `task_runs` rows to terminal state.
   - Do not leave `status='running'` once `endedAt` / terminal outcome is known.

2. **Registry self-heal using authoritative subagent run state**
   - If `subagents/runs.json` already shows a run as terminal, `tasks maintenance` should reconcile matching `task_runs` rows even if they are not yet “old enough” for stale-running heuristics.

3. **Monotonic timestamp guardrails**
   - Prevent `ended_at < started_at` from being written.
   - If encountered during audit/maintenance, treat it as repairable corruption instead of preserving `running`.

4. **Flow-level cancellation resilience**
   - `tasks flow cancel` should be able to finish a flow when the linked task is only nominally active but its backing run is already terminal.

## Useful repro shape

I do not yet have a tiny public repro script, but the incident shape was approximately:

1. Start a subagent run that creates a top-level subagent task and a child CLI task.
2. Cause the run to terminate through a kill/restart/failure path during or near finalization.
3. End up with:
   - terminal subagent run state recorded
   - no live child worker/session
   - task rows still in `running`
   - optionally broken `ended_at` ordering
4. Try cleanup:
   - `openclaw tasks cancel <runId>`
   - `openclaw tasks maintenance --apply`
   - `openclaw tasks flow cancel <flowId>`
5. Observe that none of them clear the stale `running` rows.

## Workaround used

The only reliable repair I found was:
1. back up both the task registry and flow registry SQLite files
2. manually update stale `task_runs` rows to terminal state using the authoritative terminal run data
3. re-run supported `tasks flow cancel` / maintenance afterward

That fixes the state locally, but it is obviously not a good operator workflow.

## Environment

- OpenClaw version: `2026.4.8 (9ece252)`
- Model in use: `openai-codex/gpt-5.4`
- Runtime type: self-hosted Linux gateway
- Channel/session type during incident: direct chat session spawning subagents

## Related upstream context

This feels adjacent to, but not the same as, existing stale-task reports around maintenance/cancel behavior.

Potentially related:
- stale-running maintenance gaps
- tasks cancel not clearing stale rows
- missing operator-level subagent kill/reap tools

If helpful, I can also provide a sanitized before/after snapshot of the affected registry fields and the exact sequence of supported commands that failed before manual repair.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: killed subagent run can leave task_runs stuck in running, with tasks cancel and maintenance unable to clear it #90444

Summary

What made this notable

Actual behavior

Expected behavior

Impact

Why this seems distinct from age-based stale-running cleanup

Suggested fix shape

Useful repro shape

Workaround used

Environment

Related upstream context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Bug: killed subagent run can leave task_runs stuck in running, with tasks cancel and maintenance unable to clear it #90444

Description

Summary

What made this notable

Actual behavior

Expected behavior

Impact

Why this seems distinct from age-based stale-running cleanup

Suggested fix shape

Useful repro shape

Workaround used

Environment

Related upstream context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions