Skip to content

Bug: killed subagent run can leave task_runs stuck in running, with tasks cancel and maintenance unable to clear it #90444

@Timofa

Description

@Timofa

Summary

A subagent run can terminate and be recorded as terminal in subagents/runs.json, while the task registry still leaves the related task_runs rows in status='running'.

In the incident I investigated, this affected both:

  • the top-level runtime='subagent' task row
  • the child runtime='cli' task row for the same run

The stale rows were visible in openclaw tasks, but there was no live worker left.

What made this notable

The stale rows were not just old running tasks:

  • the backing subagent run already had terminal state in subagents/runs.json
  • the task rows still showed status='running'
  • both rows had error set (killed / terminated shape)
  • at least one affected incident had broken timestamps where ended_at < started_at

Because of that, normal cleanup paths did not resolve it.

Actual behavior

Observed behavior in a live self-hosted runtime:

  1. A subagent run ended and no active worker/session remained.
  2. subagents/runs.json contained terminal data such as:
    • endedAt
    • outcome.status: "error"
    • endedReason: "subagent-killed"
    • optional suppressAnnounceReason
  3. task_runs still contained related rows with:
    • status='running'
    • stale delivery_status
    • terminal-ish error values
    • in one case, ended_at < started_at
  4. openclaw tasks cancel <runId> returned a variant of Subagent was not running.
  5. openclaw tasks maintenance --apply did not reconcile the rows.
  6. openclaw tasks flow cancel <flowId> also could not finish the flow while the linked task still looked active.

Expected behavior

When a subagent run reaches terminal state, the task registry should also transition all related task rows out of running.

At minimum:

  • parent runtime='subagent' task row should become terminal
  • child runtime='cli' task row should become terminal
  • timestamps should remain monotonic
  • any linked task flow should become cancellable/finalizable by supported commands

Impact

This creates a misleading and sticky failure mode:

  • openclaw tasks shows a running task that is not real
  • operators cannot clear it via supported commands
  • maintenance leaves it behind
  • task flow state can remain stuck because the child still appears active
  • manual SQLite repair may be required even though authoritative terminal data already exists elsewhere

Why this seems distinct from age-based stale-running cleanup

This does not look like only the already-known “multi-day stale running task” case.

In this incident the bad rows were relatively fresh, and the stronger signal was that terminal run data already existed, but registry finalization was incomplete or internally inconsistent.

Specifically, the interesting combination was:

  • terminal data present in subagents/runs.json
  • task_runs.status='running'
  • broken timestamps possible
  • supported cancel/maintenance paths still unable to clean it up

Suggested fix shape

A likely robust fix would be one or more of:

  1. Finalization consistency on subagent termination paths

    • Ensure every killed/restarted/terminated subagent path updates all related task_runs rows to terminal state.
    • Do not leave status='running' once endedAt / terminal outcome is known.
  2. Registry self-heal using authoritative subagent run state

    • If subagents/runs.json already shows a run as terminal, tasks maintenance should reconcile matching task_runs rows even if they are not yet “old enough” for stale-running heuristics.
  3. Monotonic timestamp guardrails

    • Prevent ended_at < started_at from being written.
    • If encountered during audit/maintenance, treat it as repairable corruption instead of preserving running.
  4. Flow-level cancellation resilience

    • tasks flow cancel should be able to finish a flow when the linked task is only nominally active but its backing run is already terminal.

Useful repro shape

I do not yet have a tiny public repro script, but the incident shape was approximately:

  1. Start a subagent run that creates a top-level subagent task and a child CLI task.
  2. Cause the run to terminate through a kill/restart/failure path during or near finalization.
  3. End up with:
    • terminal subagent run state recorded
    • no live child worker/session
    • task rows still in running
    • optionally broken ended_at ordering
  4. Try cleanup:
    • openclaw tasks cancel <runId>
    • openclaw tasks maintenance --apply
    • openclaw tasks flow cancel <flowId>
  5. Observe that none of them clear the stale running rows.

Workaround used

The only reliable repair I found was:

  1. back up both the task registry and flow registry SQLite files
  2. manually update stale task_runs rows to terminal state using the authoritative terminal run data
  3. re-run supported tasks flow cancel / maintenance afterward

That fixes the state locally, but it is obviously not a good operator workflow.

Environment

  • OpenClaw version: 2026.4.8 (9ece252)
  • Model in use: openai-codex/gpt-5.4
  • Runtime type: self-hosted Linux gateway
  • Channel/session type during incident: direct chat session spawning subagents

Related upstream context

This feels adjacent to, but not the same as, existing stale-task reports around maintenance/cancel behavior.

Potentially related:

  • stale-running maintenance gaps
  • tasks cancel not clearing stale rows
  • missing operator-level subagent kill/reap tools

If helpful, I can also provide a sanitized before/after snapshot of the affected registry fields and the exact sequence of supported commands that failed before manual repair.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Normal backlog priority with limited blast radius.clawsweeper:needs-live-reproClawSweeper needs live local, crabbox, or manual validation to confirm this issue.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🐚 platinum hermitGood issue quality with a plausible reproduction path needing some confirmation.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions