Skip to content

ACP zombie runs block gateway restart/update after 27 days #88205

@subaochen

Description

@subaochen

Bug Description

ACP runs that have been stuck in running status for 27 days block gateway restart/update. The warning restart blocked by active background task run(s) prevents openclaw update from proceeding.

Steps to Reproduce

  1. Start several ACP runs (e.g., via plugin that triggers many parallel sessions)
  2. ACP processes exit (crash or timeout) without proper cleanup
  3. Gateway restart or openclaw update is triggered
  4. Gateway refuses to restart: restart blocked by active background task run(s)

Expected Behavior

  • ACP runs should have a timeout/heartbeat mechanism to detect dead sessions
  • Zombie runs should be auto-terminated after a configurable timeout
  • Gateway restart should not be permanently blocked by dead sessions

Actual Behavior

5 ACP runs created on 2026-05-03 06:28-06:32 GMT+8 remained in status=running for 27 days:

runId taskId status created_at
d01c671b-fff8-4b5c-8254-fadab81441ca 482e18b0-45e3-4804-8b23-69c958e74d96 running 2026-05-03
7dbcf14c-9546-4680-8c13-5c74a58f2633 307669b0-46fb-4f1d-af42-0c14a97620a4 running 2026-05-03
7dbcf14c-9546-4680-8c13-5c74a58f2633 8ceaf3fb-bf8e-47ba-a59a-cba8a896c90e running 2026-05-03
67bc583e-6d7f-482a-8809-63ccaeb3bfc0 1c782714-73e5-4549-aefa-9751c1f7e0fc running 2026-05-03
67bc583e-6d7f-482a-8809-63ccaeb3bfc0 9cf9b4e2-6061-4030-8b44-0ed8ffdda6e2 running 2026-05-03

All had runtime=acp, no ended_at, and were referenced by multiple taskIds from the same runId.

Log Evidence

2026-05-30T10:21:28.245+08:00 warn gateway {"subsystem":"gateway"} restart blocked by active background task run(s): taskId=482e18b0-45e3-4804-8b23-69c958e74d96 runId=d01c671b-fff8-4b5c-8254-fadab81441ca status=running runtime=acp title=[Sun 2026-05-03 06:32 GMT+8] ...

Environment

  • OpenClaw: 2026.5.18 (50a2481)
  • OS: Linux 6.16.4-061604-generic (x64)
  • Node: v22.22.2
  • Runtime: acp (opencode)

Root Cause Analysis

The task_runs SQLite database (~/.openclaw/tasks/runs.sqlite) stores run state. When ACP sessions crash or timeout, their status is never transitioned from running to a terminal state (completed/failed/cancelled). The gateway restart logic checks for any status=running runs and blocks, but there is no TTL or heartbeat check.

Suggested Fix

  1. Add a max runtime TTL for ACP runs (e.g., configurable, default 30 min)
  2. Add a heartbeat timeout — if last_event_at is older than TTL, mark as failed
  3. On gateway restart, auto-clean runs where status=running AND last_event_at < threshold
  4. Consider a --force-restart flag that ignores blocked runs

Workaround (Manual Cleanup)

Manually update the SQLite database:
```sql
UPDATE task_runs SET status='failed', ended_at=, error='zombie_terminated'
WHERE status='running' AND last_event_at < ;
```

This is not a sustainable solution for end users.

Metadata

Metadata

Assignees

Labels

P1High-priority user-facing bug, regression, or broken workflow.clawsweeper:needs-live-reproClawSweeper needs live local, crabbox, or manual validation to confirm this issue.impact:crash-loopCrash, hang, restart loop, or process-level availability failure.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🐚 platinum hermitGood issue quality with a plausible reproduction path needing some confirmation.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions