Skip to content

fix(tasks): reclaim ACP zombie runs blocking gateway restart#88281

Merged
steipete merged 3 commits into
openclaw:mainfrom
openperf:fix/88205-acp-zombie-restart-blocker
May 31, 2026
Merged

fix(tasks): reclaim ACP zombie runs blocking gateway restart#88281
steipete merged 3 commits into
openclaw:mainfrom
openperf:fix/88205-acp-zombie-restart-blocker

Conversation

@openperf

@openperf openperf commented May 30, 2026

Copy link
Copy Markdown
Member

Summary

  • Problem: After a gateway crash (or openclaw update-time kill) mid-ACP-turn, the gateway can refuse to restart indefinitely. The reported symptom is a managed gateway "blocked from restart/update" with a stuck ACP run, surfacing after long uptimes (the reporter saw it at ~27 days).
  • Root Cause: The gateway blocks restart while any task record is still status=running with no endedAt (getInspectableActiveTaskRestartBlockers). Reconciliation reclaims such records once their backing session is gone, via hasBackingSession, which gates every runtime through a real liveness probe — cron through isCronJobActive, cli through an active run context — except ACP, which returned Boolean(acpEntry.entry). A persistent ACP session keeps a durable session-store entry for its entire lifetime, and that entry survives a gateway crash. So a crashed ACP turn leaves a running record that hasBackingSession treats as "backed" forever: it is never reclaimed, never times out of the restart-blocker set, and wedges restart until the operator clears state manually.
  • Fix (two parts, both behind the existing injected maintenance-runtime seam):
    • Liveness gate: gate ACP backing on the in-process live-turn registry instead of entry existence. AcpSessionManager already tracks an authoritative per-session live-turn map (set when a prompt turn starts, deleted in the turn's finally; it is even used internally as the idle-eviction guard). This adds a small public probe AcpSessionManager.hasActiveTurn(sessionKey) and routes the ACP branch of hasBackingSession through it. The per-turn invariant makes this exact: ACP task records are created when a prompt turn begins and marked terminal inside the same turn, before the live-turn entry is deleted — so status=running with no live turn is only a crash mid-turn, or the brief create→register startup window that the unchanged 5-minute reconcile grace already covers.
    • Authoritative-process gate: the live-turn map is process-local — only the gateway process owns it. The openclaw tasks maintenance CLI runs the same runTaskRegistryMaintenance in a separate process with an empty live-turn map, so a naive liveness probe there would read every genuinely-live gateway-owned ACP turn as dead and mark it lost. This reuses the existing cron pattern: the cron branch already gates its process-local isCronJobActive probe behind an authoritative-runtime flag set true only by the gateway at startup. The flag/function are generalized from cron-only (isCronRuntimeAuthoritativeisRuntimeAuthoritative, cronRuntimeAuthoritativeruntimeAuthoritative) and the same gate is applied to the ACP branch: a non-authoritative process stays conservative and never reclaims.
    • This is the generic liveness gate the issue asks for; it deliberately does not add the reporter's suggested max-runtime TTL, heartbeat-timeout, or --force-restart config knobs.
  • What changed:
    • src/acp/control-plane/manager.core.ts — add public hasActiveTurn(sessionKey) reading the existing in-process live-turn map.
    • src/tasks/task-registry.maintenance.ts — add hasActiveAcpTurn to the injected maintenance runtime (default delegates to getAcpSessionManager().hasActiveTurn); the ACP branch of hasBackingSession now returns it behind an authoritative-process gate. Generalize the cron-only authoritative flag/function to cover both process-local probes (cron job map + ACP turn map); the diagnostics retain-reason becomes runtime_not_authoritative.
    • src/gateway/server-startup-early.ts — the gateway's single configureTaskRegistryMaintenance call sets runtimeAuthoritative: true (renamed from cronRuntimeAuthoritative), marking the gateway the authoritative owner of both process-local probes.
    • src/tasks/task-registry.maintenance.issue-60299.test.ts — real-component regression tests (see below).
    • src/tasks/task-registry.test.ts — adopt the generalized flag name in the maintenance-runtime test helper and model the gateway as authoritative in the two real-runtime orphan-reconcile tests.
  • What did NOT change (scope boundary):
    • Config surface unchanged (no schema, defaults, doctor migrations, or docs/reference/config edits).
    • Plugin surface unchanged (no plugin SDK, manifest, extensions/api.ts / runtime-api.ts, registry, or loader edits).
    • Terminal-ACP-session cleanup (shouldCloseTerminalAcpSession) still reads the session-store entry; that path is untouched.
    • The 5-minute reconcile grace and all non-ACP backing probes (cli, subagent) are unchanged. The cron probe keeps identical semantics — only the flag/function name is generalized.
    • No dependency / lockfile changes.

Reproduction

  1. A persistent ACP session is running a prompt turn; a running task record exists for that turn, and the durable session-store entry exists.
  2. The gateway crashes (or is killed by openclaw update) mid-turn. The in-process live-turn registry is gone, but the persisted session-store entry survives on disk.
  3. Before this PR: on restart, the running task record has no endedAt; hasBackingSession sees the surviving entry (Boolean(acpEntry.entry) === true) and treats it as backed forever. The record never reconciles to lost, stays in getInspectableActiveTaskRestartBlockers(), and blocks restart/update until the operator clears state by hand.
  4. After this PR: in the gateway (authoritative) process, hasBackingSession consults hasActiveAcpTurn, which is false after a crash. Once the existing 5-minute reconcile grace expires, the record is marked lost, drops out of the restart-blocker set, and restart proceeds. A turn that is genuinely mid-flight keeps hasActiveTurn true and is never reclaimed. A non-authoritative openclaw tasks maintenance CLI process never reclaims any ACP record.

Real behavior proof

Behavior addressed (#88205): a grace-expired status=running ACP task whose session-store entry survives a crash but whose prompt turn is no longer live is reclaimed (lost) and leaves the restart-blocker set; a running ACP task with a live in-flight turn is never reclaimed; and a non-authoritative maintenance process (empty live-turn map) never reclaims a gateway-owned ACP task.

Real environment tested (Linux, Node 22, real-component vitest harness driving production runTaskRegistryMaintenance + previewTaskRegistryMaintenance + getInspectableActiveTaskRestartBlockers + hasBackingSession): the harness builds real TaskRecords and exercises the actual reconcile / restart-blocker code paths; only the leaf liveness probes (hasActiveAcpTurn, readAcpSessionEntry, isCronJobActive, getAgentRunContext) and the authoritative flag are injected through the same taskRegistryMaintenanceRuntime seam production uses to wire getAcpSessionManager().hasActiveTurn and the gateway's runtimeAuthoritative: true.

Exact steps or command run after this patch:

  1. node scripts/run-vitest.mjs src/tasks/task-registry.maintenance.issue-60299.test.ts --reporter=verbose (AFTER fix)
  2. Temporarily revert the ACP branch of hasBackingSession to the shipped entry-existence check (return Boolean(acpEntry.entry)), rerun the same file (BEFORE — liveness gate), then restore
  3. Temporarily drop the authoritative-process gate from the ACP branch, rerun the same file (BEFORE — authoritative gate), then restore
  4. node scripts/run-vitest.mjs src/tasks/task-registry.test.ts
  5. node scripts/run-vitest.mjs src/acp/control-plane/manager.test.ts
  6. pnpm exec oxfmt --check --threads=1 src/acp/control-plane/manager.core.ts src/tasks/task-registry.maintenance.ts src/tasks/task-registry.maintenance.issue-60299.test.ts src/tasks/task-registry.test.ts src/gateway/server-startup-early.ts

Evidence after fix (verbatim vitest BEFORE/AFTER on the regression tests):

--- AFTER FIX (both gates in place) ---
 ✓ reclaims a stale running ACP task with no live turn even though its session-store entry survives
 ✓ keeps a running ACP task live while a prompt turn is still in flight
 ✓ does not reclaim a running ACP task from a non-authoritative process even with an empty live-turn map
 Test Files  1 passed (1)
      Tests  19 passed (19)

--- BEFORE (liveness gate reverted to entry-existence Boolean(acpEntry.entry)) ---
 × reclaims a stale running ACP task with no live turn even though its session-store entry survives
   → expected +0 to be 1   (surviving entry treated as backed → zombie never reclaimed)
 ✓ keeps a running ACP task live while a prompt turn is still in flight
 ✓ does not reclaim a running ACP task from a non-authoritative process even with an empty live-turn map
 Test Files  1 failed (1)
      Tests  1 failed | 18 passed (19)

--- BEFORE (authoritative-process gate removed from ACP branch) ---
 ✓ reclaims a stale running ACP task with no live turn even though its session-store entry survives
 ✓ keeps a running ACP task live while a prompt turn is still in flight
 × does not reclaim a running ACP task from a non-authoritative process even with an empty live-turn map
   → expected 1 to be +0   (CLI process with empty live-turn map wrongly reclaims a live gateway-owned turn)
 Test Files  1 failed (1)
      Tests  1 failed | 18 passed (19)

Companion suites green after fix: src/tasks/task-registry.test.ts 72/72, src/acp/control-plane/manager.test.ts 86/86; oxfmt --check clean on all five touched files.

Observed result after fix:

  • The crashed-run zombie reconciles (reconciled: 1, status lost) and getInspectableActiveTaskRestartBlockers() becomes empty → restart unblocked.
  • A running ACP task with hasActiveAcpTurn === true is never reclaimed (reconciled: 0, stays running, still a blocker).
  • A non-authoritative maintenance process (runtimeAuthoritative: false, empty live-turn map) retains the gateway-owned running task (reconciled: 0, diagnostics reason runtime_not_authoritative, still a blocker) — no false lost.
  • Reverting only the liveness gate flips the zombie-reclaim test red (expected +0 to be 1); removing only the authoritative gate flips the non-authoritative test red (expected 1 to be +0) — each gate is independently pinned.

What was not tested:

  • End-to-end: a real opencode/codex ACP child crashing mid-turn on a managed gateway over a multi-day uptime, then a real openclaw gateway restart / openclaw update, and a real openclaw tasks maintenance --apply running while the gateway holds a live turn. The deterministic harness reproduces the necessary conditions (running record + surviving entry + no live turn; authoritative vs non-authoritative process) without a real backend or 27-day uptime.
  • Full pnpm check / full pnpm test broad gates — deferred to CI on this memory-constrained box.

Regression tests:

  • reclaims a stale running ACP task with no live turn even though its session-store entry survives (ACP zombie runs block gateway restart/update after 27 days #88205) — fails on the shipped entry-existence behavior, passes with the live-turn gate.
  • keeps a running ACP task live while a prompt turn is still in flight — passes before and after, guarding against the inverse regression (never over-reclaim a healthy in-flight turn).
  • does not reclaim a running ACP task from a non-authoritative process even with an empty live-turn map — fails if the authoritative-process gate is dropped, guarding the standalone-CLI false-lost case.

Risk / Mitigation

  • Risk: over-reclaiming a live ACP session. Mitigation: the per-turn invariant — a running ACP record always coincides with a live turn during normal operation, and there is no running record between turns of a persistent session — plus hasActiveTurn staying true for any in-flight turn (including a hung-but-alive one) means only the crashed-run set is affected.
  • Risk: a non-gateway process (e.g. the openclaw tasks maintenance CLI) observing an empty live-turn map and falsely marking live gateway-owned turns lost. Mitigation: the authoritative-process gate — process-local probes (cron job map + ACP turn map) only reclaim in the process that owns the map (the gateway, which sets runtimeAuthoritative: true at startup); every other process stays conservative and returns "backed".
  • Risk: the brief window between task-record creation and live-turn registration during backend resolution could look like "no live turn". Mitigation: that window is covered by the unchanged 5-minute reconcile grace keyed off the task's own timestamps; nothing reclaims a record younger than the grace.
  • Risk: generalizing the cron authoritative flag could alter cron behavior. Mitigation: the cron branch keeps identical gate semantics — only the flag/function identifiers are renamed; the gateway still sets the flag true, and cron behavior tests are unchanged.

Change Type (select all)

  • Bug fix

Scope (select all touched areas)

  • Gateway / orchestration (task-registry restart blockers / reconciliation)
  • ACP agent runtime (AcpSessionManager live-turn liveness)

Linked Issue/PR

Fixes #88205

@clawsweeper

clawsweeper Bot commented May 30, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs maintainer review before merge. Reviewed May 31, 2026, 9:10 AM ET / 13:10 UTC.

Summary
The branch adds process-local ACP active-turn liveness, gates ACP task reconciliation on gateway-authoritative runtime state, preserves cron authority diagnostics, and expands ACP/task regression coverage.

PR surface: Source +65, Tests +298. Total +363 across 8 files.

Reproducibility: yes. source-reproducible: current main still returns Boolean(acpEntry.entry) for ACP task backing, so a stale running task with a surviving session-store entry remains backed. The PR adds regression tests for stale, live, and non-authoritative ACP cases.

Review metrics: 2 noteworthy metrics.

  • Maintenance liveness seam: 1 ACP probe added, 1 authority flag generalized. The merge changes which process may decide whether ACP and cron task records are still live.
  • Operator diagnostic reasons: 1 ACP reason added, 1 cron reason preserved. Task maintenance diagnostics are operator-visible JSON and should remain intentional before merge.

Merge readiness
Overall: 🦞 diamond lobster
Proof: 🦞 diamond lobster
Patch quality: 🦞 diamond lobster
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Risk before merge

  • [P1] Merging intentionally changes ACP running-task liveness from durable session-entry presence to gateway-owned process-local active-turn state; marker lifecycle drift could over-reclaim a live prompt or preserve a crashed restart blocker.
  • [P1] Existing stale ACP task records may reconcile differently after upgrade: gateway-authoritative maintenance can now mark them lost, while standalone maintenance remains conservative.
  • [P1] Proof is strong deterministic production-path harness plus maintainer local checks, but no full managed gateway crash/update smoke with a real ACP child process is attached.

Maintainer options:

  1. Land With Session-State Signoff (recommended)
    Accept the gateway-owned active-turn authority model after required CI and maintainer review confirm the deterministic crash-state proof is enough for this upgrade behavior.
  2. Require Live Crash Smoke First
    Ask for a managed-gateway crash/update artifact with a real ACP child process if maintainers want proof beyond the deterministic production-path harness.
  3. Pause If Authority Model Is Disputed
    Hold or close the branch if maintainers decide standalone maintenance must reclaim ACP tasks without gateway-owned process-local liveness.

Next step before merge

  • [P2] No narrow automated repair remains; the PR needs maintainer review of the session-state authority change, required CI, and whether deterministic proof is enough before merge.

Security
Cleared: The diff does not change dependencies, workflows, permissions, credentials, lockfiles, or external code execution paths; no concrete security or supply-chain concern was found.

Review details

Best possible solution:

Land a rebased, CI-green version after maintainer session-state signoff, keeping the gateway as the only authoritative ACP/cron process-local liveness owner and preserving the focused regression coverage.

Do we have a high-confidence way to reproduce the issue?

Yes, source-reproducible: current main still returns Boolean(acpEntry.entry) for ACP task backing, so a stale running task with a surviving session-store entry remains backed. The PR adds regression tests for stale, live, and non-authoritative ACP cases.

Is this the best way to solve the issue?

Yes: the patch fixes the restart blocker at the task-maintenance seam without adding TTL, heartbeat, or force-restart config knobs. The remaining question is maintainer acceptance of the gateway-owned liveness authority model and proof depth.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 3ca4e5f61676.

Label changes

Label justifications:

  • P1: The PR fixes a gateway restart/update blocker caused by stuck ACP running tasks, which can break an active user workflow.
  • merge-risk: 🚨 compatibility: Existing installations with persisted ACP running task records may reconcile differently after upgrade, and an internal maintenance authority option is generalized.
  • merge-risk: 🚨 session-state: ACP task state now depends on a process-local active-turn marker whose lifecycle must stay aligned with real prompt-turn lifetime.
  • rating: 🦞 diamond lobster: Overall readiness is 🦞 diamond lobster; proof is 🦞 diamond lobster and patch quality is 🦞 diamond lobster.
  • status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (terminal): The PR body and maintainer follow-up include after-fix terminal command output plus before-revert failures for the production maintenance paths, though no full managed gateway crash/update smoke is attached.
  • proof: sufficient: Contributor real behavior proof is sufficient. The PR body and maintainer follow-up include after-fix terminal command output plus before-revert failures for the production maintenance paths, though no full managed gateway crash/update smoke is attached.
Evidence reviewed

PR surface:

Source +65, Tests +298. Total +363 across 8 files.

View PR surface stats
Area Files Added Removed Net
Source 4 307 242 +65
Tests 4 307 9 +298
Docs 0 0 0 0
Config 0 0 0 0
Generated 0 0 0 0
Other 0 0 0 0
Total 8 614 251 +363

What I checked:

Likely related people:

  • steipete: Current-main blame in this shallow checkout points to Peter Steinberger for the central task-maintenance and ACP manager lines, and commit 77f1359 touched the ACP core files; he also authored the PR follow-up commit that fixed the active-turn cleanup invariant. (role: recent area contributor and reviewer; confidence: high; commits: 7d8fdef995b1, 77f1359612f6, 9a1e79db98a2; files: src/tasks/task-registry.maintenance.ts, src/acp/control-plane/manager.core.ts, src/acp/control-plane/manager.test.ts)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. P1 High-priority user-facing bug, regression, or broken workflow. merge-risk: 🚨 compatibility 🚨 May break existing users, config, migrations, defaults, or upgrade paths. merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. labels May 30, 2026
@openperf openperf force-pushed the fix/88205-acp-zombie-restart-blocker branch from 09a021a to d6fc0a6 Compare May 30, 2026 09:52
@openclaw-barnacle openclaw-barnacle Bot added the gateway Gateway runtime label May 30, 2026
@clawsweeper clawsweeper Bot added rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. and removed rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. labels May 30, 2026
@openperf openperf force-pushed the fix/88205-acp-zombie-restart-blocker branch from d6fc0a6 to 3e1c339 Compare May 30, 2026 10:03
@clawsweeper clawsweeper Bot added rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. and removed rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. labels May 30, 2026
@openperf openperf force-pushed the fix/88205-acp-zombie-restart-blocker branch 2 times, most recently from ae3b49f to b547747 Compare May 30, 2026 15:04
@clawsweeper clawsweeper Bot added rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. and removed rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. labels May 30, 2026
@openperf openperf force-pushed the fix/88205-acp-zombie-restart-blocker branch from b547747 to 7146dbf Compare May 30, 2026 15:42
@clawsweeper clawsweeper Bot added rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. and removed rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. labels May 30, 2026
@clawsweeper clawsweeper Bot added rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. and removed rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. labels May 31, 2026
@openperf openperf force-pushed the fix/88205-acp-zombie-restart-blocker branch from 7f928c7 to dd9f036 Compare May 31, 2026 01:40
@clawsweeper clawsweeper Bot added rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. and removed rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. labels May 31, 2026
@openperf openperf force-pushed the fix/88205-acp-zombie-restart-blocker branch from dd9f036 to 56b0849 Compare May 31, 2026 11:37
@steipete steipete force-pushed the fix/88205-acp-zombie-restart-blocker branch from 56b0849 to 794397a Compare May 31, 2026 12:56
@steipete

Copy link
Copy Markdown
Contributor

Review finding from maintainer pass, fixed in 794397a6ba:

  • src/acp/control-plane/manager.core.ts: the ACP live-turn marker was originally cleared only through markBackgroundTaskTerminal. A retry/setup exception after markAcpTurnActive() but before terminal task update could leave the process-local marker stuck true, causing task maintenance to keep treating a dead ACP turn as backed and preserving the restart blocker. The fix moves cleanup to a runTurn-level finally, so every exit after marking clears the liveness bit.

Proof after fix:

  • node scripts/run-vitest.mjs src/acp/control-plane/manager.test.ts src/tasks/task-registry.maintenance.issue-60299.test.ts src/tasks/task-registry.test.ts --reporter=verbose
  • pnpm exec oxfmt --check --threads=1 src/acp/control-plane/manager.core.ts src/tasks/task-registry.maintenance.ts src/tasks/task-registry.maintenance.issue-60299.test.ts src/tasks/task-registry.test.ts src/gateway/server-startup-early.ts src/acp/control-plane/manager.test.ts
  • .agents/skills/autoreview/scripts/autoreview --mode branch --base origin/main -> clean, no accepted/actionable findings

@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🦞 diamond lobster Very strong PR readiness with only minor maintainer review expected. and removed rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. labels May 31, 2026
@steipete steipete force-pushed the fix/88205-acp-zombie-restart-blocker branch from 794397a to 9a1e79d Compare May 31, 2026 13:05
@steipete

Copy link
Copy Markdown
Contributor

Verification after maintainer fixup on 9a1e79db98a2426b4e0a28ebdee95ab1ad236ab3:

Behavior addressed: ACP task maintenance now uses process-local active-turn liveness for runtime-authoritative checks, and ACP liveness is cleared from the outer runTurn finally path so retry/setup failures before terminal task updates cannot leave a permanent restart blocker.

Real environment tested: local OpenClaw checkout on macOS, Node/pnpm repo wrappers.

Exact steps or command run after this patch:

  • node scripts/run-vitest.mjs src/acp/control-plane/manager.test.ts src/tasks/task-registry.maintenance.issue-60299.test.ts src/tasks/task-registry.test.ts --reporter=verbose
  • pnpm exec oxfmt --check --threads=1 src/acp/control-plane/manager.core.ts src/tasks/task-registry.maintenance.ts src/tasks/task-registry.maintenance.issue-60299.test.ts src/tasks/task-registry.test.ts src/gateway/server-startup-early.ts src/acp/control-plane/manager.test.ts
  • .agents/skills/autoreview/scripts/autoreview --mode branch --base origin/main
  • pnpm check:test-types
  • pnpm lint --threads=8

Evidence after fix:

Observed result after fix: the regression test covers an ACP retry/setup exception before terminal task update and asserts the child session is no longer reported active.

What was not tested: broad local pnpm check:test-types / pnpm lint --threads=8 are red on current origin/main outside this PR (src/agents/embedded-agent-runner.sanitize-session-history.test.ts:1375, src/infra/device-auth-store.ts:32). Fresh GitHub CI for this SHA is still queued.

openperf and others added 3 commits May 31, 2026 09:13
…w#88205)

hasBackingSession treated an ACP task as backed whenever its persisted
session-store entry existed, so a crashed mid-turn ACP run left a
status=running record that survived the crash and wedged gateway
restart/update forever.

Gate ACP backing on in-process live-turn liveness instead of entry
existence, behind the existing authoritative-process flag (generalized
from cron-only) so a standalone maintenance CLI with an empty live-turn
map stays conservative and never reclaims. The liveness signal lives in a
core-internal active-turns registry (mirroring cron active-jobs) so it
stays off the SDK-exported AcpSessionManager surface. It is marked once
before the backend loop and cleared when the task is marked terminal, so
a slow init or backend failover cleanup cannot let the sweep reclaim a
still-live turn.
Split the merged runtime_not_authoritative reason back into the existing cron_runtime_not_authoritative (shipped, consumed by openclaw tasks maintenance --json operator scripts) and a new acp_runtime_not_authoritative for the ACP branch. Strengthen the cron non-authoritative test to lock the reason string contract.
@steipete steipete force-pushed the fix/88205-acp-zombie-restart-blocker branch from 9a1e79d to 6088f14 Compare May 31, 2026 13:15
@steipete

Copy link
Copy Markdown
Contributor

Latest maintainer rebase/fixup is now 6088f14b0cc8079c5c504942b373149a4cc4070a on current origin/main (100dd79468).

Re-ran after rebase:

  • node scripts/run-vitest.mjs src/acp/control-plane/manager.test.ts src/tasks/task-registry.maintenance.issue-60299.test.ts src/tasks/task-registry.test.ts --reporter=verbose
  • .agents/skills/autoreview/scripts/autoreview --mode branch --base origin/main

Result: focused ACP/task-registry proof passed again, and autoreview is clean. Fresh GitHub CI is queued for the new SHA.

@steipete steipete merged commit 02c7b5b into openclaw:main May 31, 2026
140 of 141 checks passed
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request Jun 1, 2026
…w#88281)

* fix(tasks): reclaim ACP zombie runs blocking gateway restart (openclaw#88205)

hasBackingSession treated an ACP task as backed whenever its persisted
session-store entry existed, so a crashed mid-turn ACP run left a
status=running record that survived the crash and wedged gateway
restart/update forever.

Gate ACP backing on in-process live-turn liveness instead of entry
existence, behind the existing authoritative-process flag (generalized
from cron-only) so a standalone maintenance CLI with an empty live-turn
map stays conservative and never reclaims. The liveness signal lives in a
core-internal active-turns registry (mirroring cron active-jobs) so it
stays off the SDK-exported AcpSessionManager surface. It is marked once
before the backend loop and cleared when the task is marked terminal, so
a slow init or backend failover cleanup cannot let the sweep reclaim a
still-live turn.

* fix(tasks): preserve cron operator JSON diagnostic reason

Split the merged runtime_not_authoritative reason back into the existing cron_runtime_not_authoritative (shipped, consumed by openclaw tasks maintenance --json operator scripts) and a new acp_runtime_not_authoritative for the ACP branch. Strengthen the cron non-authoritative test to lock the reason string contract.

* fix(tasks): clear ACP turn liveness on retry failures

---------

Co-authored-by: Peter Steinberger <steipete@gmail.com>
SYU8384 pushed a commit to SYU8384/openclaw that referenced this pull request Jun 3, 2026
…w#88281)

* fix(tasks): reclaim ACP zombie runs blocking gateway restart (openclaw#88205)

hasBackingSession treated an ACP task as backed whenever its persisted
session-store entry existed, so a crashed mid-turn ACP run left a
status=running record that survived the crash and wedged gateway
restart/update forever.

Gate ACP backing on in-process live-turn liveness instead of entry
existence, behind the existing authoritative-process flag (generalized
from cron-only) so a standalone maintenance CLI with an empty live-turn
map stays conservative and never reclaims. The liveness signal lives in a
core-internal active-turns registry (mirroring cron active-jobs) so it
stays off the SDK-exported AcpSessionManager surface. It is marked once
before the backend loop and cleared when the task is marked terminal, so
a slow init or backend failover cleanup cannot let the sweep reclaim a
still-live turn.

* fix(tasks): preserve cron operator JSON diagnostic reason

Split the merged runtime_not_authoritative reason back into the existing cron_runtime_not_authoritative (shipped, consumed by openclaw tasks maintenance --json operator scripts) and a new acp_runtime_not_authoritative for the ACP branch. Strengthen the cron non-authoritative test to lock the reason string contract.

* fix(tasks): clear ACP turn liveness on retry failures

---------

Co-authored-by: Peter Steinberger <steipete@gmail.com>
sablehead pushed a commit to sablehead/openclaw that referenced this pull request Jun 10, 2026
…w#88281)

* fix(tasks): reclaim ACP zombie runs blocking gateway restart (openclaw#88205)

hasBackingSession treated an ACP task as backed whenever its persisted
session-store entry existed, so a crashed mid-turn ACP run left a
status=running record that survived the crash and wedged gateway
restart/update forever.

Gate ACP backing on in-process live-turn liveness instead of entry
existence, behind the existing authoritative-process flag (generalized
from cron-only) so a standalone maintenance CLI with an empty live-turn
map stays conservative and never reclaims. The liveness signal lives in a
core-internal active-turns registry (mirroring cron active-jobs) so it
stays off the SDK-exported AcpSessionManager surface. It is marked once
before the backend loop and cleared when the task is marked terminal, so
a slow init or backend failover cleanup cannot let the sweep reclaim a
still-live turn.

* fix(tasks): preserve cron operator JSON diagnostic reason

Split the merged runtime_not_authoritative reason back into the existing cron_runtime_not_authoritative (shipped, consumed by openclaw tasks maintenance --json operator scripts) and a new acp_runtime_not_authoritative for the ACP branch. Strengthen the cron non-authoritative test to lock the reason string contract.

* fix(tasks): clear ACP turn liveness on retry failures

---------

Co-authored-by: Peter Steinberger <steipete@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gateway Gateway runtime merge-risk: 🚨 compatibility 🚨 May break existing users, config, migrations, defaults, or upgrade paths. merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. P1 High-priority user-facing bug, regression, or broken workflow. proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🦞 diamond lobster Very strong PR readiness with only minor maintainer review expected. size: L status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ACP zombie runs block gateway restart/update after 27 days

2 participants