fix(tasks): reclaim ACP zombie runs blocking gateway restart#88281
Conversation
|
Codex review: needs maintainer review before merge. Reviewed May 31, 2026, 9:10 AM ET / 13:10 UTC. Summary PR surface: Source +65, Tests +298. Total +363 across 8 files. Reproducibility: yes. source-reproducible: current main still returns Boolean(acpEntry.entry) for ACP task backing, so a stale running task with a surviving session-store entry remains backed. The PR adds regression tests for stale, live, and non-authoritative ACP cases. Review metrics: 2 noteworthy metrics.
Merge readiness Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch. Risk before merge
Maintainer options:
Next step before merge
Security Review detailsBest possible solution: Land a rebased, CI-green version after maintainer session-state signoff, keeping the gateway as the only authoritative ACP/cron process-local liveness owner and preserving the focused regression coverage. Do we have a high-confidence way to reproduce the issue? Yes, source-reproducible: current main still returns Boolean(acpEntry.entry) for ACP task backing, so a stale running task with a surviving session-store entry remains backed. The PR adds regression tests for stale, live, and non-authoritative ACP cases. Is this the best way to solve the issue? Yes: the patch fixes the restart blocker at the task-maintenance seam without adding TTL, heartbeat, or force-restart config knobs. The remaining question is maintainer acceptance of the gateway-owned liveness authority model and proof depth. AGENTS.md: found and applied where relevant. Codex review notes: model gpt-5.5, reasoning high; reviewed against 3ca4e5f61676. Label changesLabel justifications:
Evidence reviewedPR surface: Source +65, Tests +298. Total +363 across 8 files. View PR surface stats
What I checked:
Likely related people:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. How this review workflow works
|
09a021a to
d6fc0a6
Compare
d6fc0a6 to
3e1c339
Compare
ae3b49f to
b547747
Compare
b547747 to
7146dbf
Compare
7f928c7 to
dd9f036
Compare
dd9f036 to
56b0849
Compare
56b0849 to
794397a
Compare
|
Review finding from maintainer pass, fixed in
Proof after fix:
|
794397a to
9a1e79d
Compare
|
Verification after maintainer fixup on Behavior addressed: ACP task maintenance now uses process-local active-turn liveness for runtime-authoritative checks, and ACP liveness is cleared from the outer Real environment tested: local OpenClaw checkout on macOS, Node/pnpm repo wrappers. Exact steps or command run after this patch:
Evidence after fix:
Observed result after fix: the regression test covers an ACP retry/setup exception before terminal task update and asserts the child session is no longer reported active. What was not tested: broad local |
…w#88205) hasBackingSession treated an ACP task as backed whenever its persisted session-store entry existed, so a crashed mid-turn ACP run left a status=running record that survived the crash and wedged gateway restart/update forever. Gate ACP backing on in-process live-turn liveness instead of entry existence, behind the existing authoritative-process flag (generalized from cron-only) so a standalone maintenance CLI with an empty live-turn map stays conservative and never reclaims. The liveness signal lives in a core-internal active-turns registry (mirroring cron active-jobs) so it stays off the SDK-exported AcpSessionManager surface. It is marked once before the backend loop and cleared when the task is marked terminal, so a slow init or backend failover cleanup cannot let the sweep reclaim a still-live turn.
Split the merged runtime_not_authoritative reason back into the existing cron_runtime_not_authoritative (shipped, consumed by openclaw tasks maintenance --json operator scripts) and a new acp_runtime_not_authoritative for the ACP branch. Strengthen the cron non-authoritative test to lock the reason string contract.
9a1e79d to
6088f14
Compare
|
Latest maintainer rebase/fixup is now Re-ran after rebase:
Result: focused ACP/task-registry proof passed again, and autoreview is clean. Fresh GitHub CI is queued for the new SHA. |
…w#88281) * fix(tasks): reclaim ACP zombie runs blocking gateway restart (openclaw#88205) hasBackingSession treated an ACP task as backed whenever its persisted session-store entry existed, so a crashed mid-turn ACP run left a status=running record that survived the crash and wedged gateway restart/update forever. Gate ACP backing on in-process live-turn liveness instead of entry existence, behind the existing authoritative-process flag (generalized from cron-only) so a standalone maintenance CLI with an empty live-turn map stays conservative and never reclaims. The liveness signal lives in a core-internal active-turns registry (mirroring cron active-jobs) so it stays off the SDK-exported AcpSessionManager surface. It is marked once before the backend loop and cleared when the task is marked terminal, so a slow init or backend failover cleanup cannot let the sweep reclaim a still-live turn. * fix(tasks): preserve cron operator JSON diagnostic reason Split the merged runtime_not_authoritative reason back into the existing cron_runtime_not_authoritative (shipped, consumed by openclaw tasks maintenance --json operator scripts) and a new acp_runtime_not_authoritative for the ACP branch. Strengthen the cron non-authoritative test to lock the reason string contract. * fix(tasks): clear ACP turn liveness on retry failures --------- Co-authored-by: Peter Steinberger <steipete@gmail.com>
…w#88281) * fix(tasks): reclaim ACP zombie runs blocking gateway restart (openclaw#88205) hasBackingSession treated an ACP task as backed whenever its persisted session-store entry existed, so a crashed mid-turn ACP run left a status=running record that survived the crash and wedged gateway restart/update forever. Gate ACP backing on in-process live-turn liveness instead of entry existence, behind the existing authoritative-process flag (generalized from cron-only) so a standalone maintenance CLI with an empty live-turn map stays conservative and never reclaims. The liveness signal lives in a core-internal active-turns registry (mirroring cron active-jobs) so it stays off the SDK-exported AcpSessionManager surface. It is marked once before the backend loop and cleared when the task is marked terminal, so a slow init or backend failover cleanup cannot let the sweep reclaim a still-live turn. * fix(tasks): preserve cron operator JSON diagnostic reason Split the merged runtime_not_authoritative reason back into the existing cron_runtime_not_authoritative (shipped, consumed by openclaw tasks maintenance --json operator scripts) and a new acp_runtime_not_authoritative for the ACP branch. Strengthen the cron non-authoritative test to lock the reason string contract. * fix(tasks): clear ACP turn liveness on retry failures --------- Co-authored-by: Peter Steinberger <steipete@gmail.com>
…w#88281) * fix(tasks): reclaim ACP zombie runs blocking gateway restart (openclaw#88205) hasBackingSession treated an ACP task as backed whenever its persisted session-store entry existed, so a crashed mid-turn ACP run left a status=running record that survived the crash and wedged gateway restart/update forever. Gate ACP backing on in-process live-turn liveness instead of entry existence, behind the existing authoritative-process flag (generalized from cron-only) so a standalone maintenance CLI with an empty live-turn map stays conservative and never reclaims. The liveness signal lives in a core-internal active-turns registry (mirroring cron active-jobs) so it stays off the SDK-exported AcpSessionManager surface. It is marked once before the backend loop and cleared when the task is marked terminal, so a slow init or backend failover cleanup cannot let the sweep reclaim a still-live turn. * fix(tasks): preserve cron operator JSON diagnostic reason Split the merged runtime_not_authoritative reason back into the existing cron_runtime_not_authoritative (shipped, consumed by openclaw tasks maintenance --json operator scripts) and a new acp_runtime_not_authoritative for the ACP branch. Strengthen the cron non-authoritative test to lock the reason string contract. * fix(tasks): clear ACP turn liveness on retry failures --------- Co-authored-by: Peter Steinberger <steipete@gmail.com>
Summary
openclaw update-time kill) mid-ACP-turn, the gateway can refuse to restart indefinitely. The reported symptom is a managed gateway "blocked from restart/update" with a stuck ACP run, surfacing after long uptimes (the reporter saw it at ~27 days).status=runningwith noendedAt(getInspectableActiveTaskRestartBlockers). Reconciliation reclaims such records once their backing session is gone, viahasBackingSession, which gates every runtime through a real liveness probe — cron throughisCronJobActive, cli through an active run context — except ACP, which returnedBoolean(acpEntry.entry). A persistent ACP session keeps a durable session-store entry for its entire lifetime, and that entry survives a gateway crash. So a crashed ACP turn leaves arunningrecord thathasBackingSessiontreats as "backed" forever: it is never reclaimed, never times out of the restart-blocker set, and wedges restart until the operator clears state manually.AcpSessionManageralready tracks an authoritative per-session live-turn map (set when a prompt turn starts, deleted in the turn'sfinally; it is even used internally as the idle-eviction guard). This adds a small public probeAcpSessionManager.hasActiveTurn(sessionKey)and routes the ACP branch ofhasBackingSessionthrough it. The per-turn invariant makes this exact: ACP task records are created when a prompt turn begins and marked terminal inside the same turn, before the live-turn entry is deleted — sostatus=runningwith no live turn is only a crash mid-turn, or the brief create→register startup window that the unchanged 5-minute reconcile grace already covers.openclaw tasks maintenanceCLI runs the samerunTaskRegistryMaintenancein a separate process with an empty live-turn map, so a naive liveness probe there would read every genuinely-live gateway-owned ACP turn as dead and mark itlost. This reuses the existing cron pattern: the cron branch already gates its process-localisCronJobActiveprobe behind an authoritative-runtime flag set true only by the gateway at startup. The flag/function are generalized from cron-only (isCronRuntimeAuthoritative→isRuntimeAuthoritative,cronRuntimeAuthoritative→runtimeAuthoritative) and the same gate is applied to the ACP branch: a non-authoritative process stays conservative and never reclaims.--force-restartconfig knobs.src/acp/control-plane/manager.core.ts— add publichasActiveTurn(sessionKey)reading the existing in-process live-turn map.src/tasks/task-registry.maintenance.ts— addhasActiveAcpTurnto the injected maintenance runtime (default delegates togetAcpSessionManager().hasActiveTurn); the ACP branch ofhasBackingSessionnow returns it behind an authoritative-process gate. Generalize the cron-only authoritative flag/function to cover both process-local probes (cron job map + ACP turn map); the diagnostics retain-reason becomesruntime_not_authoritative.src/gateway/server-startup-early.ts— the gateway's singleconfigureTaskRegistryMaintenancecall setsruntimeAuthoritative: true(renamed fromcronRuntimeAuthoritative), marking the gateway the authoritative owner of both process-local probes.src/tasks/task-registry.maintenance.issue-60299.test.ts— real-component regression tests (see below).src/tasks/task-registry.test.ts— adopt the generalized flag name in the maintenance-runtime test helper and model the gateway as authoritative in the two real-runtime orphan-reconcile tests.docs/reference/configedits).extensions/api.ts/runtime-api.ts, registry, or loader edits).shouldCloseTerminalAcpSession) still reads the session-store entry; that path is untouched.Reproduction
runningtask record exists for that turn, and the durable session-store entry exists.openclaw update) mid-turn. The in-process live-turn registry is gone, but the persisted session-store entry survives on disk.runningtask record has noendedAt;hasBackingSessionsees the surviving entry (Boolean(acpEntry.entry) === true) and treats it as backed forever. The record never reconciles tolost, stays ingetInspectableActiveTaskRestartBlockers(), and blocks restart/update until the operator clears state by hand.hasBackingSessionconsultshasActiveAcpTurn, which is false after a crash. Once the existing 5-minute reconcile grace expires, the record is markedlost, drops out of the restart-blocker set, and restart proceeds. A turn that is genuinely mid-flight keepshasActiveTurntrue and is never reclaimed. A non-authoritativeopenclaw tasks maintenanceCLI process never reclaims any ACP record.Real behavior proof
Behavior addressed (#88205): a grace-expired
status=runningACP task whose session-store entry survives a crash but whose prompt turn is no longer live is reclaimed (lost) and leaves the restart-blocker set; arunningACP task with a live in-flight turn is never reclaimed; and a non-authoritative maintenance process (empty live-turn map) never reclaims a gateway-owned ACP task.Real environment tested (Linux, Node 22, real-component vitest harness driving production
runTaskRegistryMaintenance+previewTaskRegistryMaintenance+getInspectableActiveTaskRestartBlockers+hasBackingSession): the harness builds realTaskRecords and exercises the actual reconcile / restart-blocker code paths; only the leaf liveness probes (hasActiveAcpTurn,readAcpSessionEntry,isCronJobActive,getAgentRunContext) and the authoritative flag are injected through the sametaskRegistryMaintenanceRuntimeseam production uses to wiregetAcpSessionManager().hasActiveTurnand the gateway'sruntimeAuthoritative: true.Exact steps or command run after this patch:
node scripts/run-vitest.mjs src/tasks/task-registry.maintenance.issue-60299.test.ts --reporter=verbose(AFTER fix)hasBackingSessionto the shipped entry-existence check (return Boolean(acpEntry.entry)), rerun the same file (BEFORE — liveness gate), then restorenode scripts/run-vitest.mjs src/tasks/task-registry.test.tsnode scripts/run-vitest.mjs src/acp/control-plane/manager.test.tspnpm exec oxfmt --check --threads=1 src/acp/control-plane/manager.core.ts src/tasks/task-registry.maintenance.ts src/tasks/task-registry.maintenance.issue-60299.test.ts src/tasks/task-registry.test.ts src/gateway/server-startup-early.tsEvidence after fix (verbatim vitest BEFORE/AFTER on the regression tests):
Companion suites green after fix:
src/tasks/task-registry.test.ts72/72,src/acp/control-plane/manager.test.ts86/86;oxfmt --checkclean on all five touched files.Observed result after fix:
reconciled: 1, statuslost) andgetInspectableActiveTaskRestartBlockers()becomes empty → restart unblocked.runningACP task withhasActiveAcpTurn === trueis never reclaimed (reconciled: 0, staysrunning, still a blocker).runtimeAuthoritative: false, empty live-turn map) retains the gateway-ownedrunningtask (reconciled: 0, diagnostics reasonruntime_not_authoritative, still a blocker) — no falselost.expected +0 to be 1); removing only the authoritative gate flips the non-authoritative test red (expected 1 to be +0) — each gate is independently pinned.What was not tested:
openclaw gateway restart/openclaw update, and a realopenclaw tasks maintenance --applyrunning while the gateway holds a live turn. The deterministic harness reproduces the necessary conditions (running record + surviving entry + no live turn; authoritative vs non-authoritative process) without a real backend or 27-day uptime.pnpm check/ fullpnpm testbroad gates — deferred to CI on this memory-constrained box.Regression tests:
reclaims a stale running ACP task with no live turn even though its session-store entry survives(ACP zombie runs block gateway restart/update after 27 days #88205) — fails on the shipped entry-existence behavior, passes with the live-turn gate.keeps a running ACP task live while a prompt turn is still in flight— passes before and after, guarding against the inverse regression (never over-reclaim a healthy in-flight turn).does not reclaim a running ACP task from a non-authoritative process even with an empty live-turn map— fails if the authoritative-process gate is dropped, guarding the standalone-CLI false-lostcase.Risk / Mitigation
runningACP record always coincides with a live turn during normal operation, and there is norunningrecord between turns of a persistent session — plushasActiveTurnstaying true for any in-flight turn (including a hung-but-alive one) means only the crashed-run set is affected.openclaw tasks maintenanceCLI) observing an empty live-turn map and falsely marking live gateway-owned turnslost. Mitigation: the authoritative-process gate — process-local probes (cron job map + ACP turn map) only reclaim in the process that owns the map (the gateway, which setsruntimeAuthoritative: trueat startup); every other process stays conservative and returns "backed".Change Type (select all)
Scope (select all touched areas)
AcpSessionManagerlive-turn liveness)Linked Issue/PR
Fixes #88205