tasks: add detached task recovery hook before markLost by garrytan · Pull Request #69313 · openclaw/openclaw

garrytan · 2026-04-20T11:29:58Z

Context

I'm building a plugin that wraps subagent execution in a durable job queue: crash recovery, retry with backoff, timeout enforcement. The DetachedTaskLifecycleRuntime seam (#68886) and plugin registration contract (#68915) give me most of what I need. The one remaining gap is stale-task recovery after a gateway restart: the maintenance sweep can find running tasks whose backing sessions are gone and mark them lost before a durable executor has a chance to re-spawn them.

This PR adds one small recovery seam so a registered detached runtime can say "I can recover this task" before core marks it lost.

What this PR does

Adds an optional tryRecoverTaskBeforeMarkLost? hook to DetachedTaskLifecycleRuntime in src/tasks/detached-task-runtime-contract.ts.

When runTaskRegistryMaintenance() is about to mark a stale running task as lost, it now:

calls tryRecoverTaskBeforeMarkLost({ taskId, runtime, task, now })
if the hook returns { recovered: true }, skips markTaskLost() and increments a recovered counter
if the hook returns { recovered: false }, proceeds normally
if the hook is absent, preserves existing behavior
if the hook throws or returns an invalid shape, logs a warning and proceeds normally
if the hook is slow, logs a warning so maintenance stalls are visible

After the async hook returns, the sweep re-reads the task and re-checks shouldMarkLost(...) before marking it lost, so concurrent completion or recovery wins.

Why optional

cancelDetachedTaskRunById is required because every detached executor needs cancel. Recovery before markLost is advisory and only matters for executors with durable recovery state. Keeping it optional preserves current behavior for existing runtimes and test doubles.

Scope

adds one optional recovery hook to the detached runtime contract
adds one dispatch wrapper with invalid-return, throw, and slow-hook logging
wires the maintenance sweep through that seam
threads recovered through TaskRegistryMaintenanceSummary and CLI output
updates the public Plugin SDK baseline hash because this is a real exported surface change
adds regression coverage for:
- recovered / not recovered / no hook
- throw fallback
- invalid return fallback
- slow-hook warning
- stale task recovered in real maintenance while preview still reports it under reconciled

Test plan

Validated on mb-server against head 2de191988a5f0f3065b4ccb48a4ffff97a67ae41:

OPENCLAW_TEST_PROFILE=serial OPENCLAW_TEST_SERIAL_GATEWAY=1 pnpm test -- src/tasks/detached-task-runtime.test.ts src/tasks/task-registry.maintenance.issue-60299.test.ts src/tasks/task-registry.test.ts src/tasks/task-executor.test.ts
pnpm tsgo:core
pnpm tsgo:core:test
NODE_OPTIONS=--max-old-space-size=4096 pnpm plugin-sdk:api:check

Notes

Preview/operator inspection remains intentionally synchronous. It cannot call the async recovery hook, so recoverable stale tasks still count under reconciled in preview until the real sweep runs and either recovers or marks them lost.

greptile-apps · 2026-04-20T11:31:51Z

Greptile Summary

This PR adds an optional onBeforeMarkLost recovery hook to DetachedTaskLifecycleRuntime, allowing plugin-registered runtimes with durable backing stores (e.g. a job queue) to intercept the maintenance sweep before a stale task is marked lost. The implementation is well-scoped: the hook is optional, errors are caught and logged, a re-read guards against concurrent completion, and existing behaviour is unchanged when no hook is registered. The new recovered counter threads through the summary type, the CLI output, and the relevant tests cleanly.

Confidence Score: 5/5

Safe to merge — no behavioural change when hook is absent, error path defaults to the existing sweep behaviour, and all 72 tests pass.

All remaining findings are P2 (a documentation/comment gap on the preview/actual discrepancy, and a missing spy assertion in one test). Neither affects correctness or the production code path.

No files require special attention.

Comments Outside Diff (1)

src/tasks/task-registry.maintenance.ts, line 255-274 (link)

Preview over-counts reconciled when hook would recover tasks

previewTaskRegistryMaintenance is synchronous, so it cannot call the async onBeforeMarkLost hook. As a result, tasks that would be recovered by the hook are counted as reconciled: N in the preview but as reconciled: N-k, recovered: k when maintenance actually runs. An operator using the "preview" mode (--dry-run / no --apply) to decide whether to apply will see an inflated reconciled count, which may cause confusion.

A comment above the previewTaskRegistryMaintenance function noting this limitation would be enough to make the discrepancy explicit, since the fix (making preview async) would be a larger change.

Prompt To Fix With AI

This is a comment left during a code review.
Path: src/tasks/task-registry.maintenance.ts
Line: 255-274

Comment:
**Preview over-counts `reconciled` when hook would recover tasks**

`previewTaskRegistryMaintenance` is synchronous, so it cannot call the async `onBeforeMarkLost` hook. As a result, tasks that would be recovered by the hook are counted as `reconciled: N` in the preview but as `reconciled: N-k, recovered: k` when maintenance actually runs. An operator using the "preview" mode (`--dry-run` / no `--apply`) to decide whether to apply will see an inflated `reconciled` count, which may cause confusion.

A comment above the `previewTaskRegistryMaintenance` function noting this limitation would be enough to make the discrepancy explicit, since the fix (making preview async) would be a larger change.

How can I resolve this? If you propose a fix, please make it concise.

Prompt To Fix All With AI

This is a comment left during a code review.
Path: src/tasks/task-registry.maintenance.ts
Line: 255-274

Comment:
**Preview over-counts `reconciled` when hook would recover tasks**

`previewTaskRegistryMaintenance` is synchronous, so it cannot call the async `onBeforeMarkLost` hook. As a result, tasks that would be recovered by the hook are counted as `reconciled: N` in the preview but as `reconciled: N-k, recovered: k` when maintenance actually runs. An operator using the "preview" mode (`--dry-run` / no `--apply`) to decide whether to apply will see an inflated `reconciled` count, which may cause confusion.

A comment above the `previewTaskRegistryMaintenance` function noting this limitation would be enough to make the discrepancy explicit, since the fix (making preview async) would be a larger change.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/tasks/detached-task-runtime.test.ts
Line: 183-199

Comment:
**Warning log assertion missing from "hook throws" test**

The test verifies the return value (`{ recovered: false }`) when the hook throws, but doesn't assert that a warning was actually logged. Given that the log warning is an explicit behavioural commitment (operators rely on it to diagnose misbehaving plugins), a spy on `log.warn` would make the contract explicit and guard against accidentally silencing the log in the future.

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "test(tasks): align maintenance summary a..." | Re-trigger Greptile}

greptile-apps · 2026-04-20T11:31:59Z

+        runtime: task.runtime,
+        task,
+      });
+      expect(result).toEqual({ recovered: false });
+    });
+
+    it("returns not recovered and logs warning when hook throws", async () => {
+      const task = createFakeTaskRecord({ taskId: "task-throw", runtime: "acp" });
+      setDetachedTaskLifecycleRuntime({
+        ...getDetachedTaskLifecycleRuntime(),
+        onBeforeMarkLost: vi.fn(() => {
+          throw new Error("plugin crashed");
+        }),
+      });
+      const result = await onBeforeMarkLost({
+        taskId: task.taskId,
+        runtime: task.runtime,


Warning log assertion missing from "hook throws" test

The test verifies the return value ({ recovered: false }) when the hook throws, but doesn't assert that a warning was actually logged. Given that the log warning is an explicit behavioural commitment (operators rely on it to diagnose misbehaving plugins), a spy on log.warn would make the contract explicit and guard against accidentally silencing the log in the future.

Prompt To Fix With AI

This is a comment left during a code review. Path: src/tasks/detached-task-runtime.test.ts Line: 183-199 Comment: **Warning log assertion missing from "hook throws" test** The test verifies the return value (`{ recovered: false }`) when the hook throws, but doesn't assert that a warning was actually logged. Given that the log warning is an explicit behavioural commitment (operators rely on it to diagnose misbehaving plugins), a spy on `log.warn` would make the contract explicit and guard against accidentally silencing the log in the future. How can I resolve this? If you propose a fix, please make it concise.

Fixed in e276b09. Added vi.hoisted mock for createSubsystemLogger and asserted mockLogWarn is called with the expected taskId, runtime, and message.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3d1e1143e2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-20T11:33:59Z

+        }
+        continue;
+      }
      const next = markTaskLost(current, now);


Re-read task state before calling markTaskLost

runTaskRegistryMaintenance now awaits onBeforeMarkLost before reconciling stale tasks, but then calls markTaskLost(current, now) using the pre-await snapshot. If a task reaches a terminal state while the hook is running (for example, an async plugin check returns { recovered: false } after the task succeeded), this path overwrites that terminal state to lost. Re-fetching the task and re-running shouldMarkLost after the await is needed to avoid corrupting final outcomes.

Useful? React with 👍 / 👎.

Fixed in e276b09. The sweep now re-reads the task and re-checks shouldMarkLost after the async hook returns, regardless of the hook result. If the task completed or was deleted during the hook call, the sweep skips it.

chatgpt-codex-connector · 2026-04-20T11:33:59Z

+        runtime: current.runtime,
+        task: current,
+      });
+      if (recovery.recovered) {


Guard against invalid onBeforeMarkLost return values

The maintenance loop assumes the hook always returns an object with a recovered boolean and dereferences it directly. A JS plugin or buggy runtime implementation that returns undefined/null causes a TypeError here, aborting the sweep instead of falling back to normal markTaskLost behavior. Since this seam is plugin-provided and intended to be fail-safe, normalize/validate the hook result before reading recovered.

Useful? React with 👍 / 👎.

Fixed in e276b09. The dispatch wrapper now validates the hook result: if (result && typeof result.recovered === "boolean"). Invalid returns (undefined, null, missing field) normalize to { recovered: false }.

garrytan · 2026-04-20T11:37:50Z

All review findings addressed in e276b09:

Codex P1 (re-read before markTaskLost): Sweep now re-reads the task and re-checks shouldMarkLost after the async hook returns, guarding against concurrent completion during the hook.

Codex P2 (guard invalid return): Dispatch wrapper validates hook result shape. Invalid returns normalize to { recovered: false }.

Greptile P2 (log.warn assertion): Added mockLogWarn spy via vi.hoisted + vi.mock and asserted the warning is emitted with correct taskId and runtime.

Greptile P2 (preview discrepancy): Added comment on previewTaskRegistryMaintenance documenting that it cannot call the async hook, so recovered tasks show under reconciled in preview mode.

garrytan · 2026-04-20T12:03:21Z

CI failures are in src/agents/subagent-registry.announce-loop-guard.test.ts (timing-dependent mock assertion: "expected to be called 1 times, but got 0 times"). This test has zero imports from any file touched by this PR. All core test shards (core-src, core-runtime, core-fast, core-security, core-support, core-ui) passed. Could a maintainer re-run the failed jobs?

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2de191988a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-20T22:45:30Z

+      const recovery = await tryRecoverTaskBeforeMarkLost({
+        taskId: current.taskId,
+        runtime: current.runtime,
+        task: current,
+        now,
+      });


Time out detached recovery hooks in maintenance sweep

runTaskRegistryMaintenance now awaits tryRecoverTaskBeforeMarkLost(...) inline for each stale task, but there is no timeout or cancellation path if a plugin hook never resolves. In that case the sweep promise never settles, startScheduledSweep never clears sweepInProgress, and periodic maintenance stops running entirely until process restart. Please race this await with a bounded timeout and fall back to normal markTaskLost behavior on timeout so one bad hook cannot wedge reconciliation.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2c9e61fd58

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-20T22:51:50Z

@@ -306,7 +312,29 @@ export async function runTaskRegistryMaintenance(): Promise<TaskRegistryMaintena
      continue;
    }
    if (shouldMarkLost(current, now)) {
-      const next = markTaskLost(current, now);
+      const recovery = await tryRecoverTaskBeforeMarkLost({


Bound recovery hook await with a timeout

Awaiting tryRecoverTaskBeforeMarkLost(...) here has no timeout, so a plugin hook that never resolves will keep runTaskRegistryMaintenance pending indefinitely; in that state startScheduledSweep never clears sweepInProgress, and all subsequent scheduled sweeps are skipped until restart. This turns one bad recovery hook into a permanent reconciliation outage for stale tasks, so the await should be raced with a bounded timeout and fall back to normal markTaskLost behavior on timeout.

Useful? React with 👍 / 👎.

Let a registered DetachedTaskLifecycleRuntime prevent the maintenance sweep from marking a recoverable task as lost. When the optional onBeforeMarkLost hook returns { recovered: true }, the sweep skips markTaskLost and increments a new `recovered` counter in TaskRegistryMaintenanceSummary. The hook receives the full TaskRecord and is wrapped in try/catch: if it throws, the sweep logs a warning and proceeds with markTaskLost (safe default). After the async hook returns, the sweep re-reads the task to guard against concurrent completion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add recovered: 0 to existing toEqual assertions in task-registry.test.ts that compare the full TaskRegistryMaintenanceSummary shape. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Re-read task after async hook returns false before calling markTaskLost, guarding against concurrent completion during the hook (Codex P1) - Validate hook return value: normalize undefined/null to { recovered: false } instead of crashing the sweep (Codex P2) - Add log.warn spy assertion to the "hook throws" test to verify the warning is actually emitted, not just that the return value is correct (Greptile P2) - Add comment on previewTaskRegistryMaintenance noting it cannot call the async hook, so recovered tasks appear under reconciled in preview (Greptile P2) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 24322af4f7

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-20T23:01:31Z

    }
    if (shouldMarkLost(current, now)) {
-      const next = markTaskLost(current, now);
+      const recovery = await tryRecoverTaskBeforeMarkLost({


Bound recovery hook await with timeout

Awaiting tryRecoverTaskBeforeMarkLost(...) without any timeout lets a misbehaving plugin hook stall this sweep forever (for example, if it returns a never-resolving Promise). In that state runTaskRegistryMaintenance() never settles, startScheduledSweep() never clears sweepInProgress, and all subsequent scheduled maintenance runs are skipped until restart, so stale tasks stop being reconciled entirely.

Useful? React with 👍 / 👎.

mbelinky · 2026-04-20T23:06:18Z

Maintainer note for the follow-up plugin work:

The landed seam to build against is tryRecoverTaskBeforeMarkLost on DetachedTaskLifecycleRuntime.

What the plugin should do:

register its detached runtime as usual
spread the current core runtime and implement tryRecoverTaskBeforeMarkLost({ taskId, runtime, task, now })
return { recovered: true } only when the durable executor has definitely claimed/requeued/recovered that task
return { recovered: false } for anything unknown, not owned, or not recoverable yet

Practical guidance:

keep the hook idempotent; maintenance may ask again after restart windows
keep it fast where possible; core now warns on slow hooks, but does not hard-timeout them
use now if your lease/retry logic needs a deterministic sweep timestamp instead of calling Date.now() again
preview/operator inspection is still synchronous, so recovery only takes effect in the real sweep, not in preview mode

So the intended plugin shape is roughly:

const coreRuntime = getDetachedTaskLifecycleRuntime();
registerDetachedTaskRuntime("your-plugin", {
  ...coreRuntime,
  async tryRecoverTaskBeforeMarkLost({ taskId, task, now }) {
    const job = await queue.findRecoverableJob(taskId, { now });
    if (!job) return { recovered: false };

    await queue.recover(job);
    return { recovered: true };
  },
});

No further core seam should be needed for this specific recovery path.

@mbelinky