tasks: add detached task recovery hook before markLost#69313
Conversation
Greptile SummaryThis PR adds an optional Confidence Score: 5/5Safe to merge — no behavioural change when hook is absent, error path defaults to the existing sweep behaviour, and all 72 tests pass. All remaining findings are P2 (a documentation/comment gap on the preview/actual discrepancy, and a missing spy assertion in one test). Neither affects correctness or the production code path. No files require special attention.
|
| runtime: task.runtime, | ||
| task, | ||
| }); | ||
| expect(result).toEqual({ recovered: false }); | ||
| }); | ||
|
|
||
| it("returns not recovered and logs warning when hook throws", async () => { | ||
| const task = createFakeTaskRecord({ taskId: "task-throw", runtime: "acp" }); | ||
| setDetachedTaskLifecycleRuntime({ | ||
| ...getDetachedTaskLifecycleRuntime(), | ||
| onBeforeMarkLost: vi.fn(() => { | ||
| throw new Error("plugin crashed"); | ||
| }), | ||
| }); | ||
| const result = await onBeforeMarkLost({ | ||
| taskId: task.taskId, | ||
| runtime: task.runtime, |
There was a problem hiding this comment.
Warning log assertion missing from "hook throws" test
The test verifies the return value ({ recovered: false }) when the hook throws, but doesn't assert that a warning was actually logged. Given that the log warning is an explicit behavioural commitment (operators rely on it to diagnose misbehaving plugins), a spy on log.warn would make the contract explicit and guard against accidentally silencing the log in the future.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/tasks/detached-task-runtime.test.ts
Line: 183-199
Comment:
**Warning log assertion missing from "hook throws" test**
The test verifies the return value (`{ recovered: false }`) when the hook throws, but doesn't assert that a warning was actually logged. Given that the log warning is an explicit behavioural commitment (operators rely on it to diagnose misbehaving plugins), a spy on `log.warn` would make the contract explicit and guard against accidentally silencing the log in the future.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Fixed in e276b09. Added vi.hoisted mock for createSubsystemLogger and asserted mockLogWarn is called with the expected taskId, runtime, and message.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3d1e1143e2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| } | ||
| continue; | ||
| } | ||
| const next = markTaskLost(current, now); |
There was a problem hiding this comment.
Re-read task state before calling markTaskLost
runTaskRegistryMaintenance now awaits onBeforeMarkLost before reconciling stale tasks, but then calls markTaskLost(current, now) using the pre-await snapshot. If a task reaches a terminal state while the hook is running (for example, an async plugin check returns { recovered: false } after the task succeeded), this path overwrites that terminal state to lost. Re-fetching the task and re-running shouldMarkLost after the await is needed to avoid corrupting final outcomes.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in e276b09. The sweep now re-reads the task and re-checks shouldMarkLost after the async hook returns, regardless of the hook result. If the task completed or was deleted during the hook call, the sweep skips it.
| runtime: current.runtime, | ||
| task: current, | ||
| }); | ||
| if (recovery.recovered) { |
There was a problem hiding this comment.
Guard against invalid onBeforeMarkLost return values
The maintenance loop assumes the hook always returns an object with a recovered boolean and dereferences it directly. A JS plugin or buggy runtime implementation that returns undefined/null causes a TypeError here, aborting the sweep instead of falling back to normal markTaskLost behavior. Since this seam is plugin-provided and intended to be fail-safe, normalize/validate the hook result before reading recovered.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in e276b09. The dispatch wrapper now validates the hook result: if (result && typeof result.recovered === "boolean"). Invalid returns (undefined, null, missing field) normalize to { recovered: false }.
|
All review findings addressed in e276b09: Codex P1 (re-read before markTaskLost): Sweep now re-reads the task and re-checks Codex P2 (guard invalid return): Dispatch wrapper validates hook result shape. Invalid returns normalize to Greptile P2 (log.warn assertion): Added Greptile P2 (preview discrepancy): Added comment on |
|
CI failures are in |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2de191988a
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| const recovery = await tryRecoverTaskBeforeMarkLost({ | ||
| taskId: current.taskId, | ||
| runtime: current.runtime, | ||
| task: current, | ||
| now, | ||
| }); |
There was a problem hiding this comment.
Time out detached recovery hooks in maintenance sweep
runTaskRegistryMaintenance now awaits tryRecoverTaskBeforeMarkLost(...) inline for each stale task, but there is no timeout or cancellation path if a plugin hook never resolves. In that case the sweep promise never settles, startScheduledSweep never clears sweepInProgress, and periodic maintenance stops running entirely until process restart. Please race this await with a bounded timeout and fall back to normal markTaskLost behavior on timeout so one bad hook cannot wedge reconciliation.
Useful? React with 👍 / 👎.
2de1919 to
2c9e61f
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2c9e61fd58
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| @@ -306,7 +312,29 @@ export async function runTaskRegistryMaintenance(): Promise<TaskRegistryMaintena | |||
| continue; | |||
| } | |||
| if (shouldMarkLost(current, now)) { | |||
| const next = markTaskLost(current, now); | |||
| const recovery = await tryRecoverTaskBeforeMarkLost({ | |||
There was a problem hiding this comment.
Bound recovery hook await with a timeout
Awaiting tryRecoverTaskBeforeMarkLost(...) here has no timeout, so a plugin hook that never resolves will keep runTaskRegistryMaintenance pending indefinitely; in that state startScheduledSweep never clears sweepInProgress, and all subsequent scheduled sweeps are skipped until restart. This turns one bad recovery hook into a permanent reconciliation outage for stale tasks, so the await should be raced with a bounded timeout and fall back to normal markTaskLost behavior on timeout.
Useful? React with 👍 / 👎.
2c9e61f to
3920bd1
Compare
Let a registered DetachedTaskLifecycleRuntime prevent the maintenance sweep
from marking a recoverable task as lost. When the optional onBeforeMarkLost
hook returns { recovered: true }, the sweep skips markTaskLost and increments
a new `recovered` counter in TaskRegistryMaintenanceSummary.
The hook receives the full TaskRecord and is wrapped in try/catch: if it
throws, the sweep logs a warning and proceeds with markTaskLost (safe
default). After the async hook returns, the sweep re-reads the task to
guard against concurrent completion.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add recovered: 0 to existing toEqual assertions in task-registry.test.ts that compare the full TaskRegistryMaintenanceSummary shape. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Re-read task after async hook returns false before calling markTaskLost,
guarding against concurrent completion during the hook (Codex P1)
- Validate hook return value: normalize undefined/null to { recovered: false }
instead of crashing the sweep (Codex P2)
- Add log.warn spy assertion to the "hook throws" test to verify the warning
is actually emitted, not just that the return value is correct (Greptile P2)
- Add comment on previewTaskRegistryMaintenance noting it cannot call the
async hook, so recovered tasks appear under reconciled in preview (Greptile P2)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3920bd1 to
24322af
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 24322af4f7
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| } | ||
| if (shouldMarkLost(current, now)) { | ||
| const next = markTaskLost(current, now); | ||
| const recovery = await tryRecoverTaskBeforeMarkLost({ |
There was a problem hiding this comment.
Bound recovery hook await with timeout
Awaiting tryRecoverTaskBeforeMarkLost(...) without any timeout lets a misbehaving plugin hook stall this sweep forever (for example, if it returns a never-resolving Promise). In that state runTaskRegistryMaintenance() never settles, startScheduledSweep() never clears sweepInProgress, and all subsequent scheduled maintenance runs are skipped until restart, so stale tasks stop being reconciled entirely.
Useful? React with 👍 / 👎.
|
Maintainer note for the follow-up plugin work: The landed seam to build against is What the plugin should do:
Practical guidance:
So the intended plugin shape is roughly: const coreRuntime = getDetachedTaskLifecycleRuntime();
registerDetachedTaskRuntime("your-plugin", {
...coreRuntime,
async tryRecoverTaskBeforeMarkLost({ taskId, task, now }) {
const job = await queue.findRecoverableJob(taskId, { now });
if (!job) return { recovered: false };
await queue.recover(job);
return { recovered: true };
},
});No further core seam should be needed for this specific recovery path. |
…l run Upstream 7d1575b (openclaw#60310) introduced the activeJobIds singleton plus markCronJobActive/clearCronJobActive so task-registry maintenance has a backing-session signal for runtime='cron' tasks (task-registry.maintenance.ts:124-128). That patch wired the pair into runDueJob (timer.ts:746/586) and executeJob (timer.ts:1344/1374) but left the remaining two execution paths uninstrumented: * runStartupCatchupCandidate (timer.ts:1043-1081) * prepareManualRun / finishPreparedManualRun (ops.ts:548-686) For runs taken on those paths, the cron branch of hasBackingSession sees isCronJobActive=false and, once TASK_RECONCILE_GRACE_MS (5 min, task-registry.maintenance.ts:28) elapses, marks the task 'lost' while the cron service is still executing it. With DEFAULT_JOB_TIMEOUT_MS=10 min (cron/service/timeout-policy.ts:8) and no recordTaskRunProgressByRunId emissions on isolated agentTurn runs, lastEventAt is pinned to startedAt so the grace is exceeded in practice. This PR mirrors the existing mark/clear contract on the two missing paths inside try/finally, completing openclaw#60310's intent. No behavioural change to runDueJob / executeJob. Related openclaw#68157 (partially addresses the task-registry misclassification aspect; the runningAtMs persistence aspect described in that issue is a separate state machine not touched by this PR). Architecture note: PR openclaw#69313 introduced tryRecoverTaskBeforeMarkLost hook infrastructure but cron does not register it, and registering the hook alone would not close this gap (the recover callback would still need an alive-signal source — i.e. activeJobIds). This PR completes the existing contract; registering the hook for cross-runtime parity is a natural follow-up if maintainers prefer that direction. [AI-assisted]
Context
I'm building a plugin that wraps subagent execution in a durable job queue: crash recovery, retry with backoff, timeout enforcement. The
DetachedTaskLifecycleRuntimeseam (#68886) and plugin registration contract (#68915) give me most of what I need. The one remaining gap is stale-task recovery after a gateway restart: the maintenance sweep can find running tasks whose backing sessions are gone and mark themlostbefore a durable executor has a chance to re-spawn them.This PR adds one small recovery seam so a registered detached runtime can say "I can recover this task" before core marks it lost.
What this PR does
Adds an optional
tryRecoverTaskBeforeMarkLost?hook toDetachedTaskLifecycleRuntimeinsrc/tasks/detached-task-runtime-contract.ts.When
runTaskRegistryMaintenance()is about to mark a stale running task aslost, it now:tryRecoverTaskBeforeMarkLost({ taskId, runtime, task, now }){ recovered: true }, skipsmarkTaskLost()and increments arecoveredcounter{ recovered: false }, proceeds normallyAfter the async hook returns, the sweep re-reads the task and re-checks
shouldMarkLost(...)before marking it lost, so concurrent completion or recovery wins.Why optional
cancelDetachedTaskRunByIdis required because every detached executor needs cancel. Recovery beforemarkLostis advisory and only matters for executors with durable recovery state. Keeping it optional preserves current behavior for existing runtimes and test doubles.Scope
recoveredthroughTaskRegistryMaintenanceSummaryand CLI outputreconciledTest plan
Validated on
mb-serveragainst head2de191988a5f0f3065b4ccb48a4ffff97a67ae41:OPENCLAW_TEST_PROFILE=serial OPENCLAW_TEST_SERIAL_GATEWAY=1 pnpm test -- src/tasks/detached-task-runtime.test.ts src/tasks/task-registry.maintenance.issue-60299.test.ts src/tasks/task-registry.test.ts src/tasks/task-executor.test.tspnpm tsgo:corepnpm tsgo:core:testNODE_OPTIONS=--max-old-space-size=4096 pnpm plugin-sdk:api:checkNotes
Preview/operator inspection remains intentionally synchronous. It cannot call the async recovery hook, so recoverable stale tasks still count under
reconciledin preview until the real sweep runs and either recovers or marks them lost.