fix(cron): mirror active-jobs mark/clear on startup catchup and manual run#71040
fix(cron): mirror active-jobs mark/clear on startup catchup and manual run#71040Feelw00 wants to merge 1 commit into
Conversation
Greptile SummaryThis PR completes the Confidence Score: 5/5Safe to merge — the change is a minimal, well-scoped instrumentation fix with solid regression test coverage and no side-effects on existing paths. All three changed files are clean. The mark/clear placement in No files require special attention. Reviews (1): Last reviewed commit: "fix(cron): mirror active-jobs mark/clear..." | Re-trigger Greptile |
…l run Upstream 7d1575b (openclaw#60310) introduced the activeJobIds singleton plus markCronJobActive/clearCronJobActive so task-registry maintenance has a backing-session signal for runtime='cron' tasks (task-registry.maintenance.ts:124-128). That patch wired the pair into runDueJob (timer.ts:746/586) and executeJob (timer.ts:1344/1374) but left the remaining two execution paths uninstrumented: * runStartupCatchupCandidate (timer.ts:1043-1081) * prepareManualRun / finishPreparedManualRun (ops.ts:548-686) For runs taken on those paths, the cron branch of hasBackingSession sees isCronJobActive=false and, once TASK_RECONCILE_GRACE_MS (5 min, task-registry.maintenance.ts:28) elapses, marks the task 'lost' while the cron service is still executing it. With DEFAULT_JOB_TIMEOUT_MS=10 min (cron/service/timeout-policy.ts:8) and no recordTaskRunProgressByRunId emissions on isolated agentTurn runs, lastEventAt is pinned to startedAt so the grace is exceeded in practice. This PR mirrors the existing mark/clear contract on the two missing paths inside try/finally, completing openclaw#60310's intent. No behavioural change to runDueJob / executeJob. Related openclaw#68157 (partially addresses the task-registry misclassification aspect; the runningAtMs persistence aspect described in that issue is a separate state machine not touched by this PR). Architecture note: PR openclaw#69313 introduced tryRecoverTaskBeforeMarkLost hook infrastructure but cron does not register it, and registering the hook alone would not close this gap (the recover callback would still need an alive-signal source — i.e. activeJobIds). This PR completes the existing contract; registering the hook for cross-runtime parity is a natural follow-up if maintainers prefer that direction. [AI-assisted]
c2cf007 to
630df26
Compare
|
Codex review: needs changes before merge. Summary Reproducibility: yes. Current main gives a high-confidence static reproduction: startup catch-up and manual cron task rows can remain active past the 5-minute reconciliation grace while Next step before merge Security Review findings
Review detailsBest possible solution: Land a rebased narrow fix that preserves the manual-run mark/clear, keeps startup catch-up jobs active until their task rows are finalized or finalizes each candidate before starting the next, and adds focused regression plus changelog coverage. Do we have a high-confidence way to reproduce the issue? Yes. Current main gives a high-confidence static reproduction: startup catch-up and manual cron task rows can remain active past the 5-minute reconciliation grace while Is this the best way to solve the issue? No, not as written. Mirroring active-job bookkeeping is the right fix direction, but startup catch-up must not clear the marker inside Full review comments:
Overall correctness: patch is incorrect Acceptance criteria:
What I checked:
Likely related people:
Remaining risk / open question:
Codex review notes: model gpt-5.5, reasoning high; reviewed against f7549079ceb3. |
|
Closing this PR — The narrow remaining gap is UX-only: during the 5-minute Thanks for the review window — closing without merge. |
…lost marker Upstream 7d1575b (openclaw#60310, 2026-04-04) introduced activeJobIds plus markCronJobActive/clearCronJobActive so task-registry maintenance has a backing-session signal for runtime='cron' tasks (task-registry.maintenance.ts hasBackingSession). That patch wired the pair into runDueJob and executeJob but left the manual-run path (prepareManualRun + finishPreparedManualRun in src/cron/service/ops.ts) without mark/clear. Symptom: when a user triggers `openclaw cron run <id> --force` (or any manual-run RPC) and the run exceeds TASK_RECONCILE_GRACE_MS (5 min) — common for force-mode `agentTurn` runs which can reach AGENT_TURN_SAFETY_TIMEOUT_MS (60 min) — task-registry sweeper marks the active task `lost` and emits a `Background task lost` system message to the session, even though the run is still progressing normally. The merged commit 1fae716 (resolveDurableCronTaskRecovery, 2026-04-26) reconciles terminal status retroactively from cron run-log + store.lastRunStatus, but only after the run finishes. This patch suppresses the transient `lost` marker by adding the producer-side mark/clear pair, restoring symmetry with runDueJob/executeJob: * prepareManualRun: markCronJobActive(job.id) after tryCreateManualTaskRun. * finishPreparedManualRun: wrap body in try/finally with clearCronJobActive(jobId). Scope intentionally narrower than openclaw#71040 (closed): * No change to runStartupCatchupCandidate — the `deferAgentTurnJobs:true` policy added in 7877182 reroutes long-running startup catchups to runDueJob/executeJob (already wired). Non-agentTurn startup catchups are theoretical hot-path per pre-PR cross-review. * No change to active-jobs.ts API. Regression test (src/cron/active-jobs-manual-run.test.ts): two cases exercising production hot-path via cron.run("<id>", "force") — success and inner-throw — assert isCronJobActive transitions true→false around the manual run. Fixes openclaw#78233. [AI-assisted, fully tested]
…lost marker Upstream 7d1575b (openclaw#60310, 2026-04-04) introduced activeJobIds plus markCronJobActive/clearCronJobActive so task-registry maintenance has a backing-session signal for runtime='cron' tasks (task-registry.maintenance.ts hasBackingSession). That patch wired the pair into runDueJob and executeJob but left the manual-run path (prepareManualRun + finishPreparedManualRun in src/cron/service/ops.ts) without mark/clear. Symptom: when a user triggers `openclaw cron run <id> --force` (or any manual-run RPC) and the run exceeds TASK_RECONCILE_GRACE_MS (5 min) — common for force-mode `agentTurn` runs which can reach AGENT_TURN_SAFETY_TIMEOUT_MS (60 min) — task-registry sweeper marks the active task `lost` and emits a `Background task lost` system message to the session, even though the run is still progressing normally. The merged commit 1fae716 (resolveDurableCronTaskRecovery, 2026-04-26) reconciles terminal status retroactively from cron run-log + store.lastRunStatus, but only after the run finishes. This patch suppresses the transient `lost` marker by adding the producer-side mark/clear pair, restoring symmetry with runDueJob/executeJob: * prepareManualRun: markCronJobActive(job.id) after tryCreateManualTaskRun. * finishPreparedManualRun: wrap body in try/finally with clearCronJobActive(jobId). Scope intentionally narrower than openclaw#71040 (closed): * No change to runStartupCatchupCandidate — the `deferAgentTurnJobs:true` policy added in 7877182 reroutes long-running startup catchups to runDueJob/executeJob (already wired). Non-agentTurn startup catchups are theoretical hot-path per pre-PR cross-review. * No change to active-jobs.ts API. Regression test (src/cron/active-jobs-manual-run.test.ts): two cases exercising production hot-path via cron.run("<id>", "force") — success and inner-throw — assert isCronJobActive transitions true→false around the manual run. Fixes openclaw#78233. [AI-assisted, fully tested]
…lost marker Upstream 7d1575b (openclaw#60310, 2026-04-04) introduced activeJobIds plus markCronJobActive/clearCronJobActive so task-registry maintenance has a backing-session signal for runtime='cron' tasks (task-registry.maintenance.ts hasBackingSession). That patch wired the pair into runDueJob and executeJob but left the manual-run path (prepareManualRun + finishPreparedManualRun in src/cron/service/ops.ts) without mark/clear. Symptom: when a user triggers `openclaw cron run <id> --force` (or any manual-run RPC) and the run exceeds TASK_RECONCILE_GRACE_MS (5 min) — common for force-mode `agentTurn` runs which can reach AGENT_TURN_SAFETY_TIMEOUT_MS (60 min) — task-registry sweeper marks the active task `lost` and emits a `Background task lost` system message to the session, even though the run is still progressing normally. The merged commit 1fae716 (resolveDurableCronTaskRecovery, 2026-04-26) reconciles terminal status retroactively from cron run-log + store.lastRunStatus, but only after the run finishes. This patch suppresses the transient `lost` marker by adding the producer-side mark/clear pair, restoring symmetry with runDueJob/executeJob: * prepareManualRun: markCronJobActive(job.id) after tryCreateManualTaskRun. * finishPreparedManualRun: wrap body in try/finally with clearCronJobActive(jobId). Scope intentionally narrower than openclaw#71040 (closed): * No change to runStartupCatchupCandidate — the `deferAgentTurnJobs:true` policy added in 7877182 reroutes long-running startup catchups to runDueJob/executeJob (already wired). Non-agentTurn startup catchups are theoretical hot-path per pre-PR cross-review. * No change to active-jobs.ts API. Regression test (src/cron/active-jobs-manual-run.test.ts): two cases exercising production hot-path via cron.run("<id>", "force") — success and inner-throw — assert isCronJobActive transitions true→false around the manual run. Fixes openclaw#78233. [AI-assisted, fully tested]
…lost marker Upstream 7d1575b (openclaw#60310, 2026-04-04) introduced activeJobIds plus markCronJobActive/clearCronJobActive so task-registry maintenance has a backing-session signal for runtime='cron' tasks (task-registry.maintenance.ts hasBackingSession). That patch wired the pair into runDueJob and executeJob but left the manual-run path (prepareManualRun + finishPreparedManualRun in src/cron/service/ops.ts) without mark/clear. Symptom: when a user triggers `openclaw cron run <id> --force` (or any manual-run RPC) and the run exceeds TASK_RECONCILE_GRACE_MS (5 min) — common for force-mode `agentTurn` runs which can reach AGENT_TURN_SAFETY_TIMEOUT_MS (60 min) — task-registry sweeper marks the active task `lost` and emits a `Background task lost` system message to the session, even though the run is still progressing normally. The merged commit 1fae716 (resolveDurableCronTaskRecovery, 2026-04-26) reconciles terminal status retroactively from cron run-log + store.lastRunStatus, but only after the run finishes. This patch suppresses the transient `lost` marker by adding the producer-side mark/clear pair, restoring symmetry with runDueJob/executeJob: * prepareManualRun: markCronJobActive(job.id) after tryCreateManualTaskRun. * finishPreparedManualRun: wrap body in try/finally with clearCronJobActive(jobId). Scope intentionally narrower than openclaw#71040 (closed): * No change to runStartupCatchupCandidate — the `deferAgentTurnJobs:true` policy added in 7877182 reroutes long-running startup catchups to runDueJob/executeJob (already wired). Non-agentTurn startup catchups are theoretical hot-path per pre-PR cross-review. * No change to active-jobs.ts API. Regression test (src/cron/active-jobs-manual-run.test.ts): two cases exercising production hot-path via cron.run("<id>", "force") — success and inner-throw — assert isCronJobActive transitions true→false around the manual run. Fixes openclaw#78233. [AI-assisted, fully tested]
…lost marker Upstream 7d1575b (openclaw#60310, 2026-04-04) introduced activeJobIds plus markCronJobActive/clearCronJobActive so task-registry maintenance has a backing-session signal for runtime='cron' tasks (task-registry.maintenance.ts hasBackingSession). That patch wired the pair into runDueJob and executeJob but left the manual-run path (prepareManualRun + finishPreparedManualRun in src/cron/service/ops.ts) without mark/clear. Symptom: when a user triggers `openclaw cron run <id> --force` (or any manual-run RPC) and the run exceeds TASK_RECONCILE_GRACE_MS (5 min) — common for force-mode `agentTurn` runs which can reach AGENT_TURN_SAFETY_TIMEOUT_MS (60 min) — task-registry sweeper marks the active task `lost` and emits a `Background task lost` system message to the session, even though the run is still progressing normally. The merged commit 1fae716 (resolveDurableCronTaskRecovery, 2026-04-26) reconciles terminal status retroactively from cron run-log + store.lastRunStatus, but only after the run finishes. This patch suppresses the transient `lost` marker by adding the producer-side mark/clear pair, restoring symmetry with runDueJob/executeJob: * prepareManualRun: markCronJobActive(job.id) after tryCreateManualTaskRun. * finishPreparedManualRun: wrap body in try/finally with clearCronJobActive(jobId). Scope intentionally narrower than openclaw#71040 (closed): * No change to runStartupCatchupCandidate — the `deferAgentTurnJobs:true` policy added in 7877182 reroutes long-running startup catchups to runDueJob/executeJob (already wired). Non-agentTurn startup catchups are theoretical hot-path per pre-PR cross-review. * No change to active-jobs.ts API. Regression test (src/cron/active-jobs-manual-run.test.ts): two cases exercising production hot-path via cron.run("<id>", "force") — success and inner-throw — assert isCronJobActive transitions true→false around the manual run. Fixes openclaw#78233. [AI-assisted, fully tested]
Summary
activeJobIdsmark/clear only intorunDueJobandexecuteJob, leavingrunStartupCatchupCandidateandprepareManualRun/finishPreparedManualRununinstrumented.task-registry.maintenance.ts:124-128's cron branch depends solely onisCronJobActive, so runs on those two paths are misclassified aslostafterTASK_RECONCILE_GRACE_MS(5 min) while the cron service is still executing them.DEFAULT_JOB_TIMEOUT_MSis 10 min and cron isolatedagentTurnemits norecordTaskRunProgressByRunIdcalls, solastEventAtis pinned tostartedAtand the 5-min grace is routinely exceeded. Affects startup catch-up on every gateway restart with missed jobs (up toDEFAULT_MAX_MISSED_JOBS_PER_RESTART=5per restart) and everyopenclaw cron run <id>/ agent tool / UI manual run.try/finally. ~13 production lines + 1 import; 1 new test file with 2 regression tests.runDueJobandexecuteJobpaths are untouched; therunningAtMspersistence mechanism is not modified; cron does not registertryRecoverTaskBeforeMarkLost(deferred to a possible follow-up).Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
runningAtMspersistence aspect is a separate state machine and is not touched by this PR.lostcron task records after gateway restart due to transientactiveJobIdsbacking-session check #68191 — independent broader proposal by @hclsys on the same general area.tryRecoverTaskBeforeMarkLosthook infrastructure; this PR is complementary. See Risks and Mitigations below.Root Cause (if applicable)
activeJobIds+markCronJobActive/clearCronJobActiveand taughttask-registry.maintenance.hasBackingSessionto depend solely onisCronJobActiveforruntime='cron'(task-registry.maintenance.ts:124-128), but wired the mark/clear pair into only 2 of the 4 cron execution paths.activeJobIdsinvariant at the cron layer across all execution paths. The existingtask-registry.maintenance.issue-60299.test.tsstubsisCronJobActive=truedirectly on the consuming side, so a producing-side gap never surfaces there.agentTurnhas norecordTaskRunProgressByRunIdemission, solastEventAtnever advances paststartedAt— the 5-min grace compares against a frozen reference and expires deterministically while the underlying LLM round-trip is still running.Regression Test Plan (if applicable)
src/cron/active-jobs-symmetry.test.ts(new).runStartupCatchupCandidateand clears it afterwards.prepareManualRunand clears infinishPreparedManualRun'sfinally, including when the inner execution throws.CronService.start()/cron.run()entry points and uses the realactiveJobIdssingleton (viaresetCronActiveJobsForTests). The only mock isrunIsolatedAgentJob, which is a legitimate constructor-level dep and is replaced by a deferred promise so the test can observe the mid-flight invariant. NoisCronJobActivestubbing, so the test cannot pass on a phantom branch.task-registry.maintenance.issue-60299.test.ts:151-165covers the consuming side (isCronJobActive=true=> reconcile skipped) but never exercises the producing side across all four cron paths, which is where the partial-merge gap lives.User-visible / Behavior Changes
None. Internal contract completion only. Existing callers, return shapes, and events are unchanged; the only observable difference is that
isCronJobActive(jobId)now returnstrueduringrunStartupCatchupCandidateand manual-run execution (matching the behaviour that already exists forrunDueJob/executeJob).Diagram (if applicable)
```text
Before (runStartupCatchupCandidate / manual run paths):
ops.start (or cron.run)
-> runStartupCatchupCandidate (or prepareManualRun+finishPreparedManualRun)
-> executeJobCoreWithTimeout [... running, 6-10 min ...]
(activeJobIds never touched)
-> task-registry maintenance tick (every ~60s)
-> hasBackingSession(task) -> isCronJobActive(jobId) == false
-> lastEventAt === startedAt; grace expires at 5 min
-> markTaskLost // mid-execution, misclassified
After:
ops.start (or cron.run)
-> runStartupCatchupCandidate (or prepareManualRun+finishPreparedManualRun)
markCronJobActive(jobId)
try { executeJobCoreWithTimeout } finally { clearCronJobActive(jobId) }
-> task-registry maintenance tick
-> hasBackingSession(task) -> isCronJobActive(jobId) == true
-> reconcile skipped while the run is live
```
Security Impact (required)
Repro + Verification
Environment
pnpmworkspace onupstream/mainatb7fba2100fSteps
upstream/mainand copysrc/cron/active-jobs-symmetry.test.tsfrom this PR only.Expected
active-jobs-symmetry.test.ts2/2 fail.active-jobs-symmetry.test.ts2/2 green;task-registry.maintenance.issue-60299.test.ts5/5 green; full cron vitest scope (79 files / 665 tests) green.Actual
Evidence
Pre-fix run on
upstream/mainplus the new test file alone:```
FAIL |cron| src/cron/active-jobs-symmetry.test.ts > ... > startup catchup marks the job active during execution and clears it on completion
AssertionError: expected false to be true
❯ src/cron/active-jobs-symmetry.test.ts
❯ expect(isCronJobActive("catchup-isolated")).toBe(true);
FAIL |cron| src/cron/active-jobs-symmetry.test.ts > ... > manual run marks the job active during execution and clears it even when the inner throws
AssertionError: expected false to be true
Test Files 1 failed (1)
Tests 2 failed (2)
```
Post-fix:
```
Test Files 1 passed (1)
Tests 2 passed (2)
```
Cron suite (post-fix):
```
Test Files 79 passed (79)
Tests 665 passed (665)
```
Human Verification (required)
active-jobs-symmetry.test.tsfailing on the two asserts; post-fix the same file green; post-fixtask-registry.maintenance.issue-60299.test.tsgreen (5/5); post-fix full cron vitest scope green (79 files / 665 tests);pnpm tsgo:allgreen;pnpm checkexit 0;pnpm buildexit 0.prepareManualRunreturnsran:falsefor every preflight bail-out (invalid-spec,already-running,not-due) before reachingmarkCronJobActive, sofinishPreparedManualRunis only called when the mark was set; the outertry/finallyguarantees the clear even whentryFinishManualTaskRunor thelockedblock throws. Thrown-inner path —executeJobCoreWithTimeoutfailures are already captured into{ status: "error" }inside the innertry; the second regression test directly exercises the thrown-inner path with a rejecting deferred promise.task-registry.maintenance.issue-60299.test.ts). Also did not run the scenario against a real long-running production LLMagentTurn(no live model invocation).Review Conversations
Compatibility / Migration
Risks and Mitigations
active-jobs-symmetry.test.tsnow asserts the invariant across the startup-catch-up and manual-run paths. A refactor that drops mark/clear fromrunDueJoborexecuteJob, or adds a new execution path without mark/clear, would need an explicit decision about whether to extend the symmetry test, making the contract visible.tryRecoverTaskBeforeMarkLosthook infrastructure and maintainers may prefer that direction over direct injection.activeJobIdsprovides. This PR completes the existingactiveJobIdscontract; registering the hook for cross-runtime parity is a natural follow-up if that is the preferred long-term direction.[AI-assisted]