fix(cron): mirror active-jobs mark/clear on startup catchup and manual run by Feelw00 · Pull Request #71040 · openclaw/openclaw

Feelw00 · 2026-04-24T09:12:15Z

Summary

Problem: Upstream fix: honor exec approval security and clean up stale tasks #60310 (7d1575b) wired activeJobIds mark/clear only into runDueJob and executeJob, leaving runStartupCatchupCandidate and prepareManualRun/finishPreparedManualRun uninstrumented. task-registry.maintenance.ts:124-128's cron branch depends solely on isCronJobActive, so runs on those two paths are misclassified as lost after TASK_RECONCILE_GRACE_MS (5 min) while the cron service is still executing them.
Why it matters: DEFAULT_JOB_TIMEOUT_MS is 10 min and cron isolated agentTurn emits no recordTaskRunProgressByRunId calls, so lastEventAt is pinned to startedAt and the 5-min grace is routinely exceeded. Affects startup catch-up on every gateway restart with missed jobs (up to DEFAULT_MAX_MISSED_JOBS_PER_RESTART=5 per restart) and every openclaw cron run <id> / agent tool / UI manual run.
What changed: Mirror the existing mark/clear pattern on the two missing paths inside try/finally. ~13 production lines + 1 import; 1 new test file with 2 regression tests.
What did NOT change (scope boundary): runDueJob and executeJob paths are untouched; the runningAtMs persistence mechanism is not modified; cron does not register tryRecoverTaskBeforeMarkLost (deferred to a possible follow-up).

Change Type (select all)

Bug fix

Scope (select all touched areas)

Gateway / orchestration

Linked Issue/PR

Related [Bug] Cron isolated agentTurn: "already-running" survives restart, run history always empty #68157 — partially addresses the task-registry misclassification aspect described there; the runningAtMs persistence aspect is a separate state machine and is not touched by this PR.
Related False-positive lost cron task records after gateway restart due to transient activeJobIds backing-session check #68191 — independent broader proposal by @hclsys on the same general area.
Related tasks: add detached task recovery hook before markLost #69313 — introduced the tryRecoverTaskBeforeMarkLost hook infrastructure; this PR is complementary. See Risks and Mitigations below.
This PR fixes a bug or regression

Root Cause (if applicable)

Root cause: Partial application of the fix: honor exec approval security and clean up stale tasks #60310 (7d1575b) contract. The commit added activeJobIds + markCronJobActive/clearCronJobActive and taught task-registry.maintenance.hasBackingSession to depend solely on isCronJobActive for runtime='cron' (task-registry.maintenance.ts:124-128), but wired the mark/clear pair into only 2 of the 4 cron execution paths.
Missing detection / guardrail: No test asserts the activeJobIds invariant at the cron layer across all execution paths. The existing task-registry.maintenance.issue-60299.test.ts stubs isCronJobActive=true directly on the consuming side, so a producing-side gap never surfaces there.
Contributing context: isolated agentTurn has no recordTaskRunProgressByRunId emission, so lastEventAt never advances past startedAt — the 5-min grace compares against a frozen reference and expires deterministically while the underlying LLM round-trip is still running.

Regression Test Plan (if applicable)

Coverage level that should have caught this:
- Unit test
Target test or file: src/cron/active-jobs-symmetry.test.ts (new).
Scenario the test should lock in:
1. Startup catch-up marks the cron job active during runStartupCatchupCandidate and clears it afterwards.
2. Manual run marks during prepareManualRun and clears in finishPreparedManualRun's finally, including when the inner execution throws.
Why this is the smallest reliable guardrail: It drives the real CronService.start() / cron.run() entry points and uses the real activeJobIds singleton (via resetCronActiveJobsForTests). The only mock is runIsolatedAgentJob, which is a legitimate constructor-level dep and is replaced by a deferred promise so the test can observe the mid-flight invariant. No isCronJobActive stubbing, so the test cannot pass on a phantom branch.
Existing test that already covers this: None — task-registry.maintenance.issue-60299.test.ts:151-165 covers the consuming side (isCronJobActive=true => reconcile skipped) but never exercises the producing side across all four cron paths, which is where the partial-merge gap lives.

User-visible / Behavior Changes

None. Internal contract completion only. Existing callers, return shapes, and events are unchanged; the only observable difference is that isCronJobActive(jobId) now returns true during runStartupCatchupCandidate and manual-run execution (matching the behaviour that already exists for runDueJob / executeJob).

Diagram (if applicable)

```text
Before (runStartupCatchupCandidate / manual run paths):
ops.start (or cron.run)
-> runStartupCatchupCandidate (or prepareManualRun+finishPreparedManualRun)
-> executeJobCoreWithTimeout [... running, 6-10 min ...]
(activeJobIds never touched)
-> task-registry maintenance tick (every ~60s)
-> hasBackingSession(task) -> isCronJobActive(jobId) == false
-> lastEventAt === startedAt; grace expires at 5 min
-> markTaskLost // mid-execution, misclassified

After:
ops.start (or cron.run)
-> runStartupCatchupCandidate (or prepareManualRun+finishPreparedManualRun)
markCronJobActive(jobId)
try { executeJobCoreWithTimeout } finally { clearCronJobActive(jobId) }
-> task-registry maintenance tick
-> hasBackingSession(task) -> isCronJobActive(jobId) == true
-> reconcile skipped while the run is live
```

Security Impact (required)

New permissions/capabilities? No
Secrets/tokens handling changed? No
New/changed network calls? No
Command/tool execution surface changed? No
Data access scope changed? No

Repro + Verification

Environment

OS: macOS 15.4 (darwin 25.4.0)
Runtime/container: local pnpm workspace on upstream/main at b7fba2100f
Model/provider: N/A (unit-level regression)
Integration/channel (if any): N/A
Relevant config (redacted): default cron config

Steps

Check out upstream/main and copy src/cron/active-jobs-symmetry.test.ts from this PR only.
Run `pnpm exec vitest run --config test/vitest/vitest.cron.config.ts src/cron/active-jobs-symmetry.test.ts` — both tests fail at the `expect(isCronJobActive(...)).toBe(true)` assertions (startup catch-up and manual run).
Apply the production changes in this PR and rerun — both tests pass.
Also run `pnpm exec vitest run src/tasks/task-registry.maintenance.issue-60299.test.ts` — 5/5 green, no regression.

Expected

Pre-fix: active-jobs-symmetry.test.ts 2/2 fail.
Post-fix: active-jobs-symmetry.test.ts 2/2 green; task-registry.maintenance.issue-60299.test.ts 5/5 green; full cron vitest scope (79 files / 665 tests) green.

Actual

Matches expected.

Evidence

Failing test/log before + passing after

Pre-fix run on upstream/main plus the new test file alone:

```
FAIL |cron| src/cron/active-jobs-symmetry.test.ts > ... > startup catchup marks the job active during execution and clears it on completion
AssertionError: expected false to be true
❯ src/cron/active-jobs-symmetry.test.ts
❯ expect(isCronJobActive("catchup-isolated")).toBe(true);

FAIL |cron| src/cron/active-jobs-symmetry.test.ts > ... > manual run marks the job active during execution and clears it even when the inner throws
AssertionError: expected false to be true

Test Files 1 failed (1)
Tests 2 failed (2)
```

Post-fix:

```
Test Files 1 passed (1)
Tests 2 passed (2)
```

Cron suite (post-fix):

```
Test Files 79 passed (79)
Tests 665 passed (665)
```

Human Verification (required)

Verified scenarios: pre-fix active-jobs-symmetry.test.ts failing on the two asserts; post-fix the same file green; post-fix task-registry.maintenance.issue-60299.test.ts green (5/5); post-fix full cron vitest scope green (79 files / 665 tests); pnpm tsgo:all green; pnpm check exit 0; pnpm build exit 0.
Edge cases checked: mark-after-early-return — prepareManualRun returns ran:false for every preflight bail-out (invalid-spec, already-running, not-due) before reaching markCronJobActive, so finishPreparedManualRun is only called when the mark was set; the outer try/finally guarantees the clear even when tryFinishManualTaskRun or the locked block throws. Thrown-inner path — executeJobCoreWithTimeout failures are already captured into { status: "error" } inside the inner try; the second regression test directly exercises the thrown-inner path with a rejecting deferred promise.
What you did not verify: end-to-end behaviour against a live task-registry sweeper in a running gateway (the test asserts the cron-layer invariant directly; the consuming side is already covered by task-registry.maintenance.issue-60299.test.ts). Also did not run the scenario against a real long-running production LLM agentTurn (no live model invocation).

Review Conversations

I replied to or resolved every bot review conversation I addressed in this PR.
I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

Backward compatible? Yes
Config/env changes? No
Migration needed? No

Risks and Mitigations

Risk: A future fifth cron execution path could be added without mark/clear, recreating the same partial-merge gap.
- Mitigation: active-jobs-symmetry.test.ts now asserts the invariant across the startup-catch-up and manual-run paths. A refactor that drops mark/clear from runDueJob or executeJob, or adds a new execution path without mark/clear, would need an explicit decision about whether to extend the symmetry test, making the contract visible.
Risk: Architecture redirection. PR tasks: add detached task recovery hook before markLost #69313 introduced the tryRecoverTaskBeforeMarkLost hook infrastructure and maintainers may prefer that direction over direct injection.
- Mitigation: Registering the hook alone would not close this gap — the recover callback would still need an alive-signal source, which is exactly what activeJobIds provides. This PR completes the existing activeJobIds contract; registering the hook for cross-runtime parity is a natural follow-up if that is the preferred long-term direction.

[AI-assisted]

greptile-apps · 2026-04-24T09:14:18Z

Greptile Summary

This PR completes the activeJobIds instrumentation contract introduced by #60310 by adding markCronJobActive/clearCronJobActive to the two previously uninstrumented cron execution paths: runStartupCatchupCandidate in timer.ts and prepareManualRun/finishPreparedManualRun in ops.ts. The fix prevents task-registry.maintenance from misclassifying in-flight cron runs on those paths as lost after the 5-minute grace period, which was a reliable outcome given isolated agentTurn never advances lastEventAt. The approach is minimal, mirrors the existing pattern exactly, and is guarded by two new regression tests that drive the real entry points and the real activeJobIds singleton.

Confidence Score: 5/5

Safe to merge — the change is a minimal, well-scoped instrumentation fix with solid regression test coverage and no side-effects on existing paths.

All three changed files are clean. The mark/clear placement in prepareManualRun is correctly guarded so only the ran:true branch is instrumented, and the finally in finishPreparedManualRun guarantees the clear even on inner throws. The startup catch-up path in timer.ts follows the same structure. Both regression tests exercise the real singleton and mid-flight state, ruling out phantom passes. No existing behaviour is altered on the runDueJob/executeJob paths. No P0 or P1 findings.

No files require special attention.

_{Reviews (1): Last reviewed commit: "fix(cron): mirror active-jobs mark/clear..." | Re-trigger Greptile}

…l run Upstream 7d1575b (openclaw#60310) introduced the activeJobIds singleton plus markCronJobActive/clearCronJobActive so task-registry maintenance has a backing-session signal for runtime='cron' tasks (task-registry.maintenance.ts:124-128). That patch wired the pair into runDueJob (timer.ts:746/586) and executeJob (timer.ts:1344/1374) but left the remaining two execution paths uninstrumented: * runStartupCatchupCandidate (timer.ts:1043-1081) * prepareManualRun / finishPreparedManualRun (ops.ts:548-686) For runs taken on those paths, the cron branch of hasBackingSession sees isCronJobActive=false and, once TASK_RECONCILE_GRACE_MS (5 min, task-registry.maintenance.ts:28) elapses, marks the task 'lost' while the cron service is still executing it. With DEFAULT_JOB_TIMEOUT_MS=10 min (cron/service/timeout-policy.ts:8) and no recordTaskRunProgressByRunId emissions on isolated agentTurn runs, lastEventAt is pinned to startedAt so the grace is exceeded in practice. This PR mirrors the existing mark/clear contract on the two missing paths inside try/finally, completing openclaw#60310's intent. No behavioural change to runDueJob / executeJob. Related openclaw#68157 (partially addresses the task-registry misclassification aspect; the runningAtMs persistence aspect described in that issue is a separate state machine not touched by this PR). Architecture note: PR openclaw#69313 introduced tryRecoverTaskBeforeMarkLost hook infrastructure but cron does not register it, and registering the hook alone would not close this gap (the recover callback would still need an alive-signal source — i.e. activeJobIds). This PR completes the existing contract; registering the hook for cross-runtime parity is a natural follow-up if maintainers prefer that direction. [AI-assisted]

clawsweeper · 2026-04-27T03:29:46Z

Codex review: needs changes before merge.

Summary
The PR adds cron active-job mark/clear bookkeeping around startup catch-up and manual cron run execution paths and adds active-job symmetry regression tests.

Reproducibility: yes. Current main gives a high-confidence static reproduction: startup catch-up and manual cron task rows can remain active past the 5-minute reconciliation grace while isCronJobActive(jobId) stays false; the PR-specific startup issue reproduces with multiple catch-up candidates when an earlier completed candidate is cleared before batched outcome application.

Next step before merge
A focused repair can preserve the useful manual-run fix, adjust startup clear timing, add the missing multi-candidate regression and changelog entry, and rebase or replace the stale source branch if needed.

Security
Cleared: The diff only changes in-process cron bookkeeping and tests; it does not add dependencies, workflows, permissions, secrets handling, network calls, or package execution paths.

Review findings

[P2] Keep startup jobs active until outcomes are applied — src/cron/service/timer.ts:1078-1080
[P3] Add the required changelog entry — src/cron/service/timer.ts:1051

Review details

Best possible solution:

Land a rebased narrow fix that preserves the manual-run mark/clear, keeps startup catch-up jobs active until their task rows are finalized or finalizes each candidate before starting the next, and adds focused regression plus changelog coverage.

Do we have a high-confidence way to reproduce the issue?

Yes. Current main gives a high-confidence static reproduction: startup catch-up and manual cron task rows can remain active past the 5-minute reconciliation grace while isCronJobActive(jobId) stays false; the PR-specific startup issue reproduces with multiple catch-up candidates when an earlier completed candidate is cleared before batched outcome application.

Is this the best way to solve the issue?

No, not as written. Mirroring active-job bookkeeping is the right fix direction, but startup catch-up must not clear the marker inside runStartupCatchupCandidate before applyStartupCatchupOutcomes finalizes the task/run outcome.

Full review comments:

[P2] Keep startup jobs active until outcomes are applied — src/cron/service/timer.ts:1078-1080
executeStartupCatchupPlan collects all missed-job outcomes and only later calls applyStartupCatchupOutcomes, so clearing the active marker in the candidate finally leaves an earlier completed-but-unapplied catch-up task inactive while later candidates run. If a later candidate exceeds the 5-minute maintenance grace, the earlier task can still be marked lost before its success/error outcome is applied. Move the clear to outcome application, or apply each candidate outcome before starting the next one.
Confidence: 0.87
[P3] Add the required changelog entry — src/cron/service/timer.ts:1051
This fix changes observable cron task status/audit behavior for manual and startup runs, but the PR does not touch CHANGELOG.md. Add a single-line entry under Unreleased Fixes so the user-facing cron status repair is recorded.
Confidence: 0.8

Overall correctness: patch is incorrect
Overall confidence: 0.86

Acceptance criteria:

pnpm test src/cron/active-jobs-symmetry.test.ts src/tasks/task-registry.maintenance.issue-60299.test.ts
pnpm test src/cron/service/timer.test.ts src/cron/service/ops.test.ts
pnpm exec oxfmt --check --threads=1 src/cron/service/timer.ts src/cron/service/ops.ts src/cron/active-jobs-symmetry.test.ts CHANGELOG.md
pnpm check:changed

What I checked:

Current main scheduled path active marker: The normal due-job path marks markCronJobActive(job.id) before execution, and applyOutcomeToStoredJob clears the marker while finalizing the task/run outcome. (src/cron/service/timer.ts:882, f7549079ceb3)
Current main startup catch-up gap: runStartupCatchupCandidate creates the task and executes the job without active-job mark/clear, while executeStartupCatchupPlan collects all candidate outcomes and applyStartupCatchupOutcomes applies them later in a batch. (src/cron/service/timer.ts:1195, f7549079ceb3)
Current main manual-run gap: prepareManualRun reserves runningAtMs and creates the cron task, then finishPreparedManualRun executes and finalizes the job, but neither function updates activeJobIds on current main. (src/cron/service/ops.ts:629, f7549079ceb3)
Maintenance lost-task dependency: Cron task reconciliation uses a 5-minute grace and, when cron runtime authority is enabled, treats isCronJobActive(jobId) as the backing-session check before marking tasks lost. (src/tasks/task-registry.maintenance.ts:443, f7549079ceb3)
PR startup clear timing: The supplied PR diff for head 630df2624aea487bfd645c5133a0deeec678e340 adds clearCronJobActive(candidate.job.id) in runStartupCatchupCandidate's finally, before the later batched applyStartupCatchupOutcomes call finalizes task rows. (src/cron/service/timer.ts:1078, 630df2624aea)
Existing consumer-side test coverage: The current task-registry regression harness injects an active cron set on the consuming side, so it proves isCronJobActive=true prevents lost marking but does not exercise producer-side mark/clear symmetry across cron entry points. (src/tasks/task-registry.maintenance.issue-60299.test.ts:168, f7549079ceb3)

Likely related people:

lml2468: Merged PR fix: honor exec approval security and clean up stale tasks #60310 is the related change that introduced the cron task-maintenance behavior and active-job contract this PR is completing. (role: introduced behavior; confidence: high; commits: 7d1575b5df79; files: src/cron/active-jobs.ts, src/cron/service/timer.ts, src/tasks/task-registry.maintenance.ts)
steipete: Prior ClawSweeper context links current cron/task-maintenance blame to Peter Steinberger, and the available local history shows recent task restart-blocker maintenance by Peter near this surface. (role: recent maintainer; confidence: medium; commits: 123a507fa2b1, 4cbd1b53cf86; files: src/cron/service/timer.ts, src/tasks/task-registry.maintenance.ts)
vincentkoc: The available local blame/log for the current sparse history attributes recent cron service and task-status adjacent changes to Vincent Koc, including the current grafted base over these files. (role: recent adjacent maintainer; confidence: medium; commits: 871cd475af8f, f6f8d74419a1; files: src/cron/service/timer.ts, src/cron/service/ops.ts, src/tasks/task-registry.maintenance.ts)
garrytan: Merged PR tasks: add detached task recovery hook before markLost #69313 added the tryRecoverTaskBeforeMarkLost recovery hook that is directly adjacent to this lost-task reconciliation path. (role: adjacent owner; confidence: medium; commits: 24322af4f75a; files: src/tasks/detached-task-runtime-contract.ts, src/tasks/detached-task-runtime.ts, src/tasks/task-registry.maintenance.ts)

Remaining risk / open question:

No tests were executed because this was a read-only review; the assessment is based on current-main source tracing, supplied PR diff/context, and existing regression coverage.
The source branch may need a rebase or replacement repair before merge because the PR base is behind current main and a prior ClawSweeper comment reported mergeability concerns.

Codex review notes: model gpt-5.5, reasoning high; reviewed against f7549079ceb3.

Feelw00 · 2026-05-06T02:58:59Z

Closing this PR — 1fae716a04 (fix: recover stale cron task records, merged 2026-04-26) addresses the same lost-marking concern from a different axis (sweeper-side post-recovery via cron run-log + store.lastRunStatus). Verified that the recovery path also covers the manual-run case (applyJobResult updates lastRunStatus/lastRunAtMs, and resolveCronJobStateRecovery reconciles).

The narrow remaining gap is UX-only: during the 5-minute TASK_RECONCILE_GRACE_MS window before recovery, manual runs that exceed the grace surface a transient lost marker plus a Background task lost system message — particularly noticeable for force-mode agentTurn runs (timeout up to 60 min). Filing this as a separate, narrower follow-up issue rather than blocking on this PR.

Thanks for the review window — closing without merge.

…lost marker Upstream 7d1575b (openclaw#60310, 2026-04-04) introduced activeJobIds plus markCronJobActive/clearCronJobActive so task-registry maintenance has a backing-session signal for runtime='cron' tasks (task-registry.maintenance.ts hasBackingSession). That patch wired the pair into runDueJob and executeJob but left the manual-run path (prepareManualRun + finishPreparedManualRun in src/cron/service/ops.ts) without mark/clear. Symptom: when a user triggers `openclaw cron run <id> --force` (or any manual-run RPC) and the run exceeds TASK_RECONCILE_GRACE_MS (5 min) — common for force-mode `agentTurn` runs which can reach AGENT_TURN_SAFETY_TIMEOUT_MS (60 min) — task-registry sweeper marks the active task `lost` and emits a `Background task lost` system message to the session, even though the run is still progressing normally. The merged commit 1fae716 (resolveDurableCronTaskRecovery, 2026-04-26) reconciles terminal status retroactively from cron run-log + store.lastRunStatus, but only after the run finishes. This patch suppresses the transient `lost` marker by adding the producer-side mark/clear pair, restoring symmetry with runDueJob/executeJob: * prepareManualRun: markCronJobActive(job.id) after tryCreateManualTaskRun. * finishPreparedManualRun: wrap body in try/finally with clearCronJobActive(jobId). Scope intentionally narrower than openclaw#71040 (closed): * No change to runStartupCatchupCandidate — the `deferAgentTurnJobs:true` policy added in 7877182 reroutes long-running startup catchups to runDueJob/executeJob (already wired). Non-agentTurn startup catchups are theoretical hot-path per pre-PR cross-review. * No change to active-jobs.ts API. Regression test (src/cron/active-jobs-manual-run.test.ts): two cases exercising production hot-path via cron.run("<id>", "force") — success and inner-throw — assert isCronJobActive transitions true→false around the manual run. Fixes openclaw#78233. [AI-assisted, fully tested]

openclaw-barnacle Bot added the size: M label Apr 24, 2026

Feelw00 force-pushed the fix/cron-active-jobs-symmetry branch from c2cf007 to 630df26 Compare April 27, 2026 03:21

clawsweeper Bot mentioned this pull request Apr 29, 2026

False-positive lost cron task records after gateway restart due to transient activeJobIds backing-session check #68191

Closed

Feelw00 closed this May 6, 2026

This was referenced May 6, 2026

cron: transient 'lost' marker on long-running manual runs before sweeper recovery #78233

Closed

fix(cron): mark active-jobs on manual-run path to suppress transient lost marker #78243

Merged

yetval mentioned this pull request Jun 9, 2026

Bug: startup catch-up cron runs never set the active marker, so long command jobs are reconciled as lost while still running #91695

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(cron): mirror active-jobs mark/clear on startup catchup and manual run#71040

fix(cron): mirror active-jobs mark/clear on startup catchup and manual run#71040
Feelw00 wants to merge 1 commit into
openclaw:mainfrom
Feelw00:fix/cron-active-jobs-symmetry

Feelw00 commented Apr 24, 2026

Uh oh!

greptile-apps Bot commented Apr 24, 2026

Uh oh!

clawsweeper Bot commented Apr 27, 2026 •

edited

Loading

Uh oh!

Feelw00 commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Feelw00 commented Apr 24, 2026

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Root Cause (if applicable)

Regression Test Plan (if applicable)

User-visible / Behavior Changes

Diagram (if applicable)

Security Impact (required)

Repro + Verification

Environment

Steps

Expected

Actual

Evidence

Human Verification (required)

Review Conversations

Compatibility / Migration

Risks and Mitigations

Uh oh!

greptile-apps Bot commented Apr 24, 2026

Greptile Summary

Confidence Score: 5/5

Uh oh!

clawsweeper Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Feelw00 commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

clawsweeper Bot commented Apr 27, 2026 •

edited

Loading