Skip to content

fix(cron): catch up overdue jobs after restart and enable immediate firing#11236

Closed
VintLin wants to merge 2 commits intoopenclaw:mainfrom
VintLin:fix/cron-catchup-recovery
Closed

fix(cron): catch up overdue jobs after restart and enable immediate firing#11236
VintLin wants to merge 2 commits intoopenclaw:mainfrom
VintLin:fix/cron-catchup-recovery

Conversation

@VintLin
Copy link
Contributor

@VintLin VintLin commented Feb 7, 2026

This PR addresses remaining edge cases in cron scheduler reliability following #10776, specifically fixing issues where missed jobs are pushed to the future and dynamic additions suffer from polling delays.

Problem

  1. Overdue jobs skipped on restart (Cron scheduler doesn't recover from past nextWakeAtMs after gateway restart #11184): While runMissedJobs handles the initial catch-up, the current recomputeNextRuns logic is too aggressive. It recalculates the nextRunAtMs to a future slot before an overdue job has a chance to successfully complete, effectively "forgetting" the missed run if it fails once or if the timing window is tight.
  2. Dynamic addition delay ([Bug] Cron jobs created after gateway start never execute until restart #11217): Newly added jobs with past anchors do not always trigger immediate execution, often waiting for the next minute-level timer tick.

Changes

  • State-aware Recomputation: Modified computeJobNextRunAtMs to implement a "past-due preservation" policy. If nextRunAtMs is in the past and lastStatus is not ok, the scheduler now preserves the overdue timestamp instead of advancing it.
  • Failure Backoff: Introduced a 5-second FAILED_RETRY_COOLDOWN to prevent tight infinite retry loops (high CPU) if an overdue job repeatedly fails (e.g., persistent network issues).
  • Immediate Timer Arming: Enabled allowPast flags during initial scheduling, allowing armTimer to detect a delay <= 0 and fire new jobs immediately upon addition.

Validation

  • Added regression tests in src/cron/service.issue-regressions.test.ts covering:
    • Overdue job preservation across recomputes.
    • 0ms execution for dynamically added jobs with past anchors.
    • Verification of the 5s cooldown window during consecutive failures.

Greptile Overview

Greptile Summary

This change adjusts cron scheduling to better handle missed/overdue runs and reduce delays when adding new jobs:

  • computeJobNextRunAtMs now preserves past-due nextRunAtMs in some cases and applies a 5s retry cooldown after failures.
  • computeNextRunAtMs accepts an allowPast option so initial scheduling can return a past timestamp (so timers can fire immediately).
  • Adds regression tests for overdue preservation, immediate scheduling for past anchors, and the failure cooldown behavior.

Key areas touched are src/cron/schedule.ts (schedule computations) and src/cron/service/jobs.ts (job next-run logic), with new tests in src/cron/service.issue-regressions.test.ts.

Confidence Score: 3/5

  • This PR is likely mergeable, but there are a couple edge cases/tests that should be tightened to avoid regressions.
  • The core intent is clear and the added tests cover the reported issues, but the new allowPast behavior for cron expressions can return nowMs at exact boundaries (breaking the prior >now invariant), and the regression tests hardcode a store version which can make them brittle across schema bumps.
  • src/cron/schedule.ts, src/cron/service.issue-regressions.test.ts

(2/5) Greptile learns from your feedback when you react with thumbs up/down!

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +63 to +74
if (options?.allowPast) {
// To find the "current" or "last" run, we can look back a bit.
// Croner doesn't have a direct "lastRun" that is efficient without a reference.
// But we can check if nowMs itself is a run time, or search backwards.
// However, the requirement is "past-due preservation", so maybe we don't
// need computeNextRunAtMs to return a past time for cron expressions
// unless it's explicitly for initial scheduling.
// Let's try to find the most recent run before or at nowMs.
const prev = cron.previousRun(new Date(nowMs + 1));
if (prev) {
return prev.getTime();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Past cron run returned

In the allowPast path for cron expressions, cron.previousRun(new Date(nowMs + 1)) can return a timestamp equal to nowMs when nowMs is exactly on a cron boundary. That breaks the usual nextRunAtMs > nowMs invariant in computeNextRunAtMs and can result in a job being scheduled as immediately due at initialization (and potentially re-armed at 0ms repeatedly depending on timer logic). Consider ensuring the returned time is strictly < nowMs (or otherwise handling the equality case explicitly).

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cron/schedule.ts
Line: 63:74

Comment:
**Past cron run returned**

In the `allowPast` path for cron expressions, `cron.previousRun(new Date(nowMs + 1))` can return a timestamp equal to `nowMs` when `nowMs` is exactly on a cron boundary. That breaks the usual `nextRunAtMs > nowMs` invariant in `computeNextRunAtMs` and can result in a job being scheduled as immediately due at initialization (and potentially re-armed at 0ms repeatedly depending on timer logic). Consider ensuring the returned time is strictly `< nowMs` (or otherwise handling the equality case explicitly).

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +348 to +377
it("preserves past-due nextRunAtMs during recomputeNextRuns (#11184)", async () => {
const now = Date.parse("2026-02-06T10:05:00.000Z");
const past = now - 60_000;

const state = createCronServiceState({
cronEnabled: true,
storePath: "/tmp/dummy.json",
log: noopLogger,
nowMs: () => now,
enqueueSystemEvent: vi.fn(),
requestHeartbeatNow: vi.fn(),
runIsolatedAgentJob: vi.fn(),
});

const job: CronJob = {
id: "past-job",
name: "past-job",
enabled: true,
createdAtMs: past - 1000,
updatedAtMs: past - 1000,
schedule: { kind: "every", everyMs: 3600_000, anchorMs: past },
sessionTarget: "main",
payload: { kind: "systemEvent", text: "tick" },
state: { nextRunAtMs: past },
};

state.store = {
version: 1,
jobs: [job],
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded store version

These new regression tests build state.store = { version: 1, ... } directly. If the real store schema version changes (or differs in this repo today), the tests can start failing for reasons unrelated to the regression. Using the same store creation helpers/constants used by the service (or loading a minimal persisted store via existing helpers) would keep these tests stable across migrations.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cron/service.issue-regressions.test.ts
Line: 348:377

Comment:
**Hardcoded store version**

These new regression tests build `state.store = { version: 1, ... }` directly. If the real store schema version changes (or differs in this repo today), the tests can start failing for reasons unrelated to the regression. Using the same store creation helpers/constants used by the service (or loading a minimal persisted store via existing helpers) would keep these tests stable across migrations.

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

@tyler6204
Copy link
Member

Superseded by #11641 (merge commit: 8fae55e). Closing to reduce duplicate PR noise. Please open a new PR only if there is additional scope beyond this fix.

@tyler6204 tyler6204 closed this Feb 8, 2026
@VintLin VintLin deleted the fix/cron-catchup-recovery branch February 10, 2026 03:00
@VintLin VintLin restored the fix/cron-catchup-recovery branch February 10, 2026 03:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants