fix(cron): catch up overdue jobs after gateway restart by izzzzzi · Pull Request #10412 · openclaw/openclaw

izzzzzi · 2026-02-06T13:25:41Z

Summary

Fixes #10389, Fixes #10401, Fixes #10045

Three fixes for cron scheduler reliability:

Fix 1: Catch up overdue jobs after gateway restart (#10389, #10045)

After a gateway restart, cron jobs that should have fired while the gateway was down were silently skipped. The scheduler's recomputeNextRuns() advanced nextRunAtMs to the next future slot without executing the missed run.

Change: Added runOverdueJobsOnStartup() in ops.start() — after recomputeNextRuns(), it scans for jobs with nextRunAtMs <= now and executes them once. Only one missed occurrence per job is caught up to avoid flooding after a long outage.

Fix 2: Re-arm timer on early return (#10401)

When onTimer() was called while a previous tick was still running (state.running = true), it returned early without re-arming the timer. This could permanently stall the scheduler — no future jobs would fire until the next external event (e.g. job add/update) called armTimer().

Change: Call armTimer() before the early return so the scheduler always has a pending timer.

Fix 3: Run due jobs after recompute (#10401)

In onTimer(), after runDueJobs() executes with persisted (skipRecompute) nextRunAtMs values, recomputeNextRuns() recalculates all nextRunAtMs from the current time. If the persisted values were stale (e.g. after a config.patch or external jobs.json edit), some jobs could become newly due after recompute but never get executed until the next timer tick — effectively skipping them.

Change: Call runDueJobs() a second time after recomputeNextRuns() to catch any jobs that became due after the recompute.

Files Changed

src/cron/service/ops.ts: Added runOverdueJobsOnStartup() called from start()
src/cron/service/timer.ts: Re-arm on early return + second runDueJobs() after recompute
src/cron/service.catches-up-overdue-after-restart.test.ts: Tests for startup catch-up
src/cron/service.timer-rearm-and-due-race.test.ts: Tests for timer reliability

Testing

Tests cover:

Daily cron job catches up after being missed during downtime
Non-overdue jobs are NOT double-fired (regression guard)
Recurring interval jobs fire across consecutive ticks
Timer re-arms correctly after job execution

greptile-apps

_{1 file reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-06T13:27:41Z

src/cron/service/ops.ts

    await ensureLoaded(state);
    recomputeNextRuns(state);
+
+    // After recomputing, run any jobs that are already overdue.
+    // This handles the case where the gateway was restarted and jobs
+    // were missed while it was down.  Without this, overdue jobs would
+    // only fire on the *next* timer tick, but armTimer computes delay
+    // from nextWakeAtMs which may skip past already-due jobs when
+    // recomputeNextRuns sets nextRunAtMs to a future slot.
+    await runOverdueJobsOnStartup(state);
+


Overdue scan after recompute

start() calls recomputeNextRuns(state) before runOverdueJobsOnStartup(), but recomputeNextRuns overwrites job.state.nextRunAtMs using computeJobNextRunAtMs(job, now) (typically advancing cron/every schedules to the next future slot). As a result, runOverdueJobsOnStartup()’s filter (now >= j.state.nextRunAtMs) will usually see no overdue jobs, so the intended catch-up won’t run for the missed occurrences after restart. This also contradicts the existing timer tick design which reloads with skipRecompute specifically so runDueJobs can observe persisted nextRunAtMs values (src/cron/service/timer.ts:44-49).

Prompt To Fix With AI

This is a comment left during a code review. Path: src/cron/service/ops.ts Line: 22:32 Comment: **Overdue scan after recompute** `start()` calls `recomputeNextRuns(state)` before `runOverdueJobsOnStartup()`, but `recomputeNextRuns` overwrites `job.state.nextRunAtMs` using `computeJobNextRunAtMs(job, now)` (typically advancing cron/every schedules to the next future slot). As a result, `runOverdueJobsOnStartup()`’s filter (`now >= j.state.nextRunAtMs`) will usually see *no* overdue jobs, so the intended catch-up won’t run for the missed occurrences after restart. This also contradicts the existing timer tick design which reloads with `skipRecompute` specifically so `runDueJobs` can observe persisted `nextRunAtMs` values (`src/cron/service/timer.ts:44-49`). How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-02-06T13:27:42Z

src/cron/service/ops.ts

+/**
+ * On startup, check for jobs whose persisted `nextRunAtMs` (from the
+ * store file, before recompute) was in the past — meaning they should
+ * have fired while the gateway was down.  For each such job, execute
+ * it once (catch-up) and then let the normal recompute advance to the
+ * next future slot.
+ *
+ * We only catch up **once per job** (not all missed occurrences) to
+ * avoid flooding after a long outage.
+ */


Doc/comment mismatches behavior

The docstring says catch-up uses the persisted nextRunAtMs “from the store file, before recompute”, but the implementation reads j.state.nextRunAtMs after recomputeNextRuns() has already run (so it’s no longer the persisted value). As written, this will not reliably detect “missed while down” jobs for cron/every schedules.

Prompt To Fix With AI

This is a comment left during a code review. Path: src/cron/service/ops.ts Line: 46:55 Comment: **Doc/comment mismatches behavior** The docstring says catch-up uses the persisted `nextRunAtMs` “from the store file, before recompute”, but the implementation reads `j.state.nextRunAtMs` after `recomputeNextRuns()` has already run (so it’s no longer the persisted value). As written, this will not reliably detect “missed while down” jobs for cron/every schedules. How can I resolve this? If you propose a fix, please make it concise.

izzzzzi · 2026-02-06T13:31:56Z

Related: #10401 describes a potentially deeper issue where periodic cron jobs never fire at all (even without restarts). The timer advances nextRunAtMs but the job is never executed.

This PR's catch-up mechanism would help recover missed jobs on startup, but #10401 may require an additional fix in the timer tick path (onTimer → runDueJobs) where the race between recomputeNextRuns and the due-time check could cause the scheduler to skip jobs that become due during the tick.

…er on early return Two fixes for cron scheduler reliability: 1. **Catch up overdue jobs after restart (openclaw#10389)**: When the gateway restarts, jobs that should have fired during downtime were silently skipped. On startup, ops.start() now loads the store with skipRecompute, snapshots which jobs have nextRunAtMs in the past, then recomputes and executes the overdue ones (one catch-up per job). 2. **Re-arm timer on early return (openclaw#10401)**: When onTimer() was called while a previous tick was still running, it returned without re-arming, which could permanently stall the scheduler. Now calls armTimer() before the early return. Also adds a second runDueJobs() call after recomputeNextRuns() in the timer tick to catch jobs that become due after recompute (e.g. after external edits to jobs.json). Tests: - Verifies overdue jobs are caught up on restart - Verifies non-overdue jobs are NOT caught up (regression guard) - Both scenarios tested in a single test to avoid fake-timer leaks Fixes openclaw#10389 Fixes openclaw#10401

acastellana

Reviewed the approach - looks correct. Key insight of snapshotting overdue jobs BEFORE recompute is solid. The once-per-job catch-up (not all missed runs) is the right trade-off to avoid flooding after long outages. Tests cover the main scenarios well.

Minor suggestion for future: could add a configurable limit on catch-up jobs at startup, but not blocking.

Tested the same issue locally and this matches the fix pattern that works. 👍

…ompute Fixes openclaw#10653 ## Problem `every` and `cron` type jobs never fire because: 1. Timer fires at T+ε (slightly late or early due to JS timer precision) 2. `recomputeNextRuns()` advances `nextRunAtMs` to next interval 3. `runDueJobs()` checks `now >= nextRunAtMs` which is now false 4. Job is perpetually pushed forward ## Solution 1. Add 2-second tolerance in `runDueJobs()`: `now >= next - 2000` 2. Skip recomputing jobs that are already due (let runDueJobs handle them) 3. Re-arm timer on early return when previous tick is running 4. Run `runDueJobs()` twice: before and after recompute 5. Catch up overdue jobs on gateway startup ## Changes - `timer.ts`: Add `DUE_TOLERANCE_MS`, re-arm on early return, double runDueJobs - `jobs.ts`: Skip due jobs in `recomputeNextRuns()` - `ops.ts`: Collect and run overdue jobs on startup (from PR openclaw#10412) Tested with 2-minute interval jobs - now fires reliably.

izzzzzi · 2026-02-07T14:28:22Z

Closing — the fixes in this PR (overdue job catch-up on restart, timer re-arm on early return, and post-recompute due-job sweep) have been fully addressed by #10776, which landed a more comprehensive cron reliability overhaul covering the same root causes.

Thanks to @tyler6204 for the thorough fix! Our issue reports (#10389, #10045) and analysis helped shape the solution. 🎉

Upstream: 141 commits merged from openclaw/openclaw origin/main. Key fixes included: - fix(cron): prevent recomputeNextRuns from skipping due jobs (openclaw#9823) - fix(cron): re-arm timer in finally to survive transient errors (openclaw#9948) - fix(cron): handle legacy atMs field in schedule (openclaw#9932) - fix cron scheduling and reminder delivery regressions (openclaw#9733) Local additions (ported from upstream PR openclaw#10412, not yet merged): - Catch up overdue jobs after gateway restart (collectOverdueJobIds) - Re-arm timer on early return when state.running - Second runDueJobs after recompute to catch races Conflict resolution: - Core/gateway: took upstream (bug fixes, features) - UI: took upstream (webchat improvements) - Build: took upstream (deps, config) - Fixed duplicate AgentsFiles* exports in protocol schema types

greptile-apps bot reviewed Feb 6, 2026

View reviewed changes

izzzzzi force-pushed the fix/cron-catchup-after-restart branch from 42fd4f9 to 0baf503 Compare February 6, 2026 14:20

izzzzzi force-pushed the fix/cron-catchup-after-restart branch from 0baf503 to 490c7f9 Compare February 6, 2026 17:13

acastellana approved these changes Feb 6, 2026

View reviewed changes

This was referenced Feb 6, 2026

[Bug]: every and cron scheduled jobs never fire #10653

Closed

[Bug]: Isolated cron jobs don't schedule (nextWakeAtMs stays null) #10523

Closed

Cherwayway mentioned this pull request Feb 7, 2026

fix(cron): add tolerance for timer precision and skip due jobs in recompute #10918

Closed

izzzzzi closed this Feb 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(cron): catch up overdue jobs after gateway restart#10412

fix(cron): catch up overdue jobs after gateway restart#10412
izzzzzi wants to merge 1 commit intoopenclaw:mainfrom
izzzzzi:fix/cron-catchup-after-restart

izzzzzi commented Feb 6, 2026 •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Feb 6, 2026

Uh oh!

greptile-apps bot Feb 6, 2026

Uh oh!

izzzzzi commented Feb 6, 2026

Uh oh!

acastellana left a comment

Uh oh!

izzzzzi commented Feb 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

izzzzzi commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fix 1: Catch up overdue jobs after gateway restart (#10389, #10045)

Fix 2: Re-arm timer on early return (#10401)

Fix 3: Run due jobs after recompute (#10401)

Files Changed

Testing

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

izzzzzi commented Feb 6, 2026

Uh oh!

acastellana left a comment

Choose a reason for hiding this comment

Uh oh!

izzzzzi commented Feb 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

izzzzzi commented Feb 6, 2026 •

edited

Loading