Skip to content

fix(cron): catch up overdue jobs after gateway restart#10412

Closed
izzzzzi wants to merge 1 commit intoopenclaw:mainfrom
izzzzzi:fix/cron-catchup-after-restart
Closed

fix(cron): catch up overdue jobs after gateway restart#10412
izzzzzi wants to merge 1 commit intoopenclaw:mainfrom
izzzzzi:fix/cron-catchup-after-restart

Conversation

@izzzzzi
Copy link

@izzzzzi izzzzzi commented Feb 6, 2026

Summary

Fixes #10389, Fixes #10401, Fixes #10045

Three fixes for cron scheduler reliability:

Fix 1: Catch up overdue jobs after gateway restart (#10389, #10045)

After a gateway restart, cron jobs that should have fired while the gateway was down were silently skipped. The scheduler's recomputeNextRuns() advanced nextRunAtMs to the next future slot without executing the missed run.

Change: Added runOverdueJobsOnStartup() in ops.start() — after recomputeNextRuns(), it scans for jobs with nextRunAtMs <= now and executes them once. Only one missed occurrence per job is caught up to avoid flooding after a long outage.

Fix 2: Re-arm timer on early return (#10401)

When onTimer() was called while a previous tick was still running (state.running = true), it returned early without re-arming the timer. This could permanently stall the scheduler — no future jobs would fire until the next external event (e.g. job add/update) called armTimer().

Change: Call armTimer() before the early return so the scheduler always has a pending timer.

Fix 3: Run due jobs after recompute (#10401)

In onTimer(), after runDueJobs() executes with persisted (skipRecompute) nextRunAtMs values, recomputeNextRuns() recalculates all nextRunAtMs from the current time. If the persisted values were stale (e.g. after a config.patch or external jobs.json edit), some jobs could become newly due after recompute but never get executed until the next timer tick — effectively skipping them.

Change: Call runDueJobs() a second time after recomputeNextRuns() to catch any jobs that became due after the recompute.

Files Changed

  • src/cron/service/ops.ts: Added runOverdueJobsOnStartup() called from start()
  • src/cron/service/timer.ts: Re-arm on early return + second runDueJobs() after recompute
  • src/cron/service.catches-up-overdue-after-restart.test.ts: Tests for startup catch-up
  • src/cron/service.timer-rearm-and-due-race.test.ts: Tests for timer reliability

Testing

Tests cover:

  1. Daily cron job catches up after being missed during downtime
  2. Non-overdue jobs are NOT double-fired (regression guard)
  3. Recurring interval jobs fire across consecutive ticks
  4. Timer re-arms correctly after job execution

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +22 to +32
await ensureLoaded(state);
recomputeNextRuns(state);

// After recomputing, run any jobs that are already overdue.
// This handles the case where the gateway was restarted and jobs
// were missed while it was down. Without this, overdue jobs would
// only fire on the *next* timer tick, but armTimer computes delay
// from nextWakeAtMs which may skip past already-due jobs when
// recomputeNextRuns sets nextRunAtMs to a future slot.
await runOverdueJobsOnStartup(state);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overdue scan after recompute

start() calls recomputeNextRuns(state) before runOverdueJobsOnStartup(), but recomputeNextRuns overwrites job.state.nextRunAtMs using computeJobNextRunAtMs(job, now) (typically advancing cron/every schedules to the next future slot). As a result, runOverdueJobsOnStartup()’s filter (now >= j.state.nextRunAtMs) will usually see no overdue jobs, so the intended catch-up won’t run for the missed occurrences after restart. This also contradicts the existing timer tick design which reloads with skipRecompute specifically so runDueJobs can observe persisted nextRunAtMs values (src/cron/service/timer.ts:44-49).

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cron/service/ops.ts
Line: 22:32

Comment:
**Overdue scan after recompute**

`start()` calls `recomputeNextRuns(state)` before `runOverdueJobsOnStartup()`, but `recomputeNextRuns` overwrites `job.state.nextRunAtMs` using `computeJobNextRunAtMs(job, now)` (typically advancing cron/every schedules to the next future slot). As a result, `runOverdueJobsOnStartup()`’s filter (`now >= j.state.nextRunAtMs`) will usually see *no* overdue jobs, so the intended catch-up won’t run for the missed occurrences after restart. This also contradicts the existing timer tick design which reloads with `skipRecompute` specifically so `runDueJobs` can observe persisted `nextRunAtMs` values (`src/cron/service/timer.ts:44-49`).

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +46 to +55
/**
* On startup, check for jobs whose persisted `nextRunAtMs` (from the
* store file, before recompute) was in the past — meaning they should
* have fired while the gateway was down. For each such job, execute
* it once (catch-up) and then let the normal recompute advance to the
* next future slot.
*
* We only catch up **once per job** (not all missed occurrences) to
* avoid flooding after a long outage.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc/comment mismatches behavior

The docstring says catch-up uses the persisted nextRunAtMs “from the store file, before recompute”, but the implementation reads j.state.nextRunAtMs after recomputeNextRuns() has already run (so it’s no longer the persisted value). As written, this will not reliably detect “missed while down” jobs for cron/every schedules.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cron/service/ops.ts
Line: 46:55

Comment:
**Doc/comment mismatches behavior**

The docstring says catch-up uses the persisted `nextRunAtMs` “from the store file, before recompute”, but the implementation reads `j.state.nextRunAtMs` after `recomputeNextRuns()` has already run (so it’s no longer the persisted value). As written, this will not reliably detect “missed while down” jobs for cron/every schedules.

How can I resolve this? If you propose a fix, please make it concise.

@izzzzzi
Copy link
Author

izzzzzi commented Feb 6, 2026

Related: #10401 describes a potentially deeper issue where periodic cron jobs never fire at all (even without restarts). The timer advances nextRunAtMs but the job is never executed.

This PR's catch-up mechanism would help recover missed jobs on startup, but #10401 may require an additional fix in the timer tick path (onTimerrunDueJobs) where the race between recomputeNextRuns and the due-time check could cause the scheduler to skip jobs that become due during the tick.

@izzzzzi izzzzzi force-pushed the fix/cron-catchup-after-restart branch from 42fd4f9 to 0baf503 Compare February 6, 2026 14:20
…er on early return

Two fixes for cron scheduler reliability:

1. **Catch up overdue jobs after restart (openclaw#10389)**: When the gateway
   restarts, jobs that should have fired during downtime were silently
   skipped.  On startup, ops.start() now loads the store with
   skipRecompute, snapshots which jobs have nextRunAtMs in the past,
   then recomputes and executes the overdue ones (one catch-up per job).

2. **Re-arm timer on early return (openclaw#10401)**: When onTimer() was called
   while a previous tick was still running, it returned without
   re-arming, which could permanently stall the scheduler.  Now calls
   armTimer() before the early return.

Also adds a second runDueJobs() call after recomputeNextRuns() in the
timer tick to catch jobs that become due after recompute (e.g. after
external edits to jobs.json).

Tests:
- Verifies overdue jobs are caught up on restart
- Verifies non-overdue jobs are NOT caught up (regression guard)
- Both scenarios tested in a single test to avoid fake-timer leaks

Fixes openclaw#10389
Fixes openclaw#10401
@izzzzzi izzzzzi force-pushed the fix/cron-catchup-after-restart branch from 0baf503 to 490c7f9 Compare February 6, 2026 17:13
Copy link

@acastellana acastellana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the approach - looks correct. Key insight of snapshotting overdue jobs BEFORE recompute is solid. The once-per-job catch-up (not all missed runs) is the right trade-off to avoid flooding after long outages. Tests cover the main scenarios well.

Minor suggestion for future: could add a configurable limit on catch-up jobs at startup, but not blocking.

Tested the same issue locally and this matches the fix pattern that works. 👍

Cherwayway pushed a commit to Cherwayway/clawdbot that referenced this pull request Feb 7, 2026
…ompute

Fixes openclaw#10653

## Problem
`every` and `cron` type jobs never fire because:
1. Timer fires at T+ε (slightly late or early due to JS timer precision)
2. `recomputeNextRuns()` advances `nextRunAtMs` to next interval
3. `runDueJobs()` checks `now >= nextRunAtMs` which is now false
4. Job is perpetually pushed forward

## Solution
1. Add 2-second tolerance in `runDueJobs()`: `now >= next - 2000`
2. Skip recomputing jobs that are already due (let runDueJobs handle them)
3. Re-arm timer on early return when previous tick is running
4. Run `runDueJobs()` twice: before and after recompute
5. Catch up overdue jobs on gateway startup

## Changes
- `timer.ts`: Add `DUE_TOLERANCE_MS`, re-arm on early return, double runDueJobs
- `jobs.ts`: Skip due jobs in `recomputeNextRuns()`
- `ops.ts`: Collect and run overdue jobs on startup (from PR openclaw#10412)

Tested with 2-minute interval jobs - now fires reliably.
@izzzzzi
Copy link
Author

izzzzzi commented Feb 7, 2026

Closing — the fixes in this PR (overdue job catch-up on restart, timer re-arm on early return, and post-recompute due-job sweep) have been fully addressed by #10776, which landed a more comprehensive cron reliability overhaul covering the same root causes.

Thanks to @tyler6204 for the thorough fix! Our issue reports (#10389, #10045) and analysis helped shape the solution. 🎉

@izzzzzi izzzzzi closed this Feb 7, 2026
aneym added a commit to aneym/openclaw that referenced this pull request Feb 13, 2026
Upstream: 141 commits merged from openclaw/openclaw origin/main.

Key fixes included:
- fix(cron): prevent recomputeNextRuns from skipping due jobs (openclaw#9823)
- fix(cron): re-arm timer in finally to survive transient errors (openclaw#9948)
- fix(cron): handle legacy atMs field in schedule (openclaw#9932)
- fix cron scheduling and reminder delivery regressions (openclaw#9733)

Local additions (ported from upstream PR openclaw#10412, not yet merged):
- Catch up overdue jobs after gateway restart (collectOverdueJobIds)
- Re-arm timer on early return when state.running
- Second runDueJobs after recompute to catch races

Conflict resolution:
- Core/gateway: took upstream (bug fixes, features)
- UI: took upstream (webchat improvements)
- Build: took upstream (deps, config)
- Fixed duplicate AgentsFiles* exports in protocol schema types
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants