fix(cron): catch up overdue jobs after gateway restart#10412
fix(cron): catch up overdue jobs after gateway restart#10412izzzzzi wants to merge 1 commit intoopenclaw:mainfrom
Conversation
src/cron/service/ops.ts
Outdated
| await ensureLoaded(state); | ||
| recomputeNextRuns(state); | ||
|
|
||
| // After recomputing, run any jobs that are already overdue. | ||
| // This handles the case where the gateway was restarted and jobs | ||
| // were missed while it was down. Without this, overdue jobs would | ||
| // only fire on the *next* timer tick, but armTimer computes delay | ||
| // from nextWakeAtMs which may skip past already-due jobs when | ||
| // recomputeNextRuns sets nextRunAtMs to a future slot. | ||
| await runOverdueJobsOnStartup(state); | ||
|
|
There was a problem hiding this comment.
Overdue scan after recompute
start() calls recomputeNextRuns(state) before runOverdueJobsOnStartup(), but recomputeNextRuns overwrites job.state.nextRunAtMs using computeJobNextRunAtMs(job, now) (typically advancing cron/every schedules to the next future slot). As a result, runOverdueJobsOnStartup()’s filter (now >= j.state.nextRunAtMs) will usually see no overdue jobs, so the intended catch-up won’t run for the missed occurrences after restart. This also contradicts the existing timer tick design which reloads with skipRecompute specifically so runDueJobs can observe persisted nextRunAtMs values (src/cron/service/timer.ts:44-49).
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cron/service/ops.ts
Line: 22:32
Comment:
**Overdue scan after recompute**
`start()` calls `recomputeNextRuns(state)` before `runOverdueJobsOnStartup()`, but `recomputeNextRuns` overwrites `job.state.nextRunAtMs` using `computeJobNextRunAtMs(job, now)` (typically advancing cron/every schedules to the next future slot). As a result, `runOverdueJobsOnStartup()`’s filter (`now >= j.state.nextRunAtMs`) will usually see *no* overdue jobs, so the intended catch-up won’t run for the missed occurrences after restart. This also contradicts the existing timer tick design which reloads with `skipRecompute` specifically so `runDueJobs` can observe persisted `nextRunAtMs` values (`src/cron/service/timer.ts:44-49`).
How can I resolve this? If you propose a fix, please make it concise.| /** | ||
| * On startup, check for jobs whose persisted `nextRunAtMs` (from the | ||
| * store file, before recompute) was in the past — meaning they should | ||
| * have fired while the gateway was down. For each such job, execute | ||
| * it once (catch-up) and then let the normal recompute advance to the | ||
| * next future slot. | ||
| * | ||
| * We only catch up **once per job** (not all missed occurrences) to | ||
| * avoid flooding after a long outage. | ||
| */ |
There was a problem hiding this comment.
Doc/comment mismatches behavior
The docstring says catch-up uses the persisted nextRunAtMs “from the store file, before recompute”, but the implementation reads j.state.nextRunAtMs after recomputeNextRuns() has already run (so it’s no longer the persisted value). As written, this will not reliably detect “missed while down” jobs for cron/every schedules.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cron/service/ops.ts
Line: 46:55
Comment:
**Doc/comment mismatches behavior**
The docstring says catch-up uses the persisted `nextRunAtMs` “from the store file, before recompute”, but the implementation reads `j.state.nextRunAtMs` after `recomputeNextRuns()` has already run (so it’s no longer the persisted value). As written, this will not reliably detect “missed while down” jobs for cron/every schedules.
How can I resolve this? If you propose a fix, please make it concise.|
Related: #10401 describes a potentially deeper issue where periodic cron jobs never fire at all (even without restarts). The timer advances This PR's catch-up mechanism would help recover missed jobs on startup, but #10401 may require an additional fix in the timer tick path ( |
42fd4f9 to
0baf503
Compare
…er on early return Two fixes for cron scheduler reliability: 1. **Catch up overdue jobs after restart (openclaw#10389)**: When the gateway restarts, jobs that should have fired during downtime were silently skipped. On startup, ops.start() now loads the store with skipRecompute, snapshots which jobs have nextRunAtMs in the past, then recomputes and executes the overdue ones (one catch-up per job). 2. **Re-arm timer on early return (openclaw#10401)**: When onTimer() was called while a previous tick was still running, it returned without re-arming, which could permanently stall the scheduler. Now calls armTimer() before the early return. Also adds a second runDueJobs() call after recomputeNextRuns() in the timer tick to catch jobs that become due after recompute (e.g. after external edits to jobs.json). Tests: - Verifies overdue jobs are caught up on restart - Verifies non-overdue jobs are NOT caught up (regression guard) - Both scenarios tested in a single test to avoid fake-timer leaks Fixes openclaw#10389 Fixes openclaw#10401
0baf503 to
490c7f9
Compare
acastellana
left a comment
There was a problem hiding this comment.
Reviewed the approach - looks correct. Key insight of snapshotting overdue jobs BEFORE recompute is solid. The once-per-job catch-up (not all missed runs) is the right trade-off to avoid flooding after long outages. Tests cover the main scenarios well.
Minor suggestion for future: could add a configurable limit on catch-up jobs at startup, but not blocking.
Tested the same issue locally and this matches the fix pattern that works. 👍
…ompute Fixes openclaw#10653 ## Problem `every` and `cron` type jobs never fire because: 1. Timer fires at T+ε (slightly late or early due to JS timer precision) 2. `recomputeNextRuns()` advances `nextRunAtMs` to next interval 3. `runDueJobs()` checks `now >= nextRunAtMs` which is now false 4. Job is perpetually pushed forward ## Solution 1. Add 2-second tolerance in `runDueJobs()`: `now >= next - 2000` 2. Skip recomputing jobs that are already due (let runDueJobs handle them) 3. Re-arm timer on early return when previous tick is running 4. Run `runDueJobs()` twice: before and after recompute 5. Catch up overdue jobs on gateway startup ## Changes - `timer.ts`: Add `DUE_TOLERANCE_MS`, re-arm on early return, double runDueJobs - `jobs.ts`: Skip due jobs in `recomputeNextRuns()` - `ops.ts`: Collect and run overdue jobs on startup (from PR openclaw#10412) Tested with 2-minute interval jobs - now fires reliably.
|
Closing — the fixes in this PR (overdue job catch-up on restart, timer re-arm on early return, and post-recompute due-job sweep) have been fully addressed by #10776, which landed a more comprehensive cron reliability overhaul covering the same root causes. Thanks to @tyler6204 for the thorough fix! Our issue reports (#10389, #10045) and analysis helped shape the solution. 🎉 |
Upstream: 141 commits merged from openclaw/openclaw origin/main. Key fixes included: - fix(cron): prevent recomputeNextRuns from skipping due jobs (openclaw#9823) - fix(cron): re-arm timer in finally to survive transient errors (openclaw#9948) - fix(cron): handle legacy atMs field in schedule (openclaw#9932) - fix cron scheduling and reminder delivery regressions (openclaw#9733) Local additions (ported from upstream PR openclaw#10412, not yet merged): - Catch up overdue jobs after gateway restart (collectOverdueJobIds) - Re-arm timer on early return when state.running - Second runDueJobs after recompute to catch races Conflict resolution: - Core/gateway: took upstream (bug fixes, features) - UI: took upstream (webchat improvements) - Build: took upstream (deps, config) - Fixed duplicate AgentsFiles* exports in protocol schema types
Summary
Fixes #10389, Fixes #10401, Fixes #10045
Three fixes for cron scheduler reliability:
Fix 1: Catch up overdue jobs after gateway restart (#10389, #10045)
After a gateway restart, cron jobs that should have fired while the gateway was down were silently skipped. The scheduler's
recomputeNextRuns()advancednextRunAtMsto the next future slot without executing the missed run.Change: Added
runOverdueJobsOnStartup()inops.start()— afterrecomputeNextRuns(), it scans for jobs withnextRunAtMs <= nowand executes them once. Only one missed occurrence per job is caught up to avoid flooding after a long outage.Fix 2: Re-arm timer on early return (#10401)
When
onTimer()was called while a previous tick was still running (state.running = true), it returned early without re-arming the timer. This could permanently stall the scheduler — no future jobs would fire until the next external event (e.g. job add/update) calledarmTimer().Change: Call
armTimer()before the early return so the scheduler always has a pending timer.Fix 3: Run due jobs after recompute (#10401)
In
onTimer(), afterrunDueJobs()executes with persisted (skipRecompute)nextRunAtMsvalues,recomputeNextRuns()recalculates allnextRunAtMsfrom the current time. If the persisted values were stale (e.g. after aconfig.patchor externaljobs.jsonedit), some jobs could become newly due after recompute but never get executed until the next timer tick — effectively skipping them.Change: Call
runDueJobs()a second time afterrecomputeNextRuns()to catch any jobs that became due after the recompute.Files Changed
src/cron/service/ops.ts: AddedrunOverdueJobsOnStartup()called fromstart()src/cron/service/timer.ts: Re-arm on early return + secondrunDueJobs()after recomputesrc/cron/service.catches-up-overdue-after-restart.test.ts: Tests for startup catch-upsrc/cron/service.timer-rearm-and-due-race.test.ts: Tests for timer reliabilityTesting
Tests cover: