fix(cron): prevent jobs from skipping runs after timer fires#9589
fix(cron): prevent jobs from skipping runs after timer fires#9589rodrigouroz wants to merge 3 commits intoopenclaw:mainfrom
Conversation
When jobs have nextRunAtMs in the past (due to missed runs while gateway was down), the scheduler could get stuck. This adds: 1. Defensive check in recomputeNextRuns: if computed nextRunAtMs is still in the past, log a warning and force recalculation from now+1s to ensure we get a future slot. 2. Info logging in armTimer when next wake time is in the past, to help debug any remaining edge cases. Fixes openclaw#8424
When the cron timer fired, onTimer() would call ensureLoaded() which recomputed nextRunAtMs BEFORE runDueJobs() checked for due jobs. This caused jobs to skip their scheduled run: 1. Timer fires at 06:00 (job scheduled for 06:00) 2. ensureLoaded() recomputes nextRunAtMs from current time (06:00) 3. Croner returns next occurrence: tomorrow 06:00 4. runDueJobs() sees nextRunAtMs (tomorrow) > now → skips execution Fix: Skip recomputeNextRuns in ensureLoaded when called from onTimer, then recompute AFTER runDueJobs executes the due jobs. Fixes openclaw#8424
src/cron/service/timer.ts
Outdated
| // Now recompute next runs for jobs that weren't executed (e.g., disabled). | ||
| recomputeNextRuns(state); | ||
| await persist(state); |
There was a problem hiding this comment.
Incorrect nextRun recompute
After runDueJobs() you call recomputeNextRuns(state) unconditionally (src/cron/service/timer.ts:57-59). This will recompute nextRunAtMs based on now for all enabled jobs, including ones that just executed and already set nextRunAtMs in finish() / finally (e.g. src/cron/service/timer.ts:117-120 and :229-233). If computeNextRunAtMs treats its nowMs parameter as inclusive, this can re-select the same slot that just ran (or otherwise regress the schedule), causing immediate re-runs on the next tick.
Suggestion: avoid recomputing for jobs that executed in this tick (track executed IDs, or only recompute for jobs where nextRunAtMs is missing/invalid/disabled), or recompute using a monotonic reference time (e.g. endedAt already computed per job).
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cron/service/timer.ts
Line: 57:59
Comment:
**Incorrect nextRun recompute**
After `runDueJobs()` you call `recomputeNextRuns(state)` unconditionally (`src/cron/service/timer.ts:57-59`). This will recompute `nextRunAtMs` based on `now` for *all* enabled jobs, including ones that just executed and already set `nextRunAtMs` in `finish()` / `finally` (e.g. `src/cron/service/timer.ts:117-120` and `:229-233`). If `computeNextRunAtMs` treats its `nowMs` parameter as inclusive, this can re-select the same slot that just ran (or otherwise regress the schedule), causing immediate re-runs on the next tick.
Suggestion: avoid recomputing for jobs that executed in this tick (track executed IDs, or only recompute for jobs where `nextRunAtMs` is missing/invalid/disabled), or recompute using a monotonic reference time (e.g. `endedAt` already computed per job).
How can I resolve this? If you propose a fix, please make it concise.executeJob() already updates nextRunAtMs for executed jobs in finish()/finally, so calling recomputeNextRuns() afterward is redundant and could potentially re-select the same slot if croner treats nowMs as inclusive.
|
Ive got a PR fixing something similar here: #9573 |
|
Can we get this merged? This is a really critical bug that is really hampering my system. |
|
Wait my fix is here, the above one is a different PR for same issue. We need to unify the fix |
|
all 3 fixes are pretty different. Gotta love vibe coding lol |
|
i hope this get release very soon 🦞 |
|
I'm running with this pull request and a few other PRs in my preview branch: kayvan/preview branch is main + these PRs:
|
Summary
When the cron timer fired,
onTimer()would callensureLoaded()which recomputednextRunAtMsbeforerunDueJobs()checked for due jobs.This caused jobs to skip their scheduled run:
ensureLoaded()recomputesnextRunAtMsfrom current timerunDueJobs()seesnextRunAtMs(tomorrow) > now → skips executionFix
Skip
recomputeNextRunsinensureLoadedwhen called fromonTimer, then recompute afterrunDueJobsexecutes the due jobs.Testing
Fixes #8424
Greptile Overview
Greptile Summary
This PR changes the cron timer tick flow to avoid skipping runs that become due exactly at the timer boundary. Specifically:
ensureLoaded()(src/cron/service/store.ts) gains an option (skipRecompute) to avoid callingrecomputeNextRuns()immediately after reloading the cron store.onTimer()(src/cron/service/timer.ts) now reloads the store withskipRecompute: true, runsrunDueJobs()against the persistednextRunAtMsvalues, then recomputes next runs and persists.recomputeNextRuns()(src/cron/service/jobs.ts) adds a defensive warning + second-pass recalculation ifcomputeJobNextRunAtMs()returns a timestamp still in the past.These changes fit into the existing cron service design where most API operations (
src/cron/service/ops.ts) work against an in-memory store, while the timer tick path force-reloads from disk to pick up external changes before deciding what is due to run.Confidence Score: 3/5
recomputeNextRuns()after executing due jobs can overwrite per-jobnextRunAtMsupdates and may cause re-running or schedule drift depending oncomputeNextRunAtMssemantics. Unable to run tests in this environment (no node/pnpm).