fix: prevent cron jobs from being skipped when timer fires#9833
fix: prevent cron jobs from being skipped when timer fires#9833christianhaberl wants to merge 1 commit intoopenclaw:mainfrom
Conversation
Root cause: When timer fires even 1ms late, ensureLoaded() calls recomputeNextRuns() which advances nextRunAtMs to the NEXT occurrence. Then runDueJobs() checks now >= nextRunAtMs → false → job skipped. Solution (based on openclaw#9661 analysis and PR openclaw#9684): 1. Add skipRecompute option to ensureLoaded() 2. In onTimer: load with skipRecompute=true → run due jobs → THEN recompute This preserves runtime reload functionality (Gemini review feedback) while fixing the timer race condition. Fixes: openclaw#9542 Related: openclaw#9661, openclaw#9684
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
| // Load without recomputing to preserve stored nextRunAtMs values | ||
| // This prevents the race condition where nextRunAtMs gets advanced | ||
| // to the next occurrence before runDueJobs can check if job is due | ||
| await ensureLoaded(state, { forceReload: true, skipRecompute: true }); | ||
| // Check and run due jobs using stored nextRunAtMs | ||
| await runDueJobs(state); | ||
| // Then recompute next runs for subsequent executions | ||
| recomputeNextRuns(state); |
There was a problem hiding this comment.
First tick can miss jobs
With ensureLoaded(..., skipRecompute: true) on the timer path, runDueJobs() relies on the persisted job.state.nextRunAtMs. If a store file contains jobs with state: {} (e.g. migrated/legacy stores) or otherwise missing nextRunAtMs, those jobs won’t be considered due on that timer tick and will only get a nextRunAtMs after recomputeNextRuns(state) runs later in the same tick, effectively delaying execution by at least one timer cycle. There are tests/fixtures that load jobs with state: {} (e.g. src/cron/service.store.migration.test.ts:41-63), so it seems realistic to encounter stores without nextRunAtMs.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cron/service/timer.ts
Line: 40:47
Comment:
**First tick can miss jobs**
With `ensureLoaded(..., skipRecompute: true)` on the timer path, `runDueJobs()` relies on the persisted `job.state.nextRunAtMs`. If a store file contains jobs with `state: {}` (e.g. migrated/legacy stores) or otherwise missing `nextRunAtMs`, those jobs won’t be considered due on that timer tick and will only get a `nextRunAtMs` after `recomputeNextRuns(state)` runs later in the same tick, effectively delaying execution by at least one timer cycle. There are tests/fixtures that load jobs with `state: {}` (e.g. `src/cron/service.store.migration.test.ts:41-63`), so it seems realistic to encounter stores without `nextRunAtMs`.
How can I resolve this? If you propose a fix, please make it concise.|
Closing as duplicate of #9823, which includes the same fix along with tests and passing CI. This PR had formatting issues and lacked test coverage. |
Problem
Cron jobs are systematically skipped when the timer fires even 1ms after the scheduled time due to a race condition in operation order.
Root cause: In
onTimer,ensureLoaded()callsrecomputeNextRuns()which advancesnextRunAtMsto the next occurrence beforerunDueJobs()can check if the job is due.Solution
Reorder operations in
onTimer:ensureLoaded(skipRecompute: true)— load without recomputingrunDueJobs()— check and run using storednextRunAtMsrecomputeNextRuns()— THEN recompute for next cyclepersist()+armTimer()This preserves runtime reload functionality (new/modified jobs still get scheduled correctly) while fixing the timer race condition.
Why This PR?
PR #9684 addresses the same issue but includes unrelated changes (Ollama baseUrl defaults, schema changes, numeric string parsing). This PR is focused solely on the cron fix for easier review and reduced regression risk.
Testing
Environment: Raspberry Pi 5 running OpenClaw (hackable install)
Branch:
fix/cron-timer-skip-preventionTests Performed
1. Basic Cron (
*/2 * * * *systemEvent) — 11 consecutive runs over 22 minutes2. Isolated agentTurn Cron with Haiku — 6 runs
3. Heartbeat (2 min interval) — 9 runs concurrent with crons
4. Complex Cron (curl + conditional output) — Skip conditions tested
Key Findings
kind: "cron") now fire reliably without being skippedkind: "every"(interval-based) jobs reset theirnextRunAtMson gateway restart, potentially delaying execution. This is separate from the timer-skip bug but worth noting.Related Issues
Fixes #9661
Fixes #9542
Related: #9084, #8424, #8298
Reviewed by: Gemini Code Assist ✅
Greptile Overview
Greptile Summary
This PR changes cron timer processing order to avoid skipping cron jobs when the Node timer fires slightly after the scheduled time. It adds a
skipRecomputeoption toensureLoaded()soonTimer()can load the persisted store without advancingnextRunAtMs, then runs due jobs using the storednextRunAtMs, and only afterward recomputes/persists the next run times and re-arms the timer.Confidence Score: 3/5
state.nextRunAtMsbeing present/valid; existing migration tests show stores withstate: {}which would cause due jobs to be skipped for at least one tick.(2/5) Greptile learns from your feedback when you react with thumbs up/down!