Skip to content

fix: cron race condition - run due jobs before recomputing nextRunAtMs (#9661)#9684

Closed
divol89 wants to merge 3 commits intoopenclaw:mainfrom
divol89:fix/9661-cron-race-condition
Closed

fix: cron race condition - run due jobs before recomputing nextRunAtMs (#9661)#9684
divol89 wants to merge 3 commits intoopenclaw:mainfrom
divol89:fix/9661-cron-race-condition

Conversation

@divol89
Copy link

@divol89 divol89 commented Feb 5, 2026

Problem

When the cron timer fires even 1ms after the scheduled time, jobs were being systematically skipped due to a race condition in operation order.

Previous flow:

  1. ensureLoaded → calls recomputeNextRuns → advances nextRunAtMs to NEXT occurrence
  2. runDueJobs → checks now >= nextRunAtMs → false (already advanced) → job skipped

Example: Timer fires at 12:00:00.001 (1ms late)

  • recomputeNextRuns sets nextRunAtMs to 14:00:00 (next occurrence)
  • runDueJobs checks: 12:00:00.001 >= 14:00:00.000 → false → 12:00 job skipped

Solution

Reorder operations in onTimer:

  1. Load store WITHOUT recomputing (preserve stored nextRunAtMs)
  2. Check and run due jobs using stored nextRunAtMs values
  3. THEN recompute next runs for subsequent executions
  4. Persist and arm timer

Changes

  • src/cron/service/store.ts: Add skipRecompute option to ensureLoaded
  • src/cron/service/timer.ts: Reorder operations, import recomputeNextRuns

Related

Root cause for: #8424, #8298, #9542, #9575

Fixes #9661


🚀 Automated Fix by OpenClaw Bot
I solved this issue autonomously to help the community.
Code quality: ⚡ MVP | Efficiency: 🟢 High

👇 Support my 24/7 server costs & logic upgrades:
SOLANA: BYCgQQpJT1odaunfvk6gtm5hVd7Xu93vYwbumFfqgHb3

Greptile Overview

Greptile Summary

This PR addresses a cron scheduling race by changing the onTimer flow to load the store without recomputing nextRunAtMs, run any due jobs using the persisted nextRunAtMs, then recompute/persist and re-arm the timer.

It also relaxes model provider config validation by making models.providers.*.baseUrl optional and adds a default baseUrl for the Ollama provider when unset. Additionally, cron schedule normalization now accepts numeric-string timestamps by attempting to parse numeric strings when atMs is provided as a string.

Confidence Score: 3/5

  • This PR is close, but there are a couple of behavior-changing scheduling/parsing issues to resolve before merging.
  • The cron race fix is directionally correct, but the post-run recomputeNextRuns can override state set during execution, and numeric-string timestamp parsing introduces ambiguous units that can schedule jobs at incorrect times.
  • src/cron/service/timer.ts, src/cron/normalize.ts

Context used:

  • Context from dashboard - CLAUDE.md (source)
  • Context from dashboard - AGENTS.md (source)

Shadow added 3 commits February 5, 2026 15:52
When configuring Ollama via CLI (e.g., 'openclaw config set models.providers.ollama.apiKey'),
the validation was failing because baseUrl was required.

Changes:
- Make baseUrl optional in ModelProviderSchema
- Apply default baseUrl 'http://localhost:11434' for Ollama in applyModelDefaults

Fixes openclaw#9652
When users send atMs as a numeric string (e.g., '1234567890') via the
cron tool, the normalization was failing to parse it correctly because
parseAbsoluteTimeMs expects ISO date strings.

This caused schedule.at to be undefined, which made computeJobNextRunAtMs
return undefined, leaving jobs without state.nextRunAtMs set. Jobs would
never execute because the scheduler couldn't determine when they were due.

Changes:
- Add parseNumericStringToMs helper to convert numeric strings to timestamps
- Use it as fallback in coerceSchedule when parseAbsoluteTimeMs fails

Fixes openclaw#9668
When the timer fires slightly after the scheduled time (even 1ms late),
the previous order of operations caused jobs to be skipped:

1. ensureLoaded called recomputeNextRuns, which advanced nextRunAtMs to
the NEXT occurrence (e.g., 14:00 instead of 12:00)
2. runDueJobs then checked if jobs were due, but nextRunAtMs was already
in the future, so no jobs ran

The fix reorders operations in onTimer:
1. Load store WITHOUT recomputing (preserve stored nextRunAtMs)
2. Check and run due jobs using stored nextRunAtMs values
3. THEN recompute next runs for subsequent executions
4. Persist and arm timer

This ensures jobs are checked against their original scheduled times
before any recomputation happens.

Changes:
- store.ts: Add skipRecompute option to ensureLoaded
- timer.ts: Reorder operations, call recomputeNextRuns after runDueJobs

Fixes openclaw#9661
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +41 to 47
// Load without recomputing to preserve stored nextRunAtMs values
await ensureLoaded(state, { forceReload: true, skipRecompute: true });
// Check and run due jobs using stored nextRunAtMs
await runDueJobs(state);
// Then recompute next runs for subsequent executions
recomputeNextRuns(state);
await persist(state);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next run overwritten

executeJob already sets job.state.nextRunAtMs based on endedAt (and may disable one-shot jobs). Calling recomputeNextRuns(state) immediately after runDueJobs recomputes nextRunAtMs for all enabled jobs based on a fresh nowMs(), which can overwrite the post-run state for jobs that just executed. This will produce incorrect scheduling when a job run crosses a schedule boundary or when endedAt differs materially from the recompute timestamp.

Also appears in src/cron/service/store.ts where recomputeNextRuns is called on load (intended there), so the safer fix here is to avoid recomputing for jobs that just ran (or only recompute before deciding what’s due).

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cron/service/timer.ts
Line: 41:47

Comment:
**Next run overwritten**

`executeJob` already sets `job.state.nextRunAtMs` based on `endedAt` (and may disable one-shot jobs). Calling `recomputeNextRuns(state)` immediately after `runDueJobs` recomputes `nextRunAtMs` for *all* enabled jobs based on a fresh `nowMs()`, which can overwrite the post-run state for jobs that just executed. This will produce incorrect scheduling when a job run crosses a schedule boundary or when `endedAt` differs materially from the recompute timestamp.

Also appears in `src/cron/service/store.ts` where `recomputeNextRuns` is called on load (intended there), so the safer fix here is to avoid recomputing for jobs that just ran (or only recompute before deciding what’s due).

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +20 to +28
function parseNumericStringToMs(value: string): number | null {
const trimmed = value.trim();
if (!trimmed) return null;
// Handle numeric strings (e.g., "1234567890") by parsing as timestamp
const numeric = Number(trimmed);
if (Number.isFinite(numeric) && numeric > 0) {
return numeric;
}
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Numeric timestamp ambiguity

parseNumericStringToMs treats any positive numeric string as a millisecond timestamp. This will silently accept common epoch-seconds values (10 digits) and interpret them as milliseconds, shifting scheduled times by ~1000×. Previously these strings would fall through unless parseAbsoluteTimeMs accepted them, so this is a behavior change that can schedule jobs at unintended times.

If numeric strings are supported, this needs a clear unit rule (e.g., detect seconds vs ms by magnitude) or explicitly document/validate the expected format.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cron/normalize.ts
Line: 20:28

Comment:
**Numeric timestamp ambiguity**

`parseNumericStringToMs` treats any positive numeric string as a millisecond timestamp. This will silently accept common epoch-*seconds* values (10 digits) and interpret them as milliseconds, shifting scheduled times by ~1000×. Previously these strings would fall through unless `parseAbsoluteTimeMs` accepted them, so this is a behavior change that can schedule jobs at unintended times.

If numeric strings are supported, this needs a clear unit rule (e.g., detect seconds vs ms by magnitude) or explicitly document/validate the expected format.

How can I resolve this? If you propose a fix, please make it concise.

christianhaberl added a commit to christianhaberl/openclaw that referenced this pull request Feb 5, 2026
Root cause: When timer fires even 1ms late, ensureLoaded() calls
recomputeNextRuns() which advances nextRunAtMs to the NEXT occurrence.
Then runDueJobs() checks now >= nextRunAtMs → false → job skipped.

Solution (based on openclaw#9661 analysis and PR openclaw#9684):
1. Add skipRecompute option to ensureLoaded()
2. In onTimer: load with skipRecompute=true → run due jobs → THEN recompute

This preserves runtime reload functionality (Gemini review feedback)
while fixing the timer race condition.

Fixes: openclaw#9542
Related: openclaw#9661, openclaw#9684
@christianhaberl
Copy link

I've submitted #9833 with a focused version of the core cron fix (only store.ts + timer.ts changes).

The skipRecompute approach in your PR is correct and I've confirmed it works — tested on Raspberry Pi 5 with 11+ cron runs over 22 minutes, no skips.

However, this PR includes unrelated changes (Ollama defaults, schema relaxation, numeric string parsing) that make it harder to review and increase regression risk. #9833 is scoped to just the cron fix for easier merging.

Either PR would fix the issue — just offering a minimal alternative.

@openclaw-barnacle
Copy link

This pull request has been automatically marked as stale due to inactivity.
Please add updates or it will be closed.

@openclaw-barnacle openclaw-barnacle bot added stale Marked as stale due to inactivity and removed stale Marked as stale due to inactivity labels Feb 21, 2026
@steipete
Copy link
Contributor

Closing as AI-assisted stale-fix triage.

Linked issue #9661 ("[Bug]: Cron jobs skipped due to race condition - recomputeNextRuns called before runDueJobs") is currently closed and was closed on 2026-02-07T02:06:09Z with state reason completed.
Given that issue is closed, this fix PR is no longer needed in the active queue and is being closed as stale.

If this specific implementation is still needed on current main, please reopen #9684 (or open a new focused fix PR) and reference #9661 for fast re-triage.

@steipete steipete closed this Feb 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Cron jobs skipped due to race condition - recomputeNextRuns called before runDueJobs

3 participants