-
-
Notifications
You must be signed in to change notification settings - Fork 54.8k
Description
Bug Description
Cron jobs can silently miss their scheduled fire time after multiple rapid gateway restarts (SIGUSR1 hot-reloads and/or SIGTERM cold restarts). The cron scheduler appears to miscalculate nextRunAtMs, jumping past the current day's scheduled window to the next valid time.
Steps to Reproduce
- Have a cron job scheduled (e.g.,
"30 6 * * 1-5"— 6:30 AM PT weekdays) - Trigger multiple rapid SIGUSR1 hot-reloads in the hours before the cron is due (in our case, ~15 restarts between 10 PM and midnight, caused by sub-agents using
gateway config.patch) - Include at least one SIGTERM → full restart cycle in the sequence
- Wait for the cron's scheduled time
Expected Behavior
The cron job fires at its scheduled time (6:30 AM PT).
Actual Behavior
- The cron job does not fire
nextRunAtMsis already set to the next valid occurrence (Monday, skipping Friday entirely), despite the gateway being up and stable for 7+ hours before the scheduled time- No errors in logs — the job is silently skipped
cron runsreturns empty entries (no run history at all for the job)- Manually triggering with
cron runreturns{"ran": false, "reason": "not-due"}
Environment
- OpenClaw version: 2026.2.3-1 (d84eb46)
- OS: macOS (Darwin 25.2.0 arm64)
- Node: v25.5.0
Timeline (from gateway.log, all UTC)
06:30:48 — SIGUSR1 config change reload (meta.lastTouchedAt, perplexity config)
06:30:50 — SIGUSR1 second reload
06:31:43 — SIGTERM full shutdown
07:17:06 — Gateway back up (new PID, stable from here)
... no restarts for 7+ hours ...
14:30:00 — Cron should fire (6:30 AM PT) — nothing happens
14:30:00 — nextRunAtMs already points to Monday 14:30 UTC
The gateway was fully stable on PID 81528 from 07:35 UTC onward. The cron window at 14:30 UTC was 7 hours into a clean run — but the scheduler had already decided Friday's slot was passed.
Hypothesis
During one of the rapid restart cycles, the cron scheduler reinitializes and evaluates nextRunAtMs. If the restart happens during or just after the evaluation window for a scheduled time, the scheduler may:
- See the current minute has "already passed" (even if it's the restart second, not the actual cron time)
- Calculate
nextRunAtMsas the next valid cron match (skipping to Monday for a weekday-only schedule) - Persist this incorrect
nextRunAtMsand never re-evaluate it
This means a restart at 10:30 PM could cause the scheduler to skip a 6:30 AM job the next morning if the internal state isn't properly reconciled on startup.
Suggested Fix
On gateway startup, the cron scheduler should:
- Recalculate
nextRunAtMsfrom scratch based on the current time - Not trust persisted
nextRunAtMsvalues that were set during a previous process lifetime - Consider adding a "catch-up" mechanism — if a job's
nextRunAtMsis in the past but within a grace window (e.g., 5 minutes), fire it immediately
Workaround
Shifted cron to an off-round-number time (6:31 instead of 6:30) and added a rule preventing sub-agents from using gateway config.patch to reduce restart frequency.