Skip to content

Cron jobs can silently skip scheduled runs after rapid gateway restarts #10472

@ZakWinnick

Description

@ZakWinnick

Bug Description

Cron jobs can silently miss their scheduled fire time after multiple rapid gateway restarts (SIGUSR1 hot-reloads and/or SIGTERM cold restarts). The cron scheduler appears to miscalculate nextRunAtMs, jumping past the current day's scheduled window to the next valid time.

Steps to Reproduce

  1. Have a cron job scheduled (e.g., "30 6 * * 1-5" — 6:30 AM PT weekdays)
  2. Trigger multiple rapid SIGUSR1 hot-reloads in the hours before the cron is due (in our case, ~15 restarts between 10 PM and midnight, caused by sub-agents using gateway config.patch)
  3. Include at least one SIGTERM → full restart cycle in the sequence
  4. Wait for the cron's scheduled time

Expected Behavior

The cron job fires at its scheduled time (6:30 AM PT).

Actual Behavior

  • The cron job does not fire
  • nextRunAtMs is already set to the next valid occurrence (Monday, skipping Friday entirely), despite the gateway being up and stable for 7+ hours before the scheduled time
  • No errors in logs — the job is silently skipped
  • cron runs returns empty entries (no run history at all for the job)
  • Manually triggering with cron run returns {"ran": false, "reason": "not-due"}

Environment

  • OpenClaw version: 2026.2.3-1 (d84eb46)
  • OS: macOS (Darwin 25.2.0 arm64)
  • Node: v25.5.0

Timeline (from gateway.log, all UTC)

06:30:48 — SIGUSR1 config change reload (meta.lastTouchedAt, perplexity config)
06:30:50 — SIGUSR1 second reload
06:31:43 — SIGTERM full shutdown
07:17:06 — Gateway back up (new PID, stable from here)
           ... no restarts for 7+ hours ...
14:30:00 — Cron should fire (6:30 AM PT) — nothing happens
14:30:00 — nextRunAtMs already points to Monday 14:30 UTC

The gateway was fully stable on PID 81528 from 07:35 UTC onward. The cron window at 14:30 UTC was 7 hours into a clean run — but the scheduler had already decided Friday's slot was passed.

Hypothesis

During one of the rapid restart cycles, the cron scheduler reinitializes and evaluates nextRunAtMs. If the restart happens during or just after the evaluation window for a scheduled time, the scheduler may:

  1. See the current minute has "already passed" (even if it's the restart second, not the actual cron time)
  2. Calculate nextRunAtMs as the next valid cron match (skipping to Monday for a weekday-only schedule)
  3. Persist this incorrect nextRunAtMs and never re-evaluate it

This means a restart at 10:30 PM could cause the scheduler to skip a 6:30 AM job the next morning if the internal state isn't properly reconciled on startup.

Suggested Fix

On gateway startup, the cron scheduler should:

  1. Recalculate nextRunAtMs from scratch based on the current time
  2. Not trust persisted nextRunAtMs values that were set during a previous process lifetime
  3. Consider adding a "catch-up" mechanism — if a job's nextRunAtMs is in the past but within a grace window (e.g., 5 minutes), fire it immediately

Workaround

Shifted cron to an off-round-number time (6:31 instead of 6:30) and added a rule preventing sub-agents from using gateway config.patch to reduce restart frequency.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions