Skip to content

Cron: every schedule jobs stop firing after repeated errors, no catch-up mechanism #10403

@mrz1836

Description

@mrz1836

Summary

every (interval) schedule jobs stopped firing for ~23 hours after encountering repeated LLM errors (rate limits, timeouts). The scheduler jumped nextRunAtMs far into the future instead of retrying on the normal interval.

Environment

  • OpenClaw version: 2026.2.3-1
  • OS: macOS 15.7.3 (arm64)
  • Node: 25.6.0

Steps to Reproduce

  1. Create an every schedule job (e.g., hourly):
{
  "schedule": {
    "kind": "every",
    "everyMs": 3600000
  }
}
  1. Let the job encounter multiple consecutive errors (rate limits, timeouts, connection errors)

  2. Observe that nextRunAtMs jumps far into the future (e.g., 24+ hours) instead of retrying on the next interval

Expected Behavior

  • After transient errors, the job should retry on the next scheduled interval (1 hour later), not jump 24+ hours ahead
  • Optional: A configurable "max catch-up" setting to handle missed runs after downtime

Actual Behavior

  • After several errors around 09:00 EST on Feb 5, the hourly jobs didn't fire again until manually recreated on Feb 6
  • The nextRunAtMs was set to ~09:00 EST the next day, skipping ~23 hourly runs
  • Run history showed errors like:
    Error: All models failed (4): anthropic/claude-opus-4-5: LLM request timed out. (unknown) | anthropic/claude-sonnet-4-5: No available auth profile (rate_limit) | ...
    

Workaround

Delete and recreate the job with a fresh anchorMs to reset the schedule state:

openclaw cron remove <job-id>
openclaw cron add --schedule.kind=every --schedule.everyMs=3600000 --schedule.anchorMs=<recent-timestamp> ...

Additional Context

  • cron expression jobs (e.g., 0 7 * * *) were unaffected and continued running normally
  • Only every (interval) jobs exhibited this behavior
  • The gateway was running continuously during this period (not restarted until troubleshooting)

Suggested Fix

  1. After an error, calculate next run as max(now, lastRunAtMs) + everyMs rather than jumping to a much later time
  2. Consider a maxSkip or catchUp option for interval schedules
  3. Add logging when a job's next run is calculated to be significantly later than expected

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions