Problem
When the gateway enters a restart loop (e.g., after a version downgrade), cron jobs can fire during a brief ~5-minute window, only to be killed by SIGTERM before they complete. No run record is written, and the next gateway instance schedules the job for the next occurrence — silently dropping the current one.
Observed Behavior
- Gateway restarts every ~5 minutes due to instability
- A cron job fires at its scheduled time (e.g.,
15 19 * * 0)
- ~35 seconds later, SIGTERM kills the gateway mid-execution
- New gateway starts, sees scheduled time has passed, moves
nextRunAtMs forward
- No run entry is written — the job is silently lost
openclaw cron runs --id <job-id> shows 0 entries
Expected Behavior
The gateway should detect restart instability (e.g., >3 restarts in 15 minutes) and defer cron execution until the gateway has been stable for a minimum window (e.g., 5 minutes). This prevents jobs from firing in doomed windows.
Alternatively, the cron scheduler could write a "started" record before execution, so the next instance knows a job was interrupted and can retry.
Environment
- OpenClaw v2026.3.11
- macOS (arm64)
- Cron job using
opus model (slow to initialize, making the kill window especially dangerous)
Suggested Approaches
- Stability gate: Track gateway start time. Don't fire cron jobs until uptime > N minutes.
- Pre-execution journaling: Write a
started record to the runs JSONL before firing. On startup, check for started without completed and retry.
- Restart detection: If the gateway detects it's been restarted >3 times in 15 min, enter a "degraded" mode that defers non-critical crons.
Problem
When the gateway enters a restart loop (e.g., after a version downgrade), cron jobs can fire during a brief ~5-minute window, only to be killed by SIGTERM before they complete. No run record is written, and the next gateway instance schedules the job for the next occurrence — silently dropping the current one.
Observed Behavior
15 19 * * 0)nextRunAtMsforwardopenclaw cron runs --id <job-id>shows 0 entriesExpected Behavior
The gateway should detect restart instability (e.g., >3 restarts in 15 minutes) and defer cron execution until the gateway has been stable for a minimum window (e.g., 5 minutes). This prevents jobs from firing in doomed windows.
Alternatively, the cron scheduler could write a "started" record before execution, so the next instance knows a job was interrupted and can retry.
Environment
opusmodel (slow to initialize, making the kill window especially dangerous)Suggested Approaches
startedrecord to the runs JSONL before firing. On startup, check forstartedwithoutcompletedand retry.