Cron scheduler silently stops firing after ~2.5 days of gateway uptime
Summary
A cron job (schedule 0 9 * * * with tz: Asia/Shanghai) silently failed to trigger after ~2.5 days of continuous gateway uptime. No error was logged; the job simply did not execute at its scheduled time. Manual openclaw cron run <id> worked immediately, and a gateway restart restored automatic scheduling.
Environment
- OpenClaw version: 2026.4.14 (323493f) — also verified identical timer code in 2026.4.26
- OS: macOS 15 (Apple Silicon)
- Node.js: bundled with OpenClaw
- Gateway mode: launchd agent (
ai.openclaw.gateway)
- Cron jobs: 3 enabled (two daily, one weekly)
Timeline
| Event |
Timestamp |
Gateway age |
| Gateway started |
T+0h |
— |
| Daily cron job A ✅ |
T+15h |
15h 34m |
| Daily cron job A ✅ |
T+39h |
1d 15h |
| Daily cron job B ✅ |
T+52h |
2d 4h |
| Daily cron job A ❌ MISSED |
T+63h |
2d 15h |
Manual openclaw cron run ✅ |
T+64h |
2d 16h |
| Gateway restart → timer reset ✅ |
T+64h |
— (fresh) |
Observations
-
No run was attempted: The cron run log (cron/runs/<job-id>.jsonl) has no entry between the last successful run and the manual trigger ~36 hours later. The timer simply stopped invoking onTimer().
-
cron status showed stale nextWakeAtMs: openclaw cron status returned a nextWakeAtMs value 35 minutes in the past, confirming the scheduler knew the next wake time but failed to act on it.
-
Gateway process was alive and active: The process was running with RSS ~958MB. Other periodic plugin activity was firing normally every 30 minutes. Only the cron setTimeout chain appears broken.
-
No cron errors in logs: gateway.log and gateway.err.log contain no cron: timer tick failed or similar entries around the scheduled time. The timer callback was simply never invoked.
-
No macOS full sleep detected: pmset -g log shows no Sleep/Wake transitions around the missed window. Display sleep may have occurred but not system sleep.
Analysis of src/cron/service/timer.ts
The timer implementation uses MAX_TIMER_DELAY_MS = 60000 (60 seconds), so the cron scheduler ticks every ≤60 seconds rather than setting one long-duration timeout. This design should be resilient to timer drift.
The armTimer → setTimeout → onTimer → finally { armTimer } chain appears correct:
function armTimer(state) {
// ...
const clampedDelay = Math.min(delay, MAX_TIMER_DELAY_MS); // max 60s
state.timer = setTimeout(() => {
onTimer(state).catch(err => {
state.deps.log.error({ err: String(err) }, "cron: timer tick failed");
// ⚠️ No re-arm here — if onTimer rejects without reaching finally, chain breaks
});
}, clampedDelay);
}
async function onTimer(state) {
if (state.running) { armRunningRecheckTimer(state); return; }
state.running = true;
armRunningRecheckTimer(state); // backup timer before try
try {
// ... execute due jobs ...
} finally {
state.running = false;
armTimer(state); // re-arm
}
}
Potential failure modes
-
macOS timer coalescing / App Nap: Even though the timer is 60s, macOS can aggressively defer setTimeout callbacks for background processes that appear idle. The gateway has no incoming network activity between cron ticks, making it a candidate for App Nap. The setInterval-based plugin refresh (30 min cycle) may be handled differently by libuv and not subject to the same coalescing.
-
Missing re-arm in .catch(): If onTimer() rejects in a way that bypasses the finally block (theoretically impossible for normal async, but Node.js internals have edge cases with unhandled abort signals, V8 GC pressure at ~1GB RSS, etc.), the .catch() handler logs but does NOT call armTimer(state), permanently breaking the chain.
-
Event loop stall: At 958MB RSS, a major GC pause could delay timer callbacks. If a GC pause coincides with the critical window, and the subsequent timer fires into a state where nextRunAtMs is now stale, the recompute logic might skip to tomorrow.
Suggested Fix
Option A: Add a watchdog setInterval (safest, backward-compatible)
Add a periodic watchdog that checks whether nextWakeAtMs is past-due, independent of the setTimeout chain:
// In cron service initialization:
const WATCHDOG_INTERVAL_MS = 5 * 60_000; // 5 minutes
setInterval(() => {
const nextAt = nextWakeAtMs(state);
if (nextAt && Date.now() >= nextAt + 60_000) {
log.warn({ nextAt, now: Date.now() }, "cron: watchdog detected missed timer, re-arming");
onTimer(state).catch(err => {
log.error({ err: String(err) }, "cron: watchdog-triggered tick failed");
armTimer(state); // ensure re-arm even on failure
});
}
}, WATCHDOG_INTERVAL_MS);
Option B: Re-arm in .catch() handler
state.timer = setTimeout(() => {
onTimer(state).catch((err) => {
state.deps.log.error({ err: String(err) }, "cron: timer tick failed");
+ armTimer(state); // ensure chain is never broken
});
}, clampedDelay);
Option C: Use setInterval instead of chained setTimeout
Replace the setTimeout chain with a single setInterval(onTimer, 60_000) that unconditionally checks for due jobs every 60 seconds. This eliminates the chain-breaking risk entirely.
Reproduction
Difficult to reproduce on demand — appears to be a timing-dependent issue related to macOS background process scheduling. Running the gateway continuously for 2+ days with only cron activity (no interactive sessions) may increase the likelihood.
Workaround
openclaw gateway restart immediately restores cron scheduling. Users can add a launchd-based periodic gateway restart or a watchdog script as a workaround.
Cron scheduler silently stops firing after ~2.5 days of gateway uptime
Summary
A cron job (schedule
0 9 * * *withtz: Asia/Shanghai) silently failed to trigger after ~2.5 days of continuous gateway uptime. No error was logged; the job simply did not execute at its scheduled time. Manualopenclaw cron run <id>worked immediately, and a gateway restart restored automatic scheduling.Environment
ai.openclaw.gateway)Timeline
openclaw cron run✅Observations
No run was attempted: The cron run log (
cron/runs/<job-id>.jsonl) has no entry between the last successful run and the manual trigger ~36 hours later. The timer simply stopped invokingonTimer().cron statusshowed stalenextWakeAtMs:openclaw cron statusreturned anextWakeAtMsvalue 35 minutes in the past, confirming the scheduler knew the next wake time but failed to act on it.Gateway process was alive and active: The process was running with RSS ~958MB. Other periodic plugin activity was firing normally every 30 minutes. Only the cron
setTimeoutchain appears broken.No cron errors in logs:
gateway.logandgateway.err.logcontain nocron: timer tick failedor similar entries around the scheduled time. The timer callback was simply never invoked.No macOS full sleep detected:
pmset -g logshows no Sleep/Wake transitions around the missed window. Display sleep may have occurred but not system sleep.Analysis of
src/cron/service/timer.tsThe timer implementation uses
MAX_TIMER_DELAY_MS = 60000(60 seconds), so the cron scheduler ticks every ≤60 seconds rather than setting one long-duration timeout. This design should be resilient to timer drift.The
armTimer → setTimeout → onTimer → finally { armTimer }chain appears correct:Potential failure modes
macOS timer coalescing / App Nap: Even though the timer is 60s, macOS can aggressively defer
setTimeoutcallbacks for background processes that appear idle. The gateway has no incoming network activity between cron ticks, making it a candidate for App Nap. ThesetInterval-based plugin refresh (30 min cycle) may be handled differently by libuv and not subject to the same coalescing.Missing re-arm in
.catch(): IfonTimer()rejects in a way that bypasses thefinallyblock (theoretically impossible for normal async, but Node.js internals have edge cases with unhandled abort signals, V8 GC pressure at ~1GB RSS, etc.), the.catch()handler logs but does NOT callarmTimer(state), permanently breaking the chain.Event loop stall: At 958MB RSS, a major GC pause could delay timer callbacks. If a GC pause coincides with the critical window, and the subsequent timer fires into a state where
nextRunAtMsis now stale, the recompute logic might skip to tomorrow.Suggested Fix
Option A: Add a watchdog
setInterval(safest, backward-compatible)Add a periodic watchdog that checks whether
nextWakeAtMsis past-due, independent of thesetTimeoutchain:Option B: Re-arm in
.catch()handlerOption C: Use
setIntervalinstead of chainedsetTimeoutReplace the
setTimeoutchain with a singlesetInterval(onTimer, 60_000)that unconditionally checks for due jobs every 60 seconds. This eliminates the chain-breaking risk entirely.Reproduction
Difficult to reproduce on demand — appears to be a timing-dependent issue related to macOS background process scheduling. Running the gateway continuously for 2+ days with only cron activity (no interactive sessions) may increase the likelihood.
Workaround
openclaw gateway restartimmediately restores cron scheduling. Users can add a launchd-based periodic gateway restart or a watchdog script as a workaround.