Skip to content

Cron scheduler silently stops firing after ~2.5 days of gateway uptime #73166

@SkywingsWang

Description

@SkywingsWang

Cron scheduler silently stops firing after ~2.5 days of gateway uptime

Summary

A cron job (schedule 0 9 * * * with tz: Asia/Shanghai) silently failed to trigger after ~2.5 days of continuous gateway uptime. No error was logged; the job simply did not execute at its scheduled time. Manual openclaw cron run <id> worked immediately, and a gateway restart restored automatic scheduling.

Environment

  • OpenClaw version: 2026.4.14 (323493f) — also verified identical timer code in 2026.4.26
  • OS: macOS 15 (Apple Silicon)
  • Node.js: bundled with OpenClaw
  • Gateway mode: launchd agent (ai.openclaw.gateway)
  • Cron jobs: 3 enabled (two daily, one weekly)

Timeline

Event Timestamp Gateway age
Gateway started T+0h
Daily cron job A ✅ T+15h 15h 34m
Daily cron job A ✅ T+39h 1d 15h
Daily cron job B ✅ T+52h 2d 4h
Daily cron job A ❌ MISSED T+63h 2d 15h
Manual openclaw cron run T+64h 2d 16h
Gateway restart → timer reset ✅ T+64h — (fresh)

Observations

  1. No run was attempted: The cron run log (cron/runs/<job-id>.jsonl) has no entry between the last successful run and the manual trigger ~36 hours later. The timer simply stopped invoking onTimer().

  2. cron status showed stale nextWakeAtMs: openclaw cron status returned a nextWakeAtMs value 35 minutes in the past, confirming the scheduler knew the next wake time but failed to act on it.

  3. Gateway process was alive and active: The process was running with RSS ~958MB. Other periodic plugin activity was firing normally every 30 minutes. Only the cron setTimeout chain appears broken.

  4. No cron errors in logs: gateway.log and gateway.err.log contain no cron: timer tick failed or similar entries around the scheduled time. The timer callback was simply never invoked.

  5. No macOS full sleep detected: pmset -g log shows no Sleep/Wake transitions around the missed window. Display sleep may have occurred but not system sleep.

Analysis of src/cron/service/timer.ts

The timer implementation uses MAX_TIMER_DELAY_MS = 60000 (60 seconds), so the cron scheduler ticks every ≤60 seconds rather than setting one long-duration timeout. This design should be resilient to timer drift.

The armTimer → setTimeout → onTimer → finally { armTimer } chain appears correct:

function armTimer(state) {
  // ...
  const clampedDelay = Math.min(delay, MAX_TIMER_DELAY_MS); // max 60s
  state.timer = setTimeout(() => {
    onTimer(state).catch(err => {
      state.deps.log.error({ err: String(err) }, "cron: timer tick failed");
      // ⚠️ No re-arm here — if onTimer rejects without reaching finally, chain breaks
    });
  }, clampedDelay);
}

async function onTimer(state) {
  if (state.running) { armRunningRecheckTimer(state); return; }
  state.running = true;
  armRunningRecheckTimer(state); // backup timer before try
  try {
    // ... execute due jobs ...
  } finally {
    state.running = false;
    armTimer(state); // re-arm
  }
}

Potential failure modes

  1. macOS timer coalescing / App Nap: Even though the timer is 60s, macOS can aggressively defer setTimeout callbacks for background processes that appear idle. The gateway has no incoming network activity between cron ticks, making it a candidate for App Nap. The setInterval-based plugin refresh (30 min cycle) may be handled differently by libuv and not subject to the same coalescing.

  2. Missing re-arm in .catch(): If onTimer() rejects in a way that bypasses the finally block (theoretically impossible for normal async, but Node.js internals have edge cases with unhandled abort signals, V8 GC pressure at ~1GB RSS, etc.), the .catch() handler logs but does NOT call armTimer(state), permanently breaking the chain.

  3. Event loop stall: At 958MB RSS, a major GC pause could delay timer callbacks. If a GC pause coincides with the critical window, and the subsequent timer fires into a state where nextRunAtMs is now stale, the recompute logic might skip to tomorrow.

Suggested Fix

Option A: Add a watchdog setInterval (safest, backward-compatible)

Add a periodic watchdog that checks whether nextWakeAtMs is past-due, independent of the setTimeout chain:

// In cron service initialization:
const WATCHDOG_INTERVAL_MS = 5 * 60_000; // 5 minutes

setInterval(() => {
  const nextAt = nextWakeAtMs(state);
  if (nextAt && Date.now() >= nextAt + 60_000) {
    log.warn({ nextAt, now: Date.now() }, "cron: watchdog detected missed timer, re-arming");
    onTimer(state).catch(err => {
      log.error({ err: String(err) }, "cron: watchdog-triggered tick failed");
      armTimer(state); // ensure re-arm even on failure
    });
  }
}, WATCHDOG_INTERVAL_MS);

Option B: Re-arm in .catch() handler

state.timer = setTimeout(() => {
  onTimer(state).catch((err) => {
    state.deps.log.error({ err: String(err) }, "cron: timer tick failed");
+   armTimer(state); // ensure chain is never broken
  });
}, clampedDelay);

Option C: Use setInterval instead of chained setTimeout

Replace the setTimeout chain with a single setInterval(onTimer, 60_000) that unconditionally checks for due jobs every 60 seconds. This eliminates the chain-breaking risk entirely.

Reproduction

Difficult to reproduce on demand — appears to be a timing-dependent issue related to macOS background process scheduling. Running the gateway continuously for 2+ days with only cron activity (no interactive sessions) may increase the likelihood.

Workaround

openclaw gateway restart immediately restores cron scheduling. Users can add a launchd-based periodic gateway restart or a watchdog script as a workaround.

Metadata

Metadata

Assignees

No one assigned

    Labels

    staleMarked as stale due to inactivity

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions