Cron scheduler silently stops firing after ~2.5 days of gateway uptime

# Cron scheduler silently stops firing after ~2.5 days of gateway uptime

## Summary

A cron job (schedule `0 9 * * *` with `tz: Asia/Shanghai`) silently failed to trigger after ~2.5 days of continuous gateway uptime. No error was logged; the job simply did not execute at its scheduled time. Manual `openclaw cron run <id>` worked immediately, and a gateway restart restored automatic scheduling.

## Environment

- **OpenClaw version**: 2026.4.14 (323493f) — also verified identical timer code in 2026.4.26
- **OS**: macOS 15 (Apple Silicon)
- **Node.js**: bundled with OpenClaw
- **Gateway mode**: launchd agent (`ai.openclaw.gateway`)
- **Cron jobs**: 3 enabled (two daily, one weekly)

## Timeline

| Event | Timestamp | Gateway age |
|---|---|---|
| Gateway started | T+0h | — |
| Daily cron job A ✅ | T+15h | 15h 34m |
| Daily cron job A ✅ | T+39h | 1d 15h |
| Daily cron job B ✅ | T+52h | 2d 4h |
| **Daily cron job A ❌ MISSED** | T+63h | **2d 15h** |
| Manual `openclaw cron run` ✅ | T+64h | 2d 16h |
| Gateway restart → timer reset ✅ | T+64h | — (fresh) |

## Observations

1. **No run was attempted**: The cron run log (`cron/runs/<job-id>.jsonl`) has no entry between the last successful run and the manual trigger ~36 hours later. The timer simply stopped invoking `onTimer()`.

2. **`cron status` showed stale `nextWakeAtMs`**: `openclaw cron status` returned a `nextWakeAtMs` value 35 minutes in the past, confirming the scheduler knew the next wake time but failed to act on it.

3. **Gateway process was alive and active**: The process was running with RSS ~958MB. Other periodic plugin activity was firing normally every 30 minutes. Only the cron `setTimeout` chain appears broken.

4. **No cron errors in logs**: `gateway.log` and `gateway.err.log` contain no `cron: timer tick failed` or similar entries around the scheduled time. The timer callback was simply never invoked.

5. **No macOS full sleep detected**: `pmset -g log` shows no Sleep/Wake transitions around the missed window. Display sleep may have occurred but not system sleep.

## Analysis of `src/cron/service/timer.ts`

The timer implementation uses `MAX_TIMER_DELAY_MS = 60000` (60 seconds), so the cron scheduler ticks every ≤60 seconds rather than setting one long-duration timeout. This design should be resilient to timer drift.

The `armTimer → setTimeout → onTimer → finally { armTimer }` chain appears correct:

```js
function armTimer(state) {
  // ...
  const clampedDelay = Math.min(delay, MAX_TIMER_DELAY_MS); // max 60s
  state.timer = setTimeout(() => {
    onTimer(state).catch(err => {
      state.deps.log.error({ err: String(err) }, "cron: timer tick failed");
      // ⚠️ No re-arm here — if onTimer rejects without reaching finally, chain breaks
    });
  }, clampedDelay);
}

async function onTimer(state) {
  if (state.running) { armRunningRecheckTimer(state); return; }
  state.running = true;
  armRunningRecheckTimer(state); // backup timer before try
  try {
    // ... execute due jobs ...
  } finally {
    state.running = false;
    armTimer(state); // re-arm
  }
}
```

### Potential failure modes

1. **macOS timer coalescing / App Nap**: Even though the timer is 60s, macOS can aggressively defer `setTimeout` callbacks for background processes that appear idle. The gateway has no incoming network activity between cron ticks, making it a candidate for App Nap. The `setInterval`-based plugin refresh (30 min cycle) may be handled differently by libuv and not subject to the same coalescing.

2. **Missing re-arm in `.catch()`**: If `onTimer()` rejects in a way that bypasses the `finally` block (theoretically impossible for normal async, but Node.js internals have edge cases with unhandled abort signals, V8 GC pressure at ~1GB RSS, etc.), the `.catch()` handler logs but does NOT call `armTimer(state)`, permanently breaking the chain.

3. **Event loop stall**: At 958MB RSS, a major GC pause could delay timer callbacks. If a GC pause coincides with the critical window, and the subsequent timer fires into a state where `nextRunAtMs` is now stale, the recompute logic might skip to tomorrow.

## Suggested Fix

### Option A: Add a watchdog `setInterval` (safest, backward-compatible)

Add a periodic watchdog that checks whether `nextWakeAtMs` is past-due, independent of the `setTimeout` chain:

```ts
// In cron service initialization:
const WATCHDOG_INTERVAL_MS = 5 * 60_000; // 5 minutes

setInterval(() => {
  const nextAt = nextWakeAtMs(state);
  if (nextAt && Date.now() >= nextAt + 60_000) {
    log.warn({ nextAt, now: Date.now() }, "cron: watchdog detected missed timer, re-arming");
    onTimer(state).catch(err => {
      log.error({ err: String(err) }, "cron: watchdog-triggered tick failed");
      armTimer(state); // ensure re-arm even on failure
    });
  }
}, WATCHDOG_INTERVAL_MS);
```

### Option B: Re-arm in `.catch()` handler

```ts
state.timer = setTimeout(() => {
  onTimer(state).catch((err) => {
    state.deps.log.error({ err: String(err) }, "cron: timer tick failed");
+   armTimer(state); // ensure chain is never broken
  });
}, clampedDelay);
```

### Option C: Use `setInterval` instead of chained `setTimeout`

Replace the `setTimeout` chain with a single `setInterval(onTimer, 60_000)` that unconditionally checks for due jobs every 60 seconds. This eliminates the chain-breaking risk entirely.

## Reproduction

Difficult to reproduce on demand — appears to be a timing-dependent issue related to macOS background process scheduling. Running the gateway continuously for 2+ days with only cron activity (no interactive sessions) may increase the likelihood.

## Workaround

`openclaw gateway restart` immediately restores cron scheduling. Users can add a launchd-based periodic gateway restart or a watchdog script as a workaround.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cron scheduler silently stops firing after ~2.5 days of gateway uptime #73166

Cron scheduler silently stops firing after ~2.5 days of gateway uptime

Summary

Environment

Timeline

Observations

Analysis of `src/cron/service/timer.ts`

Potential failure modes

Suggested Fix

Option A: Add a watchdog `setInterval` (safest, backward-compatible)

Option B: Re-arm in `.catch()` handler

Option C: Use `setInterval` instead of chained `setTimeout`

Reproduction

Workaround

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Event	Timestamp	Gateway age
Gateway started	T+0h	—
Daily cron job A ✅	T+15h	15h 34m
Daily cron job A ✅	T+39h	1d 15h
Daily cron job B ✅	T+52h	2d 4h
Daily cron job A ❌ MISSED	T+63h	2d 15h
Manual `openclaw cron run` ✅	T+64h	2d 16h
Gateway restart → timer reset ✅	T+64h	— (fresh)

Uh oh!

Cron scheduler silently stops firing after ~2.5 days of gateway uptime #73166

Description

Cron scheduler silently stops firing after ~2.5 days of gateway uptime

Summary

Environment

Timeline

Observations

Analysis of src/cron/service/timer.ts

Potential failure modes

Suggested Fix

Option A: Add a watchdog setInterval (safest, backward-compatible)

Option B: Re-arm in .catch() handler

Option C: Use setInterval instead of chained setTimeout

Reproduction

Workaround

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Analysis of `src/cron/service/timer.ts`

Option A: Add a watchdog `setInterval` (safest, backward-compatible)

Option B: Re-arm in `.catch()` handler

Option C: Use `setInterval` instead of chained `setTimeout`