fix(cron): add .catch() re-arm and watchdog to prevent runtime timer chain death by SkywingsWang · Pull Request #73355 · openclaw/openclaw

SkywingsWang · 2026-04-28T07:07:21Z

Summary

When the cron setTimeout chain breaks at runtime, the scheduler silently stops firing and never recovers until a gateway restart. This was observed after ~2.5 days of continuous gateway uptime on macOS (Apple Silicon) with no errors logged.

Root cause: The .catch() handler in armTimer() and armRunningRecheckTimer() logs the error but does NOT call armTimer(state), so if onTimer() ever rejects without reaching its finally block, the timer chain is permanently broken.

Contributing factor: On macOS, background processes (like the gateway running as a launchd agent) are subject to App Nap and timer coalescing. While MAX_TIMER_DELAY_MS = 60s should be short enough for the clamped setTimeout to survive, the OS can still defer callbacks in edge cases, and if the chain breaks once, there is no recovery mechanism.

Changes

1. Re-arm in `.catch()` handler (`timer.ts`)

Both armTimer() and armRunningRecheckTimer() now call armTimer(state) inside .catch(), ensuring the timer chain is never permanently broken:

void onTimer(state).catch((err) => {
  state.deps.log.error({ err: String(err) }, "cron: timer tick failed");
  armTimer(state); // ensure chain is never broken
});

2. Independent watchdog `setInterval` (`timer.ts`)

Added startCronWatchdog() — a setInterval-based watchdog (every 5 minutes) that detects when nextWakeAtMs is past-due by more than MAX_TIMER_DELAY_MS and force-triggers onTimer + re-arms the chain. setInterval is more resilient to OS-level timer deferral than chained setTimeout because libuv treats it as a persistent/repeating timer.

3. Lifecycle integration (`ops.ts`, `state.ts`)

Watchdog is started in start() and stopped in stop()
Added _stopWatchdog field to CronServiceState

4. Tests (`timer.watchdog.test.ts`)

Watchdog detects stalled timer and logs warning
Watchdog does not false-positive on healthy timers
Watchdog respects cronEnabled: false
Cleanup function stops the watchdog

Relationship to other PRs

fix(cron): prevent scheduler death when startup catch-up fails #68112 fixes a different bug: start() -> runMissedJobs() throwing kills the timer. That is a startup-time issue.
This PR fixes a runtime issue: the timer chain dying during normal operation after hours/days of uptime.

The two fixes are complementary and non-overlapping.

Fixes #73166

…chain death When the cron setTimeout chain breaks at runtime (e.g. due to macOS App Nap timer coalescing for background processes), the scheduler silently stops firing. This was observed after ~2.5 days of continuous gateway uptime on macOS with Apple Silicon. Changes: 1. Add armTimer(state) in the .catch() handler of both armTimer() and armRunningRecheckTimer() setTimeout callbacks. If onTimer() ever rejects without reaching its finally block, the chain is now re-established instead of permanently broken. 2. Add startCronWatchdog() — an independent setInterval-based watchdog (every 5 minutes) that detects when nextWakeAtMs is past-due by more than MAX_TIMER_DELAY_MS and force-triggers onTimer + re-arms the chain. setInterval is more resilient to OS-level timer deferral than chained setTimeout because libuv treats it as a persistent/repeating timer. 3. Wire up the watchdog in start()/stop() lifecycle. 4. Add tests for watchdog detection, no-false-positive, cleanup, and disabled-cron scenarios. Fixes openclaw#73166

greptile-apps · 2026-04-28T07:10:44Z

Greptile Summary

This PR adds two resilience mechanisms to the cron scheduler: a .catch() re-arm in the setTimeout chain and an independent setInterval-based watchdog that detects stalled chains. The watchdog logic, lifecycle integration in ops.ts/state.ts, and the interval.unref() call are all well-considered. The main concern is that the .catch() re-arm interacts with onTimer's existing finally { armTimer(state) }, causing a double-arm (leaked timer) whenever onTimer rejects normally inside its try block, and the first test case doesn't actually cover the new .catch() code path.

Confidence Score: 4/5

Safe to merge — the watchdog and re-arm logic are sound; the double-arm edge case is benign in practice due to the state.running guard.

Only P2 findings: a double-armTimer in the rare rejection-within-try path (spurious extra ticks but no data loss or incorrect job execution), and a test that doesn't fully cover the new .catch() branch. No P0/P1 issues found.

timer.ts catch handlers (double-arm interaction with finally), timer.watchdog.test.ts first test case.

Prompt To Fix All With AI

This is a comment left during a code review.
Path: src/cron/service/timer.ts
Line: 738-745

Comment:
**Double `armTimer` call on rejection inside `try` block**

When `onTimer` rejects from within its own `try` block, the `finally` clause at line 967 (`state.running = false; armTimer(state)`) still executes first — JS `finally` always runs on async rejection. The `.catch()` handler then calls `armTimer` a second time, leaking timer A (from `finally`) and replacing `state.timer` with timer B. Both subsequently fire; whichever fires second hits the `state.running` guard and calls `armRunningRecheckTimer`, cancelling and re-creating the tracked timer. Net effect: spurious extra timer ticks in the error path.

The `.catch()` re-arm only uniquely adds value for the narrow edge case where code before the `try` block (specifically `armRunningRecheckTimer` on line 819) throws synchronously. Consider guarding against that specific case, or checking `!state.timer` before re-arming in `.catch()` to avoid the double-arm:

```ts
void onTimer(state).catch((err) => {
  state.deps.log.error({ err: String(err) }, "cron: timer tick failed");
  if (!state.timer) {
    armTimer(state);
  }
});
```

The same applies to the watchdog's `.catch()` handler and `armRunningRecheckTimer`'s `.catch()` handler.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/cron/service/timer.watchdog.test.ts
Line: 38-62

Comment:
**Test doesn't exercise the `.catch()` re-arm behavior**

This test only verifies that calling `armTimer` twice sets a new timer reference — it doesn't simulate a rejection path at all. It manually clears the timer and calls `armTimer` directly, bypassing the `.catch()` handler entirely. A meaningful test would mock `onTimer` to reject (or inject a failing dep) so that the `.catch()` branch is actually exercised and `state.timer` is confirmed to be re-set after the rejection.

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "fix(cron): add .catch() re-arm and watch..." | Re-trigger Greptile}

greptile-apps · 2026-04-28T07:10:48Z

    void onTimer(state).catch((err) => {
      state.deps.log.error({ err: String(err) }, "cron: timer tick failed");
+      // Re-arm the timer even when onTimer rejects outside its own finally
+      // block.  Without this, any unhandled rejection permanently breaks
+      // the setTimeout chain, silently killing the scheduler.
+      // See: https://github.com/openclaw/openclaw/issues/73166
+      armTimer(state);
    });


Double armTimer call on rejection inside try block

When onTimer rejects from within its own try block, the finally clause at line 967 (state.running = false; armTimer(state)) still executes first — JS finally always runs on async rejection. The .catch() handler then calls armTimer a second time, leaking timer A (from finally) and replacing state.timer with timer B. Both subsequently fire; whichever fires second hits the state.running guard and calls armRunningRecheckTimer, cancelling and re-creating the tracked timer. Net effect: spurious extra timer ticks in the error path.

The .catch() re-arm only uniquely adds value for the narrow edge case where code before the try block (specifically armRunningRecheckTimer on line 819) throws synchronously. Consider guarding against that specific case, or checking !state.timer before re-arming in .catch() to avoid the double-arm:

void onTimer(state).catch((err) => { state.deps.log.error({ err: String(err) }, "cron: timer tick failed"); if (!state.timer) { armTimer(state); } });

The same applies to the watchdog's .catch() handler and armRunningRecheckTimer's .catch() handler.

Prompt To Fix With AI

This is a comment left during a code review. Path: src/cron/service/timer.ts Line: 738-745 Comment: **Double `armTimer` call on rejection inside `try` block** When `onTimer` rejects from within its own `try` block, the `finally` clause at line 967 (`state.running = false; armTimer(state)`) still executes first — JS `finally` always runs on async rejection. The `.catch()` handler then calls `armTimer` a second time, leaking timer A (from `finally`) and replacing `state.timer` with timer B. Both subsequently fire; whichever fires second hits the `state.running` guard and calls `armRunningRecheckTimer`, cancelling and re-creating the tracked timer. Net effect: spurious extra timer ticks in the error path. The `.catch()` re-arm only uniquely adds value for the narrow edge case where code before the `try` block (specifically `armRunningRecheckTimer` on line 819) throws synchronously. Consider guarding against that specific case, or checking `!state.timer` before re-arming in `.catch()` to avoid the double-arm: ```ts void onTimer(state).catch((err) => { state.deps.log.error({ err: String(err) }, "cron: timer tick failed"); if (!state.timer) { armTimer(state); } }); ``` The same applies to the watchdog's `.catch()` handler and `armRunningRecheckTimer`'s `.catch()` handler. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-28T07:10:49Z

+
+    // Arm an initial timer
+    armTimer(state);
+    expect(state.timer).not.toBeNull();
+
+    // Simulate a scenario where onTimer would fail: clear the timer
+    // reference and verify the .catch handler re-arms it.
+    // We test this indirectly by checking that after a timer fire + rejection,
+    // a new timer is set.
+    const originalTimer = state.timer;
+    clearTimeout(state.timer!);
+    state.timer = null;
+
+    // Directly call armTimer to verify it sets a timer
+    armTimer(state);
+    expect(state.timer).not.toBeNull();
+    expect(state.timer).not.toBe(originalTimer);
+  });
+});
+
+describe("startCronWatchdog", () => {
+  beforeEach(() => {
+    vi.useFakeTimers();
+    vi.clearAllMocks();
+  });


Test doesn't exercise the .catch() re-arm behavior

This test only verifies that calling armTimer twice sets a new timer reference — it doesn't simulate a rejection path at all. It manually clears the timer and calls armTimer directly, bypassing the .catch() handler entirely. A meaningful test would mock onTimer to reject (or inject a failing dep) so that the .catch() branch is actually exercised and state.timer is confirmed to be re-set after the rejection.

Prompt To Fix With AI

This is a comment left during a code review. Path: src/cron/service/timer.watchdog.test.ts Line: 38-62 Comment: **Test doesn't exercise the `.catch()` re-arm behavior** This test only verifies that calling `armTimer` twice sets a new timer reference — it doesn't simulate a rejection path at all. It manually clears the timer and calls `armTimer` directly, bypassing the `.catch()` handler entirely. A meaningful test would mock `onTimer` to reject (or inject a failing dep) so that the `.catch()` branch is actually exercised and `state.timer` is confirmed to be re-set after the rejection. How can I resolve this? If you propose a fix, please make it concise.

clawsweeper · 2026-04-28T09:45:35Z

Codex review: needs real behavior proof before merge. Reviewed May 30, 2026, 12:57 AM ET / 04:57 UTC.

Summary
Review failed before ClawSweeper could summarize the requested change.

PR surface: Source +68, Tests +188. Total +256 across 4 files.

Reproducibility: unclear. The review failed before ClawSweeper could establish a reproduction path.

Review metrics: none identified.

Merge readiness
Overall: 🌊 off-meta tidepool
Proof: 🌊 off-meta tidepool
Patch quality: 🌊 off-meta tidepool
Result: rating does not apply to this item.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Risk before merge

[P1] No close action taken because the review did not complete.

Maintainer options:

Decide the mitigation before merge
Retry the Codex review after fixing the execution failure.
Pause or close
Do not merge this PR until maintainers decide whether the risk is worth taking.

Next step before merge

[P1] Review did not complete, so no work-lane recommendation was made.

Review details

Best possible solution:

Retry the Codex review after fixing the execution failure.

Do we have a high-confidence way to reproduce the issue?

Unclear. The review failed before ClawSweeper could establish a reproduction path.

Is this the best way to solve the issue?

Unclear. Retry the review first so ClawSweeper can evaluate the actual issue and fix direction.

AGENTS.md: unclear because the file could not be read completely.

Codex review notes: model gpt-5.5, reasoning high; reviewed against b9933b2ec119.

Label changes

Label justifications:

rating: 🌊 off-meta tidepool: Overall readiness is 🌊 off-meta tidepool; proof is 🌊 off-meta tidepool and patch quality is 🌊 off-meta tidepool.

Evidence reviewed

PR surface:

Source +68, Tests +188. Total +256 across 4 files.

View PR surface stats

Area	Files	Added	Net
Source	3	68	+68
Tests	1	188	+188
Docs	0	0	0
Config	0	0	0
Generated	0	0	0
Other	0	0	0
Total	4	256	+256

What I checked:

failure reason: codex execution failed.
codex failure detail: Codex review failed for this PR with exit 1.
codex stdout: Per-item Codex failure; continuing with the rest of the shard.

Likely related people:

unknown: Codex failed before it could trace repository history. (role: review did not complete; confidence: low)

What the crustacean ranks mean

🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works

ClawSweeper keeps one durable marker-backed review comment per issue or PR.
Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
Maintainers can also comment @clawsweeper review to request a fresh review only.
Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

stainlu · 2026-04-28T14:42:26Z

does the .catch re-arm have its own test? the new test file is for the watchdog path. the .catch path is a separate change (any onTimer rejection now re-arms) but i don't see a test that exercises it directly. mocking onTimer to reject once and asserting armTimer fires after would cover it.

openclaw-barnacle · 2026-05-30T04:43:22Z

This pull request has been automatically marked as stale due to inactivity.
Please add updates or it will be closed.

openclaw-barnacle Bot added the size: M label Apr 28, 2026

greptile-apps Bot reviewed Apr 28, 2026

View reviewed changes

This was referenced Apr 30, 2026

Cron scheduler silently stops firing after ~2.5 days of gateway uptime #73166

Open

fix(cron): prevent scheduler death when startup catch-up fails #68112

Open

clawsweeper Bot mentioned this pull request May 15, 2026

[Bug]: Cron job not triggering automatically (v2026.5.12) #82006

Closed

openclaw-barnacle Bot added stale Marked as stale due to inactivity triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 30, 2026

clawsweeper Bot added the rating: 🌊 off-meta tidepool PR readiness rating does not apply to this item. label May 30, 2026

barnacle-openclaw Bot removed the stale Marked as stale due to inactivity label May 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(cron): add .catch() re-arm and watchdog to prevent runtime timer chain death#73355

fix(cron): add .catch() re-arm and watchdog to prevent runtime timer chain death#73355
SkywingsWang wants to merge 1 commit into
openclaw:mainfrom
SkywingsWang:fix/cron-timer-runtime-watchdog

SkywingsWang commented Apr 28, 2026

Uh oh!

greptile-apps Bot commented Apr 28, 2026

Uh oh!

greptile-apps Bot Apr 28, 2026

Uh oh!

greptile-apps Bot Apr 28, 2026

Uh oh!

clawsweeper Bot commented Apr 28, 2026 •

edited

Loading

Uh oh!

stainlu commented Apr 28, 2026

Uh oh!

openclaw-barnacle Bot commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

SkywingsWang commented Apr 28, 2026

Summary

Changes

1. Re-arm in .catch() handler (timer.ts)

2. Independent watchdog setInterval (timer.ts)

3. Lifecycle integration (ops.ts, state.ts)

4. Tests (timer.watchdog.test.ts)

Relationship to other PRs

Uh oh!

greptile-apps Bot commented Apr 28, 2026

Greptile Summary

Confidence Score: 4/5

Uh oh!

greptile-apps Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

clawsweeper Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stainlu commented Apr 28, 2026

Uh oh!

openclaw-barnacle Bot commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

1. Re-arm in `.catch()` handler (`timer.ts`)

2. Independent watchdog `setInterval` (`timer.ts`)

3. Lifecycle integration (`ops.ts`, `state.ts`)

4. Tests (`timer.watchdog.test.ts`)

clawsweeper Bot commented Apr 28, 2026 •

edited

Loading