Skip to content

fix(cron): add .catch() re-arm and watchdog to prevent runtime timer chain death#73355

Open
SkywingsWang wants to merge 1 commit into
openclaw:mainfrom
SkywingsWang:fix/cron-timer-runtime-watchdog
Open

fix(cron): add .catch() re-arm and watchdog to prevent runtime timer chain death#73355
SkywingsWang wants to merge 1 commit into
openclaw:mainfrom
SkywingsWang:fix/cron-timer-runtime-watchdog

Conversation

@SkywingsWang

Copy link
Copy Markdown

Summary

When the cron setTimeout chain breaks at runtime, the scheduler silently stops firing and never recovers until a gateway restart. This was observed after ~2.5 days of continuous gateway uptime on macOS (Apple Silicon) with no errors logged.

Root cause: The .catch() handler in armTimer() and armRunningRecheckTimer() logs the error but does NOT call armTimer(state), so if onTimer() ever rejects without reaching its finally block, the timer chain is permanently broken.

Contributing factor: On macOS, background processes (like the gateway running as a launchd agent) are subject to App Nap and timer coalescing. While MAX_TIMER_DELAY_MS = 60s should be short enough for the clamped setTimeout to survive, the OS can still defer callbacks in edge cases, and if the chain breaks once, there is no recovery mechanism.

Changes

1. Re-arm in .catch() handler (timer.ts)

Both armTimer() and armRunningRecheckTimer() now call armTimer(state) inside .catch(), ensuring the timer chain is never permanently broken:

void onTimer(state).catch((err) => {
  state.deps.log.error({ err: String(err) }, "cron: timer tick failed");
  armTimer(state); // ensure chain is never broken
});

2. Independent watchdog setInterval (timer.ts)

Added startCronWatchdog() — a setInterval-based watchdog (every 5 minutes) that detects when nextWakeAtMs is past-due by more than MAX_TIMER_DELAY_MS and force-triggers onTimer + re-arms the chain. setInterval is more resilient to OS-level timer deferral than chained setTimeout because libuv treats it as a persistent/repeating timer.

3. Lifecycle integration (ops.ts, state.ts)

  • Watchdog is started in start() and stopped in stop()
  • Added _stopWatchdog field to CronServiceState

4. Tests (timer.watchdog.test.ts)

  • Watchdog detects stalled timer and logs warning
  • Watchdog does not false-positive on healthy timers
  • Watchdog respects cronEnabled: false
  • Cleanup function stops the watchdog

Relationship to other PRs

The two fixes are complementary and non-overlapping.

Fixes #73166

…chain death

When the cron setTimeout chain breaks at runtime (e.g. due to macOS App Nap
timer coalescing for background processes), the scheduler silently stops
firing. This was observed after ~2.5 days of continuous gateway uptime on
macOS with Apple Silicon.

Changes:
1. Add armTimer(state) in the .catch() handler of both armTimer() and
   armRunningRecheckTimer() setTimeout callbacks. If onTimer() ever
   rejects without reaching its finally block, the chain is now
   re-established instead of permanently broken.

2. Add startCronWatchdog() — an independent setInterval-based watchdog
   (every 5 minutes) that detects when nextWakeAtMs is past-due by more
   than MAX_TIMER_DELAY_MS and force-triggers onTimer + re-arms the chain.
   setInterval is more resilient to OS-level timer deferral than chained
   setTimeout because libuv treats it as a persistent/repeating timer.

3. Wire up the watchdog in start()/stop() lifecycle.

4. Add tests for watchdog detection, no-false-positive, cleanup, and
   disabled-cron scenarios.

Fixes openclaw#73166
@greptile-apps

greptile-apps Bot commented Apr 28, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds two resilience mechanisms to the cron scheduler: a .catch() re-arm in the setTimeout chain and an independent setInterval-based watchdog that detects stalled chains. The watchdog logic, lifecycle integration in ops.ts/state.ts, and the interval.unref() call are all well-considered. The main concern is that the .catch() re-arm interacts with onTimer's existing finally { armTimer(state) }, causing a double-arm (leaked timer) whenever onTimer rejects normally inside its try block, and the first test case doesn't actually cover the new .catch() code path.

Confidence Score: 4/5

Safe to merge — the watchdog and re-arm logic are sound; the double-arm edge case is benign in practice due to the state.running guard.

Only P2 findings: a double-armTimer in the rare rejection-within-try path (spurious extra ticks but no data loss or incorrect job execution), and a test that doesn't fully cover the new .catch() branch. No P0/P1 issues found.

timer.ts catch handlers (double-arm interaction with finally), timer.watchdog.test.ts first test case.

Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/cron/service/timer.ts
Line: 738-745

Comment:
**Double `armTimer` call on rejection inside `try` block**

When `onTimer` rejects from within its own `try` block, the `finally` clause at line 967 (`state.running = false; armTimer(state)`) still executes first — JS `finally` always runs on async rejection. The `.catch()` handler then calls `armTimer` a second time, leaking timer A (from `finally`) and replacing `state.timer` with timer B. Both subsequently fire; whichever fires second hits the `state.running` guard and calls `armRunningRecheckTimer`, cancelling and re-creating the tracked timer. Net effect: spurious extra timer ticks in the error path.

The `.catch()` re-arm only uniquely adds value for the narrow edge case where code before the `try` block (specifically `armRunningRecheckTimer` on line 819) throws synchronously. Consider guarding against that specific case, or checking `!state.timer` before re-arming in `.catch()` to avoid the double-arm:

```ts
void onTimer(state).catch((err) => {
  state.deps.log.error({ err: String(err) }, "cron: timer tick failed");
  if (!state.timer) {
    armTimer(state);
  }
});
```

The same applies to the watchdog's `.catch()` handler and `armRunningRecheckTimer`'s `.catch()` handler.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/cron/service/timer.watchdog.test.ts
Line: 38-62

Comment:
**Test doesn't exercise the `.catch()` re-arm behavior**

This test only verifies that calling `armTimer` twice sets a new timer reference — it doesn't simulate a rejection path at all. It manually clears the timer and calls `armTimer` directly, bypassing the `.catch()` handler entirely. A meaningful test would mock `onTimer` to reject (or inject a failing dep) so that the `.catch()` branch is actually exercised and `state.timer` is confirmed to be re-set after the rejection.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix(cron): add .catch() re-arm and watch..." | Re-trigger Greptile

Comment thread src/cron/service/timer.ts
Comment on lines 738 to 745
void onTimer(state).catch((err) => {
state.deps.log.error({ err: String(err) }, "cron: timer tick failed");
// Re-arm the timer even when onTimer rejects outside its own finally
// block. Without this, any unhandled rejection permanently breaks
// the setTimeout chain, silently killing the scheduler.
// See: https://github.com/openclaw/openclaw/issues/73166
armTimer(state);
});

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Double armTimer call on rejection inside try block

When onTimer rejects from within its own try block, the finally clause at line 967 (state.running = false; armTimer(state)) still executes first — JS finally always runs on async rejection. The .catch() handler then calls armTimer a second time, leaking timer A (from finally) and replacing state.timer with timer B. Both subsequently fire; whichever fires second hits the state.running guard and calls armRunningRecheckTimer, cancelling and re-creating the tracked timer. Net effect: spurious extra timer ticks in the error path.

The .catch() re-arm only uniquely adds value for the narrow edge case where code before the try block (specifically armRunningRecheckTimer on line 819) throws synchronously. Consider guarding against that specific case, or checking !state.timer before re-arming in .catch() to avoid the double-arm:

void onTimer(state).catch((err) => {
  state.deps.log.error({ err: String(err) }, "cron: timer tick failed");
  if (!state.timer) {
    armTimer(state);
  }
});

The same applies to the watchdog's .catch() handler and armRunningRecheckTimer's .catch() handler.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cron/service/timer.ts
Line: 738-745

Comment:
**Double `armTimer` call on rejection inside `try` block**

When `onTimer` rejects from within its own `try` block, the `finally` clause at line 967 (`state.running = false; armTimer(state)`) still executes first — JS `finally` always runs on async rejection. The `.catch()` handler then calls `armTimer` a second time, leaking timer A (from `finally`) and replacing `state.timer` with timer B. Both subsequently fire; whichever fires second hits the `state.running` guard and calls `armRunningRecheckTimer`, cancelling and re-creating the tracked timer. Net effect: spurious extra timer ticks in the error path.

The `.catch()` re-arm only uniquely adds value for the narrow edge case where code before the `try` block (specifically `armRunningRecheckTimer` on line 819) throws synchronously. Consider guarding against that specific case, or checking `!state.timer` before re-arming in `.catch()` to avoid the double-arm:

```ts
void onTimer(state).catch((err) => {
  state.deps.log.error({ err: String(err) }, "cron: timer tick failed");
  if (!state.timer) {
    armTimer(state);
  }
});
```

The same applies to the watchdog's `.catch()` handler and `armRunningRecheckTimer`'s `.catch()` handler.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +38 to +62

// Arm an initial timer
armTimer(state);
expect(state.timer).not.toBeNull();

// Simulate a scenario where onTimer would fail: clear the timer
// reference and verify the .catch handler re-arms it.
// We test this indirectly by checking that after a timer fire + rejection,
// a new timer is set.
const originalTimer = state.timer;
clearTimeout(state.timer!);
state.timer = null;

// Directly call armTimer to verify it sets a timer
armTimer(state);
expect(state.timer).not.toBeNull();
expect(state.timer).not.toBe(originalTimer);
});
});

describe("startCronWatchdog", () => {
beforeEach(() => {
vi.useFakeTimers();
vi.clearAllMocks();
});

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Test doesn't exercise the .catch() re-arm behavior

This test only verifies that calling armTimer twice sets a new timer reference — it doesn't simulate a rejection path at all. It manually clears the timer and calls armTimer directly, bypassing the .catch() handler entirely. A meaningful test would mock onTimer to reject (or inject a failing dep) so that the .catch() branch is actually exercised and state.timer is confirmed to be re-set after the rejection.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cron/service/timer.watchdog.test.ts
Line: 38-62

Comment:
**Test doesn't exercise the `.catch()` re-arm behavior**

This test only verifies that calling `armTimer` twice sets a new timer reference — it doesn't simulate a rejection path at all. It manually clears the timer and calls `armTimer` directly, bypassing the `.catch()` handler entirely. A meaningful test would mock `onTimer` to reject (or inject a failing dep) so that the `.catch()` branch is actually exercised and `state.timer` is confirmed to be re-set after the rejection.

How can I resolve this? If you propose a fix, please make it concise.

@clawsweeper

clawsweeper Bot commented Apr 28, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs real behavior proof before merge. Reviewed May 30, 2026, 12:57 AM ET / 04:57 UTC.

Summary
Review failed before ClawSweeper could summarize the requested change.

PR surface: Source +68, Tests +188. Total +256 across 4 files.

Reproducibility: unclear. The review failed before ClawSweeper could establish a reproduction path.

Review metrics: none identified.

Merge readiness
Overall: 🌊 off-meta tidepool
Proof: 🌊 off-meta tidepool
Patch quality: 🌊 off-meta tidepool
Result: rating does not apply to this item.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Risk before merge

  • [P1] No close action taken because the review did not complete.

Maintainer options:

  1. Decide the mitigation before merge
    Retry the Codex review after fixing the execution failure.
  2. Pause or close
    Do not merge this PR until maintainers decide whether the risk is worth taking.

Next step before merge

  • [P1] Review did not complete, so no work-lane recommendation was made.
Review details

Best possible solution:

Retry the Codex review after fixing the execution failure.

Do we have a high-confidence way to reproduce the issue?

Unclear. The review failed before ClawSweeper could establish a reproduction path.

Is this the best way to solve the issue?

Unclear. Retry the review first so ClawSweeper can evaluate the actual issue and fix direction.

AGENTS.md: unclear because the file could not be read completely.

Codex review notes: model gpt-5.5, reasoning high; reviewed against b9933b2ec119.

Label changes

Label justifications:

  • rating: 🌊 off-meta tidepool: Overall readiness is 🌊 off-meta tidepool; proof is 🌊 off-meta tidepool and patch quality is 🌊 off-meta tidepool.
Evidence reviewed

PR surface:

Source +68, Tests +188. Total +256 across 4 files.

View PR surface stats
Area Files Added Removed Net
Source 3 68 0 +68
Tests 1 188 0 +188
Docs 0 0 0 0
Config 0 0 0 0
Generated 0 0 0 0
Other 0 0 0 0
Total 4 256 0 +256

What I checked:

  • failure reason: codex execution failed.
  • codex failure detail: Codex review failed for this PR with exit 1.
  • codex stdout: Per-item Codex failure; continuing with the rest of the shard.

Likely related people:

  • unknown: Codex failed before it could trace repository history. (role: review did not complete; confidence: low)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@stainlu

stainlu commented Apr 28, 2026

Copy link
Copy Markdown
Contributor

does the .catch re-arm have its own test? the new test file is for the watchdog path. the .catch path is a separate change (any onTimer rejection now re-arms) but i don't see a test that exercises it directly. mocking onTimer to reject once and asserting armTimer fires after would cover it.

@openclaw-barnacle

Copy link
Copy Markdown

This pull request has been automatically marked as stale due to inactivity.
Please add updates or it will be closed.

@openclaw-barnacle openclaw-barnacle Bot added stale Marked as stale due to inactivity triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 30, 2026
@clawsweeper clawsweeper Bot added the rating: 🌊 off-meta tidepool PR readiness rating does not apply to this item. label May 30, 2026
@barnacle-openclaw barnacle-openclaw Bot removed the stale Marked as stale due to inactivity label May 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rating: 🌊 off-meta tidepool PR readiness rating does not apply to this item. size: M triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cron scheduler silently stops firing after ~2.5 days of gateway uptime

3 participants