fix(heartbeat): move scheduleNext to finally block to prevent timer death (#45772)#47107
fix(heartbeat): move scheduleNext to finally block to prevent timer death (#45772)#47107r3n3x wants to merge 5 commits into
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d9b7d8fb4a
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Greptile SummaryThis PR moves Key changes:
Issues found:
Confidence Score: 3/5
Prompt To Fix All With AIThis is a comment left during a code review.
Path: src/infra/heartbeat-runner.ts
Line: 1227-1232
Comment:
**`requests-in-flight` retry cooldown now bypassed by `finally` block**
The comment on lines 1228–1231 still warns that calling `scheduleNext()` here "would register a 0 ms timer that races with the wake layer's 1 s retry and wins, bypassing the cooldown" — and that is exactly what now happens.
Execution order when `requests-in-flight` is returned:
1. `return res` inside the `try` block triggers the `finally` block **before** the awaited value is delivered back to the wake layer.
2. `scheduleNext()` runs synchronously. At this moment, the wake layer's timer has already been cleared (`timer = null` at the top of the callback), so `schedule(0, "normal")` creates a fresh 0 ms timer (`timerKind = "normal"`).
3. Control returns to the wake layer, which calls `schedule(DEFAULT_RETRY_MS, "retry")`. It finds the existing 0 ms timer (`timerDueAt ≤ dueAt`), so it returns early — the 1-second cooldown is never scheduled.
The guard added to `heartbeat-wake.ts/schedule` (`if (timerKind === "retry") { return; }`) does not protect against this because `timerKind` is `null` (cleared at the start of the callback) when `scheduleNext()` fires, so the guard never triggers.
The result is an immediate re-run when the main lane is still busy, rather than the intended 1-second retry backoff. In the worst case this creates a tight spin loop on contested lanes.
A minimal fix is to exclude the `requests-in-flight` path from the outer `try/finally`:
```typescript
if (res.status === "skipped" && res.reason === "requests-in-flight") {
// Do not reschedule here — the wake layer handles retry.
// Calling scheduleNext() would register a 0 ms timer that wins
// the race against the wake layer's 1 s retry (see #45772 fix).
return res; // <-- move this BEFORE the try block, like the guard returns above
}
```
Alternatively, track a flag inside the `try` block to suppress `scheduleNext()` in the `finally` for this specific path.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: src/infra/heartbeat-runner.ts
Line: 1219-1226
Comment:
**Stale comment references removed per-path `scheduleNext()` call**
The comment on lines 1220–1222 says "we must still advance the timer and **call scheduleNext** so heartbeats keep firing", but `scheduleNext()` is no longer called explicitly here — it's now handled entirely by the outer `finally` block. The comment should be updated to reflect the new structure.
```suggestion
} catch (err) {
// If runOnce throws (e.g. during session compaction), advance the
// agent's schedule so the next interval is computed correctly.
// scheduleNext() is called by the outer finally block.
const errMsg = formatErrorMessage(err);
log.error(`heartbeat runner: runOnce threw unexpectedly: ${errMsg}`, { error: errMsg });
advanceAgentSchedule(agent, now);
continue;
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: src/infra/heartbeat-runner.scheduler.test.ts
Line: 296-310
Comment:
**New test exercises the inner `catch`, not the described unhandled-rejection path**
The PR description states the bug is caused by rejections that *escape* the inner `try/catch` blocks (e.g. from code outside `runOnce`). However, `runSpy` here IS the `runOnce` function, so its rejection is caught by the existing inner `catch` in the loop — the `finally` block is not needed to re-arm the timer in this scenario (the old code would have called `scheduleNext()` after the loop completed anyway).
To accurately exercise the described bug, the rejection would need to originate outside the existing per-agent `try/catch`, for example from `advanceAgentSchedule`, from code between loop iterations, or from an async escape that bypasses the current guards. Consider adding a test that throws from a helper that is called *outside* the inner `try/catch` to ensure the `finally` block is what actually saves the timer in that case.
How can I resolve this? If you propose a fix, please make it concise.Last reviewed commit: d9b7d8f |
CI Status
Evidence: the same shard fails on recent The failure is in All other CI jobs pass (23/24), including:
Review feedback addressedAll 4 review findings from Codex and Greptile were fixed in e93604f:
|
|
👀 watching this issue - heartbeat affects our setup too |
0afc5b6 to
9ba398f
Compare
…eath (openclaw#45772) The heartbeat timer permanently stops after 1-2 triggers because scheduleNext() is called on each individual return path in run(). If an unhandled rejection or silent error exits the function before reaching any of those calls, the timer is never re-armed and heartbeats stop permanently. This fix wraps the run() body in try/finally with scheduleNext() in the finally block, guaranteeing the timer is always re-armed regardless of how the function exits. The three early guard returns (stopped, disabled, no agents) remain above the try/finally since they legitimately should not reschedule. Fixes openclaw#45772
- Add skipFinalSchedule flag to suppress scheduleNext() in the finally block when returning from the requests-in-flight path, preserving the wake layer's 1 s retry cooldown (Codex + greptile review) - Update stale comment in per-agent catch block to reflect that scheduleNext is now handled by the outer finally (greptile review) - Improve test to exercise the outer finally block directly by throwing from runOnce (not rejecting, which the inner catch handles)
… + outer finally path
9ba398f to
7f22f79
Compare
|
Heads up: @GMTekAI's analysis on #45772 (#45772 (comment)) suggests the scheduler re-arm logic in current source already handles the error paths this PR targets. The root cause may be in the delivery/gating layer rather than the timer core. We're going to update to v2026.3.23 and re-test with native heartbeat enabled to narrow down where the timer death actually happens. May need to revise this PR's approach based on findings — the finally-block fix might not be the right layer anymore. Keeping this open for now pending our re-test results. |
|
Closing this PR — the underlying issue #45772 has been resolved upstream in v2026.3.28. As noted in the later discussion by @GMTekAI, the root cause was in the delivery/gating paths rather than the scheduler re-arm logic we targeted here. The fix landed via a different path. We've confirmed native heartbeat working reliably on our multi-agent setup (10 agents, forum topics, high cron concurrency) since updating to v2026.3.28. The cron-based workaround we used as mitigation has been retired. Thanks to everyone who contributed to diagnosing this! |
|
Closing this as implemented after Codex review. Current main already contains the heartbeat re-arm fix this PR proposes, with the requests-in-flight exception preserved and regression coverage added. The functionality shipped in v2026.3.28 under a different PR (#52270), so this PR is superseded. What I checked:
So I’m closing this as already implemented rather than keeping a duplicate issue open. Review notes: reviewed against 68315792677f; fix evidence: release v2026.3.28. |
Summary
run()body intry/finallywithscheduleNext()in thefinallyblock, guaranteeing the timer is always re-armedadvanceAgentSchedulebehavior, andrequests-in-flighthandling remain identicalChange Type (select all)
Scope (select all touched areas)
Linked Issue/PR
User-visible / Behavior Changes
Heartbeat timer reliably re-arms after each run instead of dying silently. No config or API changes.
Security Impact (required)
Repro + Verification
Environment
agents.defaults.heartbeat.every: "25m"(also reported with 5m, 30m, 1h)Steps
heartbeat: started)Expected
Heartbeat fires every configured interval indefinitely.
Actual
Heartbeat fires once after startup, then timer permanently dies. No error in logs.
Evidence
New test
re-arms timer after runOnce rejects with an unhandled promise rejection (#45772)validates the fix. All 63 existing heartbeat tests pass.Multiple users confirmed the pattern in #45772: timer fires exactly once after cold start, then silently stops. Gateway logs show no errors.
Human Verification (required)
requests-in-flightreturn path still works correctly (no double-scheduling); stopped runner does not reschedule; config updates during run handled properlyAI-Assisted PR 🤖
scheduleNext()is called viafinallyblock regardless of howrun()exits, preventing silent timer death from unhandled rejectionsReview Conversations
Compatibility / Migration
Failure Recovery (if this breaks)
scheduleNext()in finally causes double-scheduling, heartbeat intervals may be shorter than configured (butscheduleNext()is idempotent — it clears existing timers before setting new ones)Risks and Mitigations
scheduleNext()called infinallyeven when returning fromrequests-in-flightpath (which previously skipped it intentionally to avoid racing with wake layer retry)scheduleNext()is idempotent (clears existing timer first). The wake layer's retry timer has a"retry"kind guard that prevents preemption by normal timers. Existing testreschedules timer when runOnce returns requests-in-flightpasses unchanged, confirming no regression.