Skip to content

fix(heartbeat): move scheduleNext to finally block to prevent timer death (#45772)#47107

Closed
r3n3x wants to merge 5 commits into
openclaw:mainfrom
r3n3x:fix/heartbeat-timer-rearm-finally
Closed

fix(heartbeat): move scheduleNext to finally block to prevent timer death (#45772)#47107
r3n3x wants to merge 5 commits into
openclaw:mainfrom
r3n3x:fix/heartbeat-timer-rearm-finally

Conversation

@r3n3x

@r3n3x r3n3x commented Mar 15, 2026

Copy link
Copy Markdown

Summary

  • Problem: Heartbeat timer fires once after gateway start, then permanently dies with no error logged (Gateway Heartbeat timer stops after 1-2 triggers (introduced in v2026.3.8) #45772)
  • Why it matters: Users lose all autonomous heartbeat functionality until manual restart, with no indication anything is wrong
  • What changed: Wrapped run() body in try/finally with scheduleNext() in the finally block, guaranteeing the timer is always re-armed
  • What did NOT change: All existing logic, guard returns, advanceAgentSchedule behavior, and requests-in-flight handling remain identical

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

User-visible / Behavior Changes

Heartbeat timer reliably re-arms after each run instead of dying silently. No config or API changes.

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No

Repro + Verification

Environment

  • OS: macOS Darwin 25.3.0 (arm64) — also reported on Linux
  • Runtime/container: Node.js v25.8.1
  • Model/provider: Various (Anthropic, Morpheus M2.5, Volcengine, local Ollama)
  • Integration/channel: Telegram, WhatsApp, Discord
  • Relevant config: agents.defaults.heartbeat.every: "25m" (also reported with 5m, 30m, 1h)

Steps

  1. Configure heartbeat with any interval
  2. Start gateway
  3. Observe heartbeat fires once (logged as heartbeat: started)
  4. Wait for next interval — heartbeat never fires again

Expected

Heartbeat fires every configured interval indefinitely.

Actual

Heartbeat fires once after startup, then timer permanently dies. No error in logs.

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets

New test re-arms timer after runOnce rejects with an unhandled promise rejection (#45772) validates the fix. All 63 existing heartbeat tests pass.

Multiple users confirmed the pattern in #45772: timer fires exactly once after cold start, then silently stops. Gateway logs show no errors.

Human Verification (required)

  • Verified scenarios: All 63 heartbeat tests pass (7 test files), including new regression test
  • Edge cases checked: requests-in-flight return path still works correctly (no double-scheduling); stopped runner does not reschedule; config updates during run handled properly
  • What I did not verify: Live production testing of the fix (we are running a cron-based workaround; the underlying timer bug is intermittent and hard to trigger on demand)

AI-Assisted PR 🤖

  • Mark as AI-assisted in the PR title or description
  • Note the degree of testing: Fully tested (all 63 existing tests + 1 new regression test pass)
  • Confirm understanding: The fix ensures scheduleNext() is called via finally block regardless of how run() exits, preventing silent timer death from unhandled rejections

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No

Failure Recovery (if this breaks)

  • How to disable/revert this change: Revert single commit; heartbeat behavior returns to previous (broken) state
  • Known bad symptoms: If scheduleNext() in finally causes double-scheduling, heartbeat intervals may be shorter than configured (but scheduleNext() is idempotent — it clears existing timers before setting new ones)

Risks and Mitigations

  • Risk: scheduleNext() called in finally even when returning from requests-in-flight path (which previously skipped it intentionally to avoid racing with wake layer retry)
    • Mitigation: scheduleNext() is idempotent (clears existing timer first). The wake layer's retry timer has a "retry" kind guard that prevents preemption by normal timers. Existing test reschedules timer when runOnce returns requests-in-flight passes unchanged, confirming no regression.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d9b7d8fb4a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/infra/heartbeat-runner.ts Outdated
@greptile-apps

greptile-apps Bot commented Mar 15, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR moves scheduleNext() into a try/finally block in heartbeat-runner.ts to guarantee the heartbeat timer is always re-armed, even when an unexpected async rejection exits the run() function before reaching the original per-path scheduleNext() calls. The structural change is sound and addresses the root cause of the silent timer death described in #45772.

Key changes:

  • run() body wrapped in try/finally; scheduleNext() called exclusively in the finally block instead of on each of the 7 return paths.
  • Three early guard returns (state.stopped, !areHeartbeatsEnabled(), state.agents.size === 0) kept above the try/finally — these intentionally skip re-arming the timer.
  • New regression test verifies the timer fires again after a runOnce rejection.

Issues found:

  • requests-in-flight cooldown regression: The return res for the requests-in-flight path (line 1232) is now inside the try block, so scheduleNext() fires in the finally block for this case too. The original code explicitly avoided this because it would create a 0 ms interval timer that races with — and wins over — the wake layer's intended 1-second retry backoff. The guard added to heartbeat-wake.ts/schedule (if (timerKind === "retry") { return; }) does not protect against this: timerKind is cleared to null at the top of the wake-layer timer callback, so by the time run's finally block fires scheduleNext(), the guard's condition is never true.
  • Stale comment: The catch block comment at line 1220–1222 still says "call scheduleNext so heartbeats keep firing" but scheduleNext() is now the responsibility of the outer finally.
  • New test coverage gap: The added test exercises a runOnce rejection that is caught by the existing inner catch — the finally block is not what saves the timer in this path. A test simulating a rejection that truly escapes the inner guards would better validate the fix.

Confidence Score: 3/5

  • Safe to merge for the primary bug fix, but the requests-in-flight cooldown bypass is a real behavioral regression that should be addressed before merging to avoid spin-retry behavior on busy lanes.
  • The try/finally restructuring correctly solves the described timer-death bug. However, the requests-in-flight path is now inadvertently affected: scheduleNext() fires via the finally block before the wake layer can set its retry timer, causing the intended 1-second cooldown to be replaced by an immediate 0 ms re-run. The guard in heartbeat-wake.ts was designed to handle this but does not trigger due to execution ordering. The fix is one targeted change away from being clean.
  • The interaction between src/infra/heartbeat-runner.ts (the requests-in-flight return path inside the try block) and src/infra/heartbeat-wake.ts (the schedule guard) needs attention. The requests-in-flight return should be moved above the try/finally or guarded with a flag, analogous to the three existing early-return guards at the top of run().
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/infra/heartbeat-runner.ts
Line: 1227-1232

Comment:
**`requests-in-flight` retry cooldown now bypassed by `finally` block**

The comment on lines 1228–1231 still warns that calling `scheduleNext()` here "would register a 0 ms timer that races with the wake layer's 1 s retry and wins, bypassing the cooldown" — and that is exactly what now happens.

Execution order when `requests-in-flight` is returned:

1. `return res` inside the `try` block triggers the `finally` block **before** the awaited value is delivered back to the wake layer.
2. `scheduleNext()` runs synchronously. At this moment, the wake layer's timer has already been cleared (`timer = null` at the top of the callback), so `schedule(0, "normal")` creates a fresh 0 ms timer (`timerKind = "normal"`).
3. Control returns to the wake layer, which calls `schedule(DEFAULT_RETRY_MS, "retry")`. It finds the existing 0 ms timer (`timerDueAt ≤ dueAt`), so it returns early — the 1-second cooldown is never scheduled.

The guard added to `heartbeat-wake.ts/schedule` (`if (timerKind === "retry") { return; }`) does not protect against this because `timerKind` is `null` (cleared at the start of the callback) when `scheduleNext()` fires, so the guard never triggers.

The result is an immediate re-run when the main lane is still busy, rather than the intended 1-second retry backoff. In the worst case this creates a tight spin loop on contested lanes.

A minimal fix is to exclude the `requests-in-flight` path from the outer `try/finally`:

```typescript
if (res.status === "skipped" && res.reason === "requests-in-flight") {
  // Do not reschedule here — the wake layer handles retry.
  // Calling scheduleNext() would register a 0 ms timer that wins
  // the race against the wake layer's 1 s retry (see #45772 fix).
  return res;  // <-- move this BEFORE the try block, like the guard returns above
}
```

Alternatively, track a flag inside the `try` block to suppress `scheduleNext()` in the `finally` for this specific path.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/infra/heartbeat-runner.ts
Line: 1219-1226

Comment:
**Stale comment references removed per-path `scheduleNext()` call**

The comment on lines 1220–1222 says "we must still advance the timer and **call scheduleNext** so heartbeats keep firing", but `scheduleNext()` is no longer called explicitly here — it's now handled entirely by the outer `finally` block. The comment should be updated to reflect the new structure.

```suggestion
        } catch (err) {
          // If runOnce throws (e.g. during session compaction), advance the
          // agent's schedule so the next interval is computed correctly.
          // scheduleNext() is called by the outer finally block.
          const errMsg = formatErrorMessage(err);
          log.error(`heartbeat runner: runOnce threw unexpectedly: ${errMsg}`, { error: errMsg });
          advanceAgentSchedule(agent, now);
          continue;
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/infra/heartbeat-runner.scheduler.test.ts
Line: 296-310

Comment:
**New test exercises the inner `catch`, not the described unhandled-rejection path**

The PR description states the bug is caused by rejections that *escape* the inner `try/catch` blocks (e.g. from code outside `runOnce`). However, `runSpy` here IS the `runOnce` function, so its rejection is caught by the existing inner `catch` in the loop — the `finally` block is not needed to re-arm the timer in this scenario (the old code would have called `scheduleNext()` after the loop completed anyway).

To accurately exercise the described bug, the rejection would need to originate outside the existing per-agent `try/catch`, for example from `advanceAgentSchedule`, from code between loop iterations, or from an async escape that bypasses the current guards. Consider adding a test that throws from a helper that is called *outside* the inner `try/catch` to ensure the `finally` block is what actually saves the timer in that case.

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: d9b7d8f

Comment thread src/infra/heartbeat-runner.ts
Comment thread src/infra/heartbeat-runner.ts
Comment thread src/infra/heartbeat-runner.scheduler.test.ts Outdated
@r3n3x

r3n3x commented Mar 15, 2026

Copy link
Copy Markdown
Author

CI Status

checks-windows (node, test, 3, 6) failure is pre-existing on main — not caused by this PR.

Evidence: the same shard fails on recent main CI runs (e.g. run 23106828563) with the identical error:

AssertionError: no summary blocks: expected [Function] to throw error including
  "No summary returned" but got "[vitest] No "resolveModelAsync" export..."

The failure is in src/tts/tts.test.ts:490 — a vitest mock issue with pi-embedded-runner/model.js on Windows. Our changes touch only src/infra/heartbeat-runner.ts and its scheduler test.

All other CI jobs pass (23/24), including:

  • check (lint + typecheck)
  • ✅ All Linux test shards (1/2, 2/2)
  • ✅ Windows test shards 1, 2, 4, 5, 6 of 6
  • ✅ Channel tests, extension tests, bun tests
  • ✅ Install smoke, secrets scan

Review feedback addressed

All 4 review findings from Codex and Greptile were fixed in e93604f:

  1. requests-in-flight cooldown preserved via skipFinalSchedule flag
  2. ✅ Stale comment updated
  3. ✅ Test improved to exercise the outer finally block directly

@ShionEria

Copy link
Copy Markdown
Contributor

👀 watching this issue - heartbeat affects our setup too

@r3n3x r3n3x force-pushed the fix/heartbeat-timer-rearm-finally branch from 0afc5b6 to 9ba398f Compare March 18, 2026 09:16
r3n3x and others added 5 commits March 18, 2026 17:55
…eath (openclaw#45772)

The heartbeat timer permanently stops after 1-2 triggers because
scheduleNext() is called on each individual return path in run().
If an unhandled rejection or silent error exits the function before
reaching any of those calls, the timer is never re-armed and
heartbeats stop permanently.

This fix wraps the run() body in try/finally with scheduleNext() in
the finally block, guaranteeing the timer is always re-armed
regardless of how the function exits.

The three early guard returns (stopped, disabled, no agents) remain
above the try/finally since they legitimately should not reschedule.

Fixes openclaw#45772
- Add skipFinalSchedule flag to suppress scheduleNext() in the finally
  block when returning from the requests-in-flight path, preserving the
  wake layer's 1 s retry cooldown (Codex + greptile review)
- Update stale comment in per-agent catch block to reflect that
  scheduleNext is now handled by the outer finally (greptile review)
- Improve test to exercise the outer finally block directly by throwing
  from runOnce (not rejecting, which the inner catch handles)
@rex05ai

rex05ai commented Mar 24, 2026

Copy link
Copy Markdown

Heads up: @GMTekAI's analysis on #45772 (#45772 (comment)) suggests the scheduler re-arm logic in current source already handles the error paths this PR targets. The root cause may be in the delivery/gating layer rather than the timer core.

We're going to update to v2026.3.23 and re-test with native heartbeat enabled to narrow down where the timer death actually happens. May need to revise this PR's approach based on findings — the finally-block fix might not be the right layer anymore.

Keeping this open for now pending our re-test results.

@rex05ai

rex05ai commented Mar 30, 2026

Copy link
Copy Markdown

Closing this PR — the underlying issue #45772 has been resolved upstream in v2026.3.28.

As noted in the later discussion by @GMTekAI, the root cause was in the delivery/gating paths rather than the scheduler re-arm logic we targeted here. The fix landed via a different path.

We've confirmed native heartbeat working reliably on our multi-agent setup (10 agents, forum topics, high cron concurrency) since updating to v2026.3.28. The cron-based workaround we used as mitigation has been retired.

Thanks to everyone who contributed to diagnosing this!

@steipete

Copy link
Copy Markdown
Contributor

Closing this as implemented after Codex review.

Current main already contains the heartbeat re-arm fix this PR proposes, with the requests-in-flight exception preserved and regression coverage added. The functionality shipped in v2026.3.28 under a different PR (#52270), so this PR is superseded.

What I checked:

  • Clean checkout preserved: git status --short returned no output before inspection, so the repository remained clean and unmodified during review. (68315792677f)
  • Shipped changelog entry: The changelog for 2026.3.28 explicitly says: "Heartbeat/runner: guarantee the interval timer is re-armed after heartbeat runs and unexpected runner errors so scheduled heartbeats do not silently stop after an interrupted cycle." It attributes that fix to PR fix: guarantee heartbeat timer re-arm with try/finally #52270. (CHANGELOG.md:1911, 68315792677f)
  • Current main implements finally-based re-arm with busy-lane guard: run() now tracks requestsInFlight, wraps execution in try/finally, and calls scheduleNext() in the finally block unless the result was requests-in-flight, which matches the intended conservative behavior and supersedes this PR's approach. (src/infra/heartbeat-runner.ts:1479, 68315792677f)
  • Wake retry cooldown logic present on main: The wake layer keeps retry backoff as a hard minimum and requeues requests-in-flight results with schedule(DEFAULT_RETRY_MS, "retry"), covering the cooldown interaction that this PR had to address during review. (src/infra/heartbeat-wake.ts:130, 68315792677f)
  • Regression tests cover both failure modes: Scheduler tests on main verify continued scheduling after a thrown run error and verify retry behavior for requests-in-flight, which is the behavioral surface this PR was trying to protect. (src/infra/heartbeat-runner.scheduler.test.ts:134, 68315792677f)

So I’m closing this as already implemented rather than keeping a duplicate issue open.

Review notes: reviewed against 68315792677f; fix evidence: release v2026.3.28.

@steipete steipete closed this Apr 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gateway Heartbeat timer stops after 1-2 triggers (introduced in v2026.3.8)

4 participants