Skip to content

fix(heartbeat): Fix timer re-arm logic and add robust watchdog#31226

Closed
openperf wants to merge 4 commits into
openclaw:mainfrom
openperf:fix/heartbeat-timer-rearm-31139
Closed

fix(heartbeat): Fix timer re-arm logic and add robust watchdog#31226
openperf wants to merge 4 commits into
openclaw:mainfrom
openperf:fix/heartbeat-timer-rearm-31139

Conversation

@openperf

@openperf openperf commented Mar 2, 2026

Copy link
Copy Markdown
Member

Summary

Problem: After the first heartbeat batch fires, scheduleNext() sets a new setTimeout but immediately calls .unref() on it. In certain Node.js event-loop states this causes the timer to silently never fire again, so all subsequent heartbeats stop until the gateway is restarted (#31139).

Fix (two parts):

  1. Remove .unref() from the primary timer — this is the root cause. The timer must keep the event loop alive to guarantee re-arming.
  2. Add a dynamic watchdog setInterval — a safety-net that periodically checks whether any agent is overdue and re-triggers a heartbeat wake if the primary timer was somehow lost. This watchdog derives its cadence from the minimum configured heartbeat interval (minIntervalMs / 4, clamped to [15 s, 5 min]) and uses a non-zero coalesceMs to avoid tight-loop risks.

The watchdog itself uses .unref() so it won't keep the process alive on its own — only the primary timer does.

Change Type

  • Bug fix

Linked Issue/PR

Tests

Three new tests added to heartbeat-runner.scheduler.test.ts:

  • re-arms timer after successful batch and fires again — verifies 3 consecutive fires
  • watchdog recovers heartbeat when primary timer is lost — verifies watchdog recovery
  • does not fire heartbeat before agent is due — verifies no premature firing

@greptile-apps

greptile-apps Bot commented Mar 2, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

Fixes critical bug where heartbeat timer failed to re-arm after first batch, causing all subsequent heartbeats to stop. Implements a two-part solution: removes .unref() from the primary timer to keep event loop alive, and adds a safety-net watchdog that periodically checks for overdue agents.

  • Removed .unref() call from primary setTimeout (line 1090) - the root cause of timer failures
  • Added dynamic watchdog interval that polls at minIntervalMs / 4, clamped to [15s, 5min]
  • Watchdog uses non-zero coalesce window to prevent tight-loop issues during concurrent runs
  • Watchdog itself uses .unref() so only the primary timer keeps process alive
  • Added proper cleanup for watchdog in cleanup() function
  • Three new tests verify timer re-arming, watchdog recovery, and correct timing

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The fix directly addresses a critical bug with a well-designed two-layer approach. The primary timer fix (removing .unref()) solves the root cause, while the watchdog provides defense-in-depth. The implementation includes proper cleanup, bounds checking, deduplication logic, and comprehensive test coverage for all scenarios. No logical errors or race conditions identified.
  • No files require special attention

Last reviewed commit: c742ce6

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c742ce6486

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/infra/heartbeat-runner.scheduler.test.ts Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 46f94d8c8f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/infra/heartbeat-runner.ts
root added 2 commits March 2, 2026 13:02
The TypeScript strict type checker rejects the direct cast from the
custom mock signature to `typeof globalThis.setTimeout` because the
two types do not sufficiently overlap (TS2352).  Use the standard
double-assertion pattern (`as unknown as T`) to satisfy the compiler.

Fixes CI 'Check types and lint and oxfmt' failure.
@openperf

openperf commented Mar 2, 2026

Copy link
Copy Markdown
Member Author

Thanks for the feedback, @chatgpt-codex-connector!

All automated review suggestions have been addressed. The final CI type error was resolved in the latest commit (c144637).

@chatgpt-codex-connector

Copy link
Copy Markdown

To use Codex here, create a Codex account and connect to github.

@steipete

steipete commented Mar 3, 2026

Copy link
Copy Markdown
Contributor

Thanks for the PR! Multiple PRs address the same fix for #31139. Keeping #31161 as the earlier submission. Closing to reduce noise. This is an AI-assisted triage review. If we got this wrong, feel free to reopen — happy to revisit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Heartbeat timer stops after first batch - scheduleNext() timer never re-fires

2 participants