fix(heartbeat): Fix timer re-arm logic and add robust watchdog#31226
fix(heartbeat): Fix timer re-arm logic and add robust watchdog#31226openperf wants to merge 4 commits into
Conversation
Greptile SummaryFixes critical bug where heartbeat timer failed to re-arm after first batch, causing all subsequent heartbeats to stop. Implements a two-part solution: removes
Confidence Score: 5/5
Last reviewed commit: c742ce6 |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c742ce6486
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 46f94d8c8f
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
The TypeScript strict type checker rejects the direct cast from the custom mock signature to `typeof globalThis.setTimeout` because the two types do not sufficiently overlap (TS2352). Use the standard double-assertion pattern (`as unknown as T`) to satisfy the compiler. Fixes CI 'Check types and lint and oxfmt' failure.
|
Thanks for the feedback, @chatgpt-codex-connector! All automated review suggestions have been addressed. The final CI type error was resolved in the latest commit ( |
|
To use Codex here, create a Codex account and connect to github. |
Summary
Problem: After the first heartbeat batch fires,
scheduleNext()sets a newsetTimeoutbut immediately calls.unref()on it. In certain Node.js event-loop states this causes the timer to silently never fire again, so all subsequent heartbeats stop until the gateway is restarted (#31139).Fix (two parts):
.unref()from the primary timer — this is the root cause. The timer must keep the event loop alive to guarantee re-arming.setInterval— a safety-net that periodically checks whether any agent is overdue and re-triggers a heartbeat wake if the primary timer was somehow lost. This watchdog derives its cadence from the minimum configured heartbeat interval (minIntervalMs / 4, clamped to [15 s, 5 min]) and uses a non-zerocoalesceMsto avoid tight-loop risks.The watchdog itself uses
.unref()so it won't keep the process alive on its own — only the primary timer does.Change Type
Linked Issue/PR
Tests
Three new tests added to
heartbeat-runner.scheduler.test.ts:re-arms timer after successful batch and fires again— verifies 3 consecutive fireswatchdog recovers heartbeat when primary timer is lost— verifies watchdog recoverydoes not fire heartbeat before agent is due— verifies no premature firing