fix: add watchdog interval to re-arm heartbeat timer after first batch (#31139)#31161
fix: add watchdog interval to re-arm heartbeat timer after first batch (#31139)#31161Sid-Qin wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5d8da5420c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const now = Date.now(); | ||
| for (const agent of state.agents.values()) { | ||
| if (now >= agent.nextDueMs) { | ||
| requestHeartbeatNow({ reason: "interval", coalesceMs: 0 }); |
There was a problem hiding this comment.
Avoid zero-delay watchdog wake when a run is in flight
Using coalesceMs: 0 here can create a tight reschedule loop when a heartbeat run takes longer than one watchdog tick (60s): while runOnce is still executing, nextDueMs is still overdue, so the watchdog queues another wake, and schedule() in src/infra/heartbeat-wake.ts keeps re-arming immediate timers when running === true (schedule(delay, kind) with delay=0). In that scenario, the process can spin hot until the long run finishes, causing unnecessary CPU/event-loop pressure in production workloads with slow model/network calls.
Useful? React with 👍 / 👎.
Greptile SummaryFixes critical bug where heartbeat timers fired once then never re-armed, causing agent heartbeats to stop after first execution. Key Changes
Implementation DetailsThe watchdog uses
Deduplication in Risk AssessmentLow risk. All edge cases properly handled:
Confidence Score: 5/5
Last reviewed commit: 5d8da54 |
Summary
setTimeoutfails to fire. Removed.unref()from the primary timer. Added test for consecutive timer re-arming.Change Type
Scope
Linked Issue/PR
User-visible / Behavior Changes
None. Heartbeats now reliably fire on schedule in all environments (Docker, Linux, macOS).
Security Impact
Repro + Verification
Environment
heartbeat.every: "10m"with multiple agentsSteps
heartbeat.every: "10m"Expected
Actual (before fix)
Evidence
re-arms timer after successful batch and fires againpasses (3 consecutive fires)Human Verification
Compatibility / Migration
Failure Recovery
src/infra/heartbeat-runner.tsRisks and Mitigations
requestHeartbeatNowcoalesces duplicate wake requests viaqueuePendingWakeReasonwhich deduplicates by target keyMade with Cursor