fix: guarantee heartbeat timer re-arm with try/finally#52270
Conversation
Greptile SummaryThis PR fixes a long-standing heartbeat timer liveness bug by wrapping the Key changes:
Confidence Score: 5/5
Reviews (1): Last reviewed commit: "fix: wrap heartbeat run() in try/finally..." | Re-trigger Greptile |
|
@steipete this is a legitimate fix apologies for the tag but i feel like resolving these discord / telegram flapping issues are of utmost importance. Especially after a major Node version bump when you jumped from Node version 22 to 24 |
d0ef0b7 to
8810bac
Compare
8810bac to
42df914
Compare
The heartbeat scheduler silently dies after the first batch because run() has multiple exit paths that skip scheduleNext(). Once any path returns without re-arming the timer, heartbeats stop permanently. The most common trigger is the requests-in-flight early return, but any unhandled rejection in the async chain between runOnce() and the manual scheduleNext() calls can also kill the timer. Fix: wrap the run() body in try/finally so scheduleNext() is called on every exit path. All manual scheduleNext() calls inside are removed — the finally block is the single re-arm point. The requests-in-flight case is excluded via a flag because the wake layer (heartbeat-wake.ts) already handles retry scheduling for this case with DEFAULT_RETRY_MS. Calling scheduleNext() here would register a 0ms timer that races with the wake layer's 1s retry. Fixes openclaw#31139 Fixes openclaw#45772 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
42df914 to
cd0bcc2
Compare
|
Merged via squash.
Thanks @MiloStack! |
Summary
run()body intry/finallysoscheduleNext()is called on every exit pathscheduleNext()calls insiderun()— thefinallyblock is the single re-arm pointrequests-in-flightexemption (wake layer handles retry for this case)Root Cause
run()has multiple exit paths that skipscheduleNext(), permanently killing the heartbeat timer. The most common trigger is therequests-in-flightearly return at line 1141, but any unhandled rejection in the async chain can also break the re-arm.The cron system doesn't have this bug because
armTimer()is always called in theonTimer()catch handler and uses a single-layer timer design.Why not remove
.unref()?PRs #31226 and #31161 proposed removing
.unref()from the timer. That's the wrong fix —.unref()is the correct Node.js pattern for background schedulers. The timer dies becausescheduleNext()is never called, not because.unref()causes the timer to be GC'd. We confirmed this by patching out.unref()and observing the timer still doesn't fire.requests-in-flighthandlingPR #47107 proposed a similar
try/finallybut was flagged by Greptile for a regression:finally { scheduleNext() }causes a tight spin loop when requests are in flight (delay=0 sincenextDueMsalready passed).This PR uses a
requestsInFlightflag to skip the re-arm for that specific case. The wake layer (heartbeat-wake.ts) already handles retry scheduling viaschedule(DEFAULT_RETRY_MS, "retry")— callingscheduleNext()would race with the wake layer's 1s retry and bypass the cooldown.The existing tests (
heartbeat-runner.scheduler.test.ts) cover both the retry and schedule-preservation behavior forrequests-in-flight.Testing
Tested on OpenClaw 2026.3.14-dev, Node 24.14.0, Docker, 5 agents:
Fixes #31139
Fixes #45772