Problem
The heartbeat scheduler silently stops firing after 1-3 hours of operation. Gateway restart temporarily fixes it. This is a different root cause than #14892 (which was fixed by #14901's try/catch around runOnce).
Root Cause Analysis
1. Wake handler ownership race (heartbeat-wake.ts)
setHeartbeatWakeHandler(next) is a global setter with no ownership protection. During lifecycle reloads, a stale runner's cleanup() can clear the handler that a newer runner installed:
- Runner A calls
setHeartbeatWakeHandler(handlerA)
- Runner B starts and calls
setHeartbeatWakeHandler(handlerB)
- Runner A's cleanup runs later and calls
setHeartbeatWakeHandler(null)
- Global handler is now
null — Runner B's handler was silently cleared
- All future wake attempts see
!active and return silently
2. Lossy wake scheduling (heartbeat-wake.ts)
schedule(coalesceMs) has if (timer) return; — new schedule requests are silently dropped while any timer exists, even if the new request should fire sooner. This makes the wake layer brittle under load/retries.
3. One-shot interval depends on wake callback executing (heartbeat-runner.ts)
scheduleNext() creates a single setTimeout that calls requestHeartbeatNow(). The next interval is only armed inside run() via scheduleNext(). If the wake handler is null (from #1), run() never executes, so scheduleNext() never re-arms — heartbeats stop permanently.
4. Timer state leak in scheduleNext callback
The scheduleNext() setTimeout callback calls requestHeartbeatNow() without clearing state.timer first, leaving stale timer state.
Reproduction
This is a timing-dependent race that occurs during gateway lifecycle events (config reloads, session compaction triggers, etc.) where runner instances briefly overlap. It's reliably observed after 1-3 hours of uptime.
Suggested Fix
-
Generation-based handler ownership: setHeartbeatWakeHandler() returns a disposer function bound to a generation counter. Stale disposers cannot clear newer handlers.
-
Timer preemption in schedule(): Instead of if (timer) return;, compare due times. Earlier requests preempt later existing timers.
-
state.timer = null in scheduleNext callback: Clear timer ref before calling requestHeartbeatNow().
-
Idempotent cleanup with stopped guard: Prevent double-cleanup and add state.stopped check at top of run().
PR incoming with all fixes + tests.
Related: #14892, #14901
Problem
The heartbeat scheduler silently stops firing after 1-3 hours of operation. Gateway restart temporarily fixes it. This is a different root cause than #14892 (which was fixed by #14901's try/catch around runOnce).
Root Cause Analysis
1. Wake handler ownership race (
heartbeat-wake.ts)setHeartbeatWakeHandler(next)is a global setter with no ownership protection. During lifecycle reloads, a stale runner'scleanup()can clear the handler that a newer runner installed:setHeartbeatWakeHandler(handlerA)setHeartbeatWakeHandler(handlerB)setHeartbeatWakeHandler(null)null— Runner B's handler was silently cleared!activeand return silently2. Lossy wake scheduling (
heartbeat-wake.ts)schedule(coalesceMs)hasif (timer) return;— new schedule requests are silently dropped while any timer exists, even if the new request should fire sooner. This makes the wake layer brittle under load/retries.3. One-shot interval depends on wake callback executing (
heartbeat-runner.ts)scheduleNext()creates a single setTimeout that callsrequestHeartbeatNow(). The next interval is only armed insiderun()viascheduleNext(). If the wake handler is null (from #1),run()never executes, soscheduleNext()never re-arms — heartbeats stop permanently.4. Timer state leak in scheduleNext callback
The
scheduleNext()setTimeout callback callsrequestHeartbeatNow()without clearingstate.timerfirst, leaving stale timer state.Reproduction
This is a timing-dependent race that occurs during gateway lifecycle events (config reloads, session compaction triggers, etc.) where runner instances briefly overlap. It's reliably observed after 1-3 hours of uptime.
Suggested Fix
Generation-based handler ownership:
setHeartbeatWakeHandler()returns a disposer function bound to a generation counter. Stale disposers cannot clear newer handlers.Timer preemption in
schedule(): Instead ofif (timer) return;, compare due times. Earlier requests preempt later existing timers.state.timer = nullin scheduleNext callback: Clear timer ref before callingrequestHeartbeatNow().Idempotent cleanup with stopped guard: Prevent double-cleanup and add
state.stoppedcheck at top ofrun().PR incoming with all fixes + tests.
Related: #14892, #14901