Skip to content

Heartbeat scheduler silently dies: wake handler ownership race + lossy scheduling #15106

@joeykrug

Description

@joeykrug

Problem

The heartbeat scheduler silently stops firing after 1-3 hours of operation. Gateway restart temporarily fixes it. This is a different root cause than #14892 (which was fixed by #14901's try/catch around runOnce).

Root Cause Analysis

1. Wake handler ownership race (heartbeat-wake.ts)

setHeartbeatWakeHandler(next) is a global setter with no ownership protection. During lifecycle reloads, a stale runner's cleanup() can clear the handler that a newer runner installed:

  1. Runner A calls setHeartbeatWakeHandler(handlerA)
  2. Runner B starts and calls setHeartbeatWakeHandler(handlerB)
  3. Runner A's cleanup runs later and calls setHeartbeatWakeHandler(null)
  4. Global handler is now null — Runner B's handler was silently cleared
  5. All future wake attempts see !active and return silently

2. Lossy wake scheduling (heartbeat-wake.ts)

schedule(coalesceMs) has if (timer) return; — new schedule requests are silently dropped while any timer exists, even if the new request should fire sooner. This makes the wake layer brittle under load/retries.

3. One-shot interval depends on wake callback executing (heartbeat-runner.ts)

scheduleNext() creates a single setTimeout that calls requestHeartbeatNow(). The next interval is only armed inside run() via scheduleNext(). If the wake handler is null (from #1), run() never executes, so scheduleNext() never re-arms — heartbeats stop permanently.

4. Timer state leak in scheduleNext callback

The scheduleNext() setTimeout callback calls requestHeartbeatNow() without clearing state.timer first, leaving stale timer state.

Reproduction

This is a timing-dependent race that occurs during gateway lifecycle events (config reloads, session compaction triggers, etc.) where runner instances briefly overlap. It's reliably observed after 1-3 hours of uptime.

Suggested Fix

  1. Generation-based handler ownership: setHeartbeatWakeHandler() returns a disposer function bound to a generation counter. Stale disposers cannot clear newer handlers.

  2. Timer preemption in schedule(): Instead of if (timer) return;, compare due times. Earlier requests preempt later existing timers.

  3. state.timer = null in scheduleNext callback: Clear timer ref before calling requestHeartbeatNow().

  4. Idempotent cleanup with stopped guard: Prevent double-cleanup and add state.stopped check at top of run().

PR incoming with all fixes + tests.

Related: #14892, #14901

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions