Skip to content

Heartbeat timer stops after first batch - scheduleNext() timer never re-fires #31139

@brainyglum

Description

@brainyglum

Description

The heartbeat runner's internal timer fires once after the configured interval, runs all due agents sequentially, but then never re-arms. After the first batch completes, no further heartbeat runs are triggered until the gateway process is restarted.

Version

OpenClaw 2026.2.21, Node.js 22.22.0, Linux (Docker)

Steps to Reproduce

  1. Configure multiple agents with heartbeat.every: "10m"
  2. Start the gateway (openclaw gateway)
  3. Observe heartbeat runs fire at the first 10-minute mark
  4. Wait for the next 10-minute mark — no heartbeat runs fire
  5. Only cron module ticks continue in the log file

Expected Behavior

Heartbeat runs should fire every 10 minutes continuously for all configured agents.

Actual Behavior

  • Gateway starts, heartbeat timer is armed with intervalMs: 600000
  • After exactly one interval (10 minutes), all due agents run sequentially
  • After the batch completes, scheduleNext() is called but the timer never fires again
  • The cron module continues ticking normally (visible in logs), confirming the event loop is alive
  • Manual trigger via openclaw system event --text "wake" --mode now works and fires heartbeats immediately
  • The gateway process itself remains healthy and responsive

Observed Pattern

Tested across multiple restarts (SIGUSR1 and container restart):

  • Restart at 17:57 UTC: Heartbeats fired from 18:08 to 00:44 (39 cycles), then stopped
  • Restart at 00:52 UTC: Heartbeats fired once at 01:02, then stopped
  • Manual system event --mode now: Always works, heartbeats fire immediately

The first restart ran for ~6 hours before stopping, while subsequent restarts only ran one cycle. This inconsistency suggests a race condition or state corruption.

Root Cause Analysis

Traced through the minified source (health-Bg0E--Yl.js and subagent-registry-cG4lnv2V.js):

Two-layer timer architecture:

  1. Heartbeat Runner (scheduleNext() in health-Bg0E--Yl.js): Sets state.timer = setTimeout(() => requestHeartbeatNow(...), delay) with .unref()
  2. Subagent Registry (schedule() in subagent-registry-cG4lnv2V.js): Receives the wake request, queues it, and calls the handler (run())

After a batch completes:

  • run() calls scheduleNext() → sets new timer for now + intervalMs with .unref()
  • Control returns to schedule()'s finally block → checks pendingWakes.size > 0 || scheduled → both false → no re-schedule at the registry level
  • The heartbeat runner's internal state.timer should fire after 10 minutes, but it never does

The .unref() on the timer shouldn't cause issues since the event loop has plenty of other active handles (HTTP server, WebSocket, Telegram polling, cron timer). However, the timer consistently fails to fire after the first batch.

Workaround

External watchdog script that runs openclaw system event --text "watchdog-heartbeat" --mode now every 10 minutes from within the container. This bypasses the internal timer entirely and reliably triggers heartbeat runs.

Environment

  • OS: Ubuntu 24.04 (Docker container based on Debian Bookworm)
  • Node.js: 22.22.0
  • OpenClaw: 2026.2.21
  • 11 agents configured (mix of 10m and 60m heartbeat intervals)
  • LLM: MiniMax-M2.5

Log Evidence

# Heartbeat starts
00:52:51 HEARTBEAT TIMER: {"intervalMs": 600000}

# First (and only) batch fires at 01:02
01:02:51 START googler [heartbeat]
01:03:09 DONE  googler (18087ms)
01:03:09 START communicator [heartbeat]
...
01:05:25 DONE  podcast-fileop (48133ms)

# After batch: only cron ticks, no more heartbeats
01:05:51: {"nextAt": 1772424000000, "delayMs": 60000, "clamped": true}  # cron module
01:06:51: {"nextAt": 1772424000000, "delayMs": 60000, "clamped": true}  # cron module
# ... repeats indefinitely, no heartbeat runs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions