Description
The heartbeat runner's internal timer fires once after the configured interval, runs all due agents sequentially, but then never re-arms. After the first batch completes, no further heartbeat runs are triggered until the gateway process is restarted.
Version
OpenClaw 2026.2.21, Node.js 22.22.0, Linux (Docker)
Steps to Reproduce
- Configure multiple agents with
heartbeat.every: "10m"
- Start the gateway (
openclaw gateway)
- Observe heartbeat runs fire at the first 10-minute mark
- Wait for the next 10-minute mark — no heartbeat runs fire
- Only cron module ticks continue in the log file
Expected Behavior
Heartbeat runs should fire every 10 minutes continuously for all configured agents.
Actual Behavior
- Gateway starts, heartbeat timer is armed with
intervalMs: 600000
- After exactly one interval (10 minutes), all due agents run sequentially
- After the batch completes,
scheduleNext() is called but the timer never fires again
- The cron module continues ticking normally (visible in logs), confirming the event loop is alive
- Manual trigger via
openclaw system event --text "wake" --mode now works and fires heartbeats immediately
- The gateway process itself remains healthy and responsive
Observed Pattern
Tested across multiple restarts (SIGUSR1 and container restart):
- Restart at 17:57 UTC: Heartbeats fired from 18:08 to 00:44 (39 cycles), then stopped
- Restart at 00:52 UTC: Heartbeats fired once at 01:02, then stopped
- Manual
system event --mode now: Always works, heartbeats fire immediately
The first restart ran for ~6 hours before stopping, while subsequent restarts only ran one cycle. This inconsistency suggests a race condition or state corruption.
Root Cause Analysis
Traced through the minified source (health-Bg0E--Yl.js and subagent-registry-cG4lnv2V.js):
Two-layer timer architecture:
- Heartbeat Runner (
scheduleNext() in health-Bg0E--Yl.js): Sets state.timer = setTimeout(() => requestHeartbeatNow(...), delay) with .unref()
- Subagent Registry (
schedule() in subagent-registry-cG4lnv2V.js): Receives the wake request, queues it, and calls the handler (run())
After a batch completes:
run() calls scheduleNext() → sets new timer for now + intervalMs with .unref()
- Control returns to
schedule()'s finally block → checks pendingWakes.size > 0 || scheduled → both false → no re-schedule at the registry level
- The heartbeat runner's internal
state.timer should fire after 10 minutes, but it never does
The .unref() on the timer shouldn't cause issues since the event loop has plenty of other active handles (HTTP server, WebSocket, Telegram polling, cron timer). However, the timer consistently fails to fire after the first batch.
Workaround
External watchdog script that runs openclaw system event --text "watchdog-heartbeat" --mode now every 10 minutes from within the container. This bypasses the internal timer entirely and reliably triggers heartbeat runs.
Environment
- OS: Ubuntu 24.04 (Docker container based on Debian Bookworm)
- Node.js: 22.22.0
- OpenClaw: 2026.2.21
- 11 agents configured (mix of 10m and 60m heartbeat intervals)
- LLM: MiniMax-M2.5
Log Evidence
# Heartbeat starts
00:52:51 HEARTBEAT TIMER: {"intervalMs": 600000}
# First (and only) batch fires at 01:02
01:02:51 START googler [heartbeat]
01:03:09 DONE googler (18087ms)
01:03:09 START communicator [heartbeat]
...
01:05:25 DONE podcast-fileop (48133ms)
# After batch: only cron ticks, no more heartbeats
01:05:51: {"nextAt": 1772424000000, "delayMs": 60000, "clamped": true} # cron module
01:06:51: {"nextAt": 1772424000000, "delayMs": 60000, "clamped": true} # cron module
# ... repeats indefinitely, no heartbeat runs
Description
The heartbeat runner's internal timer fires once after the configured interval, runs all due agents sequentially, but then never re-arms. After the first batch completes, no further heartbeat runs are triggered until the gateway process is restarted.
Version
OpenClaw
2026.2.21, Node.js 22.22.0, Linux (Docker)Steps to Reproduce
heartbeat.every: "10m"openclaw gateway)Expected Behavior
Heartbeat runs should fire every 10 minutes continuously for all configured agents.
Actual Behavior
intervalMs: 600000scheduleNext()is called but the timer never fires againopenclaw system event --text "wake" --mode nowworks and fires heartbeats immediatelyObserved Pattern
Tested across multiple restarts (SIGUSR1 and container restart):
system event --mode now: Always works, heartbeats fire immediatelyThe first restart ran for ~6 hours before stopping, while subsequent restarts only ran one cycle. This inconsistency suggests a race condition or state corruption.
Root Cause Analysis
Traced through the minified source (
health-Bg0E--Yl.jsandsubagent-registry-cG4lnv2V.js):Two-layer timer architecture:
scheduleNext()inhealth-Bg0E--Yl.js): Setsstate.timer = setTimeout(() => requestHeartbeatNow(...), delay)with.unref()schedule()insubagent-registry-cG4lnv2V.js): Receives the wake request, queues it, and calls the handler (run())After a batch completes:
run()callsscheduleNext()→ sets new timer fornow + intervalMswith.unref()schedule()'sfinallyblock → checkspendingWakes.size > 0 || scheduled→ both false → no re-schedule at the registry levelstate.timershould fire after 10 minutes, but it never doesThe
.unref()on the timer shouldn't cause issues since the event loop has plenty of other active handles (HTTP server, WebSocket, Telegram polling, cron timer). However, the timer consistently fails to fire after the first batch.Workaround
External watchdog script that runs
openclaw system event --text "watchdog-heartbeat" --mode nowevery 10 minutes from within the container. This bypasses the internal timer entirely and reliably triggers heartbeat runs.Environment
Log Evidence