Heartbeat timer stops after first batch - scheduleNext() timer never re-fires

## Description

The heartbeat runner's internal timer fires once after the configured interval, runs all due agents sequentially, but then never re-arms. After the first batch completes, no further heartbeat runs are triggered until the gateway process is restarted.

## Version

OpenClaw `2026.2.21`, Node.js 22.22.0, Linux (Docker)

## Steps to Reproduce

1. Configure multiple agents with `heartbeat.every: "10m"`
2. Start the gateway (`openclaw gateway`)
3. Observe heartbeat runs fire at the first 10-minute mark
4. Wait for the next 10-minute mark — no heartbeat runs fire
5. Only cron module ticks continue in the log file

## Expected Behavior

Heartbeat runs should fire every 10 minutes continuously for all configured agents.

## Actual Behavior

- Gateway starts, heartbeat timer is armed with `intervalMs: 600000`
- After exactly one interval (10 minutes), all due agents run sequentially
- After the batch completes, `scheduleNext()` is called but the timer never fires again
- The cron module continues ticking normally (visible in logs), confirming the event loop is alive
- Manual trigger via `openclaw system event --text "wake" --mode now` works and fires heartbeats immediately
- The gateway process itself remains healthy and responsive

## Observed Pattern

Tested across multiple restarts (SIGUSR1 and container restart):

- **Restart at 17:57 UTC**: Heartbeats fired from 18:08 to 00:44 (39 cycles), then stopped
- **Restart at 00:52 UTC**: Heartbeats fired once at 01:02, then stopped
- **Manual `system event --mode now`**: Always works, heartbeats fire immediately

The first restart ran for ~6 hours before stopping, while subsequent restarts only ran one cycle. This inconsistency suggests a race condition or state corruption.

## Root Cause Analysis

Traced through the minified source (`health-Bg0E--Yl.js` and `subagent-registry-cG4lnv2V.js`):

**Two-layer timer architecture:**

1. **Heartbeat Runner** (`scheduleNext()` in `health-Bg0E--Yl.js`): Sets `state.timer = setTimeout(() => requestHeartbeatNow(...), delay)` with `.unref()`
2. **Subagent Registry** (`schedule()` in `subagent-registry-cG4lnv2V.js`): Receives the wake request, queues it, and calls the handler (`run()`)

**After a batch completes:**
- `run()` calls `scheduleNext()` → sets new timer for `now + intervalMs` with `.unref()`
- Control returns to `schedule()`'s `finally` block → checks `pendingWakes.size > 0 || scheduled` → both false → no re-schedule at the registry level
- The heartbeat runner's internal `state.timer` should fire after 10 minutes, but it never does

The `.unref()` on the timer shouldn't cause issues since the event loop has plenty of other active handles (HTTP server, WebSocket, Telegram polling, cron timer). However, the timer consistently fails to fire after the first batch.

## Workaround

External watchdog script that runs `openclaw system event --text "watchdog-heartbeat" --mode now` every 10 minutes from within the container. This bypasses the internal timer entirely and reliably triggers heartbeat runs.

## Environment

- OS: Ubuntu 24.04 (Docker container based on Debian Bookworm)
- Node.js: 22.22.0
- OpenClaw: 2026.2.21
- 11 agents configured (mix of 10m and 60m heartbeat intervals)
- LLM: MiniMax-M2.5

## Log Evidence

```
# Heartbeat starts
00:52:51 HEARTBEAT TIMER: {"intervalMs": 600000}

# First (and only) batch fires at 01:02
01:02:51 START googler [heartbeat]
01:03:09 DONE  googler (18087ms)
01:03:09 START communicator [heartbeat]
...
01:05:25 DONE  podcast-fileop (48133ms)

# After batch: only cron ticks, no more heartbeats
01:05:51: {"nextAt": 1772424000000, "delayMs": 60000, "clamped": true}  # cron module
01:06:51: {"nextAt": 1772424000000, "delayMs": 60000, "clamped": true}  # cron module
# ... repeats indefinitely, no heartbeat runs
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Heartbeat timer stops after first batch - scheduleNext() timer never re-fires #31139

Description

Version

Steps to Reproduce

Expected Behavior

Actual Behavior

Observed Pattern

Root Cause Analysis

Workaround

Environment

Log Evidence

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Heartbeat timer stops after first batch - scheduleNext() timer never re-fires #31139

Description

Description

Version

Steps to Reproduce

Expected Behavior

Actual Behavior

Observed Pattern

Root Cause Analysis

Workaround

Environment

Log Evidence

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions