Heartbeat scheduler dies silently when runOnce() throws during session compaction

## Bug Description

The heartbeat scheduler's `run()` function in `startHeartbeatRunner()` has no try/catch around the `runOnce()` call. If `runOnce()` (which calls `getReplyFromConfig`) throws an unhandled exception — which appears to happen when the heartbeat session compacts — the `scheduleNext()` call at the end of `run()` is never reached. The timer is never rescheduled, and **heartbeats silently stop forever** until the gateway is restarted.

## Steps to Reproduce

1. Configure heartbeat with `every: 60m`
2. Let the heartbeat session accumulate context over many runs
3. Wait for the heartbeat session to hit compaction threshold
4. After compaction, heartbeats never fire again

## Evidence from Logs

```
# Heartbeats running normally every ~60m:
Feb 11 07:03  messageChannel=heartbeat
Feb 11 07:20  messageChannel=heartbeat
Feb 11 08:20  messageChannel=heartbeat  ← last one before session compacted

# 34 hours of silence — no heartbeats, no errors logged

# Only fixed by gateway restart:
Feb 12 18:53  [heartbeat] started
```

## Root Cause

In `health-format-*.js`, the `run()` function inside `startHeartbeatRunner()`:

```javascript
const run = async (params) => {
    // ...
    for (const agent of state.agents.values()) {
        // No try/catch here:
        const res = await runOnce({ ... });
        // If runOnce throws, we never reach:
        // - agent.lastRunMs = now
        // - agent.nextDueMs = now + agent.intervalMs
    }
    scheduleNext();  // Never called if runOnce throws
};
```

Also: the early return for `requests-in-flight` skips `scheduleNext()`, which could also strand the timer in edge cases.

## Suggested Fix

```javascript
for (const agent of state.agents.values()) {
    if (isInterval && now < agent.nextDueMs) continue;
    let res;
    try {
        res = await runOnce({ ... });
    } catch (runErr) {
        log.error(\`heartbeat runner: runOnce threw: \${runErr?.message ?? runErr}\`);
        agent.lastRunMs = now;
        agent.nextDueMs = now + agent.intervalMs;
        continue;
    }
    if (res.status === 'skipped' && res.reason === 'requests-in-flight') {
        scheduleNext();  // Don't forget to reschedule before returning
        return res;
    }
    // ... rest unchanged
}
scheduleNext();
```

## Workaround

Applied the above patch locally to `dist/health-format-*.js`. Also set up a watchdog cron that restarts the gateway if no heartbeats fire for 2+ hours.

## Environment

- OpenClaw 2026.2.6-3
- Model: claude-opus-4-6 with 1M context window (compaction thresholds set to 200k)
- Heartbeat model: claude-sonnet-4-20250514
- OS: Linux 6.12.67 (x64)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Heartbeat scheduler dies silently when runOnce() throws during session compaction #14892

Bug Description

Steps to Reproduce

Evidence from Logs

Root Cause

Suggested Fix

Workaround

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Heartbeat scheduler dies silently when runOnce() throws during session compaction #14892

Description

Bug Description

Steps to Reproduce

Evidence from Logs

Root Cause

Suggested Fix

Workaround

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions