Bug Description
The heartbeat scheduler's run() function in startHeartbeatRunner() has no try/catch around the runOnce() call. If runOnce() (which calls getReplyFromConfig) throws an unhandled exception — which appears to happen when the heartbeat session compacts — the scheduleNext() call at the end of run() is never reached. The timer is never rescheduled, and heartbeats silently stop forever until the gateway is restarted.
Steps to Reproduce
- Configure heartbeat with
every: 60m
- Let the heartbeat session accumulate context over many runs
- Wait for the heartbeat session to hit compaction threshold
- After compaction, heartbeats never fire again
Evidence from Logs
# Heartbeats running normally every ~60m:
Feb 11 07:03 messageChannel=heartbeat
Feb 11 07:20 messageChannel=heartbeat
Feb 11 08:20 messageChannel=heartbeat ← last one before session compacted
# 34 hours of silence — no heartbeats, no errors logged
# Only fixed by gateway restart:
Feb 12 18:53 [heartbeat] started
Root Cause
In health-format-*.js, the run() function inside startHeartbeatRunner():
const run = async (params) => {
// ...
for (const agent of state.agents.values()) {
// No try/catch here:
const res = await runOnce({ ... });
// If runOnce throws, we never reach:
// - agent.lastRunMs = now
// - agent.nextDueMs = now + agent.intervalMs
}
scheduleNext(); // Never called if runOnce throws
};
Also: the early return for requests-in-flight skips scheduleNext(), which could also strand the timer in edge cases.
Suggested Fix
for (const agent of state.agents.values()) {
if (isInterval && now < agent.nextDueMs) continue;
let res;
try {
res = await runOnce({ ... });
} catch (runErr) {
log.error(\`heartbeat runner: runOnce threw: \${runErr?.message ?? runErr}\`);
agent.lastRunMs = now;
agent.nextDueMs = now + agent.intervalMs;
continue;
}
if (res.status === 'skipped' && res.reason === 'requests-in-flight') {
scheduleNext(); // Don't forget to reschedule before returning
return res;
}
// ... rest unchanged
}
scheduleNext();
Workaround
Applied the above patch locally to dist/health-format-*.js. Also set up a watchdog cron that restarts the gateway if no heartbeats fire for 2+ hours.
Environment
- OpenClaw 2026.2.6-3
- Model: claude-opus-4-6 with 1M context window (compaction thresholds set to 200k)
- Heartbeat model: claude-sonnet-4-20250514
- OS: Linux 6.12.67 (x64)
Bug Description
The heartbeat scheduler's
run()function instartHeartbeatRunner()has no try/catch around therunOnce()call. IfrunOnce()(which callsgetReplyFromConfig) throws an unhandled exception — which appears to happen when the heartbeat session compacts — thescheduleNext()call at the end ofrun()is never reached. The timer is never rescheduled, and heartbeats silently stop forever until the gateway is restarted.Steps to Reproduce
every: 60mEvidence from Logs
Root Cause
In
health-format-*.js, therun()function insidestartHeartbeatRunner():Also: the early return for
requests-in-flightskipsscheduleNext(), which could also strand the timer in edge cases.Suggested Fix
Workaround
Applied the above patch locally to
dist/health-format-*.js. Also set up a watchdog cron that restarts the gateway if no heartbeats fire for 2+ hours.Environment