Summary
#17797 ("Heartbeat flood: exec-event bypasses interval check, causing runaway heartbeat runs") is regressed on 2026.4.30 (or never actually shipped — that issue is closed as completed but has no closing PR). The exact same code path described in #17797 is still live, with the same self-feeding loop and the same observable symptoms.
Filing a new issue because #17797 is locked.
Reproducer
Failing unit test added to src/infra/heartbeat-runner.scheduler.test.ts:
it("does not bypass interval cooldown for repeated exec-event wakes within nextDueMs", async () => {
useFakeHeartbeatTime();
const runSpy = vi.fn().mockResolvedValue({ status: "ran", durationMs: 1 });
const runner = startHeartbeatRunner({
cfg: heartbeatConfig(),
runOnce: runSpy,
stableSchedulerSeed: TEST_SCHEDULER_SEED,
});
// First exec-event wake legitimately fires the run.
requestHeartbeatNow({ reason: "exec-event", sessionKey: "agent:main:main", coalesceMs: 0 });
await vi.advanceTimersByTimeAsync(1);
expect(runSpy).toHaveBeenCalledTimes(1);
// 4 more exec-events at 10s intervals — well within the 30m configured cadence.
for (let i = 0; i < 4; i++) {
await vi.advanceTimersByTimeAsync(10_000);
requestHeartbeatNow({ reason: "exec-event", sessionKey: "agent:main:main", coalesceMs: 0 });
await vi.advanceTimersByTimeAsync(1);
}
// 40s elapsed. Configured every is 30m. Subsequent exec-events should be debounced.
expect(runSpy).toHaveBeenCalledTimes(1);
runner.stop();
});
Result: AssertionError: expected "vi.fn()" to be called 1 times, but got 5 times.
Code path (still present on main, commit e3f84fa)
src/infra/heartbeat-runner.ts:1805:
const isInterval = reason === "interval";
// ...
for (const agent of state.agents.values()) {
if (isInterval && now < agent.nextDueMs) {
continue;
}
// runOnce fires regardless of nextDueMs for non-interval reasons
}
The targeted-wake branch (heartbeat-runner.ts:1772-1801) has no cooldown gate at all — any wake with sessionKey set bypasses everything.
src/agents/bash-tools.exec-runtime.ts:347 (maybeNotifyOnExit) still calls:
requestHeartbeatNow(
scopedHeartbeatWakeOptions(sessionKey, { reason: "exec-event", coalesceMs: 0 }),
);
Other event sources also bypass: cron:*, hook:wake, acp:spawn:stream, notifications-event, cli:watchdog:stall. None enforce nextDueMs.
Production observation
Heartbeat configured every: "30m". Actual interval observed in /tmp/openclaw/openclaw-2026-04-30.log:
22:43:14 run done (run took 49.9s)
22:43:25 run start <- 11s gap
22:44:08 run done (run took 50.2s)
22:44:19 run start <- 11s gap
22:45:07 run done (run took 55.6s)
22:45:24 run start <- 17s gap
22:46:08 run done (run took 51.1s)
22:46:18 run start <- 10s gap
22:47:04 run done (run took 52.2s)
22:47:14 run start <- 10s gap
Average gap between completions and next start: ~12s (configured cadence: 30 minutes).
Cascading effects observed in the same session
Each runaway run pegs the single-threaded gateway event loop:
22:32:35 [diagnostic] liveness warning: eventLoopDelayMaxMs=6912.2
22:38:37 [diagnostic] liveness warning: eventLoopDelayMaxMs=6761.2
22:40:40 [diagnostic] liveness warning: eventLoopDelayMaxMs=8715.8 utilization=99.5% cpu=95.3%
This causes downstream symptoms:
Why the existing test missed it
heartbeat-runner.scheduler.test.ts:437 ("does not fan out to unrelated agents for session-scoped exec wakes") covers exec-event routing — it verifies the wake reaches the right agent. It does not assert that the wake is rate-limited. There is currently no test guard for cooldown enforcement on non-interval reasons. The new test above closes that gap.
Suggested fix (combining #17797's option 2 with structural fixes)
Minimal patch:
// heartbeat-runner.ts: targeted branch (line 1772-)
+ if (now < targetAgent.nextDueMs && reason !== "manual") {
+ return { status: "skipped", reason: "not-due" };
+ }
const res = await runOnce({ ... });
// heartbeat-runner.ts:1805 — broadcast branch
- if (isInterval && now < agent.nextDueMs) { continue; }
+ if (now < agent.nextDueMs && reason !== "manual") { continue; }
Architectural follow-ups worth bundling to prevent regression class:
-
Type the wake reason as a discriminated union instead of raw string. Today reason === "interval" is a fragile string-compare; a new event source bypasses the gate silently. Discriminated union + exhaustive switch makes "did I gate this?" a compile-time question.
-
Single shouldDeferWake(opts) helper called by both targeted and broadcast branches. They have different (broken) gate logic today. Centralizing the decision means future refactors can't forget one path.
-
Min-spacing floor independent of every. Even with nextDueMs enforced, two exec wakes arriving within milliseconds could still both fire — one before advanceAgentSchedule updates state, one after. A simple "no run within the last N seconds" floor catches the race.
-
Make exec wakes go through coalescer. The wake-pending-merge logic in heartbeat-wake.ts partially handles this, but exec wakes use coalesceMs: 0 which defeats it. Setting a sensible floor (e.g. 5s) on exec wakes lets multiple completions collapse into one heartbeat.
Happy to open a PR with the failing test + minimal patch + the helper extraction as a follow-up commit.
Environment
- OpenClaw
2026.4.30 commit e3f84fa
- macOS 15.4 (Darwin 25.4.0 arm64)
- Node 24.14.0
- Heartbeat config:
every: "30m", isolatedSession: true, model openai-codex/gpt-5.2
Related
Summary
#17797 ("Heartbeat flood: exec-event bypasses interval check, causing runaway heartbeat runs") is regressed on
2026.4.30(or never actually shipped — that issue is closed as completed but has no closing PR). The exact same code path described in #17797 is still live, with the same self-feeding loop and the same observable symptoms.Filing a new issue because #17797 is locked.
Reproducer
Failing unit test added to
src/infra/heartbeat-runner.scheduler.test.ts:Result:
AssertionError: expected "vi.fn()" to be called 1 times, but got 5 times.Code path (still present on
main, commit e3f84fa)src/infra/heartbeat-runner.ts:1805:The targeted-wake branch (
heartbeat-runner.ts:1772-1801) has no cooldown gate at all — any wake withsessionKeyset bypasses everything.src/agents/bash-tools.exec-runtime.ts:347(maybeNotifyOnExit) still calls:Other event sources also bypass:
cron:*,hook:wake,acp:spawn:stream,notifications-event,cli:watchdog:stall. None enforcenextDueMs.Production observation
Heartbeat configured
every: "30m". Actual interval observed in/tmp/openclaw/openclaw-2026-04-30.log:Average gap between completions and next start: ~12s (configured cadence: 30 minutes).
Cascading effects observed in the same session
Each runaway run pegs the single-threaded gateway event loop:
This causes downstream symptoms:
openclaw tuicannot connect — WS handshake closes 1000 mid-handshake (see TUI fully unresponsive to Ctrl+C / Ctrl+D / SIGINT after gateway WebSocket close #75379)Why the existing test missed it
heartbeat-runner.scheduler.test.ts:437("does not fan out to unrelated agents for session-scoped exec wakes") covers exec-event routing — it verifies the wake reaches the right agent. It does not assert that the wake is rate-limited. There is currently no test guard for cooldown enforcement on non-interval reasons. The new test above closes that gap.Suggested fix (combining #17797's option 2 with structural fixes)
Minimal patch:
Architectural follow-ups worth bundling to prevent regression class:
Type the wake reason as a discriminated union instead of raw
string. Todayreason === "interval"is a fragile string-compare; a new event source bypasses the gate silently. Discriminated union + exhaustive switch makes "did I gate this?" a compile-time question.Single
shouldDeferWake(opts)helper called by both targeted and broadcast branches. They have different (broken) gate logic today. Centralizing the decision means future refactors can't forget one path.Min-spacing floor independent of
every. Even withnextDueMsenforced, two exec wakes arriving within milliseconds could still both fire — one beforeadvanceAgentScheduleupdates state, one after. A simple "no run within the last N seconds" floor catches the race.Make exec wakes go through coalescer. The wake-pending-merge logic in
heartbeat-wake.tspartially handles this, but exec wakes usecoalesceMs: 0which defeats it. Setting a sensible floor (e.g. 5s) on exec wakes lets multiple completions collapse into one heartbeat.Happy to open a PR with the failing test + minimal patch + the helper extraction as a follow-up commit.
Environment
2026.4.30commite3f84faevery: "30m",isolatedSession: true, modelopenai-codex/gpt-5.2Related