Skip to content

Heartbeat exec-event interval-bypass regression (re: #17797): runaway runs every ~10s instead of configured every:30m #75436

@hexsprite

Description

@hexsprite

Summary

#17797 ("Heartbeat flood: exec-event bypasses interval check, causing runaway heartbeat runs") is regressed on 2026.4.30 (or never actually shipped — that issue is closed as completed but has no closing PR). The exact same code path described in #17797 is still live, with the same self-feeding loop and the same observable symptoms.

Filing a new issue because #17797 is locked.

Reproducer

Failing unit test added to src/infra/heartbeat-runner.scheduler.test.ts:

it("does not bypass interval cooldown for repeated exec-event wakes within nextDueMs", async () => {
  useFakeHeartbeatTime();
  const runSpy = vi.fn().mockResolvedValue({ status: "ran", durationMs: 1 });
  const runner = startHeartbeatRunner({
    cfg: heartbeatConfig(),
    runOnce: runSpy,
    stableSchedulerSeed: TEST_SCHEDULER_SEED,
  });

  // First exec-event wake legitimately fires the run.
  requestHeartbeatNow({ reason: "exec-event", sessionKey: "agent:main:main", coalesceMs: 0 });
  await vi.advanceTimersByTimeAsync(1);
  expect(runSpy).toHaveBeenCalledTimes(1);

  // 4 more exec-events at 10s intervals — well within the 30m configured cadence.
  for (let i = 0; i < 4; i++) {
    await vi.advanceTimersByTimeAsync(10_000);
    requestHeartbeatNow({ reason: "exec-event", sessionKey: "agent:main:main", coalesceMs: 0 });
    await vi.advanceTimersByTimeAsync(1);
  }

  // 40s elapsed. Configured every is 30m. Subsequent exec-events should be debounced.
  expect(runSpy).toHaveBeenCalledTimes(1);
  runner.stop();
});

Result: AssertionError: expected "vi.fn()" to be called 1 times, but got 5 times.

Code path (still present on main, commit e3f84fa)

src/infra/heartbeat-runner.ts:1805:

const isInterval = reason === "interval";
// ...
for (const agent of state.agents.values()) {
  if (isInterval && now < agent.nextDueMs) {
    continue;
  }
  // runOnce fires regardless of nextDueMs for non-interval reasons
}

The targeted-wake branch (heartbeat-runner.ts:1772-1801) has no cooldown gate at all — any wake with sessionKey set bypasses everything.

src/agents/bash-tools.exec-runtime.ts:347 (maybeNotifyOnExit) still calls:

requestHeartbeatNow(
  scopedHeartbeatWakeOptions(sessionKey, { reason: "exec-event", coalesceMs: 0 }),
);

Other event sources also bypass: cron:*, hook:wake, acp:spawn:stream, notifications-event, cli:watchdog:stall. None enforce nextDueMs.

Production observation

Heartbeat configured every: "30m". Actual interval observed in /tmp/openclaw/openclaw-2026-04-30.log:

22:43:14 run done    (run took 49.9s)
22:43:25 run start   <- 11s gap
22:44:08 run done    (run took 50.2s)
22:44:19 run start   <- 11s gap
22:45:07 run done    (run took 55.6s)
22:45:24 run start   <- 17s gap
22:46:08 run done    (run took 51.1s)
22:46:18 run start   <- 10s gap
22:47:04 run done    (run took 52.2s)
22:47:14 run start   <- 10s gap

Average gap between completions and next start: ~12s (configured cadence: 30 minutes).

Cascading effects observed in the same session

Each runaway run pegs the single-threaded gateway event loop:

22:32:35 [diagnostic] liveness warning: eventLoopDelayMaxMs=6912.2
22:38:37 [diagnostic] liveness warning: eventLoopDelayMaxMs=6761.2
22:40:40 [diagnostic] liveness warning: eventLoopDelayMaxMs=8715.8 utilization=99.5% cpu=95.3%

This causes downstream symptoms:

Why the existing test missed it

heartbeat-runner.scheduler.test.ts:437 ("does not fan out to unrelated agents for session-scoped exec wakes") covers exec-event routing — it verifies the wake reaches the right agent. It does not assert that the wake is rate-limited. There is currently no test guard for cooldown enforcement on non-interval reasons. The new test above closes that gap.

Suggested fix (combining #17797's option 2 with structural fixes)

Minimal patch:

// heartbeat-runner.ts: targeted branch (line 1772-)
+ if (now < targetAgent.nextDueMs && reason !== "manual") {
+   return { status: "skipped", reason: "not-due" };
+ }
  const res = await runOnce({ ... });

// heartbeat-runner.ts:1805 — broadcast branch
- if (isInterval && now < agent.nextDueMs) { continue; }
+ if (now < agent.nextDueMs && reason !== "manual") { continue; }

Architectural follow-ups worth bundling to prevent regression class:

  1. Type the wake reason as a discriminated union instead of raw string. Today reason === "interval" is a fragile string-compare; a new event source bypasses the gate silently. Discriminated union + exhaustive switch makes "did I gate this?" a compile-time question.

  2. Single shouldDeferWake(opts) helper called by both targeted and broadcast branches. They have different (broken) gate logic today. Centralizing the decision means future refactors can't forget one path.

  3. Min-spacing floor independent of every. Even with nextDueMs enforced, two exec wakes arriving within milliseconds could still both fire — one before advanceAgentSchedule updates state, one after. A simple "no run within the last N seconds" floor catches the race.

  4. Make exec wakes go through coalescer. The wake-pending-merge logic in heartbeat-wake.ts partially handles this, but exec wakes use coalesceMs: 0 which defeats it. Setting a sensible floor (e.g. 5s) on exec wakes lets multiple completions collapse into one heartbeat.

Happy to open a PR with the failing test + minimal patch + the helper extraction as a follow-up commit.

Environment

  • OpenClaw 2026.4.30 commit e3f84fa
  • macOS 15.4 (Darwin 25.4.0 arm64)
  • Node 24.14.0
  • Heartbeat config: every: "30m", isolatedSession: true, model openai-codex/gpt-5.2

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions