Skip to content

fix(heartbeat): centralized cooldown gate prevents exec-event runaway#75439

Closed
hexsprite wants to merge 1 commit intoopenclaw:mainfrom
hexsprite:fix/heartbeat-cooldown-non-interval-wakes
Closed

fix(heartbeat): centralized cooldown gate prevents exec-event runaway#75439
hexsprite wants to merge 1 commit intoopenclaw:mainfrom
hexsprite:fix/heartbeat-cooldown-non-interval-wakes

Conversation

@hexsprite
Copy link
Copy Markdown
Contributor

@hexsprite hexsprite commented May 1, 2026

Summary

Why

src/agents/bash-tools.exec-runtime.ts:347 (maybeNotifyOnExit) calls requestHeartbeatNow({reason: "exec-event"}) whenever a backgrounded process.start exits. The dispatcher in heartbeat-runner.ts only enforced nextDueMs when reason === "interval" — every other reason bypassed the cooldown. The targeted-wake branch had no cooldown gate at all.

Result observed in production: heartbeat configured every: "30m" actually fired every ~10s, pegging the gateway event loop with eventLoopDelayMaxMs >6s spikes and stalling control-UI asset serving and TUI handshakes.

The triage on #75436 (consolidated into #64016) confirmed the diagnosis and recommended exactly this approach: "centralizing the due/cooldown decision for targeted and broadcast wake paths, while preserving explicit manual heartbeat behavior and adding the repeated exec-event regression test."

What changes

src/infra/heartbeat-cooldown.ts (new)

Owns the wake-deferral decision. Both dispatcher branches call shouldDeferWake so the gate cannot be forgotten on one path.

export type WakeReason =
  | { kind: "interval" }
  | { kind: "manual" }
  | { kind: "exec-event" }
  | { kind: "cron"; raw: string }
  | { kind: "hook"; raw: string }
  | { kind: "wake"; raw: string }
  | { kind: "retry" }
  | { kind: "other"; raw?: string };

Adding a new wake source forces a TypeScript error in classifyWakeReason's exhaustive switch — the previous reason === "interval" string compare was a silent footgun.

Decision matrix:

Wake kind First wake (no prior run) Subsequent wakes
manual run run (never deferred)
interval defer if now < nextDueMs defer if now < nextDueMs
event-driven run (bootstrap responsive) defer if now < nextDueMs OR within floor

Plus a flood guard: if ≥5 runs land within 60s, every non-manual wake defers with reason: "flood" and a single warning log fires per loop episode. Even if a future change accidentally bypasses the cooldown gate again, the flood guard caps the damage.

src/infra/heartbeat-runner.ts

  • HeartbeatAgentState extended with lastRunStartedAtMs, recentRunStarts, floodLoggedSinceLastRun (carried across config reloads)
  • evaluateWakeDeferral and recordRunBookkeeping helpers added
  • Targeted branch and broadcast branch both call evaluateWakeDeferral before runOnce and recordRunBookkeeping after deciding to run

Tests

  • New heartbeat-cooldown.test.ts — 32 cases covering manual bypass, the bootstrap exception, all event-driven kinds (exec-event, cron:*, hook:*, acp:spawn:*, wake, unknown), the min-spacing floor, the flood guard, the discriminated-union exhaustiveness, and the run-start ring buffer
  • New regression test in heartbeat-runner.scheduler.test.ts: does not bypass interval cooldown for repeated exec-event wakes within nextDueMs. Asserts that 5 exec-event wakes fired 10s apart yield 1 run, not 5. Fails on main, passes here.
  • Existing 145 heartbeat-suite tests still pass — including does not fan out to unrelated agents for session-scoped exec wakes (Heartbeat duration every >24.85d overflows Node setTimeout, crashes gateway with no auto-respawn #71414 regression test) which now also implicitly verifies the cooldown by virtue of the same dispatcher path.

Test plan

  • pnpm test src/infra/heartbeat-cooldown.test.ts — 32 pass
  • pnpm test src/infra/heartbeat-runner.scheduler.test.ts — 14 pass (includes new regression)
  • Full heartbeat suite (ls src/infra/heartbeat*.test.ts | xargs pnpm test) — 145/145
  • pnpm tsgo — clean
  • pnpm check:test-types — clean
  • pnpm exec oxfmt --check on touched files — clean
  • pnpm check:changed --staged — exit 0
  • Repro on real gateway: configure heartbeat.every: "30m", run a heartbeat that uses backgrounded bash, observe runs spaced ≥30m instead of ~10s

Closes

Closes #64016

Triage references:

Related downstream symptoms

@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented May 1, 2026

Codex review: needs changes before merge.

Summary
The PR adds a centralized heartbeat wake cooldown helper/bookkeeping with regression tests, updates the changelog, exposes a new plugin-sdk/tool-descriptors subpath, updates the SDK API hash, and adjusts several tests/fixtures.

Reproducibility: yes. Current main is source-reproducible by sending repeated targeted exec-event wakes inside nextDueMs; the remaining PR regressions are source-reproducible by sending hook:wake or background-task-blocked after a prior heartbeat run.

Next step before merge
The remaining blockers are narrow mechanical repairs on the PR branch: fix wake-policy exceptions/tests and add the missing SDK subpath documentation or remove the unintended export.

Security
Cleared: No concrete security or supply-chain regression found; the diff adds no dependencies, workflows, permissions, secret handling, or downloaded code.

Review findings

  • [P2] Preserve hook wake-now delivery — src/infra/heartbeat-cooldown.ts:70
  • [P2] Keep blocked task follow-ups immediate — src/infra/heartbeat-cooldown.ts:70
  • [P3] Document the tool-descriptors SDK subpath — package.json:1233-1235
Review details

Best possible solution:

Land the centralized cooldown once it distinguishes loop-prone exec/spawn/retry wakes from explicit hook/cron wake-now and task-follow-up delivery, and either document or remove the new SDK subpath.

Do we have a high-confidence way to reproduce the issue?

Yes. Current main is source-reproducible by sending repeated targeted exec-event wakes inside nextDueMs; the remaining PR regressions are source-reproducible by sending hook:wake or background-task-blocked after a prior heartbeat run.

Is this the best way to solve the issue?

No. The shared helper is the maintainable direction, but the current policy is too broad for documented immediate delivery paths and the public SDK export needs the usual docs follow-through.

Full review comments:

  • [P2] Preserve hook wake-now delivery — src/infra/heartbeat-cooldown.ts:70
    /hooks/wake with mode: "now" emits reason: "hook:wake", but this allowlist only covers the bare wake reason. After any prior heartbeat run, hook:wake now takes the event-driven cooldown path and can be skipped as not-due until nextDueMs, so a documented now wake no longer runs promptly.
    Confidence: 0.94
  • [P2] Keep blocked task follow-ups immediate — src/infra/heartbeat-cooldown.ts:70
    queueBlockedTaskFollowup wakes the requester with reason: "background-task-blocked", but that exact reason is missing here. A blocked/needs-user task follow-up after a prior heartbeat can now wait for the next scheduled interval instead of surfacing promptly to the requester.
    Confidence: 0.92
  • [P3] Document the tool-descriptors SDK subpath — package.json:1233-1235
    This adds openclaw/plugin-sdk/tool-descriptors as a public package export, but the SDK subpath catalog still jumps from tool-payload to tool-send. Public SDK subpaths should be discoverable in docs/plugins/sdk-subpaths.md when they are promoted as supported imports.
    Confidence: 0.88

Overall correctness: patch is incorrect
Overall confidence: 0.93

Acceptance criteria:

  • pnpm test src/infra/heartbeat-cooldown.test.ts src/infra/heartbeat-runner.scheduler.test.ts src/gateway/hooks.test.ts src/cron/service.runs-one-shot-main-job-disables-it.test.ts src/tasks/task-registry.test.ts
  • pnpm exec oxfmt --check --threads=1 src/infra/heartbeat-cooldown.ts src/infra/heartbeat-cooldown.test.ts src/infra/heartbeat-runner.ts src/infra/heartbeat-runner.scheduler.test.ts docs/plugins/sdk-subpaths.md CHANGELOG.md
  • pnpm plugin-sdk:api:check
  • pnpm docs:check-mdx
  • pnpm check:changed

What I checked:

  • Current-main cooldown gap: Current main still sends targeted wakes directly to runOnce before any nextDueMs gate, while the broadcast path only checks nextDueMs for reason === "interval", matching the reported exec-event runaway surface. (src/infra/heartbeat-runner.ts:1879, 803b7ab8085d)
  • Exec-event wake source: Backgrounded exec completion feeds the heartbeat wake layer with reason: "exec-event" and coalesceMs: 0, so repeated process exits can trigger the buggy path on main. (src/agents/bash-tools.exec-runtime.ts:347, 803b7ab8085d)
  • PR immediate allowlist is too narrow: At PR head, IMMEDIATE_WAKE_REASONS contains only manual, wake, and background-task; hook:wake, cron:* queued wake-now fallbacks, and background-task-blocked fall through to the cooldown path after a prior run. (src/infra/heartbeat-cooldown.ts:70, b5295a0648b5)
  • Hook wake-now source: /hooks/wake with mode: "now" enqueues a system event and calls requestHeartbeatNow({ reason: "hook:wake" }), so the PR cooldown would make a documented now wake wait until nextDueMs after any prior heartbeat run. (src/gateway/server/hooks.ts:53, 803b7ab8085d)
  • Blocked task follow-up source: Blocked task follow-ups enqueue the user-visible follow-up and wake heartbeat with reason: "background-task-blocked"; that exact reason is not in the PR immediate allowlist. (src/tasks/task-registry.ts:1101, 803b7ab8085d)
  • Cron wake-now fallback source: When a main-session cron wakeMode: "now" run cannot synchronously wake because the runtime is busy, cron queues requestHeartbeatNow({ reason: cron:<jobId>, heartbeat: { target: "last" } }); the PR treats that queued fallback as ordinary event-driven cooldown work. (src/cron/service/timer.ts:1404, 803b7ab8085d)

Likely related people:

  • steipete: Recent commits on src/infra/heartbeat-runner.ts and src/infra/heartbeat-wake.ts touched exec-event routing, heartbeat busy deferral, and scheduler behavior adjacent to this cooldown policy. (role: recent heartbeat maintainer; confidence: high; commits: 9180173f9a79, f5e7557c70c1, 65c9eddae87e; files: src/infra/heartbeat-runner.ts, src/infra/heartbeat-wake.ts)
  • vignesh07: The pending-wakes-per-target behavior in heartbeat-wake.ts appears to date to 064a3079cb5f, and later heartbeat rebase/commitment work touched the same runner surface. (role: introduced wake coalescing/target behavior; confidence: medium; commits: 064a3079cb5f, 7451415f36d5, 8e4035d09a88; files: src/infra/heartbeat-wake.ts, src/infra/heartbeat-runner.ts)
  • obviyus: Recent cron/heartbeat work preserved deferred cron heartbeat targets and forwarded heartbeat overrides through the queued wake path that this PR now cools down. (role: cron wake target maintainer; confidence: medium; commits: 1d4e4314dddc; files: src/cron/service/timer.ts, src/gateway/server-cron.ts, src/infra/heartbeat-wake.ts)
  • vincentkoc: Recent work on hook completion routing and task delivery/registry behavior is adjacent to the hook:wake and background-task-blocked immediate-delivery regressions. (role: task and hook delivery adjacent owner; confidence: medium; commits: 6f38425e5c18, 210cccb0fea5, a1b656705992; files: src/gateway/server/hooks.ts, src/tasks/task-registry.ts)

Remaining risk / open question:

  • The PR mixes the heartbeat runtime fix with a new public SDK subpath; maintainers should confirm the SDK export belongs in this PR and keep the docs/API contract aligned if it stays.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 803b7ab8085d.

@hexsprite
Copy link
Copy Markdown
Contributor Author

Thanks for the review. Pushed 52d41262c1 addressing both findings.

P2-1: Preserve immediate wake-now paths

You're right — the previous gate caught wake (from openclaw system event --mode now) and background-task (from task-registry.ts:1079) along with the loop-prone reasons. Both are documented as immediate.

Fix: explicit IMMEDIATE_WAKE_REASONS allowlist in heartbeat-cooldown.ts:

const IMMEDIATE_WAKE_REASONS: ReadonlySet<string> = new Set([
  "manual",
  "wake",
  "background-task",
]);

Matched on the raw reason string, not on HeartbeatReasonKind. That distinction matters because acp:spawn:* also classifies as kind wake via heartbeat-reason.ts, but those are agent-emitted spawn updates that can feedback-loop and must remain gated. The allowlist is exact-string so this stays unambiguous.

The flood guard still applies as a backstop to wake and background-task (could imagine a buggy task that completion-storms the heartbeat); only manual is fully exempt since it's user-pressed.

Tests added covering:

  • wake and background-task not deferred within nextDueMs
  • acp:spawn:stream STILL deferred (verifies the kind/raw distinction)
  • Flood guard still backstops wake and background-task
  • manual exempt even during a real flood

P2-2: Don't poison cooldown on retryable busy skip

Confirmed by inspection — moved recordRunBookkeeping to AFTER runOnce in both branches, conditional on non-retryable result. The retryable busy branch now returns without recording; the wake-layer retry will see lastRunStartedAtMs unchanged and proceed.

Throws also count as terminal attempts (so a consistently-throwing reason doesn't tight-loop on retries).

New regression test:

it("retryable busy skip does not poison the cooldown for the next retry", async () => {
  let attempt = 0;
  const runSpy = vi.fn().mockImplementation(async () => {
    attempt += 1;
    if (attempt === 1) {
      return { status: "skipped", reason: HEARTBEAT_SKIP_REQUESTS_IN_FLIGHT } as const;
    }
    return { status: "ran", durationMs: 1 } as const;
  });
  // ... 1st wake: requests-in-flight. After 1.5s wake-layer retry,
  // 2nd wake must reach runOnce — not be deferred as 'not-due'/'min-spacing'.
  expect(runSpy).toHaveBeenCalledTimes(2);
});

Fails on the prior commit. Passes on 52d41262c1.

Acceptance criteria checklist

  • pnpm test src/infra/heartbeat-cooldown.test.ts src/infra/heartbeat-runner.scheduler.test.ts66 pass (49 cooldown + 17 scheduler)
  • pnpm test src/tasks/task-registry*.test.ts src/cron/service*.test.ts79 pass
  • Full heartbeat suite (ls src/infra/heartbeat*.test.ts | xargs pnpm test) — 148 pass
  • pnpm exec oxfmt --check --threads=1 src/infra/heartbeat-cooldown.ts src/infra/heartbeat-cooldown.test.ts src/infra/heartbeat-runner.ts src/infra/heartbeat-runner.scheduler.test.ts CHANGELOG.md — clean
  • pnpm check:changed — exit 0 (typecheck core + tests, lint core, all guards green)

CHANGELOG entry — happy to add one if maintainers want it as part of this PR; the AGENTS guidance suggests pure infra/test fixes can skip but this one has user-visible cadence implications.

@hexsprite
Copy link
Copy Markdown
Contributor Author

Two updates:

Live verification on real gateway

Built 52d41262c1, started gateway with --verbose, monitored for 3 minutes with the same every: "30m" heartbeat config that previously produced ~10s runaway loops.

Pre-fix baseline (same workstation, earlier tonight): 5+ heartbeat runs in 3 minutes, all messageChannel=heartbeat, ~60s gap between completions and next start.

Post-fix:

  • T+0–60s: 0 runs
  • T+120s: 1 run started, but messageChannel=unknown (external dispatch — webchat/telegram), NOT a heartbeat
  • T+180s: that run completed (38s), no new runs

Heartbeat runaway is gone in the real gateway. The cooldown gate is holding exec-event and friends as expected, and wake/background-task paths still flow through (verified via the new unit tests).

CI failure on Scan changed paths (precise) is a false attribution

Run 25204141642 / job 73901114636 flagged extensions/voice-call/src/providers/twilio/api.ts:60 for SSRF. That file is not in this PR.

$ git diff --name-only $(git merge-base origin/main HEAD)..HEAD
src/infra/heartbeat-cooldown.test.ts
src/infra/heartbeat-cooldown.ts
src/infra/heartbeat-runner.scheduler.test.ts
src/infra/heartbeat-runner.ts

The OpenGrep step uses OPENCLAW_OPENGREP_BASE_REF: e8f9c3e6dedc...HEAD, but e8f9c3e6dedc is downstream of this PR's merge-base (76930da7ebc7). The voice-call file landed on main in commit 97d42a9614 (fix(voice-call): retry twilio answered updates), well outside this PR.

The finding is real-but-unrelated — it's a pre-existing risk on main that the voice-call team should address (or nosem annotate if it's a false positive). Happy to file a separate issue if useful, but it shouldn't block merge of this PR.

@hexsprite
Copy link
Copy Markdown
Contributor Author

CI failures here are inherited from main HEAD 0e8cb3d9 (the rebase target). The latest completed CI run on main (run 25216237781) failed with the exact same 9 jobs:

  • check-lint
  • check-test-types
  • check-dependencies
  • checks-node-core-runtime-infra
  • checks-node-core-fast-support
  • checks-node-agentic-gateway-core
  • checks-node-agentic-cli
  • check
  • checks-node-core

Failing tests are in unrelated areas (whatsapp qrcode runtime dep, pairing-store, web-search-codex-config, model-pricing-cache, plugins-cli.install, etc.) — none touch heartbeat code on this PR. Will rebase again once main is green.

@hexsprite
Copy link
Copy Markdown
Contributor Author

@obviyus — would you be open to reviewing? Heartbeat scheduler / cooldown-gate work in adjacent territory to your #75922 plugin-registry caching. Targets a separate runaway path on the same gateway-CPU-peg symptom cluster (#75703 / #75872 / #75707 / #75882).

@hexsprite hexsprite force-pushed the fix/heartbeat-cooldown-non-interval-wakes branch from 8400bfa to 6bdf1d9 Compare May 2, 2026 05:26
@hexsprite hexsprite force-pushed the fix/heartbeat-cooldown-non-interval-wakes branch from 3551e95 to 45be8ee Compare May 2, 2026 06:12
@hexsprite hexsprite force-pushed the fix/heartbeat-cooldown-non-interval-wakes branch 2 times, most recently from d7c2a3e to ee47219 Compare May 2, 2026 06:21
@hexsprite hexsprite force-pushed the fix/heartbeat-cooldown-non-interval-wakes branch from ee47219 to ed17910 Compare May 2, 2026 06:21
@hexsprite hexsprite force-pushed the fix/heartbeat-cooldown-non-interval-wakes branch from ed17910 to 3458164 Compare May 2, 2026 06:36
@hexsprite hexsprite force-pushed the fix/heartbeat-cooldown-non-interval-wakes branch from 3458164 to bf7ada8 Compare May 2, 2026 06:36
@hexsprite hexsprite force-pushed the fix/heartbeat-cooldown-non-interval-wakes branch 2 times, most recently from 765d8d8 to 43f7d3c Compare May 2, 2026 06:45
@hexsprite hexsprite force-pushed the fix/heartbeat-cooldown-non-interval-wakes branch from 43f7d3c to d6a3d13 Compare May 2, 2026 06:45
hexsprite added a commit to hexsprite/openclaw that referenced this pull request May 2, 2026
…own on retryable busy skip

Address two P2 review findings on openclaw#75439:

1. Preserve immediate wake-now contract (openclaw#75439 review P2-1)
   The cooldown gate previously deferred every event-driven reason once the
   agent had already run. That broke documented contracts:

   - 'wake' from `openclaw system event --mode now` (docs/cli/system.md) —
     promised to fire the heartbeat immediately
   - 'background-task' from `task-registry.ts:1079` — promised to fire on
     terminal updates so users don't wait for the next scheduled tick

   Introduce `isImmediateWakeReason` with an explicit allowlist:
   {manual, wake, background-task}. These bypass the interval cooldown.
   Distinct from the heartbeat-reason kind 'wake' which also covers
   'acp:spawn:*' — those are agent-emitted spawn updates that can feedback-loop
   and remain gated. The allowlist matches the raw reason string for clarity.

   Flood guard still applies as a backstop to 'wake' and 'background-task' (one
   could imagine a buggy task that completion-storms the heartbeat); 'manual'
   alone is fully exempt since user-pressed.

2. Don't poison cooldown bookkeeping on retryable busy skip (openclaw#75439 review P2-2)
   Previously `recordRunBookkeeping` was called BEFORE `runOnce`. If the
   first attempt returned `requests-in-flight` / `cron-in-progress` /
   `lanes-busy`, the wake layer retries the same reason after
   DEFAULT_RETRY_MS (1s). The retry then saw populated `lastRunStartedAtMs`
   and got falsely deferred with `not-due` or `min-spacing`, dropping the
   queued event until the next scheduled heartbeat.

   Move `recordRunBookkeeping` to AFTER `runOnce` in both branches,
   conditional on a non-retryable result. Throws also count as terminal
   attempts so the retry layer doesn't tight-loop on a broken reason.

Tests:
- New: 'never defers manual wakes even during a flood'
- New: 'does not defer wake/background-task even within nextDueMs' (P2-1)
- New: 'DOES defer acp:spawn:stream within nextDueMs' (kind:wake distinction)
- New: 'flood guard still applies to wake/background-task as a backstop'
- New: 'does not bypass interval cooldown for repeated wake reasons'
- New: 'does not bypass interval cooldown for repeated background-task wakes'
- New: 'retryable busy skip does not poison the cooldown for the next retry' (P2-2)

49/49 cooldown tests, 17/17 scheduler tests, 148/148 full heartbeat suite,
79/79 task-registry + cron service suite (review acceptance criteria).

Refs:
- openclaw#75439 review by clawsweeper / Codex
- openclaw#64016 (canonical issue)
- openclaw#75436 (closed dup with full repro)
@hexsprite hexsprite force-pushed the fix/heartbeat-cooldown-non-interval-wakes branch from d6a3d13 to 43fb6e1 Compare May 2, 2026 06:49
@hexsprite hexsprite force-pushed the fix/heartbeat-cooldown-non-interval-wakes branch from 43fb6e1 to 6126d5f Compare May 2, 2026 06:56
@openclaw-barnacle openclaw-barnacle Bot added docs Improvements or additions to documentation scripts Repository scripts and removed extensions: qa-lab labels May 2, 2026
@steipete
Copy link
Copy Markdown
Contributor

steipete commented May 2, 2026

Thanks @hexsprite for tracking this down and for the detailed repro/writeup. I rewrote the fix into a slightly larger maintainer-owned cleanup that keeps the root cause explicit: wake producers now declare typed source / intent, event-driven wakes go through the shared cooldown, wake-now paths stay immediate, and the deprecated plugin requestHeartbeatNow() API remains as a compatibility alias.

I preserved your credit in the replacement commit with:

Co-authored-by: Jordan Baker jbb@scryent.com

Replacement PR: #76086
Replacement commit: deba234

Closing this one in favor of #76086 so review/CI can continue on the maintainer branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: heartbeat fires duplicate runs when external wake events (openclaw agent CLI) arrive during scheduled heartbeat

2 participants