fix(heartbeat): centralized cooldown gate prevents exec-event runaway#75439
fix(heartbeat): centralized cooldown gate prevents exec-event runaway#75439hexsprite wants to merge 1 commit intoopenclaw:mainfrom
Conversation
|
Codex review: needs changes before merge. Summary Reproducibility: yes. Current main is source-reproducible by sending repeated targeted Next step before merge Security Review findings
Review detailsBest possible solution: Land the centralized cooldown once it distinguishes loop-prone exec/spawn/retry wakes from explicit hook/cron wake-now and task-follow-up delivery, and either document or remove the new SDK subpath. Do we have a high-confidence way to reproduce the issue? Yes. Current main is source-reproducible by sending repeated targeted Is this the best way to solve the issue? No. The shared helper is the maintainable direction, but the current policy is too broad for documented immediate delivery paths and the public SDK export needs the usual docs follow-through. Full review comments:
Overall correctness: patch is incorrect Acceptance criteria:
What I checked:
Likely related people:
Remaining risk / open question:
Codex review notes: model gpt-5.5, reasoning high; reviewed against 803b7ab8085d. |
|
Thanks for the review. Pushed 52d41262c1 addressing both findings. P2-1: Preserve immediate wake-now pathsYou're right — the previous gate caught Fix: explicit const IMMEDIATE_WAKE_REASONS: ReadonlySet<string> = new Set([
"manual",
"wake",
"background-task",
]);Matched on the raw reason string, not on The flood guard still applies as a backstop to Tests added covering:
P2-2: Don't poison cooldown on retryable busy skipConfirmed by inspection — moved Throws also count as terminal attempts (so a consistently-throwing reason doesn't tight-loop on retries). New regression test: it("retryable busy skip does not poison the cooldown for the next retry", async () => {
let attempt = 0;
const runSpy = vi.fn().mockImplementation(async () => {
attempt += 1;
if (attempt === 1) {
return { status: "skipped", reason: HEARTBEAT_SKIP_REQUESTS_IN_FLIGHT } as const;
}
return { status: "ran", durationMs: 1 } as const;
});
// ... 1st wake: requests-in-flight. After 1.5s wake-layer retry,
// 2nd wake must reach runOnce — not be deferred as 'not-due'/'min-spacing'.
expect(runSpy).toHaveBeenCalledTimes(2);
});Fails on the prior commit. Passes on Acceptance criteria checklist
CHANGELOG entry — happy to add one if maintainers want it as part of this PR; the AGENTS guidance suggests pure infra/test fixes can skip but this one has user-visible cadence implications. |
|
Two updates: Live verification on real gatewayBuilt Pre-fix baseline (same workstation, earlier tonight): 5+ heartbeat runs in 3 minutes, all Post-fix:
Heartbeat runaway is gone in the real gateway. The cooldown gate is holding CI failure on
|
320d3c1 to
34ec38f
Compare
|
CI failures here are inherited from
Failing tests are in unrelated areas (whatsapp |
158f364 to
2d83ee7
Compare
2d83ee7 to
23dac0f
Compare
962a50a to
de418da
Compare
de418da to
8400bfa
Compare
8400bfa to
6bdf1d9
Compare
3551e95 to
45be8ee
Compare
d7c2a3e to
ee47219
Compare
ee47219 to
ed17910
Compare
ed17910 to
3458164
Compare
3458164 to
bf7ada8
Compare
765d8d8 to
43f7d3c
Compare
43f7d3c to
d6a3d13
Compare
…own on retryable busy skip Address two P2 review findings on openclaw#75439: 1. Preserve immediate wake-now contract (openclaw#75439 review P2-1) The cooldown gate previously deferred every event-driven reason once the agent had already run. That broke documented contracts: - 'wake' from `openclaw system event --mode now` (docs/cli/system.md) — promised to fire the heartbeat immediately - 'background-task' from `task-registry.ts:1079` — promised to fire on terminal updates so users don't wait for the next scheduled tick Introduce `isImmediateWakeReason` with an explicit allowlist: {manual, wake, background-task}. These bypass the interval cooldown. Distinct from the heartbeat-reason kind 'wake' which also covers 'acp:spawn:*' — those are agent-emitted spawn updates that can feedback-loop and remain gated. The allowlist matches the raw reason string for clarity. Flood guard still applies as a backstop to 'wake' and 'background-task' (one could imagine a buggy task that completion-storms the heartbeat); 'manual' alone is fully exempt since user-pressed. 2. Don't poison cooldown bookkeeping on retryable busy skip (openclaw#75439 review P2-2) Previously `recordRunBookkeeping` was called BEFORE `runOnce`. If the first attempt returned `requests-in-flight` / `cron-in-progress` / `lanes-busy`, the wake layer retries the same reason after DEFAULT_RETRY_MS (1s). The retry then saw populated `lastRunStartedAtMs` and got falsely deferred with `not-due` or `min-spacing`, dropping the queued event until the next scheduled heartbeat. Move `recordRunBookkeeping` to AFTER `runOnce` in both branches, conditional on a non-retryable result. Throws also count as terminal attempts so the retry layer doesn't tight-loop on a broken reason. Tests: - New: 'never defers manual wakes even during a flood' - New: 'does not defer wake/background-task even within nextDueMs' (P2-1) - New: 'DOES defer acp:spawn:stream within nextDueMs' (kind:wake distinction) - New: 'flood guard still applies to wake/background-task as a backstop' - New: 'does not bypass interval cooldown for repeated wake reasons' - New: 'does not bypass interval cooldown for repeated background-task wakes' - New: 'retryable busy skip does not poison the cooldown for the next retry' (P2-2) 49/49 cooldown tests, 17/17 scheduler tests, 148/148 full heartbeat suite, 79/79 task-registry + cron service suite (review acceptance criteria). Refs: - openclaw#75439 review by clawsweeper / Codex - openclaw#64016 (canonical issue) - openclaw#75436 (closed dup with full repro)
d6a3d13 to
43fb6e1
Compare
43fb6e1 to
6126d5f
Compare
|
Thanks @hexsprite for tracking this down and for the detailed repro/writeup. I rewrote the fix into a slightly larger maintainer-owned cleanup that keeps the root cause explicit: wake producers now declare typed I preserved your credit in the replacement commit with: Co-authored-by: Jordan Baker jbb@scryent.com Replacement PR: #76086 Closing this one in favor of #76086 so review/CI can continue on the maintainer branch. |
Summary
exec-eventrunaway loop in heartbeat dispatcher (BUG: heartbeat fires duplicate runs when external wake events (openclaw agent CLI) arrive during scheduled heartbeat #64016)WakeReasonreplaces raw string compare for reason classificationWhy
src/agents/bash-tools.exec-runtime.ts:347(maybeNotifyOnExit) callsrequestHeartbeatNow({reason: "exec-event"})whenever a backgroundedprocess.startexits. The dispatcher inheartbeat-runner.tsonly enforcednextDueMswhenreason === "interval"— every other reason bypassed the cooldown. The targeted-wake branch had no cooldown gate at all.Result observed in production: heartbeat configured
every: "30m"actually fired every ~10s, pegging the gateway event loop witheventLoopDelayMaxMs >6sspikes and stalling control-UI asset serving and TUI handshakes.The triage on #75436 (consolidated into #64016) confirmed the diagnosis and recommended exactly this approach: "centralizing the due/cooldown decision for targeted and broadcast wake paths, while preserving explicit manual heartbeat behavior and adding the repeated
exec-eventregression test."What changes
src/infra/heartbeat-cooldown.ts(new)Owns the wake-deferral decision. Both dispatcher branches call
shouldDeferWakeso the gate cannot be forgotten on one path.Adding a new wake source forces a TypeScript error in
classifyWakeReason's exhaustive switch — the previousreason === "interval"string compare was a silent footgun.Decision matrix:
manualintervalnow < nextDueMsnow < nextDueMsnow < nextDueMsOR within floorPlus a flood guard: if ≥5 runs land within 60s, every non-manual wake defers with
reason: "flood"and a single warning log fires per loop episode. Even if a future change accidentally bypasses the cooldown gate again, the flood guard caps the damage.src/infra/heartbeat-runner.tsHeartbeatAgentStateextended withlastRunStartedAtMs,recentRunStarts,floodLoggedSinceLastRun(carried across config reloads)evaluateWakeDeferralandrecordRunBookkeepinghelpers addedevaluateWakeDeferralbeforerunOnceandrecordRunBookkeepingafter deciding to runTests
heartbeat-cooldown.test.ts— 32 cases covering manual bypass, the bootstrap exception, all event-driven kinds (exec-event,cron:*,hook:*,acp:spawn:*,wake, unknown), the min-spacing floor, the flood guard, the discriminated-union exhaustiveness, and the run-start ring bufferheartbeat-runner.scheduler.test.ts:does not bypass interval cooldown for repeated exec-event wakes within nextDueMs. Asserts that 5 exec-event wakes fired 10s apart yield 1 run, not 5. Fails onmain, passes here.does not fan out to unrelated agents for session-scoped exec wakes(Heartbeat durationevery>24.85d overflows Node setTimeout, crashes gateway with no auto-respawn #71414 regression test) which now also implicitly verifies the cooldown by virtue of the same dispatcher path.Test plan
pnpm test src/infra/heartbeat-cooldown.test.ts— 32 passpnpm test src/infra/heartbeat-runner.scheduler.test.ts— 14 pass (includes new regression)ls src/infra/heartbeat*.test.ts | xargs pnpm test) — 145/145pnpm tsgo— cleanpnpm check:test-types— cleanpnpm exec oxfmt --checkon touched files — cleanpnpm check:changed --staged— exit 0heartbeat.every: "30m", run a heartbeat that uses backgrounded bash, observe runs spaced ≥30m instead of ~10sCloses
Closes #64016
Triage references:
Related downstream symptoms