Skip to content

fix(heartbeat): advance stale scheduler deferrals#88462

Merged
steipete merged 1 commit into
mainfrom
fix/heartbeat-stale-schedule-refire
May 31, 2026
Merged

fix(heartbeat): advance stale scheduler deferrals#88462
steipete merged 1 commit into
mainfrom
fix/heartbeat-stale-schedule-refire

Conversation

@steipete

@steipete steipete commented May 30, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes #79380.
Supersedes #79418 with the semantic scheduler-state fix instead of a timer-floor mitigation.

This keeps retryable busy skips owned by the wake-layer retry path, but advances stale heartbeat due slots after terminal skips and non-retry deferrals. That makes scheduleNext() see a future due time instead of rearming an immediate timer against the same past-due agent.

Verification

  • pnpm test src/infra/heartbeat-runner.scheduler.test.ts -- --reporter=verbose
  • pnpm check:changed (Testbox tbx_01ksxfavykc7qyve4ysnxg3smh)
  • pnpm test:changed (failed in unrelated src/auto-reply/reply/commands-status.thinking-default.test.ts hook timeout; reran that exact file and reproduced the same unrelated timeout)
  • autoreview --mode branch --base origin/main --parallel-tests "pnpm test src/infra/heartbeat-runner.scheduler.test.ts -- --reporter=verbose" -> clean, no accepted/actionable findings

Real behavior proof

Behavior addressed: Gateway heartbeat scheduler can hot-loop when every due agent either returns a terminal skip without advancing nextDueMs or is deferred by a non-retry cooldown/flood gate while nextDueMs is already stale.

Real environment tested: Local macOS source checkout for focused scheduler behavior; Blacksmith Testbox for changed check gate.

Exact steps or command run after this patch: pnpm test src/infra/heartbeat-runner.scheduler.test.ts -- --reporter=verbose and pnpm check:changed.

Evidence after fix: Scheduler regression tests pass and cover both stale terminal disabled skips and flood deferrals without wake-layer retry; changed check passed in Testbox tbx_01ksxfavykc7qyve4ysnxg3smh.

Observed result after fix: Non-retryable terminal skips now record bookkeeping and advance the agent due slot; stale non-retry deferrals advance the due slot before returning, so the scheduler does not keep rearming at 0ms. Retryable busy skips still avoid schedule advancement and continue to use the wake-layer retry path.

What was not tested: Physical Raspberry Pi 4 ARM64 Docker soak on this exact branch. The original report/PR discussion already includes physical Pi before/after proof for the same scheduler symptom; this PR changes the fix shape to repair the stale schedule state directly.

@openclaw-barnacle openclaw-barnacle Bot added size: S maintainer Maintainer-authored PR labels May 30, 2026
@clawsweeper

clawsweeper Bot commented May 30, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs real behavior proof before merge. Reviewed May 30, 2026, 6:44 PM ET / 22:44 UTC.

Summary
The PR advances stale heartbeat due slots after terminal skips and non-retry deferrals, while preserving wake-layer retry ownership for busy skips and adding scheduler regression tests.

PR surface: Source +13, Tests +59. Total +72 across 2 files.

Reproducibility: yes. Source inspection gives a high-confidence reproduction path on current main: stale nextDueMs plus a disabled skip or non-retry deferral reaches scheduleNext() with a 0ms delay; I did not run tests because this review is read-only.

Review metrics: none identified.

Merge readiness
Overall: 🦪 silver shellfish
Proof: 🦪 silver shellfish
Patch quality: 🐚 platinum hermit
Result: blocked until real behavior proof from a real setup is added.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • Attach redacted runtime logs, terminal output, or a linked artifact showing the scheduler no longer re-arms at 0ms under the stale deferral scenario.
  • If feasible, include an ARM64 Docker or Raspberry Pi soak result for this exact branch because the linked user report is platform-specific.

Proof guidance:

  • [P1] Needs real behavior proof before merge: The PR body supplies focused fake-timer scheduler tests and Testbox check output, but no after-fix real gateway or ARM64 Docker run on this exact branch; add redacted logs, terminal output, or an artifact and update the PR body to trigger review.

Risk before merge

  • [P1] The PR changes core heartbeat scheduler rearm behavior, so maintainers should confirm the stale-schedule advancement does not suppress expected near-term heartbeat retries outside the covered disabled/flood cases.
  • [P1] The PR body records focused scheduler tests and a Testbox changed-check, but not a real gateway or ARM64 Docker run on this exact branch.

Maintainer options:

  1. Add real runtime proof (recommended)
    Before merge, attach redacted terminal output, logs, or an artifact from a real gateway run showing the stale deferral case no longer re-arms the scheduler at 0ms.
  2. Accept focused proof
    Maintainers may choose to accept the focused scheduler tests plus Testbox changed-check because the fix is a narrow state-machine repair and the older related discussion already has platform symptom proof.
  3. Pause for fix-path decision
    Pause this PR only if maintainers prefer the older timer-floor mitigation or want a different scheduler contract than advancing stale due slots.

Next step before merge

  • [P1] The maintainer label protects this PR from cleanup automation, and the remaining action is maintainer review of proof and merge readiness rather than an autonomous code repair.

Security
Cleared: The diff is limited to heartbeat scheduler logic and colocated tests; it does not touch dependencies, secrets, package metadata, CI, install scripts, or other supply-chain surfaces.

Review details

Best possible solution:

Land the semantic scheduler-state fix after maintainer review confirms the new due-slot advancement is the intended contract and the proof bar is acceptable for the linked availability regression.

Do we have a high-confidence way to reproduce the issue?

Yes. Source inspection gives a high-confidence reproduction path on current main: stale nextDueMs plus a disabled skip or non-retry deferral reaches scheduleNext() with a 0ms delay; I did not run tests because this review is read-only.

Is this the best way to solve the issue?

Yes. The proposed direction is a narrower, more maintainable fix than a timer-floor mitigation because it repairs stale scheduler state while preserving wake-layer retries for busy runtimes.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 57c88dd46e2d.

Label changes

Label changes:

  • add P1: The PR targets an urgent gateway heartbeat scheduler regression that can pin CPU and stall channel processing for affected users.
  • add merge-risk: 🚨 availability: The diff changes core heartbeat scheduler rearm behavior, so an incorrect merge could alter liveness, retry cadence, or process availability even with focused tests passing.
  • add rating: 🦪 silver shellfish: Overall readiness is 🦪 silver shellfish; proof is 🦪 silver shellfish and patch quality is 🐚 platinum hermit.
  • add status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: The PR body supplies focused fake-timer scheduler tests and Testbox check output, but no after-fix real gateway or ARM64 Docker run on this exact branch; add redacted logs, terminal output, or an artifact and update the PR body to trigger review.

Label justifications:

  • P1: The PR targets an urgent gateway heartbeat scheduler regression that can pin CPU and stall channel processing for affected users.
  • merge-risk: 🚨 availability: The diff changes core heartbeat scheduler rearm behavior, so an incorrect merge could alter liveness, retry cadence, or process availability even with focused tests passing.
  • rating: 🦪 silver shellfish: Overall readiness is 🦪 silver shellfish; proof is 🦪 silver shellfish and patch quality is 🐚 platinum hermit.
  • status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: The PR body supplies focused fake-timer scheduler tests and Testbox check output, but no after-fix real gateway or ARM64 Docker run on this exact branch; add redacted logs, terminal output, or an artifact and update the PR body to trigger review.
Evidence reviewed

PR surface:

Source +13, Tests +59. Total +72 across 2 files.

View PR surface stats
Area Files Added Removed Net
Source 1 20 7 +13
Tests 1 59 0 +59
Docs 0 0 0 0
Config 0 0 0 0
Generated 0 0 0 0
Other 0 0 0 0
Total 2 79 7 +72

What I checked:

  • Repository policy read: Root AGENTS.md was read fully; its protected-label, ClawSweeper PR-review, proof, scheduler hot-path, and security-review guidance apply to this item. (AGENTS.md:1, 57c88dd46e2d)
  • Current main can rearm stale due slots immediately: scheduleNext() computes rawDelay = Math.max(0, nextDue - now) and resolves it with minMs: 0, so a stale nextDueMs can rearm the heartbeat timer at 0ms. (src/infra/heartbeat-runner.ts:2234, 57c88dd46e2d)
  • Current main leaves disabled skips stale: On current main, non-retryable disabled skips record bookkeeping but skip advanceAgentSchedule, leaving the old due slot for scheduleNext() to observe. (src/infra/heartbeat-runner.ts:2400, 57c88dd46e2d)
  • Current main deferrals return without due-slot repair: Targeted and broadcast deferral paths return without updating nextDueMs; the cooldown contract shows flood and min-spacing are non-retry deferrals, while wake-layer retry is reserved for busy skip reasons. (src/infra/heartbeat-runner.ts:2368, 57c88dd46e2d)
  • PR patch targets the implicated scheduler state: The PR adds advanceStaleScheduleAfterDeferral, calls it on targeted and broadcast deferrals, and removes the disabled-skip exception so non-retryable outcomes advance due slots. (src/infra/heartbeat-runner.ts:2171, 213003a85412)
  • PR patch adds focused regression coverage: The PR adds tests for non-retryable disabled skips and flood deferrals without wake-layer retry, asserting the scheduler no longer keeps rearming immediately. (src/infra/heartbeat-runner.scheduler.test.ts:363, 213003a85412)

Likely related people:

  • steipete: Current scheduler/cooldown lines blame to 3fc0df953cf17b8bc47e45250c84965765cfed76, a recent merged core refactor touching this heartbeat path; the same person also authored this PR, so they are relevant through prior current-main history, not just this branch. (role: recent area contributor; confidence: high; commits: 3fc0df953cf1, 733f7af92b01; files: src/infra/heartbeat-runner.ts, src/infra/heartbeat-cooldown.ts)
  • George Zhang: Commit 9a4a9a5993cc7862efa4bd876be755fe6c399702 changed stable heartbeat phase scheduling, which is adjacent to due-slot calculation and interval scheduling. (role: scheduler feature contributor; confidence: medium; commits: 9a4a9a5993cc; files: src/infra/heartbeat-runner.ts)
  • MiloStack: Commit b33ad4d7cb45b27faf7b21cf8333b5c5f630fc07 previously fixed heartbeat timer re-arm behavior with try/finally, making them relevant to scheduler liveness history. (role: availability fix contributor; confidence: medium; commits: b33ad4d7cb45; files: src/infra/heartbeat-runner.ts)
  • Joseph Krug: Commit 5147656d6533158904f8eccafa0449a6f94fabcf fixed heartbeat scheduler death when runOnce throws, another nearby scheduler-liveness path. (role: availability fix contributor; confidence: medium; commits: 5147656d6533; files: src/infra/heartbeat-runner.ts)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P1 High-priority user-facing bug, regression, or broken workflow. merge-risk: 🚨 availability 🚨 May cause crashes, hangs, restart loops, stalls, or process outages. labels May 30, 2026
@byungskers

Copy link
Copy Markdown

This seems like the right fix shape to me — repairing stale scheduler state is cleaner than adding another timer floor.

One extra guard I would love to see is a focused regression showing that the busy-skip / wake-owned retry path is still untouched by this change. The new tests cover the non-retry deferrals well; a neighboring “still retries the old way when busy” case would make the intent even harder to accidentally regress later.

@steipete

Copy link
Copy Markdown
Contributor Author

Landing proof for head 213003a.

Commands/proof:

  • pnpm test src/infra/heartbeat-runner.scheduler.test.ts -- --reporter=verbose passed, 24 tests.
  • pnpm check:changed passed in Testbox tbx_01ksxfavykc7qyve4ysnxg3smh.
  • /Users/steipete/Projects/agent-scripts/skills/autoreview/scripts/autoreview --mode branch --base origin/main --parallel-tests "pnpm test src/infra/heartbeat-runner.scheduler.test.ts -- --reporter=verbose" returned clean: no accepted/actionable findings.
  • GitHub CI for head 213003a is green, including Real behavior proof, preflight, security-fast, build-artifacts, and checks-node-core-runtime-infra-heartbeat-runner.

Known gap: no physical Pi ARM64 Docker soak on this exact branch. Physical Pi proof for the original symptom exists in #79418 / #79380 discussion; this PR fixes the stale scheduler state directly and keeps retryable busy skips owned by the wake layer. The existing focused guard for that path is src/infra/heartbeat-runner.scheduler.test.ts:846.

@steipete steipete removed the status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. label May 31, 2026
@steipete steipete removed rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. merge-risk: 🚨 availability 🚨 May cause crashes, hangs, restart loops, stalls, or process outages. labels May 31, 2026
@steipete steipete merged commit bbc4bee into main May 31, 2026
202 of 209 checks passed
@steipete steipete deleted the fix/heartbeat-stale-schedule-refire branch May 31, 2026 05:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

maintainer Maintainer-authored PR P1 High-priority user-facing bug, regression, or broken workflow. size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Gateway CPU spin / crash loop on Raspberry Pi 4 (ARM64) — regression from 4.23 to 4.25+

2 participants