fix(heartbeat): advance stale scheduler deferrals#88462
Conversation
|
Codex review: needs real behavior proof before merge. Reviewed May 30, 2026, 6:44 PM ET / 22:44 UTC. Summary PR surface: Source +13, Tests +59. Total +72 across 2 files. Reproducibility: yes. Source inspection gives a high-confidence reproduction path on current Review metrics: none identified. Merge readiness Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch. Rank-up moves:
Proof guidance:
Risk before merge
Maintainer options:
Next step before merge
Security Review detailsBest possible solution: Land the semantic scheduler-state fix after maintainer review confirms the new due-slot advancement is the intended contract and the proof bar is acceptable for the linked availability regression. Do we have a high-confidence way to reproduce the issue? Yes. Source inspection gives a high-confidence reproduction path on current Is this the best way to solve the issue? Yes. The proposed direction is a narrower, more maintainable fix than a timer-floor mitigation because it repairs stale scheduler state while preserving wake-layer retries for busy runtimes. AGENTS.md: found and applied where relevant. Codex review notes: model gpt-5.5, reasoning high; reviewed against 57c88dd46e2d. Label changesLabel changes:
Label justifications:
Evidence reviewedPR surface: Source +13, Tests +59. Total +72 across 2 files. View PR surface stats
What I checked:
Likely related people:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. How this review workflow works
|
|
This seems like the right fix shape to me — repairing stale scheduler state is cleaner than adding another timer floor. One extra guard I would love to see is a focused regression showing that the busy-skip / wake-owned retry path is still untouched by this change. The new tests cover the non-retry deferrals well; a neighboring “still retries the old way when busy” case would make the intent even harder to accidentally regress later. |
|
Landing proof for head 213003a. Commands/proof:
Known gap: no physical Pi ARM64 Docker soak on this exact branch. Physical Pi proof for the original symptom exists in #79418 / #79380 discussion; this PR fixes the stale scheduler state directly and keeps retryable busy skips owned by the wake layer. The existing focused guard for that path is src/infra/heartbeat-runner.scheduler.test.ts:846. |
Summary
Fixes #79380.
Supersedes #79418 with the semantic scheduler-state fix instead of a timer-floor mitigation.
This keeps retryable busy skips owned by the wake-layer retry path, but advances stale heartbeat due slots after terminal skips and non-retry deferrals. That makes
scheduleNext()see a future due time instead of rearming an immediate timer against the same past-due agent.Verification
pnpm test src/infra/heartbeat-runner.scheduler.test.ts -- --reporter=verbosepnpm check:changed(Testboxtbx_01ksxfavykc7qyve4ysnxg3smh)pnpm test:changed(failed in unrelatedsrc/auto-reply/reply/commands-status.thinking-default.test.tshook timeout; reran that exact file and reproduced the same unrelated timeout)autoreview --mode branch --base origin/main --parallel-tests "pnpm test src/infra/heartbeat-runner.scheduler.test.ts -- --reporter=verbose"-> clean, no accepted/actionable findingsReal behavior proof
Behavior addressed: Gateway heartbeat scheduler can hot-loop when every due agent either returns a terminal skip without advancing
nextDueMsor is deferred by a non-retry cooldown/flood gate whilenextDueMsis already stale.Real environment tested: Local macOS source checkout for focused scheduler behavior; Blacksmith Testbox for changed check gate.
Exact steps or command run after this patch:
pnpm test src/infra/heartbeat-runner.scheduler.test.ts -- --reporter=verboseandpnpm check:changed.Evidence after fix: Scheduler regression tests pass and cover both stale terminal
disabledskips and flood deferrals without wake-layer retry; changed check passed in Testboxtbx_01ksxfavykc7qyve4ysnxg3smh.Observed result after fix: Non-retryable terminal skips now record bookkeeping and advance the agent due slot; stale non-retry deferrals advance the due slot before returning, so the scheduler does not keep rearming at 0ms. Retryable busy skips still avoid schedule advancement and continue to use the wake-layer retry path.
What was not tested: Physical Raspberry Pi 4 ARM64 Docker soak on this exact branch. The original report/PR discussion already includes physical Pi before/after proof for the same scheduler symptom; this PR changes the fix shape to repair the stale schedule state directly.