Skip to content

fix: add watchdog interval to re-arm heartbeat timer after first batch (#31139)#31161

Closed
Sid-Qin wants to merge 1 commit into
openclaw:mainfrom
Sid-Qin:fix/heartbeat-timer-rearm-31139
Closed

fix: add watchdog interval to re-arm heartbeat timer after first batch (#31139)#31161
Sid-Qin wants to merge 1 commit into
openclaw:mainfrom
Sid-Qin:fix/heartbeat-timer-rearm-31139

Conversation

@Sid-Qin

@Sid-Qin Sid-Qin commented Mar 2, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Problem: Heartbeat timer fires once after the configured interval, runs all due agents, then never re-arms. After first batch, no further heartbeats triggered until gateway restart.
  • Why it matters: Heartbeat-driven agents stop running silently, requiring manual restart or external watchdog scripts.
  • What changed: Added a 60-second periodic watchdog interval as a safety net that re-triggers heartbeats if the primary setTimeout fails to fire. Removed .unref() from the primary timer. Added test for consecutive timer re-arming.
  • What did NOT change: No change to agent execution logic, heartbeat config resolution, or wake handler architecture.

Change Type

  • Bug fix

Scope

  • Gateway / orchestration

Linked Issue/PR

User-visible / Behavior Changes

None. Heartbeats now reliably fire on schedule in all environments (Docker, Linux, macOS).

Security Impact

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No

Repro + Verification

Environment

  • OS: Linux (Docker), macOS
  • Runtime: Node.js 22.22.0
  • Model/provider: Any
  • Integration/channel: N/A
  • Config: heartbeat.every: "10m" with multiple agents

Steps

  1. Configure multiple agents with heartbeat.every: "10m"
  2. Start gateway
  3. Observe heartbeat runs fire at the first 10-minute mark
  4. Wait for subsequent intervals

Expected

  • Heartbeats fire every 10 minutes continuously

Actual (before fix)

  • Timer fires once then stops

Evidence

  • Failing test/log before + passing after
  • New test re-arms timer after successful batch and fires again passes (3 consecutive fires)
  • All 7 heartbeat scheduler tests pass

Human Verification

  • Verified scenarios: Timer re-arms after successful batch (3 cycles), error recovery, requests-in-flight retry, config update re-scheduling
  • Edge cases checked: Watchdog fires only when agents are due, cleanup clears both timer and watchdog, stopped runner skips watchdog
  • What was not verified: Production Docker environment with 11 agents

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No

Failure Recovery

  • How to disable/revert: Revert this commit
  • Files to restore: src/infra/heartbeat-runner.ts
  • Known bad symptoms: If watchdog interval is too aggressive, it could cause duplicate heartbeat runs (mitigated by existing de-duplication in wake handler)

Risks and Mitigations

  • Risk: Watchdog fires while primary timer is still pending, causing a duplicate wake request
    • Mitigation: requestHeartbeatNow coalesces duplicate wake requests via queuePendingWakeReason which deduplicates by target key

Made with Cursor

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5d8da5420c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

const now = Date.now();
for (const agent of state.agents.values()) {
if (now >= agent.nextDueMs) {
requestHeartbeatNow({ reason: "interval", coalesceMs: 0 });

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid zero-delay watchdog wake when a run is in flight

Using coalesceMs: 0 here can create a tight reschedule loop when a heartbeat run takes longer than one watchdog tick (60s): while runOnce is still executing, nextDueMs is still overdue, so the watchdog queues another wake, and schedule() in src/infra/heartbeat-wake.ts keeps re-arming immediate timers when running === true (schedule(delay, kind) with delay=0). In that scenario, the process can spin hot until the long run finishes, causing unnecessary CPU/event-loop pressure in production workloads with slow model/network calls.

Useful? React with 👍 / 👎.

@greptile-apps

greptile-apps Bot commented Mar 2, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

Fixes critical bug where heartbeat timers fired once then never re-armed, causing agent heartbeats to stop after first execution.

Key Changes

  • Added 60-second watchdog interval as safety net to re-trigger heartbeats if primary timer fails
  • Removed .unref() from primary timer (behavior change - timer now keeps process alive for reliability)
  • Added proper watchdog cleanup in stop/cleanup function
  • New test verifies 3 consecutive timer fires to catch re-arming failures

Implementation Details

The watchdog uses setInterval to check every 60 seconds if any agents are due. If found, it calls requestHeartbeatNow({ reason: "interval", coalesceMs: 0 }), same as the primary timer. The watchdog is:

  • Armed once during initialization (in scheduleNext)
  • Never cleared during reschedules (runs continuously as a safety net)
  • Only cleared when runner is stopped
  • Uses .unref() to avoid keeping process alive

Deduplication in queuePendingWakeReason (in heartbeat-wake.ts) prevents duplicate runs if both watchdog and primary timer fire simultaneously - uses Map keyed by agentId::sessionKey to ensure only one wake request per target.

Risk Assessment

Low risk. All edge cases properly handled:

  • Duplicate wake requests deduplicated by target key
  • run() function checks now < agent.nextDueMs before executing, preventing duplicate agent runs
  • Watchdog adapts to config changes automatically via state.stopped and state.agents.size checks
  • Proper cleanup prevents resource leaks

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk - fixes critical production issue with well-designed safety net
  • The implementation is solid and well-tested. The watchdog pattern is a proven approach for ensuring reliability. All potential race conditions are properly handled through deduplication and state checks. The test suite verifies the fix works for consecutive cycles. The code changes are minimal, focused, and don't affect agent execution logic. The removal of .unref() from the primary timer is a reasonable trade-off for reliability.
  • No files require special attention - both implementation and test are straightforward

Last reviewed commit: 5d8da54

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Heartbeat timer stops after first batch - scheduleNext() timer never re-fires

2 participants