fix: add watchdog interval to re-arm heartbeat timer after first batch (#31139) by Sid-Qin · Pull Request #31161 · openclaw/openclaw

Sid-Qin · 2026-03-02T01:58:34Z

Summary

Problem: Heartbeat timer fires once after the configured interval, runs all due agents, then never re-arms. After first batch, no further heartbeats triggered until gateway restart.
Why it matters: Heartbeat-driven agents stop running silently, requiring manual restart or external watchdog scripts.
What changed: Added a 60-second periodic watchdog interval as a safety net that re-triggers heartbeats if the primary setTimeout fails to fire. Removed .unref() from the primary timer. Added test for consecutive timer re-arming.
What did NOT change: No change to agent execution logic, heartbeat config resolution, or wake handler architecture.

Change Type

Bug fix

Scope

Gateway / orchestration

Linked Issue/PR

Closes Heartbeat timer stops after first batch - scheduleNext() timer never re-fires #31139

User-visible / Behavior Changes

None. Heartbeats now reliably fire on schedule in all environments (Docker, Linux, macOS).

Security Impact

New permissions/capabilities? No
Secrets/tokens handling changed? No
New/changed network calls? No
Command/tool execution surface changed? No
Data access scope changed? No

Repro + Verification

Environment

OS: Linux (Docker), macOS
Runtime: Node.js 22.22.0
Model/provider: Any
Integration/channel: N/A
Config: heartbeat.every: "10m" with multiple agents

Steps

Configure multiple agents with heartbeat.every: "10m"
Start gateway
Observe heartbeat runs fire at the first 10-minute mark
Wait for subsequent intervals

Expected

Heartbeats fire every 10 minutes continuously

Actual (before fix)

Timer fires once then stops

Evidence

Failing test/log before + passing after
New test re-arms timer after successful batch and fires again passes (3 consecutive fires)
All 7 heartbeat scheduler tests pass

Human Verification

Verified scenarios: Timer re-arms after successful batch (3 cycles), error recovery, requests-in-flight retry, config update re-scheduling
Edge cases checked: Watchdog fires only when agents are due, cleanup clears both timer and watchdog, stopped runner skips watchdog
What was not verified: Production Docker environment with 11 agents

Compatibility / Migration

Backward compatible? Yes
Config/env changes? No
Migration needed? No

Failure Recovery

How to disable/revert: Revert this commit
Files to restore: src/infra/heartbeat-runner.ts
Known bad symptoms: If watchdog interval is too aggressive, it could cause duplicate heartbeat runs (mitigated by existing de-duplication in wake handler)

Risks and Mitigations

Risk: Watchdog fires while primary timer is still pending, causing a duplicate wake request
- Mitigation: requestHeartbeatNow coalesces duplicate wake requests via queuePendingWakeReason which deduplicates by target key

Made with Cursor

openclaw#31139) Made-with: Cursor

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5d8da5420c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-02T02:01:42Z

+      const now = Date.now();
+      for (const agent of state.agents.values()) {
+        if (now >= agent.nextDueMs) {
+          requestHeartbeatNow({ reason: "interval", coalesceMs: 0 });


Avoid zero-delay watchdog wake when a run is in flight

Using coalesceMs: 0 here can create a tight reschedule loop when a heartbeat run takes longer than one watchdog tick (60s): while runOnce is still executing, nextDueMs is still overdue, so the watchdog queues another wake, and schedule() in src/infra/heartbeat-wake.ts keeps re-arming immediate timers when running === true (schedule(delay, kind) with delay=0). In that scenario, the process can spin hot until the long run finishes, causing unnecessary CPU/event-loop pressure in production workloads with slow model/network calls.

Useful? React with 👍 / 👎.

greptile-apps · 2026-03-02T02:04:40Z

Greptile Summary

Fixes critical bug where heartbeat timers fired once then never re-armed, causing agent heartbeats to stop after first execution.

Key Changes

Added 60-second watchdog interval as safety net to re-trigger heartbeats if primary timer fails
Removed .unref() from primary timer (behavior change - timer now keeps process alive for reliability)
Added proper watchdog cleanup in stop/cleanup function
New test verifies 3 consecutive timer fires to catch re-arming failures

Implementation Details

The watchdog uses setInterval to check every 60 seconds if any agents are due. If found, it calls requestHeartbeatNow({ reason: "interval", coalesceMs: 0 }), same as the primary timer. The watchdog is:

Armed once during initialization (in scheduleNext)
Never cleared during reschedules (runs continuously as a safety net)
Only cleared when runner is stopped
Uses .unref() to avoid keeping process alive

Deduplication in queuePendingWakeReason (in heartbeat-wake.ts) prevents duplicate runs if both watchdog and primary timer fire simultaneously - uses Map keyed by agentId::sessionKey to ensure only one wake request per target.

Risk Assessment

Low risk. All edge cases properly handled:

Duplicate wake requests deduplicated by target key
run() function checks now < agent.nextDueMs before executing, preventing duplicate agent runs
Watchdog adapts to config changes automatically via state.stopped and state.agents.size checks
Proper cleanup prevents resource leaks

Confidence Score: 5/5

This PR is safe to merge with minimal risk - fixes critical production issue with well-designed safety net
The implementation is solid and well-tested. The watchdog pattern is a proven approach for ensuring reliability. All potential race conditions are properly handled through deduplication and state checks. The test suite verifies the fix works for consecutive cycles. The code changes are minimal, focused, and don't affect agent execution logic. The removal of .unref() from the primary timer is a reasonable trade-off for reliability.
No files require special attention - both implementation and test are straightforward

_{Last reviewed commit: 5d8da54}

fix: add watchdog interval to re-arm heartbeat timer after first batch (

5d8da54

openclaw#31139) Made-with: Cursor

openclaw-barnacle Bot added size: XS experienced-contributor labels Mar 2, 2026

chatgpt-codex-connector Bot reviewed Mar 2, 2026

View reviewed changes

steipete mentioned this pull request Mar 3, 2026

fix(heartbeat): Fix timer re-arm logic and add robust watchdog #31226

Closed

1 task

thewilloftheshadow closed this Mar 6, 2026

MiloStack mentioned this pull request Mar 22, 2026

fix: guarantee heartbeat timer re-arm with try/finally #52270

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: add watchdog interval to re-arm heartbeat timer after first batch (#31139)#31161

fix: add watchdog interval to re-arm heartbeat timer after first batch (#31139)#31161
Sid-Qin wants to merge 1 commit into
openclaw:mainfrom
Sid-Qin:fix/heartbeat-timer-rearm-31139

Sid-Qin commented Mar 2, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Mar 2, 2026

Uh oh!

greptile-apps Bot commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Sid-Qin commented Mar 2, 2026

Summary

Change Type

Scope

Linked Issue/PR

User-visible / Behavior Changes

Security Impact

Repro + Verification

Environment

Steps

Expected

Actual (before fix)

Evidence

Human Verification

Compatibility / Migration

Failure Recovery

Risks and Mitigations

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot commented Mar 2, 2026

Greptile Summary

Key Changes

Implementation Details

Risk Assessment

Confidence Score: 5/5

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants