Skip to content

fix(heartbeat): prevent ack-only pending delivery loops#80357

Merged
steipete merged 6 commits into
mainfrom
fix/heartbeat-pending-delivery-79258
May 10, 2026
Merged

fix(heartbeat): prevent ack-only pending delivery loops#80357
steipete merged 6 commits into
mainfrom
fix/heartbeat-pending-delivery-79258

Conversation

@steipete

@steipete steipete commented May 10, 2026

Copy link
Copy Markdown
Contributor

Summary

  • recreate the closed fix(heartbeat): skip pendingFinalDelivery when reply strips to HEARTBEAT_OK #79270 heartbeat pending-delivery fix on current main, preserving @hclsys's commits and changelog credit
  • skip durable pending-final-delivery writes when heartbeat output strips to an ack-only HEARTBEAT_OK, while preserving real heartbeat notification content
  • clear stale ack-only pending replay state before it can refresh the heartbeat requests-in-flight window
  • thread the resolved heartbeat ackMaxChars policy through write, stale replay cleanup, and defer-guard paths; normalize stale replay content to the stripped real remainder

Fixes #79258.
Supersedes/replaces #79270.

Real Behavior Proof

Behavior or issue addressed: Heartbeat ack text such as HEARTBEAT_OK or short HEARTBEAT_OK remainder text should not become durable pending final-delivery replay state; stale ack-only pending entries should self-heal; real heartbeat content must still replay/deliver according to configured ackMaxChars.

Real environment tested: Local OpenClaw checkout on PR head b74aa475f9, Node 22 runtime through the repo pnpm test wrapper. The focused verification created real temporary session-store JSON files and exercised production TypeScript modules (runReplyAgent, getReplyFromConfig, runHeartbeatOnce, stripHeartbeatToken) with real file reads/writes; delivery/model edges were controlled by repo test harnesses.

Exact steps or command run after this patch:

pnpm test src/auto-reply/reply/get-reply.fast-path.test.ts src/infra/heartbeat-runner.skips-busy-session-lane.test.ts src/auto-reply/reply/agent-runner.runreplyagent.e2e.test.ts src/infra/heartbeat-runner.respects-ackmaxchars-heartbeat-acks.test.ts -- --reporter=verbose
pnpm test src/infra/heartbeat-runner.returns-default-unset.test.ts src/auto-reply/reply/dispatch-from-config.reply-dispatch.test.ts src/auto-reply/reply/session.heartbeat-no-reset.test.ts -- --reporter=verbose
OPENCLAW_TESTBOX= pnpm check:changed

Evidence after fix: Terminal output from the focused run:

✓ ... clears stale ack-only heartbeat pending delivery before replay
✓ ... uses ackMaxChars when replaying stale heartbeat pending delivery
✓ ... does not defer on a recent heartbeat ack pending final delivery
✓ ... keeps deferring recent pending delivery when ackMaxChars makes the remainder real content
✓ ... does not persist heartbeat ack text as pending final delivery
[test] passed 3 Vitest shards in 17.36s

Terminal output from the acceptance-adjacent run:

✓ ... clears pending final delivery after final dispatch succeeds
✓ ... preserves pending final delivery when final dispatch fails
✓ ... should NOT reset session when Provider is 'heartbeat'
[test] passed 2 Vitest shards in 6.53s

Observed result after fix: Ack-only pending delivery no longer self-refreshes the heartbeat skip loop. Existing ack-only pending state is cleared before replay. With ackMaxChars: 0, short post-token content remains real pending delivery and still defers as in-flight. Real heartbeat notification text remains durable for recovery.

What was not tested: No live 60-minute gateway daemon interval and no external Telegram/Discord adapter were run for this PR; proof is local production-module behavior with real session-store files plus CI.

Verification

pnpm exec oxfmt --check --threads=1 src/auto-reply/reply/get-reply.ts src/auto-reply/reply/get-reply.fast-path.test.ts src/infra/heartbeat-runner.skips-busy-session-lane.test.ts src/auto-reply/reply/agent-runner.ts src/auto-reply/reply/agent-runner.runreplyagent.e2e.test.ts src/infra/heartbeat-runner.ts src/infra/heartbeat-runner.test-utils.ts CHANGELOG.md
pnpm test src/auto-reply/reply/get-reply.fast-path.test.ts src/infra/heartbeat-runner.skips-busy-session-lane.test.ts src/auto-reply/reply/agent-runner.runreplyagent.e2e.test.ts src/infra/heartbeat-runner.respects-ackmaxchars-heartbeat-acks.test.ts -- --reporter=verbose
pnpm test src/infra/heartbeat-runner.returns-default-unset.test.ts src/auto-reply/reply/dispatch-from-config.reply-dispatch.test.ts src/auto-reply/reply/session.heartbeat-no-reset.test.ts -- --reporter=verbose
OPENCLAW_TESTBOX= pnpm check:changed
git diff --check origin/main...HEAD

Thanks @haumanto for the detailed report and @hclsys for the original fix work in #79270.

HCL and others added 6 commits May 10, 2026 18:27
shouldSkipHeartbeatPendingFinalDelivery was using the default 300-char
threshold regardless of per-agent heartbeat config. Replace with inline
logic that resolves ackMaxChars from cfg.agents[agentId].heartbeat ->
cfg.agents.defaults.heartbeat -> DEFAULT_HEARTBEAT_ACK_MAX_CHARS.

Also fix: store the stripped text (remainder after HEARTBEAT_OK) rather
than the raw payload text. Previously pendingFinalDeliveryText would
have contained the HEARTBEAT_OK prefix, causing heartbeat-runner to
re-deliver it verbatim on retry.

Resolves clawsweeper P2 review finding on #79270.
@openclaw-barnacle openclaw-barnacle Bot added size: M maintainer Maintainer-authored PR labels May 10, 2026
@clawsweeper

clawsweeper Bot commented May 10, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs real behavior proof before merge.

Summary
The PR adds heartbeat-specific pending-final-delivery handling so ack-only HEARTBEAT_OK text is not persisted or replayed, threads ackMaxChars through replay/defer decisions, and adds regression tests plus a changelog entry.

Reproducibility: yes. Current main source shows a concrete loop path: ack text can be persisted before stripping, heartbeat replay refreshes updatedAt, and the runner then defers on recent pending delivery; I did not run a live fixture in this read-only review.

Real behavior proof
Needs real behavior proof before merge: The PR body lists tests and checks but does not show after-fix real behavior output from a real setup; add redacted terminal output, logs, or a recording/artifact showing the pending state no longer loops, then update the PR body to trigger re-review or ask a maintainer for @clawsweeper re-review.

Next step before merge
This active protected-label PR needs maintainer review and after-fix real behavior proof rather than a ClawSweeper repair PR.

Security
Cleared: The diff only changes TypeScript heartbeat/session logic, tests, and changelog text; I found no concrete security or supply-chain concern.

Review details

Best possible solution:

Land a focused heartbeat/session-state fix after maintainer review, preserving durable replay for real heartbeat content while keeping ack-only heartbeat text out of pending replay state.

Do we have a high-confidence way to reproduce the issue?

Yes. Current main source shows a concrete loop path: ack text can be persisted before stripping, heartbeat replay refreshes updatedAt, and the runner then defers on recent pending delivery; I did not run a live fixture in this read-only review.

Is this the best way to solve the issue?

Yes, the code direction is the narrow maintainable fix: reuse the existing heartbeat token classifier at write, replay, and defer points rather than redesigning durable delivery. The remaining blocker is proof and maintainer handling, not an obvious patch defect.

What I checked:

Likely related people:

  • MertBasar0: Commit history for the heartbeat runner and reply paths shows c240e71 introduced the main-session durable pending-final-delivery feature, including the heartbeat deferral behavior this PR narrows. (role: introduced behavior; confidence: high; commits: c240e718e91e; files: src/infra/heartbeat-runner.ts, src/auto-reply/reply/agent-runner.ts, src/auto-reply/reply/get-reply.ts)
  • steipete: Recent current-main history for the same reply/heartbeat/session delivery surface includes delivery API, followup-drain, queue, and heartbeat fallback cleanup work; this also makes the PR author relevant beyond merely opening this PR. (role: recent area contributor; confidence: high; commits: a4b17d65a8ff, 0909df1a4f3d, 751423299bb5; files: src/auto-reply/reply/agent-runner.ts, src/auto-reply/reply/get-reply.ts, src/infra/heartbeat-runner.ts)
  • shakkernerd: Local blame and GitHub file history show recent touches on the same heartbeat and agent-turn diagnostics area, making them a secondary routing candidate if maintainers need context on recent path movement. (role: recent adjacent contributor; confidence: medium; commits: 01741f81f84e, 61223a74a43f; files: src/infra/heartbeat-runner.ts, src/auto-reply/reply/agent-runner.ts, src/auto-reply/reply/get-reply.ts)

Remaining risk / open question:

  • The PR body lists tests and checks but does not include after-fix real behavior proof from a real setup or copied terminal/log output showing the fixed behavior.
  • I did not execute the verification commands because this was a read-only review that had to leave the checkout clean.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 3ba2ab7a0950.

Re-review progress:

@steipete

Copy link
Copy Markdown
Contributor Author

@clawsweeper re-review

@clawsweeper

clawsweeper Bot commented May 10, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@steipete steipete merged commit 1c1253e into main May 10, 2026
134 of 140 checks passed
@steipete steipete deleted the fix/heartbeat-pending-delivery-79258 branch May 10, 2026 18:00
@steipete

Copy link
Copy Markdown
Contributor Author

Landed via rebase onto main.

  • Gate: pnpm exec oxfmt --check --threads=1 ...; focused heartbeat/reply tests; acceptance-adjacent heartbeat/reply tests; OPENCLAW_TESTBOX= pnpm check:changed; git diff --check origin/main...HEAD
  • Source head before merge: b74aa47
  • Landed head on main: 1c1253e

Thanks @hclsys for the original fix work, and @haumanto for the report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

maintainer Maintainer-authored PR size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Heartbeat death loop — pendingFinalDelivery stuck on agent main session, blocks all future heartbeats for days

2 participants