Skip to content

fix(heartbeat): clear pendingFinalDelivery* on send success#83187

Open
agocs wants to merge 3 commits into
openclaw:mainfrom
agocs:fix/heartbeat-clear-pending-final-delivery
Open

fix(heartbeat): clear pendingFinalDelivery* on send success#83187
agocs wants to merge 3 commits into
openclaw:mainfrom
agocs:fix/heartbeat-clear-pending-final-delivery

Conversation

@agocs

@agocs agocs commented May 17, 2026

Copy link
Copy Markdown

Summary

Fixes #83184.

The heartbeat send path in src/infra/heartbeat-runner.ts:2011-2022 wrote lastHeartbeatText / lastHeartbeatSentAt after a successful sendDurableMessageBatch but never nulled the pendingFinalDelivery* recovery fields. A heartbeat-driven agent run that previously set pendingFinalDelivery: true (via src/auto-reply/reply/agent-runner.ts:2169) therefore left the session permanently stuck: subsequent heartbeats short-circuited in get-reply's redelivery path, the agent was never re-invoked, and the symptom surfaced as silent loss of notifications.

This change extends the existing post-success updateSessionStore block to clear all seven pendingFinalDelivery* fields, mirroring the canonical clearPendingFinalDeliveryAfterSuccess helper at src/auto-reply/reply/dispatch-from-config.ts:365-391 (already invoked from the user-message dispatch path at :1671).

Affected versions: verified on 2026.5.7 through 2026.5.16-beta.2; fix applies against current main (2026.5.17).

Verification

Commands run locally against this branch on Ubuntu 24.04, Node 22.22.2, pnpm 11.1.0:

  • node scripts/run-vitest.mjs src/infra/heartbeat-runner.clears-pending-final-delivery.test.ts — new regression test, 1 passed (also confirmed RED before the fix landed in the same branch).
  • node scripts/run-vitest.mjs <all 16 src/infra/heartbeat-runner.*.test.ts non-e2e files> — 16 files / 152 tests passed.
  • pnpm check:changed — green: conflict markers, changelog attributions, wildcard re-export guards, duplicate coverage, dependency pin guard, package patch guard, typecheck core, typecheck core tests, lint core (0/0), media download helper guard, runtime sidecar loader guard, runtime import cycles, webhook body guard, pairing store guard, pairing account guard.
  • oxfmt --check and core oxlint — both clean on the two touched files.

The new test seeds a session with pendingFinalDelivery: true + a heartbeat-ack pendingFinalDeliveryText (so the heartbeat-defer check at heartbeat-runner.ts:~1328 does not short-circuit), manually patches in the remaining five pendingFinalDelivery* fields the seeder does not expose, runs runHeartbeatOnce with a substantive reply, and asserts all seven fields are undefined afterward. It also asserts lastHeartbeatText / lastHeartbeatSentAt are written, proving the success branch was actually taken.

Real behavior proof

Behavior addressed: heartbeat send-success path leaves pendingFinalDelivery* set, causing the next heartbeat to skip the agent via the get-reply redelivery short-circuit.

Real environment tested: Docker-deployed gateway on Ubuntu 24.04 (kernel 6.8.0-111-generic). Prior to this patch the gateway was on OpenClaw 2026.5.12; the bug was deterministic there, confirmed by an out-of-band cron job (*/10 * * * *) that nulls the stuck fields whenever pendingFinalDeliveryText matches lastHeartbeatText. Cron log baseline showed one "cleared stuck" entry per hour (e.g., 2026-05-17T15:50:01Z and 2026-05-17T16:50:01Z, both agent:main:main) — i.e., every substantive heartbeat left the session stuck and the cron had to clean up.

Exact steps or command run after this patch:

  1. Checked out this branch (fix/heartbeat-clear-pending-final-delivery, current main + the diff in this PR; reported OpenClaw 2026.5.17).
  2. Built the openclaw image from this branch: DOCKER_BUILDKIT=1 docker build -t openclaw-source:fix-heartbeat-pending-delivery ..
  3. Pointed the deployment stack at that locally built image and redeployed: docker compose build && docker compose down && docker compose up -d. New container came up at 2026-05-17T17:04:33Z.
  4. Confirmed the cleanup block compiled into the runtime bundle: docker compose exec openclaw-gateway grep -c "pendingFinalDeliveryContext: void 0" /app/dist/heartbeat-runner-*.js1 in the send-path bundle (heartbeat-runner-CecDa3Rk.js), 0 in the runtime stub (heartbeat-runner-DPGtQP6X.js). Context grep around lastHeartbeatSentAt: startedAt shows the seven pendingFinalDelivery*: void 0 lines immediately after it, in the same updateSessionStore write that records lastHeartbeatText/lastHeartbeatSentAt.
  5. Watched agent:main:main in the host-mounted sessions.json and the cron log over the next ~2 hours of real heartbeats.

Evidence after fix:

  • New regression test (src/infra/heartbeat-runner.clears-pending-final-delivery.test.ts) goes RED on main (expected true to be undefined at the pendingFinalDelivery assertion) and GREEN with this patch. Sister tests in src/infra/heartbeat-runner.* (16 files / 152 tests) remain green.
  • Live prod heartbeat trace (agent:main:main, host-mounted sessions store, redacted preview):
    • 2026-05-17T17:04:33Z — patched container started (OpenClaw 2026.5.17).
    • 2026-05-17T17:04:41Z — heartbeat scheduler logged heartbeat: started intervalMs=1800000. Subsequent ticks ran but were all HEARTBEAT_OK (silent, no store write), keeping lastHeartbeatSentAt at the pre-redeploy 2026-05-17T16:40:18Z for ~2.5 hours.
    • 2026-05-17T19:10:18Z — first substantive heartbeat after the redeploy. lastHeartbeatSentAt advanced to 1779045018769, lastHeartbeatText written ("One mildly notable item — everything else is promos: …").
    • 2026-05-17T19:11:04Z (≤46s after the send write, well before the next */10 cron tick at 19:20:00Z) — pendingFinalDelivery: None, pendingFinalDeliveryText: "". This clean state is attributable only to the patched send-path write, not to the cron.
    • 2026-05-17T19:20:00Z cron tick — zero new lines in clear-stuck-pending-delivery.log (baseline was 2 lines, post-tick was still 2 lines). On the unpatched image this is the tick that would have logged a "cleared stuck" line, matching the prior 15:50 / 16:50 cadence.

Observed result after fix: a real substantive heartbeat in the deployed gateway now leaves all seven pendingFinalDelivery* fields cleared in the same updateSessionStore write that records lastHeartbeatText/lastHeartbeatSentAt, and the standing cron mitigation no longer finds anything to clean up.

What was not tested:

  • Multi-hour soak: only one substantive heartbeat has been observed post-deploy at the time of writing; a longer window would harden confidence that the cron stays silent across many cycles.
  • Channels other than Telegram: this deployment only uses Telegram for heartbeat delivery, so WhatsApp / Slack send-success paths have not been exercised end-to-end (the unit test covers the post-success store write generically via the test-runtime channel plugin).
  • The pendingFinalDelivery: true recovery branch where final delivery failed and a retry is genuinely needed — that path is still gated by the clearPendingFinalDeliveryAfterSuccess guard in dispatch-from-config.ts; this PR only adds the equivalent clear in the heartbeat send-success path.

@openclaw-barnacle openclaw-barnacle Bot added size: S triage: mock-only-proof Candidate: PR proof only shows tests, mocks, snapshots, lint, typecheck, or CI. labels May 17, 2026
@clawsweeper

clawsweeper Bot commented May 17, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs changes before merge. Reviewed May 30, 2026, 3:02 AM ET / 07:02 UTC.

Summary
The branch clears durable pendingFinalDelivery* session fields after a successful heartbeat send and adds a focused regression test for that send-success path.

PR surface: Source +11, Tests +98. Total +109 across 2 files.

Reproducibility: yes. Source inspection shows sendDurableMessageBatch can return suppressed, while the heartbeat branch only rejects failed and partial_failed before the PR's new cleanup runs.

Review metrics: 1 noteworthy metric.

  • Durable send terminal statuses entering cleanup: 2 non-failure statuses: sent and suppressed. The cleanup is safe for visible sends, but suppressed can mean no outbound message was delivered.

Merge readiness
Overall: 🦐 gold shrimp
Proof: 🦀 challenger crab
Patch quality: 🦐 gold shrimp
Result: needs maintainer review before merge.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • [P2] Preserve pending recovery state when sendDurableMessageBatch returns suppressed and add a regression test for that path.
  • Have maintainers decide whether the duplicate-suppression cleanup from the closed sibling belongs in this branch or a separate follow-up.

Risk before merge

  • [P1] The new cleanup treats suppressed durable send results as successful delivery; hook-cancelled or no-visible-result sends can lose pendingFinalDelivery* recovery state.
  • [P1] The closed sibling work identified a separate duplicate-suppression cleanup path that this branch does not include, so maintainers should decide whether that belongs in this PR or a follow-up.

Maintainer options:

  1. Gate Cleanup On Visible Delivery (recommended)
    Update the heartbeat branch so pendingFinalDelivery* is cleared only after an actual sent durable result, while preserving recovery state for suppressed results and covering that with a regression test.
  2. Accept Suppressed As Terminal
    Maintainers may intentionally treat suppressed as terminal for heartbeat delivery, but the PR should say that explicitly and add proof that this will not drop recoverable user-visible heartbeats.
  3. Split Duplicate Suppression
    If the duplicate-suppression cleanup is not part of this branch, leave it tracked as a separate follow-up rather than expanding this PR without maintainer agreement.
Copy recommended automerge instruction
@clawsweeper automerge

Special instructions:
Update `src/infra/heartbeat-runner.ts` so the pendingFinalDelivery cleanup added by this PR only runs after an actual visible durable send result, preserve pending recovery state for `send.status === "suppressed"`, add a regression test for the suppressed-send case, and keep the successful-send cleanup test green.

Next step before merge

  • [P2] A narrow automated repair can update the heartbeat cleanup condition and add the missing suppressed-send regression without a product decision.

Security
Cleared: The diff only changes heartbeat session-state cleanup and a colocated test; it does not add dependencies, workflow changes, secret handling, or new code-execution surfaces.

Review findings

  • [P2] Gate cleanup on an actual sent heartbeat — src/infra/heartbeat-runner.ts:2080-2086
Review details

Best possible solution:

Gate the pending-final-delivery cleanup to actual visible heartbeat delivery, add regression coverage for the suppressed-send case, and make an explicit maintainer call on whether duplicate-suppression cleanup lands here or separately.

Do we have a high-confidence way to reproduce the issue?

Yes. Source inspection shows sendDurableMessageBatch can return suppressed, while the heartbeat branch only rejects failed and partial_failed before the PR's new cleanup runs.

Is this the best way to solve the issue?

No, not quite. Clearing after a real sent heartbeat is the right fix shape, but the branch should distinguish sent from suppressed and add the missing regression coverage before merge.

Full review comments:

  • [P2] Gate cleanup on an actual sent heartbeat — src/infra/heartbeat-runner.ts:2080-2086
    This cleanup now runs for every non-failed durable result, but sendDurableMessageBatch also returns suppressed when hooks cancel delivery or no visible outbound result exists. In that case no heartbeat was delivered, so clearing pendingFinalDelivery* here can erase the retry state that should remain for recovery; only clear after a real sent result, and add a suppressed-result regression test.
    Confidence: 0.86

Overall correctness: patch is incorrect
Overall confidence: 0.86

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against f90b8cffc744.

Label changes

Label justifications:

  • P1: The PR targets a high-impact heartbeat bug that can silently stop user notifications and model invocations.
  • merge-risk: 🚨 message-delivery: Clearing pending recovery state after a suppressed durable send can drop a still-undelivered heartbeat final message.
  • merge-risk: 🚨 session-state: The diff mutates durable session recovery fields, and the current condition can erase them for a no-visible-result send.
  • rating: 🦐 gold shrimp: Overall readiness is 🦐 gold shrimp; proof is 🦀 challenger crab and patch quality is 🦐 gold shrimp.
  • status: ⏳ waiting on author: ClawSweeper has contributor-facing work open and is waiting for author action. Sufficient (logs): The PR body supplies after-fix real gateway proof with redacted live heartbeat/session-store logs for the sent-path behavior.
  • proof: sufficient: Contributor real behavior proof is sufficient. The PR body supplies after-fix real gateway proof with redacted live heartbeat/session-store logs for the sent-path behavior.
Evidence reviewed

PR surface:

Source +11, Tests +98. Total +109 across 2 files.

View PR surface stats
Area Files Added Removed Net
Source 1 11 0 +11
Tests 1 98 0 +98
Docs 0 0 0 0
Config 0 0 0 0
Generated 0 0 0 0
Other 0 0 0 0
Total 2 109 0 +109

Acceptance criteria:

  • [P1] node scripts/run-vitest.mjs src/infra/heartbeat-runner.clears-pending-final-delivery.test.ts.
  • [P1] node scripts/run-vitest.mjs src/channels/message/send.test.ts.

What I checked:

  • PR diff adds cleanup inside the existing non-failed heartbeat send branch: The PR adds pendingFinalDelivery* clears to the same store update that records lastHeartbeatText and lastHeartbeatSentAt; the surrounding code only throws for failed and partial_failed, so suppressed still reaches the new cleanup. (src/infra/heartbeat-runner.ts:2080, 2e62d2e0bb64)
  • Current heartbeat branch treats non-failed durable results as delivered bookkeeping: Current main calls sendDurableMessageBatch, throws only for failed or partial_failed, and then records heartbeat delivery metadata for the remaining statuses. (src/infra/heartbeat-runner.ts:2035, f90b8cffc744)
  • Durable send contract includes a no-visible-delivery suppressed status: sendDurableMessageBatch can return suppressed with empty results, and its own tests cover both no visible outbound result and hook-cancelled delivery as committed suppressed sends. (src/channels/message/send.ts:70, f90b8cffc744)
  • User-message dispatch does not clear pending state when no final route succeeded: The normal dispatch path marks final delivery failed when no routed final delivery occurred, and only calls clearPendingFinalDeliveryAfterSuccess when attempted delivery did not fail. (src/auto-reply/reply/dispatch-from-config.ts:2749, f90b8cffc744)
  • Related sibling comment identifies remaining duplicate-suppression scope: A contributor linked closed sibling work showing that a matching stale pending state can also survive when the duplicate-suppression branch returns before the send-success write; that is a scope decision for this canonical PR or a follow-up.
  • Real behavior proof is strong for the sent path: The PR body includes a Docker-deployed gateway proof with before/after session-store observations, cron-mitigation silence after the patch, and focused regression-test output for the successful-send path. (2e62d2e0bb64)

Likely related people:

  • Ayaan Zaidi: Current checkout blame attributes the heartbeat runner, get-reply pending-delivery branch, session type, and dispatch clear helper snapshots to commit 25dfe92; the local history is shallow/grafted, so this is a recent-current-source signal rather than full authorship proof. (role: recent area contributor; confidence: medium; commits: 25dfe9294fbb; files: src/infra/heartbeat-runner.ts, src/auto-reply/reply/get-reply.ts, src/auto-reply/reply/dispatch-from-config.ts)
  • void: git log -S lastHeartbeatText finds commit e274b5a adding heartbeat prompt and dedupe behavior in src/infra/heartbeat-runner.ts, credited in the subject to @voidserf. (role: heartbeat dedupe feature history; confidence: medium; commits: e274b5a040b8; files: src/infra/heartbeat-runner.ts)
  • Peter Steinberger: Recent visible release/current-main commits in the shallow checkout include broad snapshots of the central heartbeat and session files; useful as a low-confidence routing signal when deeper history is unavailable locally. (role: recent release/current-main contributor; confidence: low; commits: 27ae826f6525, 5db2cd6c00b5; files: src/infra/heartbeat-runner.ts, src/auto-reply/reply/get-reply.ts, src/auto-reply/reply/dispatch-from-config.ts)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added P1 High-priority user-facing bug, regression, or broken workflow. impact:message-loss Channel message delivery can be lost, duplicated, or misrouted. impact:session-state Session, memory, transcript, context, or agent state can drift or corrupt. labels May 17, 2026
@openclaw-barnacle openclaw-barnacle Bot added proof: supplied External PR includes structured after-fix real behavior proof. and removed triage: mock-only-proof Candidate: PR proof only shows tests, mocks, snapshots, lint, typecheck, or CI. labels May 17, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 17, 2026
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 28, 2026
@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🦞 diamond lobster Very strong PR readiness with only minor maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. and removed impact:session-state Session, memory, transcript, context, or agent state can drift or corrupt. impact:message-loss Channel message delivery can be lost, duplicated, or misrouted. labels May 28, 2026
@clawsweeper

clawsweeper Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

ClawSweeper PR egg: ✨ hatched 🥚 common Clockwork Signal Puff. Rarity: 🥚 common. Trait: hums during re-review.

Details

Share on X: post this hatch
Copy: My PR egg hatched a 🥚 common Clockwork Signal Puff in ClawSweeper.
Hatchability:

  • Merged PRs are hatchable.
  • Open PRs are hatchable when they are status: 👀 ready for maintainer look, status: 🚀 automerge armed, or labeled clawsweeper:automerge.
  • Closed unmerged PRs are hatchable only when one of those hatchable labels is still present in the durable record.

About:

  • Eggs appear after real-behavior proof passes. They are collectible flavor only.
  • Review momentum changes the shell state: follow-up work warms it, re-review makes it wobble, and a clean final review lets it hatch.
  • The hatch is seeded from this repository and PR number, so the same PR keeps the same creature; the reviewed head SHA can only change safe visual details.
  • Rarity is just collectible sparkle: 🥚 common, 🌱 uncommon, 💎 rare, ✨ glimmer, and 🌈 legendary.

Copy link
Copy Markdown
Contributor

Sibling PR #87583 was closed as superseded by this canonical cleanup path, which makes sense for the send-success portion. One remaining gap from #87583 may still be worth porting here: duplicate-suppression cleanup.

#83187 clears pendingFinalDelivery* after the successful send path. #87583 additionally clears matching stale pending state when the duplicate branch returns before the send-success store write, using this predicate:

  • pendingFinalDelivery === true
  • trimmed pendingFinalDeliveryText exactly matches the normalized heartbeat text
  • duplicate suppression already proved the same payload was delivered recently via lastHeartbeatText / lastHeartbeatSentAt

I ran a disposable local OpenClaw source-runtime proof on #87583 (codex/fix-heartbeat-pending-replay, 9497a7a032cf) with a temporary config/workspace/session store and the actual runHeartbeatOnce + sendDurableMessageBatch heartbeat path. The model reply was stubbed to deterministic text so the proof isolates delivery/session-state behavior.

Relevant duplicate-suppression proof:

{
  "scenario": "duplicate-skip",
  "result": { "status": "ran", "durationMs": 51 },
  "sendCallCount": 0,
  "before": {
    "pendingFinalDelivery": true,
    "pendingFinalDeliveryText": "PR87583 duplicate-skip heartbeat payload",
    "pendingFinalDeliveryCreatedAt": 1779957646044,
    "pendingFinalDeliveryLastAttemptAt": 1779957706044,
    "pendingFinalDeliveryAttemptCount": 2,
    "pendingFinalDeliveryLastError": "redacted prior delivery failure",
    "pendingFinalDeliveryContext": { "channel": "telegram", "to": "proof-chat" },
    "pendingFinalDeliveryIntentId": "intent-duplicate-skip",
    "lastHeartbeatText": "PR87583 duplicate-skip heartbeat payload",
    "lastHeartbeatSentAt": 1779957706044
  },
  "after": {
    "pendingFinalDelivery": "<absent>",
    "pendingFinalDeliveryText": "<absent>",
    "pendingFinalDeliveryCreatedAt": "<absent>",
    "pendingFinalDeliveryLastAttemptAt": "<absent>",
    "pendingFinalDeliveryAttemptCount": "<absent>",
    "pendingFinalDeliveryLastError": "<absent>",
    "pendingFinalDeliveryContext": "<absent>",
    "pendingFinalDeliveryIntentId": "<absent>",
    "lastHeartbeatText": "PR87583 duplicate-skip heartbeat payload",
    "lastHeartbeatSentAt": 1779957706044
  },
  "clearedAllPendingFields": true,
  "deliveryEvidence": "duplicate suppression skipped outbound send because lastHeartbeatText matched within 24h"
}

If maintainers want this PR to remain the single canonical fix, please consider porting the duplicate-suppression cleanup and regression test from #87583 rather than reopening the sibling PR.

@RomneyDa

Copy link
Copy Markdown
Member

Heads up: this PR needs to be updated against current main before the new required Dependency Guard check can pass.

@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 29, 2026
@agocs

agocs commented May 29, 2026

Copy link
Copy Markdown
Author

Button clicked, thanks @RomneyDa

@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 29, 2026
Heartbeat-driven agent runs can leave pendingFinalDelivery set on the
session; runHeartbeatOnce wrote lastHeartbeatText/lastHeartbeatSentAt on
send success but never cleared the recovery fields, so the next heartbeat
hit the get-reply redelivery short-circuit and skipped the agent. Extend
the existing post-success updateSessionStore block to null all seven
pendingFinalDelivery* fields, mirroring clearPendingFinalDeliveryAfterSuccess
in dispatch-from-config.ts.

Refs: openclaw#83184
@agocs agocs force-pushed the fix/heartbeat-clear-pending-final-delivery branch from 20311f3 to e2f3d73 Compare May 30, 2026 06:42
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 30, 2026
@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. merge-risk: 🚨 message-delivery 🚨 May drop, duplicate, misroute, suppress, or wrongly target messages. merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. and removed rating: 🦞 diamond lobster Very strong PR readiness with only minor maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. labels May 30, 2026
@aw-stevens

Copy link
Copy Markdown

Confirming this on a production deployment — the fix is correct and the approach is the right one.

We hit this on OpenClaw 2026.5.26 in a multi-agent setup: one agent runs a main-session heartbeat (isolatedSession: false) on a reasoning model that occasionally emits a no-op reply exceeding ackMaxChars: 300 instead of a bare HEARTBEAT_OK. The moment that happens, pendingFinalDelivery gets set and is never cleared — and every subsequent tick short-circuits into the redelivery path, replaying the stale text with no model call. So the awareness heartbeat silently stops doing any real work and re-posts the same stale message to the channel. Fully deterministic; we watched pendingFinalDeliveryAttemptCount climb to 18 before intervening.

What makes @agocs' method the right fix specifically:

  • The bug is an asymmetry between the two delivery paths.
  • The user-message dispatch path already calls clearPendingFinalDeliveryAfterSuccess after a non-failed send; the heartbeat path simply never had the equivalent.
  • Mirroring that into the post-success updateSessionStore block in runHeartbeatOnce is the minimal, symmetric fix -- and, crucially, it clears at send-success, which is the only place that works.
  • The existing replay-time self-heal can't cover this: classifyHeartbeatPendingFinalDelivery only clears when stripHeartbeatToken(text).shouldSkip is true, but a final is stored precisely because it is substantive (shouldSkip === false), so that gate can never fire for exactly the messages that get stuck.
  • Clearing has to happen when the send succeeds, not when the stale state is re-examined later — which is what this PR does.

#80357 already landed the ack-only side of this in 2026.5.26 — clearing on send-success here closes the remaining substantive-text gap in the same heartbeat path, so the two together cover both cases at the source rather than downstream.

(Our current workarounds are both stopgaps for this exact issue: manually nulling the pendingFinalDelivery* keys in sessions.json + restarting, and pinning a non-reasoning model for heartbeat that reliably emits a clean ack so the substantive-final state never gets created in the first place. Neither should be necessary once state can't go stale).

Checks are green and there are no conflicts — would love to see this prioritized for merge. 😊 🙏🏻 Happy to validate against our repro if that's useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-risk: 🚨 message-delivery 🚨 May drop, duplicate, misroute, suppress, or wrongly target messages. merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. P1 High-priority user-facing bug, regression, or broken workflow. proof: sufficient ClawSweeper judged the real behavior proof convincing. proof: supplied External PR includes structured after-fix real behavior proof. rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. size: S status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Heartbeat-driven agent replies leave pendingFinalDelivery stuck, blocking subsequent heartbeats

4 participants