Skip to content

[Bug]: Heartbeat death loop — pendingFinalDelivery stuck on agent main session, blocks all future heartbeats for days #79258

@haumanto

Description

@haumanto

Bug type

Behavior bug (silent state corruption, no crash, no error logs)

Beta release blocker

No (workaround exists)

Summary

A heartbeat run that returns any non-token text to a session whose origin.to
is the auto-reply pseudo-target "heartbeat" puts the agent's main session
into a permanent pendingFinalDelivery: true state. Every subsequent
heartbeat tick retries the dead delivery first — failing silently with
pendingFinalDeliveryLastError: null despite climbing
pendingFinalDeliveryAttemptCount — and bumps updatedAt = now, which keeps
the 30-second skip-window check (runHeartbeatOnce, line 866-870 of
heartbeat-runner-DpQCcYf2.js) perpetually true. Result: the heartbeat
scheduler logs heartbeat: started intervalMs: 3600000 cleanly at every
gateway boot, but no actual heartbeat run ever happens again. We observed
64 consecutive hours of silence (2026-05-05 17:54 → 2026-05-08 13:53 WIB)
before investigating. cron list, doctor, and system heartbeat last all
surface nothing about the stuck state.

Reproduction (deterministic on v2026.5.7)

  1. Configure an agent with heartbeat: { every: "60m" } and target unset
    (defaults to "none"). Agent has no real Telegram/Discord delivery
    target for its heartbeat.
  2. Let the agent's heartbeat fire normally a few times. The session
    ~/.openclaw/agents/<agent>/sessions/sessions.json for agent:<id>:main
    accumulates origin: { label: "heartbeat", from: "heartbeat", to: "heartbeat" }.
  3. Force the heartbeat to return any text other than the bare token. In our
    case the agent appended a preamble: "All clear.\n\nHEARTBEAT_OK" instead
    of bare HEARTBEAT_OK. (See "Even bare HEARTBEAT_OK still triggers
    pending" below — non-token preamble accelerates it but isn't required.)
  4. The session enters pendingFinalDelivery: true with the preamble text.
  5. Wait one heartbeat interval. Inspect:
    python3 -c "import json; e=json.load(open('/home/openclaw/.openclaw/agents/<agent>/sessions/sessions.json'))['agent:<agent>:main']; print({k:e.get(k) for k in ['pendingFinalDelivery','pendingFinalDeliveryAttemptCount','pendingFinalDeliveryLastError','updatedAt']})"
    
  6. pendingFinalDeliveryAttemptCount climbs by 1 every hour, but
    pendingFinalDeliveryLastError stays null. updatedAt matches each
    retry timestamp.

In our reproduction we hit attemptCount: 64 before noticing.

Even bare HEARTBEAT_OK still triggers pending

After tightening the agent's HEARTBEAT.md to forbid preamble and forcing a
fresh session run (manual openclaw system event --mode now --text "test" --url ws://127.0.0.1:18789 --token $OPENCLAW_GATEWAY_TOKEN on a freshly-created
session entry), the agent returned the literal token HEARTBEAT_OK and still
ended with pendingFinalDelivery: true, pendingFinalDeliveryText: "HEARTBEAT_OK".
So the stripHeartbeatToken call in heartbeat-Dynyl6hI.js:52-87 either runs
after the pending-queue write or its empty-after-strip output isn't gating the
queueing. The runner should treat "output that strips to empty" as
effectively-empty and skip the final-delivery queue entirely.

The only thing that prevented the death loop reforming after the fresh
session was that the new session has origin: null and lastTo: null
(rather than origin.to: "heartbeat"), so there's nothing for dispatch to
retry against. Pending stays cosmetically true but updatedAt doesn't get
bumped, and the next 60m tick fires normally.

Two distinct issues compounding

Bug A — pending-delivery flag set even when output is the bare token

  • agent-runner.runtime-DQsCsHUA.js:4093-4095 sets pendingFinalDelivery: true
    • pendingFinalDeliveryText: pendingText whenever pendingText is
      non-empty by the runner's metric.
  • For heartbeat sessions, "output that strips to the empty string" should
    count as effectively-empty. Currently pendingText = "HEARTBEAT_OK"
    reaches that block.

Bug B — silent retry against pseudo-target with no error captured

  • dispatch-8E8vi2HV.js:227-246 (clearPendingFinalDeliveryAfterSuccess)
    only clears the flag on success. There is no corresponding
    recordPendingFinalDeliveryFailure that captures the error string into
    pendingFinalDeliveryLastError — so failures look identical to "still
    trying" and never surface in logs.
  • When delivery.to === "heartbeat" (the auto-reply pseudo-channel set on
    the session origin) and no real channel adapter resolves, the dispatch
    path returns silently. Compare [Bug]: deliverySucceeded=true returned when no adapter was invoked (early returns in deliverOutboundPayloads masquerade as success) #78532 (closed 2026-05-07) which addressed
    a similar deliverySucceeded=true masquerade — this is the same family
    of telemetry-vs-state mismatch on the failure side.
  • The 30s skip window in runHeartbeatOnce:
    if (recentSessionEntry?.pendingFinalDelivery === true
        && recentSessionEntry?.updatedAt
        && startedAt - recentSessionEntry.updatedAt < 3e4) return SKIP_REQUESTS_IN_FLIGHT;
    is correct in principle, but combined with retry logic that bumps
    updatedAt = now on each silent failure, it becomes a perpetual block.

Workaround (proven on v2026.5.7)

Stop gateway → drop the entire agent:<agent>:main entry from
sessions.json and remove its associated *.jsonl/*.trajectory.jsonl
files → restart gateway. The runner re-creates the session on the next
tick with origin: null, breaking the dispatch retry loop.

Clearing only the pendingFinalDelivery* fields is insufficient — we
verified within 8 minutes that the same heartbeat output re-creates the
stuck state, because origin.to: "heartbeat" is still on the session and
keeps re-dispatching.

Suggested fixes

  1. (Bug A) In the agent-runner pending-delivery write, gate on
    isHeartbeatContentEffectivelyEmpty(stripHeartbeatToken(text).text). If
    stripped output is empty, skip the pending queue write entirely.
  2. (Bug B-1) Capture dispatch failures into
    pendingFinalDeliveryLastError so silent failures become visible.
  3. (Bug B-2) When delivery.to === "heartbeat" and no channel plugin
    resolves, treat as clearPendingFinalDeliveryAfterSuccess — the
    pseudo-target acknowledges by reaching it; persistent retry is the bug.
  4. (Hardening) openclaw doctor should warn when any session has
    pendingFinalDelivery: true AND now - pendingFinalDeliveryCreatedAt > 1h
    AND pendingFinalDeliveryLastError === null. That's the diagnostic
    triple that masks this bug.

Environment

  • OpenClaw: 2026.5.7 (eeef486)
  • OS: Ubuntu 24.04 / Linux 6.8.0-110-generic x86_64
  • Node: v22.22.2
  • Install: npm global as system-level systemd service
  • Topology: 5-agent (orchestrator + reasoner + coder + fast + multimodal)
  • Affected agent: fast — 60m heartbeat, no target set, runs
    deepseek-v4-flash via opencode-go provider
  • Channel: Telegram-bound orchestrator; the affected agent has no direct
    user-facing channel.

Related issues (not duplicates)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions