Skip to content

Heartbeat scheduler silently stops dispatching polls after session compaction/recreation #87438

@ovrsr

Description

@ovrsr

Summary

The internal heartbeat scheduler silently stopped dispatching polls to agent:main:main for ~2h52m despite an active, responsive session and a stable gateway process. Heartbeats resumed only after a config.patch write to agents.defaults.heartbeat.every (same value), suggesting the scheduler was in a stopped state and the config file write re-initialized it.

Environment

  • OpenClaw: gateway running as systemd-style process (PID stable, no restart during incident)
  • Session: agent:main:main (direct/webchat)
  • Heartbeat config: agents.defaults.heartbeat.every: "10m" — always set, unchanged
  • Model: openrouter/openrouter/primary (402 billing exhausted), failover to openrouter/openrouter/owl-alpha

Timeline (UTC)

Time Event
14:00–18:08 Heartbeats every 10m, normal
18:03–18:07 Session compacts 3x due to context pressure
18:08 Last heartbeat on old session; compaction recovery in progress
18:19 New session created
18:24–18:44 Heartbeats resume normally in new session
18:44–21:36 Silent gap — no heartbeat polls. Session active and processing direct messages.
21:36 config.patch on agents.defaults.heartbeat.every (no-op value change)
21:44 Heartbeats resume and continue normally

Key Facts

  1. Gateway process never restarted (PID stable throughout).
  2. Session was fully functional during the gap — processing direct user messages, running tool calls, and producing responses. Only heartbeat polls were affected.
  3. The gap began 30 minutes into a new session (after 3 normal heartbeats on that session), not immediately at session creation.
  4. A config file write (even a no-op merge) re-triggered the scheduler.
  5. A manually created backup cron (10m interval, system event to main session) also did not fire during the gap, suggesting the issue extends beyond the per-session timer.

Possible Causes

  • Heartbeat timer bound to session object that got GC'd or orphaned during compaction, with the new session inheriting a stale handle.
  • Compaction风暴 (4 compacts in ~5 min) may have triggered an edge case in the scheduler's session-registration logic.
  • Config hot-reload on write re-initializes the global scheduler, which would explain the self-healing behavior.

Expected Behavior

Heartbeat scheduling should survive session recreation and compaction without requiring a config write.

Severity

Medium. Heartbeats are the primary in-band channel for an autonomous agent to self-report, check calendars, scan email, and surface issues to their human. A silent multi-hour gap with no logged error is a reliability concern.

Logs

No explicit error messages were logged during the gap. The session transcript simply shows no [OpenClaw heartbeat poll] messages arriving between 18:44 and 21:36. The next poll after the gap correlates with the config.patch timestamp.

Metadata

Metadata

Assignees

Labels

P2Normal backlog priority with limited blast radius.clawsweeper:needs-live-reproClawSweeper needs live local, crabbox, or manual validation to confirm this issue.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🐚 platinum hermitGood issue quality with a plausible reproduction path needing some confirmation.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions