Skip to content

fix: drop heartbeat runs that arrive while another run is active#25610

Closed
mcaxtr wants to merge 2 commits intoopenclaw:mainfrom
mcaxtr:fix/25606-heartbeat-followup
Closed

fix: drop heartbeat runs that arrive while another run is active#25610
mcaxtr wants to merge 2 commits intoopenclaw:mainfrom
mcaxtr:fix/25606-heartbeat-followup

Conversation

@mcaxtr
Copy link
Contributor

@mcaxtr mcaxtr commented Feb 24, 2026

Summary

  • Problem: Heartbeat runs arriving while another agent run is active get enqueued as followups, and when the queue drains later, they produce duplicate agent runs that send multiple response branches to users
  • Why it matters: Creates poor UX with 2-4 duplicate messages instead of 1, confusing users and creating noise
  • What changed: Added early-return guard in runReplyAgent that drops heartbeat runs when isActive: true, before they reach the enqueue path. Also added markRunComplete() calls to all early-return paths for consistency.
  • What did NOT change: Non-heartbeat runs continue to enqueue normally when another run is active; followup mechanism unchanged; heartbeat runs that arrive when no run is active execute normally

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

User-visible / Behavior Changes

Users with heartbeat enabled will no longer receive duplicate messages when heartbeat intervals coincide with active agent runs. Heartbeat runs that arrive while another run is active are now silently dropped, and the next heartbeat interval independently re-checks the session.

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No

Repro + Verification

Environment

  • OS: Linux / macOS / all platforms
  • Runtime/container: Node 22+
  • Model/provider: Any (tested with gpt-4o-mini, ministral-3b)
  • Integration/channel: All channels affected
  • Relevant config: Heartbeat enabled with any interval

Steps

  1. Configure heartbeat with default settings
  2. Trigger a heartbeat check while another agent run is active for the same session
  3. Wait for the active run to complete and the followup queue to drain
  4. Observe messages delivered to user

Expected

Single message delivered (either HEARTBEAT_OK or brief status update)

Actual

Before fix: 2-4 duplicate messages due to stale heartbeat followup creating additional agent run branches
After fix: Single message only (heartbeat dropped when isActive, next interval re-checks)

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

New test suite with 3 test cases:

  1. Heartbeat + active → returns undefined, does not enqueue, does not run
  2. Non-heartbeat + active → returns undefined, enqueues normally
  3. Heartbeat + inactive → executes normally, returns payload

All tests pass. Existing agent-runner and heartbeat test suites pass.

Human Verification (required)

Verified scenarios:

  • New test suite passes (3 tests covering heartbeat drop, non-heartbeat enqueue, heartbeat execute)
  • Existing agent-runner tests pass (13 test files)
  • Existing heartbeat tests pass (43 tests)
  • pnpm build && pnpm check passes (oxfmt, oxlint, tsgo all clean)

Edge cases checked:

  • Non-heartbeat runs still enqueue normally when isActive
  • Heartbeat runs execute normally when not active
  • Early-return paths all call markRunComplete() before cleanup()

What I did not verify:

  • End-to-end testing with real heartbeat intervals and live channels (requires production setup with timing coordination)

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly: Revert commit or cherry-pick the inverse
  • Files/config to restore: None (no config changes)
  • Known bad symptoms reviewers should watch for: Heartbeat messages not being delivered at all (would indicate guard is triggering incorrectly)

Risks and Mitigations

  • Risk: Heartbeat might be dropped too aggressively if isActive logic has false positives

    • Mitigation: Next heartbeat interval will re-check independently; isActive flag is managed by existing typing controller which is well-tested
  • Risk: Subtle differences in markRunComplete() timing across early-return paths

    • Mitigation: Added markRunComplete() to all early-return paths for consistency, following the pattern from the finally block

Greptile Summary

This PR fixes a bug where heartbeat runs arriving while another agent run is active would get enqueued as followups, causing duplicate agent runs and multiple response branches delivered to users (2-4 duplicate messages instead of 1).

  • Adds an early-return guard in runReplyAgent that silently drops heartbeat runs when isActive: true, before they reach the enqueue path. The next heartbeat interval independently re-checks the session, so no heartbeat cycles are permanently lost.
  • Adds markRunComplete() calls to all pre-try early-return paths (steered, heartbeat-dropped, and enqueue paths) for consistency — these paths return before the try/finally block, so without explicit markRunComplete() calls the typing controller would skip that step.
  • Non-heartbeat runs continue to enqueue normally when another run is active; the followup mechanism is unchanged.
  • New test suite covers all three key scenarios: heartbeat + active (dropped), non-heartbeat + active (enqueued), heartbeat + inactive (executes normally).

Confidence Score: 5/5

  • This PR is safe to merge — the change is narrowly scoped, correctly guarded, and well-tested.
  • The fix is minimal and surgical: a single conditional guard that only affects heartbeat runs when isActive is true. The logic is straightforward (two boolean checks), placement is correct (before the enqueue path), and the consistency improvements to markRunComplete() calls are harmless since they target early-return paths outside the try/finally block. No existing behavior is altered for non-heartbeat runs. The test suite covers the critical scenarios.
  • No files require special attention.

Last reviewed commit: d77683e

@mcaxtr mcaxtr force-pushed the fix/25606-heartbeat-followup branch from d77683e to 3a953f5 Compare February 24, 2026 16:26
@openclaw-barnacle openclaw-barnacle bot added the app: web-ui App: web-ui label Feb 24, 2026
steipete added a commit that referenced this pull request Feb 25, 2026
Co-authored-by: Marcus Castro <mcaxtr@gmail.com>
@steipete
Copy link
Contributor

Implemented and landed on main as c736778b3.

What I changed (reimplemented after code review):

  • Added an early guard in runReplyAgent to drop heartbeat runs when isActive is already true, before the followup enqueue path.
  • Kept non-heartbeat active-run behavior unchanged (still enqueues followups).
  • Added regression coverage in src/auto-reply/reply/agent-runner.runreplyagent.test.ts for:
    • heartbeat + active => dropped (no enqueue)
    • non-heartbeat + active => enqueued

Validation:

  • pnpm lint
  • pnpm build
  • pnpm test

Thanks for the report and original PR, @mcaxtr.

@steipete steipete closed this Feb 25, 2026
joshavant pushed a commit that referenced this pull request Feb 25, 2026
Co-authored-by: Marcus Castro <mcaxtr@gmail.com>
margulans pushed a commit to margulans/Neiron-AI-assistant that referenced this pull request Feb 25, 2026
Jackson3195 pushed a commit to Jackson3195/openclaw-with-a-personal-touch that referenced this pull request Feb 25, 2026
brianleach pushed a commit to brianleach/openclaw that referenced this pull request Feb 26, 2026
execute008 pushed a commit to execute008/openclaw that referenced this pull request Feb 27, 2026
r4jiv007 pushed a commit to r4jiv007/openclaw that referenced this pull request Feb 28, 2026
hughdidit pushed a commit to hughdidit/DAISy-Agency that referenced this pull request Mar 1, 2026
@mcaxtr)

Co-authored-by: Marcus Castro <mcaxtr@gmail.com>
(cherry picked from commit c736778)

# Conflicts:
#	src/auto-reply/reply/agent-runner.runreplyagent.test.ts
hughdidit pushed a commit to hughdidit/DAISy-Agency that referenced this pull request Mar 3, 2026
@mcaxtr)

Co-authored-by: Marcus Castro <mcaxtr@gmail.com>
(cherry picked from commit c736778)

# Conflicts:
#	src/auto-reply/reply/agent-runner.runreplyagent.test.ts
zooqueen pushed a commit to hanzoai/bot that referenced this pull request Mar 6, 2026
thebenjaminlee pushed a commit to escape-velocity-ventures/openclaw that referenced this pull request Mar 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Heartbeat runs create duplicate response branches when enqueued as followups

2 participants