Skip to content

fix(heartbeat): skip when target session lane is busy#40526

Merged
vincentkoc merged 1 commit into
openclaw:mainfrom
lucky7323:fix/heartbeat-skip-busy-session-lane
Apr 5, 2026
Merged

fix(heartbeat): skip when target session lane is busy#40526
vincentkoc merged 1 commit into
openclaw:mainfrom
lucky7323:fix/heartbeat-skip-busy-session-lane

Conversation

@lucky7323

Copy link
Copy Markdown
Contributor

Summary

Exec completion/failure notifications enqueue a session-scoped system event and immediately trigger a scoped heartbeat wake. If the target session is already streaming and the queue mode is interrupt, that synthetic heartbeat turn can clear the session lane and abort the active run — causing the user to lose the in-progress response.

This change defers heartbeat execution until the target session lane is idle by returning requests-in-flight from the heartbeat runner when the resolved session lane is busy. The existing heartbeat wake retry logic (heartbeat-wake.ts) already re-schedules those wakes, so exec system events are preserved and delivered once the active turn completes.

Root Cause

  1. maybeNotifyOnExit() in bash-tools.exec-runtime.ts calls enqueueSystemEvent() + requestHeartbeatNow()
  2. runHeartbeatOnce() only checked CommandLane.Main — it did not check the resolved session lane
  3. So a heartbeat turn could enter get-reply-run.ts while the session was actively streaming
  4. drainFormattedSystemEvents() consumes the queued events before the heartbeat drop guard runs
  5. In interrupt mode, clearCommandLane() + abortEmbeddedPiRun() aborts the active turn

What Changed

  • src/infra/heartbeat-runner.ts: After preflight resolves the target sessionKey, check the session lane via resolveEmbeddedSessionLane(). If busy, return { status: "skipped", reason: "requests-in-flight" }.
  • New test: heartbeat-runner.skips-busy-session-lane.test.ts — verifies heartbeat is skipped when session lane is busy, and proceeds when idle.

Why This Fix

Fixing in the heartbeat runner prevents both failure modes:

  1. Active streaming turn interruption (interrupt mode)
  2. Silent loss of drained system events (non-interrupt modes where heartbeat is dropped after drain)

A local guard in get-reply-run.ts (e.g. skip interrupt for heartbeats) would stop the hard abort but not the premature drain.

Prior Art

This is a re-submission of the core fix from #14396, which was closed without merge when #14901 addressed a different scheduler-death bug. The session lane busy check from #14396 was not included in #14901.

Complements #31638 (exponential backoff for requests-in-flight retries).

Refs

@greptile-apps

greptile-apps Bot commented Mar 9, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes a race condition where a heartbeat turn triggered by maybeNotifyOnExit() could enter the agent run pipeline while the target session was still actively streaming, causing either an interrupted response (in interrupt mode) or silent loss of drained system events (other modes).

The fix adds a session lane queue-size check in runHeartbeatOnce — placed after resolveHeartbeatPreflight has resolved the concrete sessionKey — so the heartbeat is deferred rather than dropped when the session is busy. The existing heartbeat-wake.ts retry logic will re-schedule the wake automatically.

Key changes:

  • heartbeat-runner.ts: Calls resolveEmbeddedSessionLane(sessionKey) and returns { status: "skipped", reason: "requests-in-flight" } if that lane's queue is non-empty. Emits a heartbeat event before returning, consistent with the preflight skip path.
  • heartbeat-runner.skips-busy-session-lane.test.ts: Two new integration-style tests — one verifying the skip when the session lane is busy, one verifying normal execution when it is idle. The busy-lane test is comprehensive; the idle-lane test only asserts that getReplyFromConfig was called without checking the final result.status, which is a minor gap in coverage.

Confidence Score: 5/5

  • This PR is safe to merge; it adds a well-scoped, reversible guard with tests and no breaking changes to existing behavior.
  • The fix is a narrow, additive check that mirrors the already-tested CommandLane.Main guard pattern. It uses the injectable getQueueSize dep for testability, emits a telemetry event consistent with the surrounding code, and relies on the existing wake retry loop for re-delivery — no new infrastructure required. The only notable gap is the idle-lane test's missing result.status assertion, which is a test quality issue rather than a correctness risk.
  • No files require special attention.

Last reviewed commit: 9bf56a1

Comment on lines +142 to +151
const result = await runHeartbeatOnce({
cfg,
deps: { getQueueSize, nowMs: () => Date.now() } as HeartbeatDeps,
});

// Should have proceeded past the lane checks and called getReplyFromConfig
expect(replySpy).toHaveBeenCalled();

replySpy.mockRestore();
});

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Idle lane" test missing result-status assertion

The second test case only verifies that getReplyFromConfig was called, but never asserts on the return value of runHeartbeatOnce. This means the test would still pass even if the function returned an unexpected status (e.g., "skipped") as long as the reply spy was triggered somewhere in the call chain.

Adding an explicit assertion on result would make the test's intent clearer and guard against future regressions:

Suggested change
const result = await runHeartbeatOnce({
cfg,
deps: { getQueueSize, nowMs: () => Date.now() } as HeartbeatDeps,
});
// Should have proceeded past the lane checks and called getReplyFromConfig
expect(replySpy).toHaveBeenCalled();
replySpy.mockRestore();
});
const result = await runHeartbeatOnce({
cfg,
deps: { getQueueSize, nowMs: () => Date.now() } as HeartbeatDeps,
});
// Should have proceeded past the lane checks and called getReplyFromConfig
expect(replySpy).toHaveBeenCalled();
// Confirm the run completed (not skipped due to a lane check)
expect(result.status).not.toBe("skipped");
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/infra/heartbeat-runner.skips-busy-session-lane.test.ts
Line: 142-151

Comment:
**"Idle lane" test missing result-status assertion**

The second test case only verifies that `getReplyFromConfig` was called, but never asserts on the return value of `runHeartbeatOnce`. This means the test would still pass even if the function returned an unexpected status (e.g., `"skipped"`) as long as the reply spy was triggered somewhere in the call chain.

Adding an explicit assertion on `result` would make the test's intent clearer and guard against future regressions:

```suggestion
    const result = await runHeartbeatOnce({
      cfg,
      deps: { getQueueSize, nowMs: () => Date.now() } as HeartbeatDeps,
    });

    // Should have proceeded past the lane checks and called getReplyFromConfig
    expect(replySpy).toHaveBeenCalled();
    // Confirm the run completed (not skipped due to a lane check)
    expect(result.status).not.toBe("skipped");
```

How can I resolve this? If you propose a fix, please make it concise.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fd3bbf1852

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

if (sessionLaneSize > 0) {
emitHeartbeatEvent({
status: "skipped",
reason: "requests-in-flight",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Differentiate session-lane busy from global queue saturation

Returning reason: "requests-in-flight" for a busy session lane conflates two different conditions: global main-lane contention and per-session contention. The scheduler path in startHeartbeatRunner (the run loop in src/infra/heartbeat-runner.ts) treats requests-in-flight as a global backpressure signal and returns early, so in multi-agent setups one busy session can block heartbeat execution for other agents that are otherwise idle. This should use a distinct skip reason (or scoped handling) so only the contended session is retried.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cdcdf53947

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

if (sessionLaneSize > 0) {
emitHeartbeatEvent({
status: "skipped",
reason: "requests-in-flight",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Return a distinct skip reason for busy session lanes

This new branch reports reason: "requests-in-flight" even when only the resolved session lane is busy, but unscoped scheduling paths interpret that reason as global backpressure: startHeartbeatRunner exits the agent loop on that reason (src/infra/heartbeat-runner.ts:1208) and heartbeat-wake.ts retries it as “main lane busy” (src/infra/heartbeat-wake.ts:157). In multi-agent or multi-session runs, one busy session lane can therefore repeatedly prevent heartbeats for otherwise idle agents/sessions. Please use a separate skip reason (or separate handling) for session-lane contention so retries stay scoped to the contended target.

Useful? React with 👍 / 👎.

@openclaw-barnacle openclaw-barnacle Bot added cli CLI command changes size: S and removed size: M labels Mar 9, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b8566f126a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

reason: "requests-in-flight",
durationMs: Date.now() - startedAt,
});
return { status: "skipped", reason: "requests-in-flight" };

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Return distinct skip reason for busy session lanes

Using "requests-in-flight" for session-lane contention makes scoped backpressure look like global main-lane saturation, and that changes scheduler behavior: in the unscoped agent loop, startHeartbeatRunner returns early on this reason (src/infra/heartbeat-runner.ts:1208), while heartbeat-wake retries it as a global busy condition (src/infra/heartbeat-wake.ts:156), so one busy session can delay heartbeats for other agents/sessions that are idle. Fresh evidence in this commit is the new session-lane branch now returning this shared reason, which introduces the conflation directly.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0f16b2c189

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +663 to +666
reason: "requests-in-flight",
durationMs: Date.now() - startedAt,
});
return { status: "skipped", reason: "requests-in-flight" };

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use distinct skip reason for busy session lanes

Returning reason: "requests-in-flight" for session-lane contention conflates per-session backpressure with global main-lane saturation, and downstream logic treats that reason as global: startHeartbeatRunner exits the unscoped agent loop early on this reason (src/infra/heartbeat-runner.ts:1208) and heartbeat-wake.ts retries it as a main-lane busy condition (src/infra/heartbeat-wake.ts:156). In multi-agent runs, a single busy session can therefore delay heartbeats for unrelated idle agents/sessions. Fresh evidence in this commit is the new session-lane guard now returning this shared reason.

Useful? React with 👍 / 👎.

@vincentkoc vincentkoc force-pushed the fix/heartbeat-skip-busy-session-lane branch from 0f16b2c to 61a74f5 Compare April 5, 2026 08:24
@openclaw-barnacle openclaw-barnacle Bot removed the cli CLI command changes label Apr 5, 2026
@vincentkoc vincentkoc self-assigned this Apr 5, 2026
@vincentkoc vincentkoc force-pushed the fix/heartbeat-skip-busy-session-lane branch from 61a74f5 to c14c333 Compare April 5, 2026 08:25
@vincentkoc vincentkoc merged commit 5b9cdb8 into openclaw:main Apr 5, 2026
7 checks passed

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c14c33315e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

if (sessionLaneSize > 0) {
emitHeartbeatEvent({
status: "skipped",
reason: "requests-in-flight",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use a distinct skip reason for busy session lanes

Returning "requests-in-flight" here makes per-session contention look identical to global main-lane saturation, but downstream scheduling treats that reason as global backpressure: the unscoped runner loop exits early on this reason (startHeartbeatRunner run loop) and heartbeat-wake retries it as a main-lane busy condition. In multi-agent or multi-session usage, one busy session lane can therefore delay heartbeats for unrelated idle sessions until that session drains.

Useful? React with 👍 / 👎.

lovewanwan pushed a commit to lovewanwan/openclaw that referenced this pull request Apr 28, 2026
Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
ogt-redknie pushed a commit to ogt-redknie/OPENX that referenced this pull request May 2, 2026
Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 9, 2026
Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 24, 2026
Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants