fix(heartbeat): skip when target session lane is busy#40526
Conversation
Greptile SummaryThis PR fixes a race condition where a heartbeat turn triggered by The fix adds a session lane queue-size check in Key changes:
Confidence Score: 5/5
Last reviewed commit: 9bf56a1 |
| const result = await runHeartbeatOnce({ | ||
| cfg, | ||
| deps: { getQueueSize, nowMs: () => Date.now() } as HeartbeatDeps, | ||
| }); | ||
|
|
||
| // Should have proceeded past the lane checks and called getReplyFromConfig | ||
| expect(replySpy).toHaveBeenCalled(); | ||
|
|
||
| replySpy.mockRestore(); | ||
| }); |
There was a problem hiding this comment.
"Idle lane" test missing result-status assertion
The second test case only verifies that getReplyFromConfig was called, but never asserts on the return value of runHeartbeatOnce. This means the test would still pass even if the function returned an unexpected status (e.g., "skipped") as long as the reply spy was triggered somewhere in the call chain.
Adding an explicit assertion on result would make the test's intent clearer and guard against future regressions:
| const result = await runHeartbeatOnce({ | |
| cfg, | |
| deps: { getQueueSize, nowMs: () => Date.now() } as HeartbeatDeps, | |
| }); | |
| // Should have proceeded past the lane checks and called getReplyFromConfig | |
| expect(replySpy).toHaveBeenCalled(); | |
| replySpy.mockRestore(); | |
| }); | |
| const result = await runHeartbeatOnce({ | |
| cfg, | |
| deps: { getQueueSize, nowMs: () => Date.now() } as HeartbeatDeps, | |
| }); | |
| // Should have proceeded past the lane checks and called getReplyFromConfig | |
| expect(replySpy).toHaveBeenCalled(); | |
| // Confirm the run completed (not skipped due to a lane check) | |
| expect(result.status).not.toBe("skipped"); |
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/infra/heartbeat-runner.skips-busy-session-lane.test.ts
Line: 142-151
Comment:
**"Idle lane" test missing result-status assertion**
The second test case only verifies that `getReplyFromConfig` was called, but never asserts on the return value of `runHeartbeatOnce`. This means the test would still pass even if the function returned an unexpected status (e.g., `"skipped"`) as long as the reply spy was triggered somewhere in the call chain.
Adding an explicit assertion on `result` would make the test's intent clearer and guard against future regressions:
```suggestion
const result = await runHeartbeatOnce({
cfg,
deps: { getQueueSize, nowMs: () => Date.now() } as HeartbeatDeps,
});
// Should have proceeded past the lane checks and called getReplyFromConfig
expect(replySpy).toHaveBeenCalled();
// Confirm the run completed (not skipped due to a lane check)
expect(result.status).not.toBe("skipped");
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fd3bbf1852
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if (sessionLaneSize > 0) { | ||
| emitHeartbeatEvent({ | ||
| status: "skipped", | ||
| reason: "requests-in-flight", |
There was a problem hiding this comment.
Differentiate session-lane busy from global queue saturation
Returning reason: "requests-in-flight" for a busy session lane conflates two different conditions: global main-lane contention and per-session contention. The scheduler path in startHeartbeatRunner (the run loop in src/infra/heartbeat-runner.ts) treats requests-in-flight as a global backpressure signal and returns early, so in multi-agent setups one busy session can block heartbeat execution for other agents that are otherwise idle. This should use a distinct skip reason (or scoped handling) so only the contended session is retried.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cdcdf53947
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if (sessionLaneSize > 0) { | ||
| emitHeartbeatEvent({ | ||
| status: "skipped", | ||
| reason: "requests-in-flight", |
There was a problem hiding this comment.
Return a distinct skip reason for busy session lanes
This new branch reports reason: "requests-in-flight" even when only the resolved session lane is busy, but unscoped scheduling paths interpret that reason as global backpressure: startHeartbeatRunner exits the agent loop on that reason (src/infra/heartbeat-runner.ts:1208) and heartbeat-wake.ts retries it as “main lane busy” (src/infra/heartbeat-wake.ts:157). In multi-agent or multi-session runs, one busy session lane can therefore repeatedly prevent heartbeats for otherwise idle agents/sessions. Please use a separate skip reason (or separate handling) for session-lane contention so retries stay scoped to the contended target.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b8566f126a
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| reason: "requests-in-flight", | ||
| durationMs: Date.now() - startedAt, | ||
| }); | ||
| return { status: "skipped", reason: "requests-in-flight" }; |
There was a problem hiding this comment.
Return distinct skip reason for busy session lanes
Using "requests-in-flight" for session-lane contention makes scoped backpressure look like global main-lane saturation, and that changes scheduler behavior: in the unscoped agent loop, startHeartbeatRunner returns early on this reason (src/infra/heartbeat-runner.ts:1208), while heartbeat-wake retries it as a global busy condition (src/infra/heartbeat-wake.ts:156), so one busy session can delay heartbeats for other agents/sessions that are idle. Fresh evidence in this commit is the new session-lane branch now returning this shared reason, which introduces the conflation directly.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0f16b2c189
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| reason: "requests-in-flight", | ||
| durationMs: Date.now() - startedAt, | ||
| }); | ||
| return { status: "skipped", reason: "requests-in-flight" }; |
There was a problem hiding this comment.
Use distinct skip reason for busy session lanes
Returning reason: "requests-in-flight" for session-lane contention conflates per-session backpressure with global main-lane saturation, and downstream logic treats that reason as global: startHeartbeatRunner exits the unscoped agent loop early on this reason (src/infra/heartbeat-runner.ts:1208) and heartbeat-wake.ts retries it as a main-lane busy condition (src/infra/heartbeat-wake.ts:156). In multi-agent runs, a single busy session can therefore delay heartbeats for unrelated idle agents/sessions. Fresh evidence in this commit is the new session-lane guard now returning this shared reason.
Useful? React with 👍 / 👎.
0f16b2c to
61a74f5
Compare
61a74f5 to
c14c333
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c14c33315e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (sessionLaneSize > 0) { | ||
| emitHeartbeatEvent({ | ||
| status: "skipped", | ||
| reason: "requests-in-flight", |
There was a problem hiding this comment.
Use a distinct skip reason for busy session lanes
Returning "requests-in-flight" here makes per-session contention look identical to global main-lane saturation, but downstream scheduling treats that reason as global backpressure: the unscoped runner loop exits early on this reason (startHeartbeatRunner run loop) and heartbeat-wake retries it as a main-lane busy condition. In multi-agent or multi-session usage, one busy session lane can therefore delay heartbeats for unrelated idle sessions until that session drains.
Useful? React with 👍 / 👎.
Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
Summary
Exec completion/failure notifications enqueue a session-scoped system event and immediately trigger a scoped heartbeat wake. If the target session is already streaming and the queue mode is
interrupt, that synthetic heartbeat turn can clear the session lane and abort the active run — causing the user to lose the in-progress response.This change defers heartbeat execution until the target session lane is idle by returning
requests-in-flightfrom the heartbeat runner when the resolved session lane is busy. The existing heartbeat wake retry logic (heartbeat-wake.ts) already re-schedules those wakes, so exec system events are preserved and delivered once the active turn completes.Root Cause
maybeNotifyOnExit()inbash-tools.exec-runtime.tscallsenqueueSystemEvent()+requestHeartbeatNow()runHeartbeatOnce()only checkedCommandLane.Main— it did not check the resolved session laneget-reply-run.tswhile the session was actively streamingdrainFormattedSystemEvents()consumes the queued events before the heartbeat drop guard runsinterruptmode,clearCommandLane()+abortEmbeddedPiRun()aborts the active turnWhat Changed
src/infra/heartbeat-runner.ts: After preflight resolves the targetsessionKey, check the session lane viaresolveEmbeddedSessionLane(). If busy, return{ status: "skipped", reason: "requests-in-flight" }.heartbeat-runner.skips-busy-session-lane.test.ts— verifies heartbeat is skipped when session lane is busy, and proceeds when idle.Why This Fix
Fixing in the heartbeat runner prevents both failure modes:
interruptmode)A local guard in
get-reply-run.ts(e.g. skip interrupt for heartbeats) would stop the hard abort but not the premature drain.Prior Art
This is a re-submission of the core fix from #14396, which was closed without merge when #14901 addressed a different scheduler-death bug. The session lane busy check from #14396 was not included in #14901.
Complements #31638 (exponential backoff for
requests-in-flightretries).Refs