fix(agent): prevent session lock deadlock on timeout during compaction#9855
Conversation
Additional Comments (1)
Also appears in Prompt To Fix With AIThis is a comment left during a code review.
Path: src/agents/pi-embedded-runner/run.overflow-compaction.test.ts
Line: 157:174
Comment:
**Test factory missing field**
`EmbeddedRunAttemptResult` gained a required `timedOutDuringCompaction` field (`src/agents/pi-embedded-runner/run/types.ts:93-112`), but `makeAttemptResult()` doesn’t set it. This will break TS typechecking (and any test compilation) once this PR lands.
Also appears in `src/agents/pi-embedded-runner.run-embedded-pi-agent.auth-profile-rotation.test.ts:47-62` (`makeAttempt`).
How can I resolve this? If you propose a fix, please make it concise. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 70f24ff64b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
7d3230b to
041a8da
Compare
00a91e7 to
7b6fdb8
Compare
e01452b to
d08b39f
Compare
a0d07fe to
e8037c7
Compare
e8037c7 to
3af1f85
Compare
bb88c2f to
5f60f68
Compare
Additional Comments (1)
Prompt To Fix With AIThis is a comment left during a code review.
Path: src/agents/pi-embedded-subscribe.ts
Line: 590:603
Comment:
**Unhandled compaction wait rejection**
`waitForCompactionRetry()` creates a microtask promise and, if compaction starts, it does `state.compactionRetryPromise.then(resolve)` without a rejection handler (`src/agents/pi-embedded-subscribe.ts:590-603`). Since `unsubscribe()` now rejects `compactionRetryPromise` with an AbortError (`:544-557`), this path will leave the outer promise pending forever on unsubscribe/abort, recreating the hang it’s meant to prevent. Handle rejection too (e.g. `then(resolve, resolve)` or propagate AbortError via a `reject` in the outer promise).
How can I resolve this? If you propose a fix, please make it concise. |
5f60f68 to
4838d19
Compare
When a run times out during compaction, waitForCompactionRetry() was not respecting the abort signal, causing execution to hang before reaching the finally block that releases the session write lock. This left sessions permanently unresponsive. Additionally, the initial fix had issues with stale snapshots and mixed compaction state sources that could cause incorrect snapshot selection or never-settling promises after unsubscribe. Changes: - Wrap waitForCompactionRetry() in abortable() to respect abort signal - Add unsubscribed flag to prevent creating un-resolvable promises after cleanup - Use activeSession.isCompacting consistently (not subscription state) for snapshot decisions to maintain single source of truth - Gate pre-compaction snapshot usage on whether compaction was already running when snapshot was captured, preventing use of incomplete mid-stream snapshots - Force-reject pending compaction promises in unsubscribe with AbortError Fixes the issue where sessions become stuck after hitting the 10-minute default timeout, requiring gateway restart to recover.
…potency - Make unsubscribe() idempotent with early return guard on state.unsubscribed - Gate rotation shouldRotate on !aborted for timeout branch (avoid penalizing profile when run was externally aborted) - Narrow timedOutDuringCompaction to actual compaction in-flight only, excluding compaction retry prompts where timeouts are genuine provider issues - Expose isCompactionInFlight() on subscription for precise compaction state - Remove dead awaitingCompaction variable
abortRun(true) sets both timedOut and aborted, making the timedOut && !aborted gate unreachable. Remove !aborted from the timeout branch since timedOut already implies the abort was self-inflicted by the timeout handler, not an external abort.
Move state.unsubscribed = true to the beginning of unsubscribe(), immediately after the early-return check. This prevents waitForCompactionRetry() from creating new promises during the teardown window, eliminating the race where a promise created mid-unsubscribe would never resolve or reject, causing deadlock.
Use subscription.isCompacting() instead of isCompactionInFlight() when classifying timeouts during compaction. isCompacting() checks both compactionInFlight and pendingCompactionRetry > 0, correctly capturing the window after auto_compaction_end schedules a retry but before it starts. Without this, timeouts in the pending-retry window were misclassified as provider timeouts, incorrectly triggering auth profile rotation for what are actually infrastructure timeouts. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Previously, unsubscribe() failures were logged at warn level, making resource leaks (event handlers, timers, maps) easy to miss. Now logs at error level with CRITICAL prefix to ensure visibility. Cannot rethrow in finally block as it would mask exceptions from try block.
2b88a4b to
64a2890
Compare
|
Merged via squash. Thanks @mverrilli! |
openclaw#9855) Merged via /review-pr -> /prepare-pr -> /merge-pr. Prepared head SHA: 64a2890 Co-authored-by: mverrilli <816450+mverrilli@users.noreply.github.com> Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com> Reviewed-by: @gumadeiras
openclaw#9855) Merged via /review-pr -> /prepare-pr -> /merge-pr. Prepared head SHA: 64a2890 Co-authored-by: mverrilli <816450+mverrilli@users.noreply.github.com> Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com> Reviewed-by: @gumadeiras
When a heartbeat run's embedded agent hits a compaction timeout, the getReplyFromConfig promise can hang forever if the compaction retry promise never settles (the root cause was fixed in openclaw#9855 by wrapping waitForCompactionRetry with abortable(), but that fix hasn't shipped yet in a release). Even after openclaw#9855 ships, similar hung-promise bugs in getReplyFromConfig could still permanently block the heartbeat scheduler, since: - The `running` flag in heartbeat-wake stays true forever - The lane task wrapping the heartbeat never completes - All future heartbeats AND incoming messages on that lane are blocked This caused a 22+ hour outage on Feb 14 2026 where the gateway stayed alive as a zombie — no heartbeats, no message processing. Add two defense-in-depth safety timeouts: 1. heartbeat-runner.ts: Promise.race with 12-minute timeout around getReplyFromConfig(). If the promise hangs, it rejects and the existing catch block handles it as a normal failure. 2. heartbeat-wake.ts: 13-minute safety timer on the handler callback. If the handler promise never settles, forcibly reset the `running` flag and reschedule. This ensures the scheduler always recovers regardless of what goes wrong inside the handler. Includes test for the safety timeout in heartbeat-wake.test.ts. Relates to openclaw#9855
openclaw#9855) Merged via /review-pr -> /prepare-pr -> /merge-pr. Prepared head SHA: 64a2890 Co-authored-by: mverrilli <816450+mverrilli@users.noreply.github.com> Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com> Reviewed-by: @gumadeiras
openclaw#9855) Merged via /review-pr -> /prepare-pr -> /merge-pr. Prepared head SHA: 64a2890 Co-authored-by: mverrilli <816450+mverrilli@users.noreply.github.com> Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com> Reviewed-by: @gumadeiras
openclaw#9855) Merged via /review-pr -> /prepare-pr -> /merge-pr. Prepared head SHA: 64a2890 Co-authored-by: mverrilli <816450+mverrilli@users.noreply.github.com> Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com> Reviewed-by: @gumadeiras
openclaw#9855) Merged via /review-pr -> /prepare-pr -> /merge-pr. Prepared head SHA: 64a2890 Co-authored-by: mverrilli <816450+mverrilli@users.noreply.github.com> Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com> Reviewed-by: @gumadeiras
Summary
waitForCompactionRetry()was not respecting the abort signal, causing execution to hang before reaching thefinallyblock that releases the session write lock, leaving sessions permanently unresponsivetimedOutDuringCompactionflag to skip auth profile rotation on compaction timeouts (model succeeded; not a profile issue)Root cause
When a run hits the timeout (default 10 minutes) while
waitForCompactionRetry()is blocked, the abort signal fires but the compaction wait has no way to be cancelled. Thefinallyblock that callsunsubscribe()and releases the session write lock never runs, leaving the session permanently stuck until gateway restart.Changes
Subscription cleanup (
pi-embedded-subscribe.ts):unsubscribe()now actively rejects pending compaction promises withAbortError, aborts in-flight compaction, and sets anunsubscribedflag to prevent new promises from being createdwaitForCompactionRetry()rejects withAbortErrorafter unsubscribe (cancellation, not false success) and propagates rejections through all code paths including the microtask race-checkAttempt lifecycle (
attempt.ts):waitForCompactionRetry()inabortable()so the abort signal can cancel the waitawaitingCompactionflag +subscription.isCompacting()for robust timeout classification that doesn't depend on a single instantaneous readactiveSession.isCompacting(session state) for snapshot safety decisions,subscription.isCompacting()(broader signal including pending retries) for timeout classificationsessionIdUsedwith whichever snapshot source is actually used to prevent downstream correlation issuesunsubscribe()with try/catch in thefinallyblock so remaining cleanup (clearActiveEmbeddedRun, abort listener removal) always runsProfile rotation (
run.ts):Greptile Overview
Greptile Summary
This PR hardens embedded session teardown to prevent deadlocks when a run times out during post-prompt compaction.
Key changes:
subscribeEmbeddedPiSession()now tracks anunsubscribedstate and rejects any pending compaction-wait promise with anAbortErroron teardown, andwaitForCompactionRetry()rejects after unsubscribe instead of returning a false “success”.runEmbeddedAttempt()wrapswaitForCompactionRetry()in the localabortable()helper so timeouts/aborts can cancel the wait and reliably reach thefinallythat releases the session write lock.runEmbeddedAttempt()persists a safe message snapshot (pre-compaction if compaction wasn’t running during snapshot capture) and reportstimedOutDuringCompactionupward.runEmbeddedPiAgent()usestimedOutDuringCompactionto skip auth profile rotation for compaction-induced timeouts (model succeeded; compaction/infra was the bottleneck).Overall this aligns timeout behavior with cancellation semantics, ensures teardown is idempotent, and reduces the chance of sessions becoming permanently stuck due to unreleased locks.
Confidence Score: 5/5
Last reviewed commit: c10290d