Skip to content

fix(agent): prevent session lock deadlock on timeout during compaction#9855

Merged
gumadeiras merged 11 commits intoopenclaw:mainfrom
mverrilli:mverrilli/compaction_timeout_deadlock_fix
Feb 14, 2026
Merged

fix(agent): prevent session lock deadlock on timeout during compaction#9855
gumadeiras merged 11 commits intoopenclaw:mainfrom
mverrilli:mverrilli/compaction_timeout_deadlock_fix

Conversation

@mverrilli
Copy link
Contributor

@mverrilli mverrilli commented Feb 5, 2026

Summary

  • Fix session lock deadlock when a run times out during post-prompt compaction
  • waitForCompactionRetry() was not respecting the abort signal, causing execution to hang before reaching the finally block that releases the session write lock, leaving sessions permanently unresponsive
  • Add timedOutDuringCompaction flag to skip auth profile rotation on compaction timeouts (model succeeded; not a profile issue)
  • Capture pre-compaction message snapshot to persist a complete transcript when compaction times out mid-restructure

Root cause

When a run hits the timeout (default 10 minutes) while waitForCompactionRetry() is blocked, the abort signal fires but the compaction wait has no way to be cancelled. The finally block that calls unsubscribe() and releases the session write lock never runs, leaving the session permanently stuck until gateway restart.

Changes

Subscription cleanup (pi-embedded-subscribe.ts):

  • unsubscribe() now actively rejects pending compaction promises with AbortError, aborts in-flight compaction, and sets an unsubscribed flag to prevent new promises from being created
  • waitForCompactionRetry() rejects with AbortError after unsubscribe (cancellation, not false success) and propagates rejections through all code paths including the microtask race-check
  • Orphaned compaction promises (created by late event handlers with no consumer) are logged at debug level to prevent silent unhandled rejections

Attempt lifecycle (attempt.ts):

  • Wrap waitForCompactionRetry() in abortable() so the abort signal can cancel the wait
  • Capture a pre-compaction message snapshot with before/after compaction state checks to avoid copying a mid-compaction array
  • Use awaitingCompaction flag + subscription.isCompacting() for robust timeout classification that doesn't depend on a single instantaneous read
  • Use activeSession.isCompacting (session state) for snapshot safety decisions, subscription.isCompacting() (broader signal including pending retries) for timeout classification
  • Pair sessionIdUsed with whichever snapshot source is actually used to prevent downstream correlation issues
  • Guard unsubscribe() with try/catch in the finally block so remaining cleanup (clearActiveEmbeddedRun, abort listener removal) always runs

Profile rotation (run.ts):

  • Skip auth profile rotation when timeout occurred during compaction — the model succeeded, compaction was the bottleneck


  • This was AI assisted. I understand what the code does.
  • Tested for 48 hours, regular hangs experienced have disappeared.

Greptile Overview

Greptile Summary

This PR hardens embedded session teardown to prevent deadlocks when a run times out during post-prompt compaction.

Key changes:

  • subscribeEmbeddedPiSession() now tracks an unsubscribed state and rejects any pending compaction-wait promise with an AbortError on teardown, and waitForCompactionRetry() rejects after unsubscribe instead of returning a false “success”.
  • runEmbeddedAttempt() wraps waitForCompactionRetry() in the local abortable() helper so timeouts/aborts can cancel the wait and reliably reach the finally that releases the session write lock.
  • On timeout during compaction, runEmbeddedAttempt() persists a safe message snapshot (pre-compaction if compaction wasn’t running during snapshot capture) and reports timedOutDuringCompaction upward.
  • runEmbeddedPiAgent() uses timedOutDuringCompaction to skip auth profile rotation for compaction-induced timeouts (model succeeded; compaction/infra was the bottleneck).

Overall this aligns timeout behavior with cancellation semantics, ensures teardown is idempotent, and reduces the chance of sessions becoming permanently stuck due to unreleased locks.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk.
  • Changes are localized to compaction wait/teardown and timeout classification, and the new behavior (rejecting waits on unsubscribe + making unsubscribe idempotent) directly addresses the deadlock root cause without altering unrelated runner flows. Snapshot/sessionId pairing is handled consistently, and abort errors are correctly classified by existing runner helpers.
  • No files require special attention

Last reviewed commit: c10290d

@openclaw-barnacle openclaw-barnacle bot added the agents Agent runtime and tooling label Feb 5, 2026
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 5, 2026

Additional Comments (1)

src/agents/pi-embedded-runner/run.overflow-compaction.test.ts
Test factory missing field

EmbeddedRunAttemptResult gained a required timedOutDuringCompaction field (src/agents/pi-embedded-runner/run/types.ts:93-112), but makeAttemptResult() doesn’t set it. This will break TS typechecking (and any test compilation) once this PR lands.

Also appears in src/agents/pi-embedded-runner.run-embedded-pi-agent.auth-profile-rotation.test.ts:47-62 (makeAttempt).

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/pi-embedded-runner/run.overflow-compaction.test.ts
Line: 157:174

Comment:
**Test factory missing field**

`EmbeddedRunAttemptResult` gained a required `timedOutDuringCompaction` field (`src/agents/pi-embedded-runner/run/types.ts:93-112`), but `makeAttemptResult()` doesn’t set it. This will break TS typechecking (and any test compilation) once this PR lands.

Also appears in `src/agents/pi-embedded-runner.run-embedded-pi-agent.auth-profile-rotation.test.ts:47-62` (`makeAttempt`).

How can I resolve this? If you propose a fix, please make it concise.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 70f24ff64b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@mverrilli mverrilli marked this pull request as draft February 5, 2026 20:08
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@mverrilli mverrilli force-pushed the mverrilli/compaction_timeout_deadlock_fix branch 2 times, most recently from 7d3230b to 041a8da Compare February 5, 2026 21:40
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@mverrilli mverrilli force-pushed the mverrilli/compaction_timeout_deadlock_fix branch from 00a91e7 to 7b6fdb8 Compare February 6, 2026 00:34
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@mverrilli mverrilli force-pushed the mverrilli/compaction_timeout_deadlock_fix branch 2 times, most recently from e01452b to d08b39f Compare February 6, 2026 01:54
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@mverrilli mverrilli force-pushed the mverrilli/compaction_timeout_deadlock_fix branch 2 times, most recently from a0d07fe to e8037c7 Compare February 6, 2026 02:09
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

@mverrilli mverrilli force-pushed the mverrilli/compaction_timeout_deadlock_fix branch from e8037c7 to 3af1f85 Compare February 6, 2026 02:45
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@mverrilli mverrilli force-pushed the mverrilli/compaction_timeout_deadlock_fix branch 2 times, most recently from bb88c2f to 5f60f68 Compare February 6, 2026 03:19
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 6, 2026

Additional Comments (1)

src/agents/pi-embedded-subscribe.ts
Unhandled compaction wait rejection
waitForCompactionRetry() creates a microtask promise and, if compaction starts, it does state.compactionRetryPromise.then(resolve) without a rejection handler (src/agents/pi-embedded-subscribe.ts:590-603). Since unsubscribe() now rejects compactionRetryPromise with an AbortError (:544-557), this path will leave the outer promise pending forever on unsubscribe/abort, recreating the hang it’s meant to prevent. Handle rejection too (e.g. then(resolve, resolve) or propagate AbortError via a reject in the outer promise).

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/pi-embedded-subscribe.ts
Line: 590:603

Comment:
**Unhandled compaction wait rejection**
`waitForCompactionRetry()` creates a microtask promise and, if compaction starts, it does `state.compactionRetryPromise.then(resolve)` without a rejection handler (`src/agents/pi-embedded-subscribe.ts:590-603`). Since `unsubscribe()` now rejects `compactionRetryPromise` with an AbortError (`:544-557`), this path will leave the outer promise pending forever on unsubscribe/abort, recreating the hang it’s meant to prevent. Handle rejection too (e.g. `then(resolve, resolve)` or propagate AbortError via a `reject` in the outer promise).

How can I resolve this? If you propose a fix, please make it concise.

@mverrilli mverrilli force-pushed the mverrilli/compaction_timeout_deadlock_fix branch from 5f60f68 to 4838d19 Compare February 6, 2026 03:29
mverrilli and others added 11 commits February 14, 2026 14:23
When a run times out during compaction, waitForCompactionRetry() was not
respecting the abort signal, causing execution to hang before reaching the
finally block that releases the session write lock. This left sessions
permanently unresponsive.

Additionally, the initial fix had issues with stale snapshots and mixed
compaction state sources that could cause incorrect snapshot selection or
never-settling promises after unsubscribe.

Changes:
- Wrap waitForCompactionRetry() in abortable() to respect abort signal
- Add unsubscribed flag to prevent creating un-resolvable promises after cleanup
- Use activeSession.isCompacting consistently (not subscription state) for
  snapshot decisions to maintain single source of truth
- Gate pre-compaction snapshot usage on whether compaction was already running
  when snapshot was captured, preventing use of incomplete mid-stream snapshots
- Force-reject pending compaction promises in unsubscribe with AbortError

Fixes the issue where sessions become stuck after hitting the 10-minute
default timeout, requiring gateway restart to recover.
…potency

- Make unsubscribe() idempotent with early return guard on state.unsubscribed
- Gate rotation shouldRotate on !aborted for timeout branch (avoid penalizing
  profile when run was externally aborted)
- Narrow timedOutDuringCompaction to actual compaction in-flight only, excluding
  compaction retry prompts where timeouts are genuine provider issues
- Expose isCompactionInFlight() on subscription for precise compaction state
- Remove dead awaitingCompaction variable
abortRun(true) sets both timedOut and aborted, making the
timedOut && !aborted gate unreachable. Remove !aborted from the
timeout branch since timedOut already implies the abort was
self-inflicted by the timeout handler, not an external abort.
Move state.unsubscribed = true to the beginning of unsubscribe(),
immediately after the early-return check. This prevents
waitForCompactionRetry() from creating new promises during the teardown
window, eliminating the race where a promise created mid-unsubscribe
would never resolve or reject, causing deadlock.
Use subscription.isCompacting() instead of isCompactionInFlight() when
classifying timeouts during compaction. isCompacting() checks both
compactionInFlight and pendingCompactionRetry > 0, correctly capturing
the window after auto_compaction_end schedules a retry but before it
starts. Without this, timeouts in the pending-retry window were
misclassified as provider timeouts, incorrectly triggering auth profile
rotation for what are actually infrastructure timeouts.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Previously, unsubscribe() failures were logged at warn level, making
resource leaks (event handlers, timers, maps) easy to miss. Now logs
at error level with CRITICAL prefix to ensure visibility. Cannot
rethrow in finally block as it would mask exceptions from try block.
@gumadeiras gumadeiras force-pushed the mverrilli/compaction_timeout_deadlock_fix branch from 2b88a4b to 64a2890 Compare February 14, 2026 19:24
@gumadeiras gumadeiras merged commit e6f67d5 into openclaw:main Feb 14, 2026
9 checks passed
@gumadeiras
Copy link
Member

Merged via squash.

Thanks @mverrilli!

philga7 pushed a commit to philga7/openclaw-fork that referenced this pull request Feb 14, 2026
openclaw#9855)

Merged via /review-pr -> /prepare-pr -> /merge-pr.

Prepared head SHA: 64a2890
Co-authored-by: mverrilli <816450+mverrilli@users.noreply.github.com>
Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com>
Reviewed-by: @gumadeiras
@mverrilli mverrilli deleted the mverrilli/compaction_timeout_deadlock_fix branch February 14, 2026 20:45
akoscz pushed a commit to akoscz/openclaw that referenced this pull request Feb 15, 2026
openclaw#9855)

Merged via /review-pr -> /prepare-pr -> /merge-pr.

Prepared head SHA: 64a2890
Co-authored-by: mverrilli <816450+mverrilli@users.noreply.github.com>
Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com>
Reviewed-by: @gumadeiras
joeykrug added a commit to joeykrug/openclaw that referenced this pull request Feb 15, 2026
When a heartbeat run's embedded agent hits a compaction timeout, the
getReplyFromConfig promise can hang forever if the compaction retry
promise never settles (the root cause was fixed in openclaw#9855 by wrapping
waitForCompactionRetry with abortable(), but that fix hasn't shipped
yet in a release).

Even after openclaw#9855 ships, similar hung-promise bugs in getReplyFromConfig
could still permanently block the heartbeat scheduler, since:
- The `running` flag in heartbeat-wake stays true forever
- The lane task wrapping the heartbeat never completes
- All future heartbeats AND incoming messages on that lane are blocked

This caused a 22+ hour outage on Feb 14 2026 where the gateway stayed
alive as a zombie — no heartbeats, no message processing.

Add two defense-in-depth safety timeouts:

1. heartbeat-runner.ts: Promise.race with 12-minute timeout around
   getReplyFromConfig(). If the promise hangs, it rejects and the
   existing catch block handles it as a normal failure.

2. heartbeat-wake.ts: 13-minute safety timer on the handler callback.
   If the handler promise never settles, forcibly reset the `running`
   flag and reschedule. This ensures the scheduler always recovers
   regardless of what goes wrong inside the handler.

Includes test for the safety timeout in heartbeat-wake.test.ts.

Relates to openclaw#9855
GwonHyeok pushed a commit to learners-superpumped/openclaw that referenced this pull request Feb 15, 2026
openclaw#9855)

Merged via /review-pr -> /prepare-pr -> /merge-pr.

Prepared head SHA: 64a2890
Co-authored-by: mverrilli <816450+mverrilli@users.noreply.github.com>
Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com>
Reviewed-by: @gumadeiras
vincentkoc pushed a commit to vincentkoc/openclaw that referenced this pull request Feb 15, 2026
openclaw#9855)

Merged via /review-pr -> /prepare-pr -> /merge-pr.

Prepared head SHA: 64a2890
Co-authored-by: mverrilli <816450+mverrilli@users.noreply.github.com>
Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com>
Reviewed-by: @gumadeiras
snowzlm pushed a commit to snowzlm/openclaw that referenced this pull request Feb 15, 2026
openclaw#9855)

Merged via /review-pr -> /prepare-pr -> /merge-pr.

Prepared head SHA: 64a2890
Co-authored-by: mverrilli <816450+mverrilli@users.noreply.github.com>
Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com>
Reviewed-by: @gumadeiras
zooqueen pushed a commit to hanzoai/bot that referenced this pull request Mar 6, 2026
openclaw#9855)

Merged via /review-pr -> /prepare-pr -> /merge-pr.

Prepared head SHA: 64a2890
Co-authored-by: mverrilli <816450+mverrilli@users.noreply.github.com>
Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com>
Reviewed-by: @gumadeiras
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling size: M

Projects

None yet

3 participants