fix(agent): prevent session lock deadlock on timeout during compaction by mverrilli · Pull Request #9855 · openclaw/openclaw

mverrilli · 2026-02-05T19:20:50Z

Summary

Fix session lock deadlock when a run times out during post-prompt compaction
waitForCompactionRetry() was not respecting the abort signal, causing execution to hang before reaching the finally block that releases the session write lock, leaving sessions permanently unresponsive
Add timedOutDuringCompaction flag to skip auth profile rotation on compaction timeouts (model succeeded; not a profile issue)
Capture pre-compaction message snapshot to persist a complete transcript when compaction times out mid-restructure

Root cause

When a run hits the timeout (default 10 minutes) while waitForCompactionRetry() is blocked, the abort signal fires but the compaction wait has no way to be cancelled. The finally block that calls unsubscribe() and releases the session write lock never runs, leaving the session permanently stuck until gateway restart.

Changes

Subscription cleanup (pi-embedded-subscribe.ts):

unsubscribe() now actively rejects pending compaction promises with AbortError, aborts in-flight compaction, and sets an unsubscribed flag to prevent new promises from being created
waitForCompactionRetry() rejects with AbortError after unsubscribe (cancellation, not false success) and propagates rejections through all code paths including the microtask race-check
Orphaned compaction promises (created by late event handlers with no consumer) are logged at debug level to prevent silent unhandled rejections

Attempt lifecycle (attempt.ts):

Wrap waitForCompactionRetry() in abortable() so the abort signal can cancel the wait
Capture a pre-compaction message snapshot with before/after compaction state checks to avoid copying a mid-compaction array
Use awaitingCompaction flag + subscription.isCompacting() for robust timeout classification that doesn't depend on a single instantaneous read
Use activeSession.isCompacting (session state) for snapshot safety decisions, subscription.isCompacting() (broader signal including pending retries) for timeout classification
Pair sessionIdUsed with whichever snapshot source is actually used to prevent downstream correlation issues
Guard unsubscribe() with try/catch in the finally block so remaining cleanup (clearActiveEmbeddedRun, abort listener removal) always runs

Profile rotation (run.ts):

Skip auth profile rotation when timeout occurred during compaction — the model succeeded, compaction was the bottleneck

This was AI assisted. I understand what the code does.
Tested for 48 hours, regular hangs experienced have disappeared.

Greptile Overview

Greptile Summary

This PR hardens embedded session teardown to prevent deadlocks when a run times out during post-prompt compaction.

Key changes:

subscribeEmbeddedPiSession() now tracks an unsubscribed state and rejects any pending compaction-wait promise with an AbortError on teardown, and waitForCompactionRetry() rejects after unsubscribe instead of returning a false “success”.
runEmbeddedAttempt() wraps waitForCompactionRetry() in the local abortable() helper so timeouts/aborts can cancel the wait and reliably reach the finally that releases the session write lock.
On timeout during compaction, runEmbeddedAttempt() persists a safe message snapshot (pre-compaction if compaction wasn’t running during snapshot capture) and reports timedOutDuringCompaction upward.
runEmbeddedPiAgent() uses timedOutDuringCompaction to skip auth profile rotation for compaction-induced timeouts (model succeeded; compaction/infra was the bottleneck).

Overall this aligns timeout behavior with cancellation semantics, ensures teardown is idempotent, and reduces the chance of sessions becoming permanently stuck due to unreleased locks.

Confidence Score: 5/5

This PR is safe to merge with minimal risk.
Changes are localized to compaction wait/teardown and timeout classification, and the new behavior (rejecting waits on unsubscribe + making unsubscribe idempotent) directly addresses the deadlock root cause without altering unrelated runner flows. Snapshot/sessionId pairing is handled consistently, and abort errors are correctly classified by existing runner helpers.
No files require special attention

_{Last reviewed commit: c10290d}

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-05T19:22:36Z

Additional Comments (1)

src/agents/pi-embedded-runner/run.overflow-compaction.test.ts
Test factory missing field

EmbeddedRunAttemptResult gained a required timedOutDuringCompaction field (src/agents/pi-embedded-runner/run/types.ts:93-112), but makeAttemptResult() doesn’t set it. This will break TS typechecking (and any test compilation) once this PR lands.

Also appears in src/agents/pi-embedded-runner.run-embedded-pi-agent.auth-profile-rotation.test.ts:47-62 (makeAttempt).

Prompt To Fix With AI

This is a comment left during a code review.
Path: src/agents/pi-embedded-runner/run.overflow-compaction.test.ts
Line: 157:174

Comment:
**Test factory missing field**

`EmbeddedRunAttemptResult` gained a required `timedOutDuringCompaction` field (`src/agents/pi-embedded-runner/run/types.ts:93-112`), but `makeAttemptResult()` doesn’t set it. This will break TS typechecking (and any test compilation) once this PR lands.

Also appears in `src/agents/pi-embedded-runner.run-embedded-pi-agent.auth-profile-rotation.test.ts:47-62` (`makeAttempt`).

How can I resolve this? If you propose a fix, please make it concise.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 70f24ff64b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

src/agents/pi-embedded-runner/run/attempt.ts

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

src/agents/pi-embedded-subscribe.ts

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

src/agents/pi-embedded-subscribe.ts

greptile-apps

_{1 file reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

src/agents/pi-embedded-subscribe.ts

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

src/agents/pi-embedded-runner/run/attempt.ts

greptile-apps

_{2 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

src/agents/pi-embedded-subscribe.ts

src/agents/pi-embedded-runner/run/attempt.ts

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

src/agents/pi-embedded-subscribe.ts

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

src/agents/pi-embedded-runner/run/attempt.ts

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

src/agents/pi-embedded-subscribe.ts

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

src/agents/pi-embedded-runner/run/attempt.ts

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

src/agents/pi-embedded-runner/run/attempt.ts

greptile-apps

_{2 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

src/agents/pi-embedded-runner/run/attempt.ts

src/agents/pi-embedded-runner/run.ts

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

src/agents/pi-embedded-runner/run/attempt.ts

greptile-apps

_{2 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

src/agents/pi-embedded-runner/run/attempt.ts

greptile-apps · 2026-02-06T03:28:26Z

Additional Comments (1)

src/agents/pi-embedded-subscribe.ts
Unhandled compaction wait rejection
waitForCompactionRetry() creates a microtask promise and, if compaction starts, it does state.compactionRetryPromise.then(resolve) without a rejection handler (src/agents/pi-embedded-subscribe.ts:590-603). Since unsubscribe() now rejects compactionRetryPromise with an AbortError (:544-557), this path will leave the outer promise pending forever on unsubscribe/abort, recreating the hang it’s meant to prevent. Handle rejection too (e.g. then(resolve, resolve) or propagate AbortError via a reject in the outer promise).

Prompt To Fix With AI

This is a comment left during a code review.
Path: src/agents/pi-embedded-subscribe.ts
Line: 590:603

Comment:
**Unhandled compaction wait rejection**
`waitForCompactionRetry()` creates a microtask promise and, if compaction starts, it does `state.compactionRetryPromise.then(resolve)` without a rejection handler (`src/agents/pi-embedded-subscribe.ts:590-603`). Since `unsubscribe()` now rejects `compactionRetryPromise` with an AbortError (`:544-557`), this path will leave the outer promise pending forever on unsubscribe/abort, recreating the hang it’s meant to prevent. Handle rejection too (e.g. `then(resolve, resolve)` or propagate AbortError via a `reject` in the outer promise).

How can I resolve this? If you propose a fix, please make it concise.

When a run times out during compaction, waitForCompactionRetry() was not respecting the abort signal, causing execution to hang before reaching the finally block that releases the session write lock. This left sessions permanently unresponsive. Additionally, the initial fix had issues with stale snapshots and mixed compaction state sources that could cause incorrect snapshot selection or never-settling promises after unsubscribe. Changes: - Wrap waitForCompactionRetry() in abortable() to respect abort signal - Add unsubscribed flag to prevent creating un-resolvable promises after cleanup - Use activeSession.isCompacting consistently (not subscription state) for snapshot decisions to maintain single source of truth - Gate pre-compaction snapshot usage on whether compaction was already running when snapshot was captured, preventing use of incomplete mid-stream snapshots - Force-reject pending compaction promises in unsubscribe with AbortError Fixes the issue where sessions become stuck after hitting the 10-minute default timeout, requiring gateway restart to recover.

…potency - Make unsubscribe() idempotent with early return guard on state.unsubscribed - Gate rotation shouldRotate on !aborted for timeout branch (avoid penalizing profile when run was externally aborted) - Narrow timedOutDuringCompaction to actual compaction in-flight only, excluding compaction retry prompts where timeouts are genuine provider issues - Expose isCompactionInFlight() on subscription for precise compaction state - Remove dead awaitingCompaction variable

abortRun(true) sets both timedOut and aborted, making the timedOut && !aborted gate unreachable. Remove !aborted from the timeout branch since timedOut already implies the abort was self-inflicted by the timeout handler, not an external abort.

Move state.unsubscribed = true to the beginning of unsubscribe(), immediately after the early-return check. This prevents waitForCompactionRetry() from creating new promises during the teardown window, eliminating the race where a promise created mid-unsubscribe would never resolve or reject, causing deadlock.

Use subscription.isCompacting() instead of isCompactionInFlight() when classifying timeouts during compaction. isCompacting() checks both compactionInFlight and pendingCompactionRetry > 0, correctly capturing the window after auto_compaction_end schedules a retry but before it starts. Without this, timeouts in the pending-retry window were misclassified as provider timeouts, incorrectly triggering auth profile rotation for what are actually infrastructure timeouts. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Previously, unsubscribe() failures were logged at warn level, making resource leaks (event handlers, timers, maps) easy to miss. Now logs at error level with CRITICAL prefix to ensure visibility. Cannot rethrow in finally block as it would mask exceptions from try block.

gumadeiras · 2026-02-14T19:24:23Z

Merged via squash.

Prepared head SHA: 64a2890
Merge commit: e6f67d5

Thanks @mverrilli!

@gumadeiras

openclaw#9855) Merged via /review-pr -> /prepare-pr -> /merge-pr. Prepared head SHA: 64a2890 Co-authored-by: mverrilli <816450+mverrilli@users.noreply.github.com> Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com> Reviewed-by: @gumadeiras

@gumadeiras

openclaw#9855) Merged via /review-pr -> /prepare-pr -> /merge-pr. Prepared head SHA: 64a2890 Co-authored-by: mverrilli <816450+mverrilli@users.noreply.github.com> Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com> Reviewed-by: @gumadeiras

When a heartbeat run's embedded agent hits a compaction timeout, the getReplyFromConfig promise can hang forever if the compaction retry promise never settles (the root cause was fixed in openclaw#9855 by wrapping waitForCompactionRetry with abortable(), but that fix hasn't shipped yet in a release). Even after openclaw#9855 ships, similar hung-promise bugs in getReplyFromConfig could still permanently block the heartbeat scheduler, since: - The `running` flag in heartbeat-wake stays true forever - The lane task wrapping the heartbeat never completes - All future heartbeats AND incoming messages on that lane are blocked This caused a 22+ hour outage on Feb 14 2026 where the gateway stayed alive as a zombie — no heartbeats, no message processing. Add two defense-in-depth safety timeouts: 1. heartbeat-runner.ts: Promise.race with 12-minute timeout around getReplyFromConfig(). If the promise hangs, it rejects and the existing catch block handles it as a normal failure. 2. heartbeat-wake.ts: 13-minute safety timer on the handler callback. If the handler promise never settles, forcibly reset the `running` flag and reschedule. This ensures the scheduler always recovers regardless of what goes wrong inside the handler. Includes test for the safety timeout in heartbeat-wake.test.ts. Relates to openclaw#9855

@gumadeiras

openclaw#9855) Merged via /review-pr -> /prepare-pr -> /merge-pr. Prepared head SHA: 64a2890 Co-authored-by: mverrilli <816450+mverrilli@users.noreply.github.com> Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com> Reviewed-by: @gumadeiras

@gumadeiras

openclaw#9855) Merged via /review-pr -> /prepare-pr -> /merge-pr. Prepared head SHA: 64a2890 Co-authored-by: mverrilli <816450+mverrilli@users.noreply.github.com> Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com> Reviewed-by: @gumadeiras

@gumadeiras

openclaw#9855) Merged via /review-pr -> /prepare-pr -> /merge-pr. Prepared head SHA: 64a2890 Co-authored-by: mverrilli <816450+mverrilli@users.noreply.github.com> Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com> Reviewed-by: @gumadeiras

@gumadeiras

openclaw#9855) Merged via /review-pr -> /prepare-pr -> /merge-pr. Prepared head SHA: 64a2890 Co-authored-by: mverrilli <816450+mverrilli@users.noreply.github.com> Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com> Reviewed-by: @gumadeiras

openclaw-barnacle bot added the agents Agent runtime and tooling label Feb 5, 2026

greptile-apps bot reviewed Feb 5, 2026

View reviewed changes

chatgpt-codex-connector bot reviewed Feb 5, 2026

View reviewed changes

src/agents/pi-embedded-runner/run/attempt.ts Show resolved Hide resolved

mverrilli marked this pull request as draft February 5, 2026 20:08

greptile-apps bot reviewed Feb 5, 2026

View reviewed changes

src/agents/pi-embedded-subscribe.ts Show resolved Hide resolved

greptile-apps bot reviewed Feb 5, 2026

View reviewed changes

src/agents/pi-embedded-subscribe.ts Show resolved Hide resolved

mverrilli force-pushed the mverrilli/compaction_timeout_deadlock_fix branch 2 times, most recently from 7d3230b to 041a8da Compare February 5, 2026 21:40

greptile-apps bot reviewed Feb 5, 2026

View reviewed changes

src/agents/pi-embedded-subscribe.ts Outdated Show resolved Hide resolved

src/agents/pi-embedded-subscribe.ts Outdated Show resolved Hide resolved

greptile-apps bot reviewed Feb 5, 2026

View reviewed changes

src/agents/pi-embedded-runner/run/attempt.ts Show resolved Hide resolved

greptile-apps bot reviewed Feb 5, 2026

View reviewed changes

src/agents/pi-embedded-subscribe.ts Show resolved Hide resolved

src/agents/pi-embedded-runner/run/attempt.ts Show resolved Hide resolved

greptile-apps bot reviewed Feb 6, 2026

View reviewed changes

src/agents/pi-embedded-subscribe.ts Show resolved Hide resolved

mverrilli force-pushed the mverrilli/compaction_timeout_deadlock_fix branch from 00a91e7 to 7b6fdb8 Compare February 6, 2026 00:34

greptile-apps bot reviewed Feb 6, 2026

View reviewed changes

src/agents/pi-embedded-runner/run/attempt.ts Show resolved Hide resolved

greptile-apps bot reviewed Feb 6, 2026

View reviewed changes

src/agents/pi-embedded-subscribe.ts Outdated Show resolved Hide resolved

greptile-apps bot reviewed Feb 6, 2026

View reviewed changes

src/agents/pi-embedded-runner/run/attempt.ts Show resolved Hide resolved

mverrilli force-pushed the mverrilli/compaction_timeout_deadlock_fix branch 2 times, most recently from e01452b to d08b39f Compare February 6, 2026 01:54

greptile-apps bot reviewed Feb 6, 2026

View reviewed changes

src/agents/pi-embedded-runner/run/attempt.ts Outdated Show resolved Hide resolved

mverrilli force-pushed the mverrilli/compaction_timeout_deadlock_fix branch 2 times, most recently from a0d07fe to e8037c7 Compare February 6, 2026 02:09

greptile-apps bot reviewed Feb 6, 2026

View reviewed changes

src/agents/pi-embedded-runner/run/attempt.ts Outdated Show resolved Hide resolved

src/agents/pi-embedded-runner/run/attempt.ts Show resolved Hide resolved

src/agents/pi-embedded-runner/run.ts Show resolved Hide resolved

mverrilli force-pushed the mverrilli/compaction_timeout_deadlock_fix branch from e8037c7 to 3af1f85 Compare February 6, 2026 02:45

greptile-apps bot reviewed Feb 6, 2026

View reviewed changes

src/agents/pi-embedded-runner/run/attempt.ts Outdated Show resolved Hide resolved

mverrilli force-pushed the mverrilli/compaction_timeout_deadlock_fix branch 2 times, most recently from bb88c2f to 5f60f68 Compare February 6, 2026 03:19

greptile-apps bot reviewed Feb 6, 2026

View reviewed changes

src/agents/pi-embedded-runner/run/attempt.ts Show resolved Hide resolved

mverrilli force-pushed the mverrilli/compaction_timeout_deadlock_fix branch from 5f60f68 to 4838d19 Compare February 6, 2026 03:29

mverrilli and others added 11 commits February 14, 2026 14:23

fix(agent): harden compaction timeout cleanup

4515811

lint

04489de

lint

30d7f0d

fix(agent): classify external compaction timeout aborts

eeaf721

test(agent): cover compaction-timeout parity and unsubscribe abort

64a2890

gumadeiras force-pushed the mverrilli/compaction_timeout_deadlock_fix branch from 2b88a4b to 64a2890 Compare February 14, 2026 19:24

gumadeiras merged commit e6f67d5 into openclaw:main Feb 14, 2026
9 checks passed

openclaw-barnacle bot added size: M and removed size: S labels Feb 14, 2026

github-actions bot mentioned this pull request Feb 14, 2026

📡 Upstream Digest — 2026-02-14 20:17 UTC curtismercier/openclaw-mods#26

Open

mverrilli deleted the mverrilli/compaction_timeout_deadlock_fix branch February 14, 2026 20:45

mverrilli mentioned this pull request Feb 14, 2026

fix(agent): prevent session deadlock on timeout during tool execution #16554

Open

joeykrug mentioned this pull request Feb 15, 2026

fix(heartbeat): add safety timeouts to prevent permanent scheduler hang #16780

Closed

nathankjer mentioned this pull request Feb 17, 2026

BUG: Compaction-retry timeout leaves session permanently stuck in processing state (deadlock) #12375

Closed

This was referenced Feb 23, 2026

Bug: Telegram bot receives messages but fails to send replies #6278

Closed

Fix: make embedded compaction wait abortable #5467

Closed

Uh oh!

Conversation

mverrilli commented Feb 5, 2026 • edited by greptile-apps bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Changes

Greptile Overview

Greptile Summary

Confidence Score: 5/5

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Feb 5, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot commented Feb 6, 2026

Uh oh!

Uh oh!

mverrilli commented Feb 5, 2026 •

edited by greptile-apps bot

Loading