Skip to content

fix(agents): file-scoped prompt-window guard for same-session embedded races#86067

Closed
ubehera wants to merge 2 commits into
openclaw:mainfrom
ubehera:fix/session-takeover-file-scoped-prompt-guard
Closed

fix(agents): file-scoped prompt-window guard for same-session embedded races#86067
ubehera wants to merge 2 commits into
openclaw:mainfrom
ubehera:fix/session-takeover-file-scoped-prompt-guard

Conversation

@ubehera

@ubehera ubehera commented May 24, 2026

Copy link
Copy Markdown
Contributor

Update 2026-05-25: An earlier revision of this PR also included a waitForSessionEvents drain in installPromptSubmissionLockRelease's finally block (reported by @kesslerio against AlphaClaw). That commit has been moved to its own focused PR with honest framing — see the followup discussion on #85913. This PR is scoped strictly to the file-scoped prompt-window guard. The vanilla-openclaw same-lane race (a separate failure mode from the cross-lane race this PR addresses) is being fixed in #86584.

Summary

  • Adds a file-scoped prompt-window guard that serialises embedded-runner prompt windows on the same resolved session JSONL across concurrent lanes.
  • Introduces releaseForSessionIdleWait() for the post-prompt compaction window so it does not block other lanes' real prompt windows.
  • Routes every exit path through a turn-release try/catch and try/finally so the file-scoped queue is vacated on success, takeover, lock-timeout, fence-mismatch, and mid-flight release errors. A wedge in one lane never leaves later same-file waiters blocked.

Fixes #85913.

Background

createEmbeddedAttemptSessionLockController releases the OS-level session write lock during provider streaming so long model calls don't hold the lock for minutes at a time. After streaming, reacquireAfterPrompt() reacquires the lock and assertSessionFileFence() checks an inode/size/mtime fingerprint to detect external mutation. If the file changed in an unowned way, the controller throws EmbeddedAttemptSessionTakeoverError.

The fence is a symptom-catcher, not a race preventer. The live-repro evidence on the issue shows the underlying race: two lanes on the same OpenClaw process can both resolve to the same JSONL (stuck-session recovery firing a fresh embedded run on a session UUID whose file resolution aliases to an already-active file). Both lanes release for prompt simultaneously, both stride into provider streams, the slower one's reacquire trips the fence after the provider call has already produced output, and the chain falls all the way to Embedded agent failed before reply: All models failed. The user sees a dropped reply.

RuntimeEnv is module-scoped at ownedSessionFileWrites / trustedSessionFileStates — both keyed by path.resolve(sessionFile). The fix extends that pattern with a third process-global map: a per-file queue of prompt turns.

What changed

src/agents/pi-embedded-runner/run/attempt.session-lock.ts

  • New module-level promptSessionFileTurnTails: Map<string, Promise<void>> keyed by the same resolveSessionFileFenceKey the fence already uses.
  • acquirePromptSessionFileTurn(sessionFileKey) returns { awaitPrior, release }. The first entrant on a file gets awaitPrior: undefined and falls through with no extra acquire/release. The second entrant gets the first's tail and waits on it before its own prompt window opens.
  • performHeldLockReleaseForWindow(registerPromptHolder) consolidates the held-lock release path. When registerPromptHolder is true and a prior turn exists, the entrant yields its OS lock, awaits the prior turn, reacquires fresh, and then continues with the existing fingerprint/snapshot/release sequence. Wrapped in try/catch so any exception between acquiring the turn and the controller taking ownership releases the turn slot before rethrowing.
  • releaseForPrompt() becomes a one-liner over performHeldLockReleaseForWindow(true).
  • releaseForSessionIdleWait() (new public method) becomes a one-liner over performHeldLockReleaseForWindow(false). Used for the post-prompt compaction window, which is not a provider-prompt window and must not serialise other lanes' prompts.
  • reacquireAfterPrompt() releases the turn in a finally block — the success path and every error branch (lock-timeout, takeover fence, any other exception) all vacate the slot. The no-op-early-return branch (takeoverDetected || heldLock) also releases the turn so a no-op entrant cannot strand the queue.
  • resetEmbeddedAttemptSessionFilePromptGuardsForTest() (new test helper) clears the module-level map so tests start from a clean slate.
  • Exported releaseForSessionIdleWait on the EmbeddedAttemptSessionLockController type.

src/agents/pi-embedded-runner/run/attempt.ts

  • Line 4277: the post-prompt compaction wait uses releaseForSessionIdleWait() instead of releaseForPrompt(). This path was already a non-prompt release — the rename makes it not contend with real prompt windows on the same file. The matching post-compaction withSessionWriteLock reacquires the OS lock internally; no other behavior changes.

src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts

  • afterEach clears the new queue alongside the existing resetSessionWriteLockStateForTest() cleanup.
  • Six existing multi-controller test bodies call resetEmbeddedAttemptSessionFilePromptGuardsForTest() between the first controller's releaseForPrompt() and the second controller's releaseForPrompt(). These tests were written before the file-scoped guard existed and use releaseForPrompt() as a "drop the OS lock" call without a paired reacquireAfterPrompt(). The reset preserves their existing assertions without forcing them to invent reacquire calls they don't need.
  • Five new test cases under the same describe block:
    • Serialisation happy path — second same-file controller's releaseForPrompt() blocks until the first calls reacquireAfterPrompt().
    • Cleanup-on-reacquire-failure — first controller's reacquireAfterPrompt() rejects with a SessionWriteLockTimeoutError; the second controller's releaseForPrompt() still completes (the turn was released in the finally).
    • Cleanup-on-release-failure — first controller's held-lock release rejects partway through releaseForPrompt(); the second controller's releaseForPrompt() still completes (the turn was released in the catch before re-throwing).
    • Compaction-wait isolation — releaseForSessionIdleWait() does not block another lane's releaseForPrompt() on the same file.
    • Cross-file isolation — controllers on different sessionFiles do not serialise.

Verification

# All acceptance-criteria tests from the issue's bot review:
node scripts/run-vitest.mjs \
  src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts \
  src/logging/diagnostic-stuck-session-recovery.runtime.test.ts \
  src/infra/heartbeat-runner.skips-busy-session-lane.test.ts \
  src/agents/model-fallback.test.ts \
  src/agents/failover-error.test.ts

# Test Files  7 passed (7)
#      Tests  306 passed (306)

pnpm check:changed exit 0 (typecheck, lint, runtime import cycles, repo guards).

Real behavior proof

  • Behavior or issue addressed: Two concurrent embedded attempts on the same resolved session JSONL no longer race during the provider-streaming window. The losing lane's reply was being dropped because the takeover fence fired mid-stream after the provider had already started producing output. With the file-scoped turn queue, the second entrant waits for the first to reacquire (or fail cleanly) before its own prompt window opens. Failure-path cleanup ensures one wedged prompt cannot block later prompts on the same file.

  • Real environment tested: Local OpenClaw source checkout (macOS 15.5, Node 22.19+) at the patched branch HEAD 65705d8c39. The patched createEmbeddedAttemptSessionLockController runs in two real concurrent controllers against a shared session JSONL with the production acquireSessionWriteLock contract; per-microsecond timestamps and the lock-acquire/release event ledger are captured below. The acceptance-criteria test sweep from the issue's bot review runs against the same patched source.

  • Exact steps or command run after this patch:

    pnpm install
    pnpm build
    node scripts/run-vitest.mjs src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts
    
    # Acceptance-criteria sweep from the issue's bot review:
    node scripts/run-vitest.mjs \
      src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts \
      src/logging/diagnostic-stuck-session-recovery.runtime.test.ts \
      src/infra/heartbeat-runner.skips-busy-session-lane.test.ts \
      src/agents/model-fallback.test.ts \
      src/agents/failover-error.test.ts
    
    # Drive the patched controller end-to-end against a shared session JSONL:
    npx tsx /abs/path/to/pr-proof/run-session-takeover-guard-proof.mjs /abs/path/to/openclaw
  • Evidence after fix:

    Before/after reproduction — same scenario run against pre-patch (5be62e779b, the commit one before this fix) and post-patch (65705d8c39) source via the same run-same-file-race-repro.mjs script. Two real EmbeddedAttemptSessionLockController instances on a shared temp session JSONL; production acquireSessionWriteLock (no test mocks); external mutation injected mid-prompt to simulate the aliased-UUID concurrent lane from the live-repro evidence in EmbeddedAttemptSessionTakeoverError races between heartbeat lane and channel/direct lane on same session file (internal ref #83510) #85913:

    PRE-PATCH (5be62e779b):
      [   2.60ms] ctrl-A   prompt in flight (200ms hold)
      [   9.47ms] ctrl-B   prompt in flight (200ms hold)
      [ 111.70ms] external appending bytes outside the ownership chain
      [ 209.67ms] ctrl-A   THREW on reacquire: TAKEOVER (EmbeddedAttemptSessionTakeoverError)
      [ 210.98ms] ctrl-B   THREW on reacquire: TAKEOVER (EmbeddedAttemptSessionTakeoverError)
      Result:  replies delivered 0 of 2  ·  takeover errors 2
    
    POST-PATCH (65705d8c39):
      [   2.64ms] ctrl-A   prompt in flight (200ms hold)
      [ 111.53ms] external appending bytes outside the ownership chain
      [ 209.33ms] ctrl-A   THREW on reacquire: TAKEOVER (EmbeddedAttemptSessionTakeoverError)
      [ 210.89ms] ctrl-B   prompt in flight (200ms hold)
      [ 415.45ms] ctrl-B   DELIVERED (reacquire succeeded)
      Result:  replies delivered 1 of 2  ·  takeover errors 1
    

    Read the timestamps: on pre-patch, both controllers enter their mid-prompt sleep at ms 2 and 9 simultaneously (both have active fences capturing the same stale fingerprint), so when the external write hits at ms 111, both lanes observe the mutation when they reacquire at ms ~210. On post-patch, lane B's prompt in flight log line does not appear until ms 210 — which is the same moment lane A's reacquire throws. The file-scoped queue held lane B's releaseForPrompt until lane A's turn vacated; by then the file was stable, so lane B's release captured a fresh fingerprint and its reacquire succeeded at ms 415.

    Lane A's takeover is the unavoidable case (a true external mutation during its prompt window — no coordination primitive can save it; the fence correctly fires as the safety net). Lane B is the cascade casualty the patch saves.

    Internal controller proof — earlier in the proof pipeline, two real controllers driven through same-file contention with timestamped event ledger. Verbatim stdout from the run on this branch (65705d8c39):

    ============================================================
    PR #86067 — file-scoped prompt-window guard proof
    ============================================================
    
    openclaw checkout: /Users/.../openclaw
    source under test:  src/agents/pi-embedded-runner/run/attempt.session-lock.ts
    
    ============================================================
    Scenario A — same-file prompt windows serialise
    ============================================================
    
      [  509.55ms] setup                  two controllers on /tmp/proof-session.jsonl
      [  509.59ms] first                  releaseForPrompt() — register turn-1
      [  514.63ms] second                 releaseForPrompt() — should block on turn-1
      [  514.69ms] observe                t+0.06ms: secondCompleted=false (expected: false)
      [  514.70ms] first                  reacquireAfterPrompt() — release turn-1, second waiter unblocks
      [  515.86ms] observe                t+1.23ms: secondCompleted=true (expected: true)
    
      event ledger:
        acquire(A-1)
        acquire(A-2)
        release(A-1)
        release(A-2)
        acquire(A-3)
        acquire(A-4)
        release(A-4)
    
      Scenario A: PASS — second controller waited for first's reacquire
    
    ============================================================
    Scenario B — cleanup-on-failure: first's reacquire throws
    ============================================================
    
      [  516.19ms] setup                  two controllers; first's reacquire mocked to throw SessionWriteLockTimeoutError
      [  516.21ms] first                  releaseForPrompt() — register turn-1
      [  516.31ms] second                 releaseForPrompt() — should block on turn-1
      [  516.37ms] observe                t+0.05ms: secondCompleted=false (expected: false)
      [  516.38ms] first                  reacquireAfterPrompt() — throws SessionWriteLockTimeoutError
      [  516.41ms] first                  threw: SessionWriteLockTimeoutError (OPENCLAW_SESSION_WRITE_LOCK_TIMEOUT)
      [  516.42ms] observe                turn-1 must have been vacated in reacquire's finally
      [  516.50ms] observe                t+0.18ms: secondCompleted=true (expected: true)
    
      Scenario B: PASS — second waiter completed despite first's reacquire failure
    

    Scenario A's event ledger is the key observable: lock A-1 is the first controller's initial acquire, A-2 the second's initial acquire, A-3 is the first controller's reacquire-after-prompt (which fires the turn release on the queue), and A-4 is the second controller's reacquire-fresh after its prior-wait dance. Without the file-scoped guard, both A-1 and A-2 would release simultaneously and both controllers would stride into provider streams.

    Scenario B's timestamps prove the cleanup-on-failure path is wired correctly: secondCompleted flips from falsetrue within 80 µs of the first controller's reacquire throwing. If reacquireAfterPrompt's finally block weren't releasing the turn on the error path, this assertion would have timed out instead.

    Targeted session-lock suite (74 tests including the 5 new file-scoped guard regressions):

    RUN  v4.1.7 /Users/.../openclaw
     Test Files  2 passed (2)
          Tests  74 passed (74)
       Duration  919ms
    

    Full acceptance-criteria sweep (the five files the issue's bot review named):

    RUN  v4.1.7 /Users/.../openclaw
     Test Files  7 passed (7)
          Tests  306 passed (306)
       Duration  46.34s
    

    The two regression tests that distinguish a correct fix from a half-fix are:

    • vacates the prompt turn so later waiters proceed when reacquireAfterPrompt throws — drives the first controller's reacquire mock to reject with SessionWriteLockTimeoutError. Asserts that a second waiter blocked on the first's turn completes its releaseForPrompt() after the throw. If the cleanup is wired only on the success path, this test times out (the second waiter blocks forever on a turn promise that never resolves).
    • vacates the prompt turn when releaseForPrompt itself fails mid-flight — drives the first controller's held-lock release to reject. The first releaseForPrompt() throws partway, but the catch in performHeldLockReleaseForWindow releases the turn before rethrowing. A second releaseForPrompt() then completes cleanly. If the cleanup is only in reacquireAfterPrompt, this test also times out.

    Both of these are deterministic — they trigger the failure path via vi.fn().mockRejectedValueOnce(...), not via timing.

  • Observed result after fix: Concurrent same-file prompt windows serialise instead of racing. Failed prompt windows (lock-timeout, takeover, release error) vacate the file-scoped queue so subsequent prompts on the same file proceed. The takeover fence remains in place as a defense-in-depth check against external mutation.

  • What was not tested: A live multi-agent gateway run that fires stuck-session recovery on an aliased session UUID. The unit tests cover the locking primitive directly through its public contract; the existing heartbeat-runner.skips-busy-session-lane.test.ts and diagnostic-stuck-session-recovery.runtime.test.ts continue to pass and exercise the surrounding flows. Verifying the end-to-end fix against the real 122-event/3-day takeover pattern from the issue would require running a patched gateway under that workload for the same window.

  • Before evidence: Without the file-scoped guard, two same-file controllers both call releaseForPrompt(), both stride into provider streams, and whichever reacquires second observes the other's writes and throws EmbeddedAttemptSessionTakeoverError. The chain falls through model-fallbackfailover-errorEmbedded agent failed before reply: All models failed. The live log slices in EmbeddedAttemptSessionTakeoverError races between heartbeat lane and channel/direct lane on same session file (internal ref #83510) #85913 show 122 such events across 3 days on one local gateway, including the user-visible drop trace at 9dada894-1cb4-4610-a135-68d5ece67e51 on 2026-05-19 02:42:34.

Refreshed proof for current HEAD 2606e52d74 (post-rebase + dispose cleanup)

The PR was rebased onto current upstream/main (which includes #86427's retained-lock dispose() from 32ddfc22f5). The merge conflict in attempt.session-lock.ts was resolved by keeping the new activePromptSessionTurn.release() block at the top of dispose() and preserving the existing heldLock release that landed in main. Run against the rebased HEAD 2606e52d74:

Test suite — 80 cases across 2 files green:

 Test Files  2 passed (2)
      Tests  80 passed (80)

The two existing dispose tests from #86014 ("releases the eagerly-held attempt lock on dispose when cleanup is skipped" and "dispose does not double-release a lock already handed to cleanup") still pass after the rebase — the dispose extension is backward-compatible.

Dispose-after-releaseForPrompt proof harness — three scenarios driven against real EmbeddedAttemptSessionLockController instances on a shared temp session JSONL with production acquireSessionWriteLock. Verbatim stdout:

============================================================
PR #86067 — dispose-after-releaseForPrompt proof
============================================================

-------- Scenario A — happy path: releaseForPrompt -> reacquireAfterPrompt --------
  [   1.46ms] ctrlA       constructed, eager lock held
  [   2.38ms] ctrlA       releaseForPrompt — activePromptSessionTurn set
  [   2.71ms] ctrlA       reacquireAfterPrompt — turn cleared via finally
  [   2.95ms] ctrlA       acquireForCleanup -> release — OS lock fully released
  [   3.21ms] ctrlB       constructed on same session file
  [   3.55ms] ctrlB       releaseForPrompt: completed

-------- Scenario B — pre-fix bug shape: skip reacquire AND skip dispose --------
  [   0.27ms] ctrlA       constructed, eager lock held
  [   0.54ms] ctrlA       releaseForPrompt — activePromptSessionTurn set
  [   0.55ms] ctrlA       SKIP reacquireAfterPrompt AND SKIP dispose (pre-fix bug shape)
  [   0.77ms] ctrlB       constructed on same session file
  [ 202.09ms] ctrlB       releaseForPrompt: timeout

-------- Scenario C — post-fix: dispose() vacates the prompt turn --------
  [   0.98ms] ctrlA       constructed, eager lock held
  [   1.85ms] ctrlA       releaseForPrompt — activePromptSessionTurn set
  [   1.89ms] ctrlA       dispose() — prompt turn released, heldLock already absent
  [   1.90ms] ctrlA       dispose() again — idempotent, no-op
  [   2.56ms] ctrlB       constructed on same session file
  [   3.52ms] ctrlB       releaseForPrompt: completed

============================================================
  Scenario A (happy):              ctrlB releaseForPrompt = completed
  Scenario B (pre-fix leak shape): ctrlB releaseForPrompt = timeout
  Scenario C (post-fix dispose):   ctrlB releaseForPrompt = completed

  RESULT: PASS
============================================================

Scenario B is the bug-shape proof: ctrlA.releaseForPrompt() sets activePromptSessionTurn without subsequently releasing it via either reacquireAfterPrompt or dispose. ctrlB.releaseForPrompt() then blocks on the prior turn's tail, the 200ms test timeout fires, and the harness records timeout. Scenario C confirms that dispose()'s new activePromptSessionTurn.release() call at the top of the body fixes this — ctrlB's release completes in 3.52ms.

Harness: /Users/umank/code/openclaw-tickets/proof/run-dispose-after-release-proof.mjs.

Risk / compatibility

  • Public Plugin SDK surface: unchanged. EmbeddedAttemptSessionLockController is internal to src/agents/pi-embedded-runner/run/ and not exposed through openclaw/plugin-sdk/*.
  • Backward compatibility: releaseForPrompt() keeps its prior signature; releaseForSessionIdleWait() is additive. The only production caller of releaseForPrompt is installPromptSubmissionLockRelease (attempt.session-lock.ts:1014), which already pairs it with reacquireAfterPrompt in a try/finally — that wrapper is what makes the file-scoped guard correct in production.
  • Single new module-level map (promptSessionFileTurnTails) sits alongside the existing ownedSessionFileWrites and trustedSessionFileStates singletons; same lifetime, same key scheme, same reset hook. No persistent state, no disk artefacts, no cross-process coordination.
  • Performance: no contention path is a single map lookup + an early continue. Contention path adds one extra acquireLock + release pair on the second entrant to yield the OS lock during the wait — this is necessary because the prior turn-holder must be able to reacquire after their provider call.

Security

  • New permissions/capabilities? No.
  • Secrets/tokens handling changed? No.
  • New/changed network calls? No.
  • Command/tool execution surface changed? No.
  • Data access scope changed? No. Touches in-process lock state and one new process-global map keyed by resolved session file paths.

This PR is AI-assisted. Code authored with Claude.

@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: M proof: supplied External PR includes structured after-fix real behavior proof. labels May 24, 2026
@clawsweeper

clawsweeper Bot commented May 24, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs maintainer review before merge. Reviewed May 25, 2026, 8:17 PM ET / 00:17 UTC.

Summary
This PR adds a per-session-file prompt-window queue to the embedded-runner session-lock controller, separates post-prompt compaction idle release from prompt release, and extends session-lock regression coverage.

PR surface: Source +128, Tests +267. Total +395 across 3 files.

Reproducibility: yes. The PR body provides a concrete before/after harness for two real controllers on one session JSONL, and the current source path still matches the unguarded release/reacquire race.

Review metrics: 1 noteworthy metric.

  • Prompt-window coordination: 1 process-global per-file queue added. This is the main behavior change and the source of the availability tradeoff maintainers need to review before merge.

Merge readiness
Overall: 🐚 platinum hermit
Proof: 🦞 diamond lobster
Patch quality: 🐚 platinum hermit
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • Refresh the branch or merge result onto current main before landing.
  • Rerun the focused session-lock test sweep and same-file contention proof on the final head.

Risk before merge

  • The new per-file prompt-turn queue is in the embedded-runner hot path; if a provider stream or abort path never reaches reacquire or dispose, later same-session-file prompt windows can wait behind that turn for the process lifetime.
  • The PR merge ref was created against an older base while current main has since moved through the same runner files, so maintainers should require a current-main rebase or merge-result proof before landing.

Maintainer options:

  1. Refresh And Prove On Current Main (recommended)
    Rebase or regenerate the merge result on current main, then rerun the focused session-lock tests and same-file contention proof before merge.
  2. Accept The Serialized Prompt Tradeoff
    Maintainers can land the queue if they are comfortable relying on provider timeouts, abort propagation, and dispose cleanup to prevent same-file waiters from hanging.
  3. Pause For Race-Fix Ordering
    If the related same-lane race PRs change the same controller invariants, pause this PR until the final ownership shape is clear.

Next step before merge
No narrow automated repair remains; the next action is maintainer judgment on the hot-path availability tradeoff, current-main freshness, and landing order with related session-race PRs.

Security
Cleared: The diff changes embedded-runner coordination code and tests only; I found no dependency, workflow, credential, or supply-chain surface change.

Review details

Best possible solution:

Land the focused guard after maintainer acceptance of the availability tradeoff, current-main merge proof, and coordination with the separate same-lane session-race fixes.

Do we have a high-confidence way to reproduce the issue?

Yes. The PR body provides a concrete before/after harness for two real controllers on one session JSONL, and the current source path still matches the unguarded release/reacquire race.

Is this the best way to solve the issue?

Yes for the cross-lane same-file race. The patch serializes only real prompt windows by resolved session file and keeps compaction idle waits out of that queue; the separate same-lane race remains tracked by #86584.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against ea2496b00c5c.

Label changes

Label justifications:

  • P1: The PR targets a user-visible embedded-agent race that can drop replies in active agent or channel workflows.
  • merge-risk: 🚨 availability: The patch serializes same-session-file prompt windows through a process-global queue, so a dangling turn could block later work on that session file.
  • rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🦞 diamond lobster and patch quality is 🐚 platinum hermit.
  • status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (terminal): The PR body includes after-fix terminal proof and before/after controller output for the changed runtime behavior, which is sufficient for this non-visual embedded-runner fix.
  • proof: sufficient: Contributor real behavior proof is sufficient. The PR body includes after-fix terminal proof and before/after controller output for the changed runtime behavior, which is sufficient for this non-visual embedded-runner fix.
Evidence reviewed

PR surface:

Source +128, Tests +267. Total +395 across 3 files.

View PR surface stats
Area Files Added Removed Net
Source 2 135 7 +128
Tests 1 267 0 +267
Docs 0 0 0 0
Config 0 0 0 0
Generated 0 0 0 0
Other 0 0 0 0
Total 3 402 7 +395

Acceptance criteria:

  • node scripts/run-vitest.mjs src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts src/logging/diagnostic-stuck-session-recovery.runtime.test.ts src/infra/heartbeat-runner.skips-busy-session-lane.test.ts src/agents/model-fallback.test.ts src/agents/failover-error.test.ts
  • pnpm check:changed or Testbox equivalent on the final merge head

What I checked:

Likely related people:

  • openperf: Authored the merged session-lock cleanup PR whose dispose and outer-finally release path this PR builds on. (role: recent adjacent contributor; confidence: high; commits: 32ddfc22f5e4; files: src/agents/pi-embedded-runner/run/attempt.session-lock.ts, src/agents/pi-embedded-runner/run/attempt.ts, src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts)
  • Michael Zelbel: Current-main blame for the embedded-attempt session-lock controller points to recent work touching the same controller surface. (role: recent area contributor; confidence: medium; commits: 9c79a0f8f417; files: src/agents/pi-embedded-runner/run/attempt.session-lock.ts)
  • Peter Steinberger: Git history shows prior release-era and refactor work touching the embedded-runner session-lock file and related cleanup seams. (role: prior area contributor; confidence: medium; commits: a374c3a5bfd5, bcbfb357bec7; files: src/agents/pi-embedded-runner/run/attempt.session-lock.ts)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P1 High-priority user-facing bug, regression, or broken workflow. merge-risk: 🚨 availability 🚨 May cause crashes, hangs, restart loops, stalls, or process outages. labels May 24, 2026
@clawsweeper

clawsweeper Bot commented May 24, 2026

Copy link
Copy Markdown
Contributor

ClawSweeper PR egg

✨ Hatched: 💎 rare Tiny Crabkin

Hatch command

Comment @clawsweeper hatch when this PR is hatchable.

Hatchability rules:

  • Merged PRs are hatchable.
  • Open PRs are hatchable when they are status: 👀 ready for maintainer look, status: 🚀 automerge armed, or labeled clawsweeper:automerge.
  • Closed unmerged PRs are hatchable only when one of those hatchable labels is still present in the durable record.

Rarity: 💎 rare.
Trait: sniffs out flaky tests.
Image traits: location merge queue dock; accessory commit compass; palette charcoal, cyan, and signal green; mood sleepy but ready; pose standing beside its cracked shell; shell smooth pearl shell; lighting subtle sparkle highlights; background gentle dashboard dots.
Share on X: post this hatch
Copy: My PR egg hatched a 💎 rare Tiny Crabkin in ClawSweeper.

What is this egg doing here?
  • Eggs appear after the PR passes real-behavior proof. It is here for vibes, not verdicts: it does not change labels, ratings, merge decisions, or automation.
  • The shell reacts to review momentum: open follow-up work warms it up, re-review makes it wobble, and a clean final review lets it hatch.
  • Hatchability usually comes from sufficient real-behavior proof, no blocking P0/P1/P2 findings, no security attention needed, and clean correctness. A merged PR is already final, so merge makes the egg hatchable independently.
  • The hatch is seeded from this repository and PR number, so the same PR keeps the same creature; the reviewed head SHA can only change safe visual details.
  • Rarity is just collectible sparkle: 🥚 common, 🌱 uncommon, 💎 rare, ✨ glimmer, and 🌈 legendary.

@clawsweeper

clawsweeper Bot commented May 24, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@ubehera

ubehera commented May 24, 2026

Copy link
Copy Markdown
Contributor Author

@clawsweeper re-review

PR body updated with verbatim runtime output from the patched controller — Scenario A (same-file serialisation) and Scenario B (cleanup-on-failure when reacquireAfterPrompt throws SessionWriteLockTimeoutError), with per-microsecond timestamps and the lock-acquire/release event ledger.

@clawsweeper

clawsweeper Bot commented May 24, 2026

Copy link
Copy Markdown
Contributor

🦞👀
ClawSweeper picked this up.

Command router queued. I will update this comment with the next step.

Re-review progress:

@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. and removed rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. labels May 24, 2026
@ubehera

ubehera commented May 24, 2026

Copy link
Copy Markdown
Contributor Author

Added a before/after reproduction to the Evidence section. Same run-same-file-race-repro.mjs script, two real EmbeddedAttemptSessionLockController instances on a shared session JSONL with the production acquireSessionWriteLock (no test mocks), external mutation injected mid-prompt to model the aliased-UUID concurrent-lane scenario from the issue's live-repro logs.

  • Pre-patch (5be62e779b): 0 of 2 replies delivered, 2 takeover errors
  • Post-patch (65705d8c39): 1 of 2 replies delivered, 1 takeover error (the unavoidable lane-A case — external mutation during its prompt window; the fence correctly fires as the safety net)

The patch saves lane B by serialising its prompt window behind lane A's turn — visible in the timestamps: lane B's prompt in flight log line does not appear until ms 210, which is the same moment lane A's reacquire throws and vacates its turn.

@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 24, 2026
@ubehera

ubehera commented May 24, 2026

Copy link
Copy Markdown
Contributor Author

@clawsweeper re-review

The PR body's Evidence section now leads with a pre-patch vs post-patch reproduction run — same script (run-same-file-race-repro.mjs), same scenario, real production acquireSessionWriteLock, two real EmbeddedAttemptSessionLockController instances on a shared session JSONL. Pre-patch (5be62e779b): 0/2 replies delivered, 2 takeover errors. Post-patch (65705d8c39): 1/2 replies delivered, 1 takeover error.

Lane A's takeover is the unavoidable case (external mutation during its prompt window — the fence correctly fires as the safety net). Lane B is the cascade casualty the patch saves by serialising B's release behind A's turn.

@clawsweeper

clawsweeper Bot commented May 24, 2026

Copy link
Copy Markdown
Contributor

🦞👀
ClawSweeper picked this up.

Command router queued. I will update this comment with the next step.

Re-review progress:

@ubehera

ubehera commented May 24, 2026

Copy link
Copy Markdown
Contributor Author

The merge-risk: 🚨 availability flag points to a real concern: a coordination primitive in the embedded-runner hot path can wedge sessions if the queue ever gains a dangling turn. Walking through every exit path that touches the turn, with the source lines that handle each:

1. Normal success pathreleaseForPrompt() completes, provider streams, reacquireAfterPrompt() returns cleanly.

  • attempt.session-lock.ts:849 promotes the turn to controller ownership only after every step that can throw has passed.
  • attempt.session-lock.ts:911-912 (the finally block in reacquireAfterPrompt) clears activePromptSessionTurn on the success path. ✅

2. releaseForPrompt() throws before the controller takes ownership — e.g., acquireLock() rejects during the prior-wait reacquire, readSessionFileFingerprint() rejects, or lock.release() rejects at the end.

  • attempt.session-lock.ts:851-862 catch block: promptTurn.release() is called before rethrowing. The try/catch wraps every awaited step including the prior-wait dance, the fingerprint read, the fence-snapshot read, and the final OS lock release. ✅

3. reacquireAfterPrompt() no-op early returntakeoverDetected already true (set by a prior lock-timeout) or heldLock already present (defensive idempotency).

  • attempt.session-lock.ts:882-885 releases the turn before returning early — so even a no-op reacquire doesn't strand the queue. ✅

4. reacquireAfterPrompt() throws — lock-timeout (SessionWriteLockTimeoutError), takeover-fence trip (EmbeddedAttemptSessionTakeoverError), or any other exception.

  • attempt.session-lock.ts:905-911 catch block re-throws; the finally at line 912-914 still runs, releasing the turn regardless of what the catch did. ✅ (Regression-covered by vacates the prompt turn so later waiters proceed when reacquireAfterPrompt throws.)

5. Provider stream throws — the wrapped streamFn in installPromptSubmissionLockRelease (attempt.session-lock.ts:1014) puts reacquireAfterPrompt() in a finally, so the model error path still hits case 1 / case 4. ✅

6. Abort signal fires mid-prompt — same path as case 5; the abort propagates as a thrown error, the wrapping finally still calls reacquireAfterPrompt. ✅

7. Process exits mid-prompt — the entire promptSessionFileTurnTails Map dies with the process; no surviving waiter exists in another lane (all lanes share the same Map). ✅

The one residual scenario the patch cannot defend against is a provider stream that hangs indefinitely without the abort signal ever firing — reacquireAfterPrompt() would never get called, and the turn would dangle for the lifetime of the process. In practice this is bounded by:

  • The provider SDK's own request timeout (configured at the agent level)
  • The agent run abort controller (fires on outer timeouts; the wrapping try/finally still calls reacquireAfterPrompt)
  • maxHoldMs on the session write lock (separate from the prompt turn, but tied to the same configured timeout budget)

This isn't a new exposure: pre-patch, an indefinitely-hung streamFn also leaks the run (the OS lock is released but the run cleanup never runs, leaving session state in limbo). The patch changes the symptom — pre-patch a hung run silently loses one reply; post-patch a hung run blocks one specific same-file waiter behind it until the hang resolves. Both scenarios share the same root cause and the same mitigation (provider-level timeouts and abort propagation), neither of which this PR touches.

The cleanup paths are exhaustively tested:

  • vacates the prompt turn so later waiters proceed when reacquireAfterPrompt throws (the SessionWriteLockTimeoutError path)
  • vacates the prompt turn when releaseForPrompt itself fails mid-flight (the held-lock-release-throws path)
  • serialises prompt windows on the same session file across two controllers (the happy path)
  • does not block prompt-window waiters behind a compaction-wait release on the same file (separation of releaseForSessionIdleWait)
  • does not serialise prompt windows across different session files (cross-file isolation)

And the same scenarios run end-to-end against real acquireSessionWriteLock via run-same-file-race-repro.mjs (before/after capture in the PR body).

@ubehera

ubehera commented May 25, 2026

Copy link
Copy Markdown
Contributor Author

@clawsweeper re-review

Added commit 4254386508 with a post-stream waitForSessionEvents drain in installPromptSubmissionLockRelease's finally block. Reported by @kesslerio against AlphaClaw production. PR body's leading addendum honestly notes: this is symmetric with the pre-release drain (no-op in vanilla pi where _agentEventQueue isn't populated, load-bearing in forks that wire it). The architectural fix for the vanilla-openclaw post-fence listener-write race is filed separately as #86572.

Existing tests updated for the new event-order marker; one new test (drains prompt-emitted session events before reacquireAfterPrompt's fence check) pins down the ordering invariant. 76 cases green.

@clawsweeper

clawsweeper Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@clawsweeper clawsweeper Bot added rating: 🦞 diamond lobster Very strong PR readiness with only minor maintainer review expected. rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. and removed rating: 🦞 diamond lobster Very strong PR readiness with only minor maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. labels May 25, 2026
@ubehera

ubehera commented May 25, 2026

Copy link
Copy Markdown
Contributor Author

Force-pushed 9d031490b6 addressing the ClawSweeper P2 finding on 65705d8c39.

What the bot caught:

The rebase that removed the kesslerio drain commit from this branch also stripped unrelated infrastructure — EmbeddedAttemptSessionLockController.dispose(), its outer-finally call site in runEmbeddedAttempt via releaseRetainedSessionLock, AND two existing tests from main that exercise it (#86014). Without dispose() the eager session lock leaks on post-prompt error paths; with the new activePromptSessionTurn introduced by this PR, the gap is worse: teardown after releaseForPrompt without reacquireAfterPrompt leaves the file-scoped queue tail unresolved, and the next same-file controller hangs.

Restored:

  • dispose() on the controller type and implementation. Behavior: vacate activePromptSessionTurn first (NEW per bot guidance), then release heldLock if held (the original main behavior).
  • releaseRetainedSessionLock declaration + assignment + outer-finally invocation in runEmbeddedAttempt.
  • The two pre-existing dispose tests from main.

New regression (the bot-required dispose-after-releaseForPrompt):

it("dispose-after-releaseForPrompt vacates the prompt turn so the next same-file controller does not hang", async () => {
  await controllerA.releaseForPrompt();
  await controllerA.dispose();          // teardown without reacquireAfterPrompt
  await controllerA.dispose();          // idempotent

  // Controller B on same file must NOT hang on A's vacated turn.
  await expect(Promise.race([
    controllerB.releaseForPrompt().then(() => "released"),
    new Promise<string>(r => setTimeout(() => r("timeout"), 200)),
  ])).resolves.toBe("released");
});

80 tests green (was 74 before; +2 restored + 1 new). pnpm check:changed clean on the touched surface.

@clawsweeper re-review

@clawsweeper

clawsweeper Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 25, 2026
@clawsweeper clawsweeper Bot added status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. and removed status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. labels May 25, 2026
ubehera added 2 commits May 25, 2026 15:16
…d races

Two embedded-runner lanes on the same OpenClaw process (typically a
heartbeat and a channel reply, or a fresh stuck-session recovery run
whose UUID aliases to an active session file) can both release their
session write lock for provider streaming, both stride into the model
call, and have the takeover fence fire mid-stream on whichever
reacquires second. The losing lane's reply is dropped; the user sees
'Embedded agent failed before reply: All models failed'.

Add a file-scoped prompt-window queue (module-level Map keyed by the
resolved session JSONL path, alongside the existing
ownedSessionFileWrites / trustedSessionFileStates singletons) that
serialises prompt windows on the same file. The second entrant yields
its OS lock, waits for the first lane's turn to vacate, reacquires,
then proceeds with its own release/prompt/reacquire cycle.

Critically, every exit path vacates the turn: the catch in
performHeldLockReleaseForWindow handles errors before the controller
takes ownership of the turn, and the finally in reacquireAfterPrompt
handles errors after. Lock-timeout, takeover-fence trip, mid-flight
release failure, or no-op early returns all leave the queue clean for
later same-file waiters. Without the cleanup path, one wedged prompt
would block every later prompt on that session file for the rest of
the process.

The post-prompt compaction-wait release now uses the new
releaseForSessionIdleWait(), which goes through the same lock-release
machinery but skips queue registration — compaction-wait isn't a
provider-prompt window and must not block other lanes' real prompts.

Tests: 5 new cases covering same-file serialisation, both
cleanup-on-failure paths (reacquire throws + release throws),
compaction-wait isolation, and cross-file isolation. Existing
multi-controller tests get a reset hook between the first and second
controllers' releaseForPrompt calls so their assertions hold against
the new queue without forcing them to invent reacquire calls they
don't need. afterEach now clears the queue for test isolation.

Acceptance criteria (74 + 232 = 306 tests across 7 files, the issue's
bot review named four of them):
  - attempt.session-lock.test.ts (74 tests, includes the new cases)
  - diagnostic-stuck-session-recovery.runtime.test.ts
  - heartbeat-runner.skips-busy-session-lane.test.ts
  - model-fallback.test.ts
  - failover-error.test.ts

Closes openclaw#85913.
…ardown

ClawSweeper review on the previous force-push found that the rebase that
dropped the kesslerio drain commit also removed unrelated infrastructure:
the EmbeddedAttemptSessionLockController.dispose() method, its outer-
finally call site in runEmbeddedAttempt, and two existing test cases on
main that exercise it. Without dispose() the eager session lock leaks on
post-prompt error paths (the original openclaw#86014 fix). The new
activePromptSessionTurn introduced by this PR makes the gap worse —
teardown after releaseForPrompt without going through reacquireAfterPrompt
leaves the file-scoped queue tail unresolved and the next same-file
controller waits forever (bot finding, confidence 0.9).

Restored:

- EmbeddedAttemptSessionLockController.dispose(): Promise<void> back on
  the type and implementation. Behavior: vacate activePromptSessionTurn
  first (new), then release heldLock if held (the original main behavior).
  Idempotent — both heldLock and activePromptSessionTurn are cleared
  after first call.

- releaseRetainedSessionLock infrastructure in runEmbeddedAttempt:
  declaration before the try block, assignment after the controller
  constructor, and the outer-finally invocation with the log.error
  fallback. Restored from main as it was before this branch's rebase.

- The two existing dispose tests from main (openclaw#86014):
  "releases the eagerly-held attempt lock on dispose when cleanup is
  skipped" and "dispose does not double-release a lock already handed to
  cleanup".

New regression for the prompt-turn cleanup invariant:

- "dispose-after-releaseForPrompt vacates the prompt turn so the next
  same-file controller does not hang". Verifies the case the bot
  flagged: controller A calls releaseForPrompt, then dispose without
  going through reacquireAfterPrompt. Controller B on the same session
  file must be able to releaseForPrompt without hanging. The test races
  releaseForPrompt against a 200 ms timeout and asserts the release
  resolves first.

80 cases across 2 test files green.

Closes openclaw#85913.
@ubehera ubehera force-pushed the fix/session-takeover-file-scoped-prompt-guard branch from 9d03149 to 2606e52 Compare May 25, 2026 21:23
@ubehera

ubehera commented May 25, 2026

Copy link
Copy Markdown
Contributor Author

Force-pushed 2606e52d74 (rebased onto current upstream/main).

Two issues from the prior ClawSweeper review resolved:

  1. Dirty merge state (rebase): Upstream main now includes 32ddfc22f5 ("release embedded-attempt session lock on every exit path fix(agents): release embedded-attempt session lock on every exit path #86427") which independently restored dispose(). Rebasing produced one conflict in attempt.session-lock.ts — both sides defined dispose() for different invariants. Resolution: keep the new activePromptSessionTurn.release() block at the top (this PR's prompt-turn cleanup) and preserve the existing heldLock release that landed in main. Both dispose invariants now coexist in one body.

  2. Fresh proof for current HEAD: PR body's "Real behavior proof" section now includes a "Refreshed proof for current HEAD 2606e52d74" subsection with verbatim stdout from the new run-dispose-after-release-proof.mjs harness. Three scenarios:

    • Happy: releaseForPrompt -> reacquireAfterPrompt -> acquireForCleanup. ctrlB.releaseForPrompt = completed.
    • Pre-fix bug shape: Skip both reacquireAfterPrompt AND dispose. ctrlB.releaseForPrompt = timeout (200ms).
    • Post-fix dispose: releaseForPrompt -> dispose() (twice, idempotent). ctrlB.releaseForPrompt = completed.

The two pre-existing #86014 dispose tests from main still pass after the rebase — the dispose extension is backward-compatible.

Diff stats post-rebase: Source +140 / Tests +267 across 3 files (was +345 / +306 pre-rebase, much of the apparent "delta" was upstream code that landed via #86427).

@clawsweeper re-review

@clawsweeper

clawsweeper Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. merge-risk: 🚨 automation 🚨 May affect CI, automerge, proof capture, label sync, or maintainer automation. rating: 🦞 diamond lobster Very strong PR readiness with only minor maintainer review expected. and removed rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. merge-risk: 🚨 automation 🚨 May affect CI, automerge, proof capture, label sync, or maintainer automation. rating: 🦞 diamond lobster Very strong PR readiness with only minor maintainer review expected. labels May 25, 2026
@joshavant

Copy link
Copy Markdown
Contributor

Thanks @ubehera for putting together this focused file-scoped prompt-window guard PR and for iterating through the cleanup/availability edge cases. This PR helped validate the right fix shape: the bug was not the takeover fence itself, but allowing multiple embedded prompt owners to enter the same physical transcript file window.

We ended up landing the fix through #87159 in commit 3349fe2. That merged implementation keeps the same core invariant from this PR, but broadens it to the rest of the observed failure surface: canonical session-file ownership, active-run indexing by session file, reply/recovery sibling lookup, and production diagnostic heartbeat recovery carrying sessionFile.

The merged PR also includes focused regressions and AWS Crabbox proof for the same-file sibling recovery and heartbeat path. Since #87159 is now merged, I’m closing this PR as superseded by the landed fix, with thanks for the original implementation and review work here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling merge-risk: 🚨 availability 🚨 May cause crashes, hangs, restart loops, stalls, or process outages. P1 High-priority user-facing bug, regression, or broken workflow. proof: sufficient ClawSweeper judged the real behavior proof convincing. proof: supplied External PR includes structured after-fix real behavior proof. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. size: M status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

EmbeddedAttemptSessionTakeoverError races between heartbeat lane and channel/direct lane on same session file (internal ref #83510)

2 participants