Skip to content

Fix embedded session file ownership race#87159

Merged
joshavant merged 4 commits into
mainfrom
fix/85913-session-file-owner
May 27, 2026
Merged

Fix embedded session file ownership race#87159
joshavant merged 4 commits into
mainfrom
fix/85913-session-file-owner

Conversation

@joshavant

@joshavant joshavant commented May 27, 2026

Copy link
Copy Markdown
Contributor

Summary

  • serialize embedded attempts by canonical session file so sibling session keys cannot open overlapping prompt windows on the same transcript
  • add session-file active-run indexing for reply admission, interrupt handling, stuck-session recovery, and production diagnostic heartbeat recovery requests
  • drain queued transcript events before reacquiring the prompt lock and add regressions for owner, symlink, registry, heartbeat, and recovery paths

Fixes #85913.

Verification

  • .agents/skills/autoreview/scripts/autoreview --mode local - clean, no accepted/actionable findings
  • node scripts/run-vitest.mjs src/logging/diagnostic.test.ts src/logging/diagnostic-stuck-session-recovery.runtime.test.ts src/logging/diagnostic-stuck-session-recovery.integration.test.ts src/agents/pi-embedded-runner/runs.test.ts src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts src/auto-reply/reply/get-reply-run.media-only.test.ts - 8 files, 275 tests passed
  • AWS Crabbox run_3a67455a592e, provider aws, lease cbx_8c50f1cc49d4, type c7a.8xlarge: source-level heartbeat proof passed and emitted OPENCLAW_85913_HEARTBEAT_FIX_PROOF

Real behavior proof

Behavior addressed: Same-process embedded runs that resolve to the same physical transcript file are serialized by session-file owner, and sibling/fallback-key recovery now observes active same-file runs from both direct recovery and production diagnostic heartbeat recovery paths.

Real environment tested: AWS Crabbox Linux, provider aws, lease cbx_8c50f1cc49d4, run run_3a67455a592e, Node v24.16.0, synced branch patch.

Exact steps or command run after this patch: Source-level proof registered an active embedded run with a canonical session file, verified diagnostic state retained that file, verified sibling-key stuck recovery skipped because that same session file had an active embedded run, then aged a processing diagnostic state and let the real heartbeat interval request recovery.

Evidence after fix: Crabbox proof emitted OPENCLAW_85913_HEARTBEAT_FIX_PROOF {"status":"fixed","recordedFile":"/tmp/openclaw-85913-heartbeat-proof.jsonl","siblingRecovery":{"status":"skipped","reason":"active_embedded_run","activeSessionId":"openclaw-85913-active-session"},"heartbeatRequest":{"sessionId":"openclaw-85913-heartbeat-session","sessionKey":"agent:main:openclaw-85913-heartbeat","sessionFile":"/tmp/openclaw-85913-heartbeat-proof.jsonl"}}.

Observed result after fix: The active run's transcript path was retained in diagnostic state, same-file sibling recovery returned status=skipped, reason=active_embedded_run, and the production heartbeat recovery request carried the same sessionFile into recovery.

What was not tested: A real Telegram bot roundtrip was not rerun for this update because the fix is in the shared embedded runner/session ownership and diagnostic recovery paths, and the AWS proof exercises the reported same-file race plus the highest-risk heartbeat sibling recovery path without live channel credentials.

@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: M maintainer Maintainer-authored PR labels May 27, 2026
@clawsweeper

clawsweeper Bot commented May 27, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs maintainer review before merge. Reviewed May 27, 2026, 2:12 AM ET / 06:12 UTC.

Summary
The branch serializes embedded attempts by canonical session file, indexes active embedded runs by session file for reply and recovery paths, threads session files through diagnostic heartbeat recovery, and adds targeted regressions.

PR surface: Source +317, Tests +224. Total +541 across 19 files.

Reproducibility: yes. Source inspection and the linked issue give a high-confidence path: two same-process embedded lanes resolve to the same transcript while the session lock is released for a provider prompt, then reacquire sees a changed fence; I did not rerun the live scenario in this read-only review.

Review metrics: 2 noteworthy metrics.

  • Session-file state indexes: 1 active-run file index added, 1 diagnostic sessionFile field added. These are the stateful pieces that make the fix work and need lifecycle review beyond ordinary unit coverage.
  • Embedded runtime lookup exports: 3 export surfaces updated. The new session-file active-run lookup is exposed through embedded runtime barrels, so maintainers should confirm this internal runtime contract before merge.

Merge readiness
Overall: 🦞 diamond lobster
Proof: 🦞 diamond lobster
Patch quality: 🦞 diamond lobster
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Risk before merge

  • This PR intentionally changes embedded run coordination for shared transcript files; maintainer review should focus on owner acquire/release ordering, abort propagation, and file-key lifecycle.
  • Stuck-session recovery now trusts session-file active-run indexing before resetting sibling lanes; if that index becomes stale, recovery could leave a genuinely stuck lane active longer than before.
  • The provided GitHub context reports mergeable but unstable status, so the final head still needs green required checks before merge.

Maintainer options:

  1. Accept after focused maintainer review (recommended)
    Review the owner acquire/release paths, abort handling, and diagnostic recovery fallback ordering, then merge once the final head has green required checks.
  2. Request extra cleanup proof
    Ask for one more focused proof around owner timeout or abort cleanup if maintainers are not comfortable with the current owner lifecycle coverage.
  3. Pause if recovery semantics are disputed
    Hold the PR if maintainers want a different policy for whether heartbeat recovery should wait, skip, or force a fresh session when same-file active work is detected.

Next step before merge
The protected maintainer label and session-state/availability merge risk make this a human maintainer review item, not an automated repair candidate.

Security
Cleared: The diff does not add dependencies, CI, package scripts, credential handling, or external code execution paths; no concrete security or supply-chain concern was found.

Review details

Best possible solution:

Land this as the canonical fix after maintainers accept the session-file ownership semantics, cleanup ordering, and final CI for the protected session-state and availability surface.

Do we have a high-confidence way to reproduce the issue?

Yes. Source inspection and the linked issue give a high-confidence path: two same-process embedded lanes resolve to the same transcript while the session lock is released for a provider prompt, then reacquire sees a changed fence; I did not rerun the live scenario in this read-only review.

Is this the best way to solve the issue?

Yes. Serializing by canonical session file and teaching reply/recovery lookups about same-file active work is a maintainable fix for the reported race while keeping the existing takeover detection as a safety net.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 9c2a6a8df519.

Label changes

Label changes:

  • add rating: 🦞 diamond lobster: Overall readiness is 🦞 diamond lobster; proof is 🦞 diamond lobster and patch quality is 🦞 diamond lobster.
  • remove rating: 🐚 platinum hermit: Current PR rating is rating: 🦞 diamond lobster, so this older rating label is no longer current.

Label justifications:

  • P1: The linked issue reports user-visible dropped replies from an embedded runner coordination race, and this PR targets that active agent/channel workflow regression.
  • merge-risk: 🚨 session-state: The diff adds canonical session-file ownership and active-run indexing, so stale or wrong ownership state could affect reply admission, recovery, or interrupts.
  • merge-risk: 🚨 availability: The owner wait/timeout path and diagnostic recovery skip behavior can affect whether embedded runs stall, recover, or abort under contention.
  • rating: 🦞 diamond lobster: Overall readiness is 🦞 diamond lobster; proof is 🦞 diamond lobster and patch quality is 🦞 diamond lobster.
  • status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (live_output): The PR body includes after-fix AWS Crabbox live output with a concrete heartbeat proof payload showing same-file recovery skips active embedded work and preserves sessionFile data.
  • proof: sufficient: Contributor real behavior proof is sufficient. The PR body includes after-fix AWS Crabbox live output with a concrete heartbeat proof payload showing same-file recovery skips active embedded work and preserves sessionFile data.
Evidence reviewed

PR surface:

Source +317, Tests +224. Total +541 across 19 files.

View PR surface stats
Area Files Added Removed Net
Source 13 331 14 +317
Tests 6 229 5 +224
Docs 0 0 0 0
Config 0 0 0 0
Generated 0 0 0 0
Other 0 0 0 0
Total 19 560 19 +541

What I checked:

  • Repository policy read: The root AGENTS.md and scoped agent-runner guides were read and applied to the protected-label, session-state, proof, and merge-risk handling for this review. (AGENTS.md:1, 9c2a6a8df519)
  • Protected maintainer handling: The provided GitHub context shows the protected maintainer label on this PR, so the cleanup workflow should keep it open for explicit maintainer handling.
  • Session-file owner implementation: The PR acquires an embedded attempt session-file owner before creating the attempt session lock, using the session write-lock max-hold window and attempt abort signal. (src/agents/pi-embedded-runner/run/attempt.ts:2286, fdbd854cba4b)
  • Owner wait lifecycle: The owner registry waits by canonical session-file key, cleans up waiters on resolve, reject, timeout, or abort, and releases all waiters when the current owner releases. (src/agents/pi-embedded-runner/run/attempt.session-lock.ts:458, fdbd854cba4b)
  • Canonical file key helper: The new helper realpaths existing transcript files and canonicalizes the parent directory when the transcript does not exist yet, matching the symlink aliasing case in the tests. (src/agents/pi-embedded-runner/session-file-key.ts:4, fdbd854cba4b)
  • Active-run lookup by file: The PR adds a session-file index for active embedded runs and clears stale file entries by session id before setting or clearing file ownership. (src/agents/pi-embedded-runner/runs.ts:129, fdbd854cba4b)

Likely related people:

  • vincentkoc: Current-main blame for the session-lock wrapper, active-run registry, failover classification, and diagnostic recovery paths points to commit 6d5c15a across the central files affected by this PR. (role: recent area contributor; confidence: high; commits: 6d5c15a744ca; files: src/agents/pi-embedded-runner/run/attempt.session-lock.ts, src/agents/pi-embedded-runner/runs.ts, src/agents/failover-error.ts)
  • steipete: Shortlog and current-main history show repeated recent work around sessions, gateway/session lifecycle, and adjacent diagnostic/runtime surfaces, including the latest main commit in this checkout. (role: recent adjacent contributor; confidence: medium; commits: 9c2a6a8df519, a374c3a5bfd5; files: src/auto-reply/reply/get-reply.ts, src/config/sessions/store.ts, src/gateway/session-lifecycle-state.ts)
  • ubehera: The linked issue author supplied production traces, lane histograms, and a proposed fix shape for the same transcript-file race, making them useful for validating whether the fix covers the observed deployment pattern. (role: reporter with production trace context; confidence: medium; files: src/agents/pi-embedded-runner/run/attempt.session-lock.ts, src/logging/diagnostic-stuck-session-recovery.runtime.ts)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. labels May 27, 2026
@clawsweeper

clawsweeper Bot commented May 27, 2026

Copy link
Copy Markdown
Contributor

ClawSweeper PR egg

✨ Hatched: 🥚 common Moonlit Branchling

Hatch command

Comment @clawsweeper hatch when this PR is hatchable.

Hatchability rules:

  • Merged PRs are hatchable.
  • Open PRs are hatchable when they are status: 👀 ready for maintainer look, status: 🚀 automerge armed, or labeled clawsweeper:automerge.
  • Closed unmerged PRs are hatchable only when one of those hatchable labels is still present in the durable record.

Rarity: 🥚 common.
Trait: sniffs out flaky tests.
Image traits: location merge queue dock; accessory review stamp; palette amber, ink, and glacier blue; mood focused; pose stepping out of a freshly hatched shell; shell soft velvet shell; lighting bright celebratory glints; background subtle branch markers.
Share on X: post this hatch
Copy: My PR egg hatched a 🥚 common Moonlit Branchling in ClawSweeper.

What is this egg doing here?
  • Eggs appear after the PR passes real-behavior proof. It is here for vibes, not verdicts: it does not change labels, ratings, merge decisions, or automation.
  • The shell reacts to review momentum: open follow-up work warms it up, re-review makes it wobble, and a clean final review lets it hatch.
  • Hatchability usually comes from sufficient real-behavior proof, no blocking P0/P1/P2 findings, no security attention needed, and clean correctness. A merged PR is already final, so merge makes the egg hatchable independently.
  • The hatch is seeded from this repository and PR number, so the same PR keeps the same creature; the reviewed head SHA can only change safe visual details.
  • Rarity is just collectible sparkle: 🥚 common, 🌱 uncommon, 💎 rare, ✨ glimmer, and 🌈 legendary.

@clawsweeper clawsweeper Bot added P1 High-priority user-facing bug, regression, or broken workflow. merge-risk: 🚨 automation 🚨 May affect CI, automerge, proof capture, label sync, or maintainer automation. labels May 27, 2026
@joshavant joshavant force-pushed the fix/85913-session-file-owner branch from d8a1f2c to 2e249a1 Compare May 27, 2026 05:11
@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. and removed proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. labels May 27, 2026
@joshavant joshavant force-pushed the fix/85913-session-file-owner branch 2 times, most recently from bbbb74e to 151dfa0 Compare May 27, 2026 05:49
@joshavant joshavant force-pushed the fix/85913-session-file-owner branch from 151dfa0 to 5f17e82 Compare May 27, 2026 05:53
@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. and removed rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. labels May 27, 2026
@clawsweeper clawsweeper Bot added merge-risk: 🚨 availability 🚨 May cause crashes, hangs, restart loops, stalls, or process outages. and removed merge-risk: 🚨 automation 🚨 May affect CI, automerge, proof capture, label sync, or maintainer automation. labels May 27, 2026
@clawsweeper clawsweeper Bot added rating: 🦞 diamond lobster Very strong PR readiness with only minor maintainer review expected. and removed rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. labels May 27, 2026
@joshavant joshavant merged commit 3349fe2 into main May 27, 2026
108 of 110 checks passed
@joshavant joshavant deleted the fix/85913-session-file-owner branch May 27, 2026 06:18
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 28, 2026
* fix: serialize embedded session file attempts

* test: update reply runtime mock for session file lookup

* fix: thread session files into diagnostic recovery

* fix: attach causes to session owner abort errors
SYU8384 pushed a commit to SYU8384/openclaw that referenced this pull request Jun 3, 2026
* fix: serialize embedded session file attempts

* test: update reply runtime mock for session file lookup

* fix: thread session files into diagnostic recovery

* fix: attach causes to session owner abort errors
sablehead pushed a commit to sablehead/openclaw that referenced this pull request Jun 10, 2026
* fix: serialize embedded session file attempts

* test: update reply runtime mock for session file lookup

* fix: thread session files into diagnostic recovery

* fix: attach causes to session owner abort errors
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling maintainer Maintainer-authored PR merge-risk: 🚨 availability 🚨 May cause crashes, hangs, restart loops, stalls, or process outages. merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. P1 High-priority user-facing bug, regression, or broken workflow. proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🦞 diamond lobster Very strong PR readiness with only minor maintainer review expected. size: L status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

EmbeddedAttemptSessionTakeoverError races between heartbeat lane and channel/direct lane on same session file (internal ref #83510)

1 participant