Fix embedded session file ownership race#87159
Conversation
|
Codex review: needs maintainer review before merge. Reviewed May 27, 2026, 2:12 AM ET / 06:12 UTC. Summary PR surface: Source +317, Tests +224. Total +541 across 19 files. Reproducibility: yes. Source inspection and the linked issue give a high-confidence path: two same-process embedded lanes resolve to the same transcript while the session lock is released for a provider prompt, then reacquire sees a changed fence; I did not rerun the live scenario in this read-only review. Review metrics: 2 noteworthy metrics.
Merge readiness Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch. Risk before merge
Maintainer options:
Next step before merge Security Review detailsBest possible solution: Land this as the canonical fix after maintainers accept the session-file ownership semantics, cleanup ordering, and final CI for the protected session-state and availability surface. Do we have a high-confidence way to reproduce the issue? Yes. Source inspection and the linked issue give a high-confidence path: two same-process embedded lanes resolve to the same transcript while the session lock is released for a provider prompt, then reacquire sees a changed fence; I did not rerun the live scenario in this read-only review. Is this the best way to solve the issue? Yes. Serializing by canonical session file and teaching reply/recovery lookups about same-file active work is a maintainable fix for the reported race while keeping the existing takeover detection as a safety net. AGENTS.md: found and applied where relevant. Codex review notes: model gpt-5.5, reasoning high; reviewed against 9c2a6a8df519. Label changesLabel changes:
Label justifications:
Evidence reviewedPR surface: Source +317, Tests +224. Total +541 across 19 files. View PR surface stats
What I checked:
Likely related people:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. How this review workflow works
|
|
ClawSweeper PR egg ✨ Hatched: 🥚 common Moonlit Branchling Hatch commandComment Hatchability rules:
Rarity: 🥚 common. What is this egg doing here?
|
d8a1f2c to
2e249a1
Compare
bbbb74e to
151dfa0
Compare
151dfa0 to
5f17e82
Compare
* fix: serialize embedded session file attempts * test: update reply runtime mock for session file lookup * fix: thread session files into diagnostic recovery * fix: attach causes to session owner abort errors
* fix: serialize embedded session file attempts * test: update reply runtime mock for session file lookup * fix: thread session files into diagnostic recovery * fix: attach causes to session owner abort errors
* fix: serialize embedded session file attempts * test: update reply runtime mock for session file lookup * fix: thread session files into diagnostic recovery * fix: attach causes to session owner abort errors
Summary
Fixes #85913.
Verification
.agents/skills/autoreview/scripts/autoreview --mode local- clean, no accepted/actionable findingsnode scripts/run-vitest.mjs src/logging/diagnostic.test.ts src/logging/diagnostic-stuck-session-recovery.runtime.test.ts src/logging/diagnostic-stuck-session-recovery.integration.test.ts src/agents/pi-embedded-runner/runs.test.ts src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts src/auto-reply/reply/get-reply-run.media-only.test.ts- 8 files, 275 tests passedrun_3a67455a592e, provideraws, leasecbx_8c50f1cc49d4, typec7a.8xlarge: source-level heartbeat proof passed and emittedOPENCLAW_85913_HEARTBEAT_FIX_PROOFReal behavior proof
Behavior addressed: Same-process embedded runs that resolve to the same physical transcript file are serialized by session-file owner, and sibling/fallback-key recovery now observes active same-file runs from both direct recovery and production diagnostic heartbeat recovery paths.
Real environment tested: AWS Crabbox Linux, provider
aws, leasecbx_8c50f1cc49d4, runrun_3a67455a592e, Nodev24.16.0, synced branch patch.Exact steps or command run after this patch: Source-level proof registered an active embedded run with a canonical session file, verified diagnostic state retained that file, verified sibling-key stuck recovery skipped because that same session file had an active embedded run, then aged a processing diagnostic state and let the real heartbeat interval request recovery.
Evidence after fix: Crabbox proof emitted
OPENCLAW_85913_HEARTBEAT_FIX_PROOF {"status":"fixed","recordedFile":"/tmp/openclaw-85913-heartbeat-proof.jsonl","siblingRecovery":{"status":"skipped","reason":"active_embedded_run","activeSessionId":"openclaw-85913-active-session"},"heartbeatRequest":{"sessionId":"openclaw-85913-heartbeat-session","sessionKey":"agent:main:openclaw-85913-heartbeat","sessionFile":"/tmp/openclaw-85913-heartbeat-proof.jsonl"}}.Observed result after fix: The active run's transcript path was retained in diagnostic state, same-file sibling recovery returned
status=skipped,reason=active_embedded_run, and the production heartbeat recovery request carried the samesessionFileinto recovery.What was not tested: A real Telegram bot roundtrip was not rerun for this update because the fix is in the shared embedded runner/session ownership and diagnostic recovery paths, and the AWS proof exercises the reported same-file race plus the highest-risk heartbeat sibling recovery path without live channel credentials.