[Fix] Deliver restart recovery replies#86089
Conversation
|
Codex review: needs changes before merge. Reviewed May 30, 2026, 12:37 PM ET / 16:37 UTC. Summary PR surface: Source +434, Tests +670. Total +1104 across 14 files. Reproducibility: yes. Source inspection shows current main resumes restart-aborted sessions with Review metrics: 1 noteworthy metric.
Merge readiness Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch. Rank-up moves:
Risk before merge
Maintainer options:
Copy recommended automerge instructionNext step before merge
Security Review findings
Review detailsBest possible solution: Repair the post-run guards to use the effective current session id, add focused compaction-rotation coverage, then land the restart-recovery behavior with the existing QA Gateway proof. Do we have a high-confidence way to reproduce the issue? Yes. Source inspection shows current main resumes restart-aborted sessions with Is this the best way to solve the issue? No, not yet. The overall fix direction is appropriate, but the latest patch still uses the original session id in post-run persistence guards after the code has already computed an effective rotated session id. Full review comments:
Overall correctness: patch is incorrect AGENTS.md: found and applied where relevant. Codex review notes: model gpt-5.5, reasoning high; reviewed against 5ba3505fedd9. Label changesLabel justifications:
Evidence reviewedPR surface: Source +434, Tests +670. Total +1104 across 14 files. View PR surface stats
Acceptance criteria:
What I checked:
Likely related people:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. How this review workflow works
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bc4c034418
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
ClawSweeper PR egg ✨ Hatched: 🥚 common Pearl Proofling Hatch commandComment Hatchability rules:
Rarity: 🥚 common. What is this egg doing here?
|
|
@codex review |
|
Context from pruning stale PRs: #81202 attempted stale reply-turn recovery, but its implementation depended on |
This comment was marked as spam.
This comment was marked as spam.
f341a81 to
7a01cb2
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7a01cb2923
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| sessionKey, | ||
| storePath, | ||
| entry: next, | ||
| shouldPersist: (current) => shouldPersistCurrentRunSessionCleanup(current, sessionId), |
There was a problem hiding this comment.
Keep rotated runs eligible for delivery persistence
When this same run rotates the backing session during compaction, updateSessionStoreAfterAgentRun has already rewritten the store entry to effectiveSessionId, but this guard still compares against the original sessionId. In that normal rotated-run path the pending-final write is rejected, so a visible final payload is no longer made durable before delivery; the cleanup guard below has the same original-id check, leaving the old restartRecoveryDeliveryRunId behind and preventing later runs from claiming a fresh recovery route. Compare against the effective/current session id for this run instead of the pre-rotation id.
Useful? React with 👍 / 👎.
9dc5646 to
216bf79
Compare
|
Verification for head Behavior addressed: restart recovery now keeps enough delivery context to send visible recovered final replies back to the original channel instead of transcript-only Real environment tested: CI on the PR head, plus local focused Node/Vitest/type checks in this checkout. Exact steps or command run after this patch:
Evidence after fix: local Observed result after fix: completed lanes have no failures on the rebased head. What was not tested: no fresh live Discord/Telegram restart roundtrip was run locally for this maintainer fixup. |
Summary
Motivation
Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
Real behavior proof (required for external PRs)
Behavior addressed: Gateway restart recovery produces visible feedback for interrupted chat-bound main sessions after a Gateway restart.
Real environment tested: Local Darwin 25.4.0 arm64, Node v24.14.0, real OpenClaw QA Gateway child over WebSocket/RPC, bundled qa-channel plugin, qa-lab synthetic bus, mock-openai provider.
Exact steps or command run after this patch:
node scripts/run-node.mjs qa suite --provider-mode mock-openai --scenario gateway-restart-inflight-run --concurrency 1 --output-dir .artifacts/qa-e2e/gateway-restart-visible-recovery-ci-fix-finalEvidence after fix: QA suite report
.artifacts/qa-e2e/gateway-restart-visible-recovery-ci-fix-final/qa-suite-report.mdincludes this copied terminal/report output from the real Gateway/qa-channel run:Observed result after fix: The in-flight run settled across Gateway restart without duplicate interrupted delivery (
interruptedMarkers=0), Gateway and qa-channel became healthy again, and the follow-up recovery marker was delivered exactly once to the qa-channel transcript.What was not tested: Live Discord transport/guild delivery and live model provider; this proof used qa-channel for the real Gateway/plugin transport path and mock-openai for deterministic model behavior.
Before evidence (optional but encouraged): Not collected from a real environment before the fix.
Root Cause (if applicable)
Regression Test Plan (if applicable)
src/agents/main-session-restart-recovery.test.ts,src/agents/agent-command.live-model-switch.test.ts,src/auto-reply/reply/agent-runner.runreplyagent.e2e.test.ts,src/config/sessions/sessions.test.tsUser-visible / Behavior Changes
Interrupted chat-bound main sessions can now produce a visible best-effort recovery reply or failure notice after Gateway restart.
Diagram (if applicable)
Security Impact (required)
Yes/No): NoYes/No): NoYes/No): YesYes/No): NoYes/No): NoYes, explain risk + mitigation: recovery can now call existing Gateway delivery paths after restart using persisted route context; mitigation is deliverable-channel validation, existing send policy checks, and best-effort delivery.Repro + Verification
Environment
Steps
node scripts/run-node.mjs qa suite --provider-mode mock-openai --scenario gateway-restart-inflight-run --concurrency 1 --output-dir .artifacts/qa-e2e/gateway-restart-visible-recovery-ci-fix-final..artifacts/qa-e2e/gateway-restart-visible-recovery-ci-fix-final/qa-suite-report.mdandqa-suite-summary.json.Expected
Actual
runId=3019924e-f9c4-4e3f-b970-c9a3d7e83207 interruptedStatus=ok interruptedMarkers=0followed byRESTART-RECOVERY-OK.Evidence
Human Verification (required)
What you personally verified (not just CI), and how:
Review Conversations
Compatibility / Migration
Yes/No): YesYes/No): NoYes/No): NoRisks and Mitigations