Skip to content

[Fix] Deliver restart recovery replies#86089

Merged
steipete merged 4 commits into
openclaw:mainfrom
samzong:fix/gateway-restart-visible-recovery
May 30, 2026
Merged

[Fix] Deliver restart recovery replies#86089
steipete merged 4 commits into
openclaw:mainfrom
samzong:fix/gateway-restart-visible-recovery

Conversation

@samzong

@samzong samzong commented May 24, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Problem: gateway restarts could leave in-flight Discord/chat turns silent when the previous Gateway process died mid-run.
  • Solution: persist restart delivery context for direct agent and auto-reply runs, then use it when recovery resumes or emits a visible failure notice.
  • What changed: restart recovery now delivers best-effort resumed replies through the captured route, preserves pending final delivery context, normalizes new session-store fields, and keeps legacy route fallback only for unresumable notices.
  • What did NOT change (scope boundary): no new channel defaults, no plugin-specific policy in core, and no live transport implementation changes.

Motivation

  • Users should see either the recovered reply or a clear recovery failure notice after a Gateway restart instead of a silent aborted turn.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

Real behavior proof (required for external PRs)

Behavior addressed: Gateway restart recovery produces visible feedback for interrupted chat-bound main sessions after a Gateway restart.
Real environment tested: Local Darwin 25.4.0 arm64, Node v24.14.0, real OpenClaw QA Gateway child over WebSocket/RPC, bundled qa-channel plugin, qa-lab synthetic bus, mock-openai provider.
Exact steps or command run after this patch: node scripts/run-node.mjs qa suite --provider-mode mock-openai --scenario gateway-restart-inflight-run --concurrency 1 --output-dir .artifacts/qa-e2e/gateway-restart-visible-recovery-ci-fix-final
Evidence after fix: QA suite report .artifacts/qa-e2e/gateway-restart-visible-recovery-ci-fix-final/qa-suite-report.md includes this copied terminal/report output from the real Gateway/qa-channel run:

OpenClaw QA Scenario Suite
Passed: 1
Failed: 0

Gateway restart in-flight recovery
Status: pass
runId=3019924e-f9c4-4e3f-b970-c9a3d7e83207 interruptedStatus=ok interruptedMarkers=0
RESTART-RECOVERY-OK
Notes: qa-channel + qa-lab bus + real gateway child + mock-openai provider.

Observed result after fix: The in-flight run settled across Gateway restart without duplicate interrupted delivery (interruptedMarkers=0), Gateway and qa-channel became healthy again, and the follow-up recovery marker was delivered exactly once to the qa-channel transcript.
What was not tested: Live Discord transport/guild delivery and live model provider; this proof used qa-channel for the real Gateway/plugin transport path and mock-openai for deterministic model behavior.
Before evidence (optional but encouraged): Not collected from a real environment before the fix.

Root Cause (if applicable)

  • Root cause: restart recovery tracked run/session state but did not carry the outbound delivery route needed for a replacement Gateway process to send visible recovery output.
  • Missing detection / guardrail: restart recovery tests did not assert the resumed delivery context or the visible failure notice route.
  • Contributing context (if known): direct agent runs and auto-reply runs persisted delivery metadata differently, so recovery needed one normalized context path plus a scoped legacy fallback for failure notices.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: src/agents/main-session-restart-recovery.test.ts, src/agents/agent-command.live-model-switch.test.ts, src/auto-reply/reply/agent-runner.runreplyagent.e2e.test.ts, src/config/sessions/sessions.test.ts
  • Scenario the test should lock in: interrupted main sessions resume with captured route context, pending final delivery prefers pending context, unresumable notices can use legacy session route, and store loading normalizes recovery fields.
  • Why this is the smallest reliable guardrail: it exercises persisted session recovery decisions without needing a live channel transport.
  • Existing test that already covers this (if any): N/A before this PR.
  • If no new test is added, why not: N/A - tests were added/updated.

User-visible / Behavior Changes

Interrupted chat-bound main sessions can now produce a visible best-effort recovery reply or failure notice after Gateway restart.

Diagram (if applicable)

Before:
[gateway restart] -> [running session marked aborted] -> [recovery resumes deliver=false] -> [chat may stay silent]

After:
[gateway restart] -> [delivery context persisted] -> [recovery resumes or sends notice] -> [best-effort visible chat feedback]

Security Impact (required)

  • New permissions/capabilities? (Yes/No): No
  • Secrets/tokens handling changed? (Yes/No): No
  • New/changed network calls? (Yes/No): Yes
  • Command/tool execution surface changed? (Yes/No): No
  • Data access scope changed? (Yes/No): No
  • If any Yes, explain risk + mitigation: recovery can now call existing Gateway delivery paths after restart using persisted route context; mitigation is deliverable-channel validation, existing send policy checks, and best-effort delivery.

Repro + Verification

Environment

  • OS: Darwin 25.4.0 arm64
  • Runtime/container: Node v24.14.0 local Codex worktree
  • Model/provider: mock-openai
  • Integration/channel (if any): real Gateway child over WebSocket/RPC + bundled qa-channel plugin + qa-lab bus; no live Discord
  • Relevant config (redacted): QA suite generated isolated config/state

Steps

  1. Run node scripts/run-node.mjs qa suite --provider-mode mock-openai --scenario gateway-restart-inflight-run --concurrency 1 --output-dir .artifacts/qa-e2e/gateway-restart-visible-recovery-ci-fix-final.
  2. The scenario starts an agent run, applies a restart-required config change, waits for Gateway + qa-channel readiness after restart, then sends a same-session recovery follow-up.
  3. Inspect .artifacts/qa-e2e/gateway-restart-visible-recovery-ci-fix-final/qa-suite-report.md and qa-suite-summary.json.

Expected

  • Gateway returns healthy after restart, qa-channel returns ready, interrupted output is not duplicated, and the same session delivers the recovery marker exactly once.

Actual

  • QA scenario passed. Report details: runId=3019924e-f9c4-4e3f-b970-c9a3d7e83207 interruptedStatus=ok interruptedMarkers=0 followed by RESTART-RECOVERY-OK.

Evidence

  • Real Gateway/qa-channel trace or runtime report
  • Supplemental focused test output
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios: real QA Gateway restart in-flight recovery over qa-channel; focused restart recovery tests; local review gates.
  • Edge cases checked: Gateway restart readiness, qa-channel readiness, no duplicate interrupted delivery, exactly-once recovery marker delivery, pending final delivery context precedence, legacy route notice fallback, send-policy denial, and preserving no-historical-route behavior for resumable recovery.
  • What you did not verify: live Gateway restart against real Discord or a live model provider.

Review Conversations

  • N/A - no GitHub review conversations yet.

Compatibility / Migration

  • Backward compatible? (Yes/No): Yes
  • Config/env changes? (Yes/No): No
  • Migration needed? (Yes/No): No
  • If yes, exact upgrade steps: N/A

Risks and Mitigations

  • Risk: persisted route context could be stale.
    • Mitigation: resumable delivery uses only pending/restart recovery context, while historical route fallback is limited to unresumable failure notices and still passes send policy checks.

@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: XL triage: mock-only-proof Candidate: PR proof only shows tests, mocks, snapshots, lint, typecheck, or CI. labels May 24, 2026
@clawsweeper

clawsweeper Bot commented May 24, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs changes before merge. Reviewed May 30, 2026, 12:37 PM ET / 16:37 UTC.

Summary
The PR persists restart-recovery delivery context for agent and auto-reply runs, uses it during main-session restart recovery or failure notices, normalizes session-store fields, reserves slot keys, and adds regression coverage.

PR surface: Source +434, Tests +670. Total +1104 across 14 files.

Reproducibility: yes. Source inspection shows current main resumes restart-aborted sessions with deliver: false, and the PR proof exercises a real Gateway/qa-channel restart recovery path that now emits a visible marker.

Review metrics: 1 noteworthy metric.

  • Session compatibility surface: 2 persisted fields added; 2 reserved slot keys added. Maintainers should explicitly notice the additive session-store and plugin-slot contract before merge.

Merge readiness
Overall: 🧂 unranked krab
Proof: 🦞 diamond lobster
Patch quality: 🧂 unranked krab
Result: blocked by patch quality or review findings.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • [P2] Fix the effective-session-id guard for pending-final persistence and restart-recovery cleanup.
  • [P2] Add focused regression coverage for compaction/session rotation plus pending-final and cleanup ownership.
  • [P2] Rerun the focused agent-command and restart-recovery tests after the repair.

Risk before merge

  • [P1] Compaction-rotated main runs can still reject pending-final persistence and cleanup because the new guards compare against the pre-rotation session id.
  • [P1] Persisting and replaying route context after restart can misroute, duplicate, or suppress user-visible messages if ownership cleanup is wrong.
  • [P1] The PR adds two persisted session fields and reserves two plugin slot keys, which is an additive but compatibility-visible session/plugin-slot surface.
  • [P1] Live Discord delivery was not tested; the supplied proof exercises the real Gateway/plugin path through qa-channel.

Maintainer options:

  1. Fix rotated-session ownership before merge (recommended)
    Update the pending-final and restart-recovery cleanup guards to compare against the effective post-run session id, with regression coverage for compaction rotation.
  2. Accept additive session-state surface after repair
    After the identity guard is fixed, maintainers can explicitly accept the two new persisted fields and reserved plugin slot keys as the intended upgrade surface.
Copy recommended automerge instruction
@clawsweeper automerge

Special instructions:
Update src/agents/agent-command.ts so restart-recovery and pending-final persistence/cleanup guards use the effective post-run session id after compaction/session rotation, and add focused tests in src/agents/agent-command.live-model-switch.test.ts proving a rotated current run still persists pending-final delivery and clears only its own restart-recovery claim.

Next step before merge

  • [P2] A narrow automated repair can update the rotated-session guard and add focused regression coverage before maintainer re-review.

Security
Cleared: No dependency, lockfile, CI, permission, secret, or new third-party execution surface was changed; the remaining concern is functional delivery/session-state correctness.

Review findings

  • [P1] Use the effective session id for delivery writes — src/agents/agent-command.ts:1909
Review details

Best possible solution:

Repair the post-run guards to use the effective current session id, add focused compaction-rotation coverage, then land the restart-recovery behavior with the existing QA Gateway proof.

Do we have a high-confidence way to reproduce the issue?

Yes. Source inspection shows current main resumes restart-aborted sessions with deliver: false, and the PR proof exercises a real Gateway/qa-channel restart recovery path that now emits a visible marker.

Is this the best way to solve the issue?

No, not yet. The overall fix direction is appropriate, but the latest patch still uses the original session id in post-run persistence guards after the code has already computed an effective rotated session id.

Full review comments:

  • [P1] Use the effective session id for delivery writes — src/agents/agent-command.ts:1909
    When this run rotates the backing session during compaction, updateSessionStoreAfterAgentRun writes the store entry under effectiveSessionId, but this new guard still compares the current entry to the original sessionId. That rejects the current run's pending-final write; the matching cleanup guards below have the same issue, so a rotated run can lose durable final delivery and leave its restart-recovery claim stale. Hoist/use the effective current session id for these guards and cover the compaction-rotated delivery path.
    Confidence: 0.9

Overall correctness: patch is incorrect
Overall confidence: 0.88

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 5ba3505fedd9.

Label changes

Label justifications:

  • P1: The PR targets a broken user-visible agent/channel workflow where gateway restarts can leave chat turns silent.
  • merge-risk: 🚨 compatibility: The diff adds persisted SessionEntry fields and reserves matching plugin slot keys, which affects upgrade and plugin-slot compatibility.
  • merge-risk: 🚨 message-delivery: The diff changes how stored route context is used to send user-visible recovery output after restart.
  • merge-risk: 🚨 session-state: The diff writes, normalizes, and clears durable restart-recovery state on session entries.
  • rating: 🧂 unranked krab: Overall readiness is 🧂 unranked krab; proof is 🦞 diamond lobster and patch quality is 🧂 unranked krab.
  • status: ⏳ waiting on author: ClawSweeper has contributor-facing work open and is waiting for author action. Sufficient (terminal): The PR body supplies after-fix terminal output from a real Gateway child over WebSocket/RPC with qa-channel showing the restart-recovery marker delivered once.
  • proof: sufficient: Contributor real behavior proof is sufficient. The PR body supplies after-fix terminal output from a real Gateway child over WebSocket/RPC with qa-channel showing the restart-recovery marker delivered once.
Evidence reviewed

PR surface:

Source +434, Tests +670. Total +1104 across 14 files.

View PR surface stats
Area Files Added Removed Net
Source 8 472 38 +434
Tests 6 744 74 +670
Docs 0 0 0 0
Config 0 0 0 0
Generated 0 0 0 0
Other 0 0 0 0
Total 14 1216 112 +1104

Acceptance criteria:

  • [P1] node scripts/run-vitest.mjs src/agents/agent-command.live-model-switch.test.ts -t "restart recovery".
  • [P1] node scripts/run-vitest.mjs src/agents/main-session-restart-recovery.test.ts src/auto-reply/reply/agent-runner.runreplyagent.e2e.test.ts.
  • [P1] git diff --check.

What I checked:

  • Current main gap: Current main resumes interrupted sessions with deliver: false, so the central visible-recovery behavior is not already implemented on main. (src/agents/main-session-restart-recovery.ts:349, 5ba3505fedd9)
  • Blocking guard defect: The new pending-final write guard compares the current store entry with the original sessionId; after this same run rotates to effectiveSessionId, the guard rejects the current run's durable delivery write. (src/agents/agent-command.ts:1909, 216bf79457f4)
  • Cleanup guard has same identity problem: The successful-delivery cleanup and finally cleanup use the same original-session guard, so a compaction-rotated current run can leave its own restartRecoveryDeliveryRunId claim behind. (src/agents/agent-command.ts:1972, 216bf79457f4)
  • Session rotation contract: The existing session-store update path writes the effective post-run sessionId back to the session entry and preserves existing fields by spreading the prior entry, which makes the original-id guard stale after rotation. (src/agents/command/session-store.ts:128, 5ba3505fedd9)
  • Compatibility surface: The PR adds two persisted SessionEntry fields and reserves the same two keys for plugin slots, so AGENTS.md compatibility guidance applies. (src/config/sessions/types.ts:375, 216bf79457f4)
  • Repository policy: Root AGENTS.md marks provider routing, auth/session state, persisted preferences, config loading, defaults, migrations, setup, startup checks, and fallback behavior as compatibility-sensitive review surfaces. (AGENTS.md:26, 18e7d28b2179)

Likely related people:

  • Peter Steinberger: Dominates recent history for the touched agent/session/recovery files and authored recent current-main commits in this area, including the current-main session-store and delivery-adjacent lines inspected. (role: recent area contributor; confidence: high; commits: b668ffe7ca79, 27ae826f6525, 1f1ff0567a01; files: src/agents/agent-command.ts, src/agents/main-session-restart-recovery.ts, src/auto-reply/reply/agent-runner.ts)
  • Vincent Koc: Shortlog and recent history show repeated work on agent-command.ts, delivery runtime loading, and session-store update paths adjacent to this PR's hot path. (role: adjacent owner; confidence: medium; commits: 99755fcb2f1f, dd27aa945e43, f126088761b7; files: src/agents/agent-command.ts)
  • Ayaan Zaidi: Recent history includes CLI transcript persistence and session metadata work that interacts with the effective-session-id and compaction path implicated by the finding. (role: adjacent feature contributor; confidence: medium; commits: 898fd0482a40, b8ef507cc082; files: src/agents/agent-command.ts, src/agents/command/session-store.ts)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@openclaw-barnacle openclaw-barnacle Bot added triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. and removed triage: mock-only-proof Candidate: PR proof only shows tests, mocks, snapshots, lint, typecheck, or CI. labels May 24, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bc4c034418

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/agents/agent-command.ts Outdated
@clawsweeper clawsweeper Bot added rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. labels May 24, 2026
@clawsweeper

clawsweeper Bot commented May 24, 2026

Copy link
Copy Markdown
Contributor

ClawSweeper PR egg

✨ Hatched: 🥚 common Pearl Proofling

Hatch command

Comment @clawsweeper hatch when this PR is hatchable.

Hatchability rules:

  • Merged PRs are hatchable.
  • Open PRs are hatchable when they are status: 👀 ready for maintainer look, status: 🚀 automerge armed, or labeled clawsweeper:automerge.
  • Closed unmerged PRs are hatchable only when one of those hatchable labels is still present in the durable record.

Rarity: 🥚 common.
Trait: guards the happy path.
Image traits: location green-check meadow; accessory miniature diff map; palette violet, aqua, and starlight; mood curious; pose leaning over a miniature review desk; shell brushed metal shell; lighting calm overcast light; background quiet workflow signs.
Share on X: post this hatch
Copy: My PR egg hatched a 🥚 common Pearl Proofling in ClawSweeper.

What is this egg doing here?
  • Eggs appear after the PR passes real-behavior proof. It is here for vibes, not verdicts: it does not change labels, ratings, merge decisions, or automation.
  • The shell reacts to review momentum: open follow-up work warms it up, re-review makes it wobble, and a clean final review lets it hatch.
  • Hatchability usually comes from sufficient real-behavior proof, no blocking P0/P1/P2 findings, no security attention needed, and clean correctness. A merged PR is already final, so merge makes the egg hatchable independently.
  • The hatch is seeded from this repository and PR number, so the same PR keeps the same creature; the reviewed head SHA can only change safe visual details.
  • Rarity is just collectible sparkle: 🥚 common, 🌱 uncommon, 💎 rare, ✨ glimmer, and 🌈 legendary.

@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. P1 High-priority user-facing bug, regression, or broken workflow. merge-risk: 🚨 message-delivery 🚨 May drop, duplicate, misroute, suppress, or wrongly target messages. merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. merge-risk: 🚨 security-boundary 🚨 May affect sandboxing, authorization, credentials, or sensitive data. and removed rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. labels May 24, 2026
@openclaw-barnacle openclaw-barnacle Bot added triage: mock-only-proof Candidate: PR proof only shows tests, mocks, snapshots, lint, typecheck, or CI. and removed triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 24, 2026
@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. and removed status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. labels May 24, 2026
@openclaw-barnacle openclaw-barnacle Bot added triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. proof: supplied External PR includes structured after-fix real behavior proof. and removed triage: mock-only-proof Candidate: PR proof only shows tests, mocks, snapshots, lint, typecheck, or CI. proof: sufficient ClawSweeper judged the real behavior proof convincing. triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 24, 2026
@samzong

samzong commented May 25, 2026

Copy link
Copy Markdown
Contributor Author

@codex review

@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. and removed status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. labels May 25, 2026
@clawsweeper clawsweeper Bot added status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. and removed status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. merge-risk: 🚨 automation 🚨 May affect CI, automerge, proof capture, label sync, or maintainer automation. labels May 25, 2026
@moeedahmed

Copy link
Copy Markdown
Contributor

Context from pruning stale PRs: #81202 attempted stale reply-turn recovery, but its implementation depended on replyTurn* fields that current main does not type or persist, so the branch would not run against real session stores. Any future stale-turn/restart recovery path should be rebuilt on the current persisted session contract and backed by real Telegram stale-turn recovery proof, not hand-seeded test fields.

@BingqingLyu

This comment was marked as spam.

@steipete steipete self-assigned this May 30, 2026
@steipete steipete force-pushed the fix/gateway-restart-visible-recovery branch from f341a81 to 7a01cb2 Compare May 30, 2026 16:03
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 30, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7a01cb2923

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

sessionKey,
storePath,
entry: next,
shouldPersist: (current) => shouldPersistCurrentRunSessionCleanup(current, sessionId),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep rotated runs eligible for delivery persistence

When this same run rotates the backing session during compaction, updateSessionStoreAfterAgentRun has already rewritten the store entry to effectiveSessionId, but this guard still compares against the original sessionId. In that normal rotated-run path the pending-final write is rejected, so a visible final payload is no longer made durable before delivery; the cleanup guard below has the same original-id check, leaving the old restartRecoveryDeliveryRunId behind and preventing later runs from claiming a fresh recovery route. Compare against the effective/current session id for this run instead of the pre-rotation id.

Useful? React with 👍 / 👎.

@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 30, 2026
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 30, 2026
@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. merge-risk: 🚨 compatibility 🚨 May break existing users, config, migrations, defaults, or upgrade paths. and removed rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. labels May 30, 2026
@steipete steipete force-pushed the fix/gateway-restart-visible-recovery branch from 9dc5646 to 216bf79 Compare May 30, 2026 16:29
@openclaw-barnacle openclaw-barnacle Bot added gateway Gateway runtime channel: feishu Channel integration: feishu labels May 30, 2026
@steipete

Copy link
Copy Markdown
Contributor

Verification for head 216bf79457f4fb41bde107ac3394467ce7cf482a:

Behavior addressed: restart recovery now keeps enough delivery context to send visible recovered final replies back to the original channel instead of transcript-only deliver:false recovery.

Real environment tested: CI on the PR head, plus local focused Node/Vitest/type checks in this checkout.

Exact steps or command run after this patch:

  • pnpm check:test-types
  • node scripts/run-vitest.mjs src/agents/agent-command.live-model-switch.test.ts src/agents/main-session-restart-recovery.test.ts src/config/sessions/sessions.test.ts src/commands/agent.test.ts extensions/feishu/src/bot-sender-name.test.ts
  • CI run 26689010749 on head 216bf79457f4fb41bde107ac3394467ce7cf482a

Evidence after fix: local check:test-types passed; focused Vitest wrapper passed 4 shards / 125 tests; CI check-test-types, check-prod-types, check-lint, build, auto-reply, agent, command, gateway, security, dependency, OpenGrep, and Real behavior proof lanes are green on the current head. Two CI shards (checks-node-core-runtime-infra-state, checks-node-agentic-commands-doctor) were still pending in GitHub background wait at the time this note was posted; auto-merge is left to branch protection.

Observed result after fix: completed lanes have no failures on the rebased head.

What was not tested: no fresh live Discord/Telegram restart roundtrip was run locally for this maintainer fixup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling channel: feishu Channel integration: feishu commands Command implementations gateway Gateway runtime merge-risk: 🚨 compatibility 🚨 May break existing users, config, migrations, defaults, or upgrade paths. merge-risk: 🚨 message-delivery 🚨 May drop, duplicate, misroute, suppress, or wrongly target messages. merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. P1 High-priority user-facing bug, regression, or broken workflow. proof: sufficient ClawSweeper judged the real behavior proof convincing. proof: supplied External PR includes structured after-fix real behavior proof. rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. size: XL status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gateway restart can silently abort an in-flight Discord turn, with no automatic recovery message to the user

4 participants