[Fix] Preserve node reconnect state#78351
Conversation
|
Codex review: needs changes before merge. Summary Reproducibility: yes. Source inspection on current main gives a high-confidence path: register the same node id on an old and new connection, then close the old connId; current cleanup can remove the live session and publish stale disconnect presence. Real behavior proof Next step before merge Security Review findings
Review detailsBest possible solution: Land the connId-scoped Gateway fix with its regression coverage after adding the Unreleased changelog note and letting exact-head checks pass. Do we have a high-confidence way to reproduce the issue? Yes. Source inspection on current main gives a high-confidence path: register the same node id on an old and new connection, then close the old connId; current cleanup can remove the live session and publish stale disconnect presence. Is this the best way to solve the issue? Yes for the code direction. Scoping unregister cleanup, pending invokes, invoke-result acceptance, and node disconnect presence to the owning connId is the narrow maintainable fix; the remaining issue is the required changelog entry. Full review comments:
Overall correctness: patch is correct Acceptance criteria:
What I checked:
Likely related people:
Remaining risk / open question:
Codex review notes: model gpt-5.5, reasoning high; reviewed against 36f847a60e39. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5e74a09d24
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@clawsweeper re-review |
5e74a09 to
42fbd3b
Compare
42fbd3b to
19edcb3
Compare
|
@codex review |
|
Codex Review: Didn't find any major issues. Swish! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
## Considered and deferred - src/gateway/server/ws-connection.test.ts:321 [BOT-NIT]: New test repeats the local socket harness pattern already used in this file. Extracting a helper would be cleanup, not a correctness fix, and would broaden this commit. Signed-off-by: samzong <samzong.lu@gmail.com>
19edcb3 to
7fdd864
Compare
|
Merged. Thanks @samzong. |
Preserve node registry ownership across same-node WebSocket reconnect races so stale old-socket closes cannot clear the replacement session or complete the wrong pending invoke. Thanks @samzong.
Preserve node registry ownership across same-node WebSocket reconnect races so stale old-socket closes cannot clear the replacement session or complete the wrong pending invoke. Thanks @samzong.
Summary
node.list, and node invokes can diverge during reconnect races, making a live node appear disconnected until another presence update arrives.NodeRegistrynow scopes pending invokes and stale unregister cleanup byconnId;node.invoke.resultmatches pending work bynodeId + connId; WebSocket close handling only writes node disconnect presence when the closing socket still owns the current node session.Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
Real behavior proof (required for external PRs)
pnpm test src/gateway/node-registry.test.ts src/gateway/gateway-misc.test.ts src/gateway/server/ws-connection.test.ts -- --reporter=verbose;pnpm check:changed; one-offpnpm exec tsxlive Gateway harness that starts a loopback Gateway, pairs operator and node devices, connects the same node id twice, closes the old socket late, then verifiesnode.list,node.invoke, andsystem-presence.pnpm check:changedpassed core/coreTests typecheck, lint, and guards; live Gateway reconnect harness passed with copied terminal output:null, keeps the reconnected session, rejects only the old connection's pending invokes, ignores mismatched invoke results, skips node presence disconnect broadcast for stale sockets, and the live replacement node still handlesnode.invokeafter the old socket closes.node-registryregression fails on the old implementation becauseunregister(oldConnId)deletes the currentnodesByIdentry.Root Cause (if applicable)
NodeRegistry.unregister(connId)resolved the old connection to anodeIdand then cleanednodesById, pending invokes, and close-side node state bynodeIdwithout verifying that the closing connection still owned the current node session.nodeId, while WebSocket lifecycle events are keyed byconnId.Regression Test Plan (if applicable)
src/gateway/node-registry.test.ts;src/gateway/server/ws-connection.test.ts.User-visible / Behavior Changes
A reconnected node should no longer briefly appear disconnected or lose pending node invoke state because an older socket closed late.
Diagram (if applicable)
Security Impact (required)
Yes/No): NoYes/No): NoYes/No): NoYes/No): NoYes/No): NoYes, explain risk + mitigation: N/ARepro + Verification
Environment
OPENCLAW_STATE_DIRand loopback token authSteps
node-1and start a node invoke.node-1.connId, then close the old connection.nodeRegistry.unregister(connId)returnsnullfor a stale node socket.node.invoke.Expected
node.list, keeps connected presence, and handlesnode.invokeafter the old socket closes.Actual
Evidence
Attach at least one:
Human Verification (required)
What you personally verified (not just CI), and how:
node.invokeafter the stale old socket closes.node.invoke.resultstill treats unknown/late invoke ids as ignored success throughgateway-misc.test.ts; current close cleanup remains gated onunregisterreturning the active node id.Review Conversations
If a bot review conversation is addressed by this PR, resolve that conversation yourself. Do not leave bot review conversation cleanup for maintainers.
Compatibility / Migration
Yes/No): YesYes/No): NoYes/No): NoRisks and Mitigations
presenceKey; the new guard only suppresses node disconnect presence whenunregister(connId)reports a stale close.Considered and deferred