Skip to content

[Fix] Preserve node reconnect state#78351

Merged
frankekn merged 3 commits intoopenclaw:mainfrom
samzong:fix/node-reconnect-registry
May 8, 2026
Merged

[Fix] Preserve node reconnect state#78351
frankekn merged 3 commits intoopenclaw:mainfrom
samzong:fix/node-reconnect-registry

Conversation

@samzong
Copy link
Copy Markdown
Contributor

@samzong samzong commented May 6, 2026

Summary

  • Problem: a stale node WebSocket close could unregister or mark disconnected state for a node that had already reconnected with the same node id.
  • Why it matters: gateway node presence, node.list, and node invokes can diverge during reconnect races, making a live node appear disconnected until another presence update arrives.
  • What changed: NodeRegistry now scopes pending invokes and stale unregister cleanup by connId; node.invoke.result matches pending work by nodeId + connId; WebSocket close handling only writes node disconnect presence when the closing socket still owns the current node session.
  • What did NOT change (scope boundary): no protocol schema changes, no node pairing changes, and no reconnect policy changes.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes N/A
  • Related N/A
  • This PR fixes a bug or regression

Real behavior proof (required for external PRs)

  • Behavior or issue addressed: stale node socket close after reconnect should not clear the new live node session, should not complete new-connection pending invokes from the old connection, and should not broadcast a node presence disconnect for the replacement session.
  • Real environment tested: local macOS workspace with Node/pnpm test runner plus a real Gateway WebSocket reconnect harness using a paired node-role client.
  • Exact steps or command run after this patch: pnpm test src/gateway/node-registry.test.ts src/gateway/gateway-misc.test.ts src/gateway/server/ws-connection.test.ts -- --reporter=verbose; pnpm check:changed; one-off pnpm exec tsx live Gateway harness that starts a loopback Gateway, pairs operator and node devices, connects the same node id twice, closes the old socket late, then verifies node.list, node.invoke, and system-presence.
  • Evidence after fix (screenshot, recording, terminal capture, console output, redacted runtime log, linked artifact, or copied live output): targeted tests passed: 3 files, 40 tests; pnpm check:changed passed core/coreTests typecheck, lint, and guards; live Gateway reconnect harness passed with copied terminal output:
LIVE_PROOF_PASS pr=78351 sha=5e74a09d24ba501c821b68390a1135cc4f0f70fc
gateway_url=ws://127.0.0.1:49817
node_id=c628f148a45664d0a21cf434846fed9e68b115e4c2eee4eb3a2b42d9d20931d9
before_reconnect.connected=true commands=["device.info"]
after_reconnect.connected=true commands=["device.info"]
after_old_close.connected=true commands=["device.info"]
invoke.payload.handledBy=new
presence.reason=connect presence.instanceId=c628f148a45664d0a21cf434846fed9e68b115e4c2eee4eb3a2b42d9d20931d9
invoke_requests.old=0 invoke_requests.new=1
  • Observed result after fix: stale unregister returns null, keeps the reconnected session, rejects only the old connection's pending invokes, ignores mismatched invoke results, skips node presence disconnect broadcast for stale sockets, and the live replacement node still handles node.invoke after the old socket closes.
  • What was not tested: physical mobile/desktop app binary reconnect. A live loopback Gateway WebSocket reconnect with a paired node-role client was tested.
  • Before evidence (optional but encouraged): the added node-registry regression fails on the old implementation because unregister(oldConnId) deletes the current nodesById entry.

Root Cause (if applicable)

  • Root cause: NodeRegistry.unregister(connId) resolved the old connection to a nodeId and then cleaned nodesById, pending invokes, and close-side node state by nodeId without verifying that the closing connection still owned the current node session.
  • Missing detection / guardrail: no regression test covered reconnecting the same node id before the old socket close event ran.
  • Contributing context (if known): node sessions are keyed by nodeId, while WebSocket lifecycle events are keyed by connId.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: src/gateway/node-registry.test.ts; src/gateway/server/ws-connection.test.ts.
  • Scenario the test should lock in: old/new connections with the same node id, stale old close, current session preserved, stale invoke result rejected, old pending invoke disconnected, and stale close presence disconnect skipped.
  • Why this is the smallest reliable guardrail: it exercises the registry and close-handler seams directly without starting a full Gateway process.
  • Existing test that already covers this (if any): none before this PR.
  • If no new test is added, why not: N/A.

User-visible / Behavior Changes

A reconnected node should no longer briefly appear disconnected or lose pending node invoke state because an older socket closed late.

Diagram (if applicable)

Before:
old socket close -> unregister(nodeId) -> deletes new session / broadcasts disconnect

After:
old socket close -> unregister(connId) -> stale close returns null -> new session remains live

Security Impact (required)

  • New permissions/capabilities? (Yes/No): No
  • Secrets/tokens handling changed? (Yes/No): No
  • New/changed network calls? (Yes/No): No
  • Command/tool execution surface changed? (Yes/No): No
  • Data access scope changed? (Yes/No): No
  • If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

  • OS: macOS local development workspace
  • Runtime/container: Node 22+ via repo pnpm scripts
  • Model/provider: N/A
  • Integration/channel (if any): Gateway node WebSocket lifecycle
  • Relevant config (redacted): default local test configuration; live harness used an isolated temporary OPENCLAW_STATE_DIR and loopback token auth

Steps

  1. Register an old node connection for node-1 and start a node invoke.
  2. Register a new node connection for the same node-1.
  3. Deliver a result from the wrong connId, then close the old connection.
  4. Exercise the WebSocket close handler where nodeRegistry.unregister(connId) returns null for a stale node socket.
  5. In the live harness, start a real loopback Gateway, pair operator and node devices, connect the same node id twice, close the old socket, then verify the replacement node remains connected and handles node.invoke.

Expected

  • The new session remains registered.
  • The mismatched invoke result is ignored.
  • The old pending invoke is rejected on old socket close.
  • The stale close does not write or broadcast node presence disconnect.
  • The replacement live node remains visible in node.list, keeps connected presence, and handles node.invoke after the old socket closes.

Actual

  • Matches expected in the new regression tests and in the live Gateway reconnect harness.

Evidence

Attach at least one:

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios: targeted registry reconnect race, stale invoke result rejection, old pending invoke disconnect cleanup, stale close presence disconnect skip, and live Gateway WebSocket reconnect where the replacement node remains connected and handles node.invoke after the stale old socket closes.
  • Edge cases checked: node.invoke.result still treats unknown/late invoke ids as ignored success through gateway-misc.test.ts; current close cleanup remains gated on unregister returning the active node id.
  • What you did not verify: reconnect with a physical mobile/desktop app binary.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

If a bot review conversation is addressed by this PR, resolve that conversation yourself. Do not leave bot review conversation cleanup for maintainers.

Compatibility / Migration

  • Backward compatible? (Yes/No): Yes
  • Config/env changes? (Yes/No): No
  • Migration needed? (Yes/No): No
  • If yes, exact upgrade steps: N/A

Risks and Mitigations

  • Risk: close-handler ordering changes could affect non-node clients.
    • Mitigation: disconnect presence still runs for non-node clients with a presenceKey; the new guard only suppresses node disconnect presence when unregister(connId) reports a stale close.

Considered and deferred

  • src/gateway/server/ws-connection.test.ts:321 [BOT-NIT]: New test repeats the local socket harness pattern already used in this file. Extracting a helper would be cleanup, not a correctness fix, and would broaden this commit.

@openclaw-barnacle openclaw-barnacle Bot added proof: supplied External PR includes structured after-fix real behavior proof. gateway Gateway runtime agents Agent runtime and tooling size: L and removed proof: supplied External PR includes structured after-fix real behavior proof. labels May 6, 2026
@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented May 6, 2026

Codex review: needs changes before merge.

Summary
The PR scopes Gateway node unregister, pending invoke cleanup, invoke-result matching, and stale-close presence updates by connection id, with registry and WebSocket close-handler regression tests.

Reproducibility: yes. Source inspection on current main gives a high-confidence path: register the same node id on an old and new connection, then close the old connId; current cleanup can remove the live session and publish stale disconnect presence.

Real behavior proof
Sufficient (terminal): The PR body includes copied terminal output from a real loopback Gateway WebSocket reconnect harness showing the replacement node remains connected and handles node.invoke after the stale old socket closes.

Next step before merge
A repair worker can add the single required Unreleased changelog line; the code direction otherwise looks correct.

Security
Cleared: No concrete security or supply-chain concern was found in the narrowed Gateway lifecycle and regression-test diff.

Review findings

  • [P3] Add the missing changelog entry — src/gateway/node-registry.ts:89-91
Review details

Best possible solution:

Land the connId-scoped Gateway fix with its regression coverage after adding the Unreleased changelog note and letting exact-head checks pass.

Do we have a high-confidence way to reproduce the issue?

Yes. Source inspection on current main gives a high-confidence path: register the same node id on an old and new connection, then close the old connId; current cleanup can remove the live session and publish stale disconnect presence.

Is this the best way to solve the issue?

Yes for the code direction. Scoping unregister cleanup, pending invokes, invoke-result acceptance, and node disconnect presence to the owning connId is the narrow maintainable fix; the remaining issue is the required changelog entry.

Full review comments:

  • [P3] Add the missing changelog entry — src/gateway/node-registry.ts:89-91
    This PR changes user-visible Gateway node reconnect behavior, but the current five-file PR diff does not touch CHANGELOG.md. Repo policy requires user-facing fixes to add an Unreleased changelog entry, so release notes would otherwise miss this reconnect fix.
    Confidence: 0.92

Overall correctness: patch is correct
Overall confidence: 0.87

Acceptance criteria:

  • git diff --check
  • pnpm test src/gateway/node-registry.test.ts src/gateway/gateway-misc.test.ts src/gateway/server/ws-connection.test.ts -- --reporter=verbose
  • pnpm check:changed

What I checked:

  • Current-main stale unregister path: On current main, NodeRegistry.unregister(connId) resolves a node id from the closing connection, then deletes nodesById and rejects pending invokes by node id, so a late old-socket close can remove a replacement session for the same node id. (src/gateway/node-registry.ts:85, 36f847a60e39)
  • Current-main stale presence path: The current close handler writes disconnect presence before node unregister cleanup, so the stale close can publish disconnected presence for a node that already reconnected. (src/gateway/server/ws-connection.ts:390, 36f847a60e39)
  • PR scopes registry state by connId: The PR adds connId to pending invokes, preserves nodesById unless the closing connection owns the current session, rejects only pending invokes from that connection, and requires invoke results to match node id plus connection id. (src/gateway/node-registry.ts:89, 19edcb3c2e63)
  • PR gates stale node disconnect presence: The PR unregisters node sockets before writing disconnect presence and skips the node disconnect presence write when unregister(connId) returns null for a stale close. (src/gateway/server/ws-connection.ts:397, 19edcb3c2e63)
  • Prior device-less-node concern checked: Runtime policy only lets operator shared-auth skip device identity, and missing node device identity is rejected with DEVICE_IDENTITY_REQUIRED, so the review concern about device-less node presence is not a blocking runtime path. (src/gateway/role-policy.ts:14, 36f847a60e39)
  • No protocol schema expansion: NodeInvokeResultParamsSchema still has no connId; the PR derives the connection id from the authenticated WebSocket client context instead of changing the public protocol. (src/gateway/protocol/schema/nodes.ts:117, 36f847a60e39)

Likely related people:

  • steipete: Path history shows Peter Steinberger introduced the current node registry/WebSocket node split and repeatedly maintained node identity, role, and Gateway WebSocket behavior around this code path. (role: introduced behavior and primary Gateway node owner; confidence: high; commits: 2f8206862a68, 9dbc1435a6ca, f16c176a4cc1; files: src/gateway/node-registry.ts, src/gateway/server/ws-connection.ts, src/gateway/server/ws-connection/message-handler.ts)
  • vincentkoc: Path history for src/gateway/server/ws-connection.ts includes recent Gateway/WebSocket maintenance work by Vincent Koc, making them an adjacent reviewer for lifecycle regressions even though the node registry history points more strongly to Peter. (role: recent adjacent Gateway WebSocket maintainer; confidence: medium; commits: be21d64d080f, dc8b881c1166; files: src/gateway/server/ws-connection.ts)

Remaining risk / open question:

  • Exact-head CI still had in-progress checks at review time, so merge should wait for the pending Gateway/core checks to complete.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 36f847a60e39.

@openclaw-barnacle openclaw-barnacle Bot added size: S and removed agents Agent runtime and tooling size: L labels May 6, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5e74a09d24

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/gateway/server/ws-connection.ts
@openclaw-barnacle openclaw-barnacle Bot added triage: blank-template Candidate: PR template appears mostly untouched. triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. proof: supplied External PR includes structured after-fix real behavior proof. and removed proof: supplied External PR includes structured after-fix real behavior proof. triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 6, 2026
@samzong
Copy link
Copy Markdown
Contributor Author

samzong commented May 6, 2026

@clawsweeper re-review

@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 6, 2026
@samzong samzong force-pushed the fix/node-reconnect-registry branch from 5e74a09 to 42fbd3b Compare May 7, 2026 16:03
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 7, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 7, 2026
@frankekn frankekn self-assigned this May 8, 2026
@frankekn frankekn force-pushed the fix/node-reconnect-registry branch from 42fbd3b to 19edcb3 Compare May 8, 2026 04:14
@frankekn
Copy link
Copy Markdown
Member

frankekn commented May 8, 2026

@codex review

@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 8, 2026
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Swish!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 8, 2026
samzong and others added 3 commits May 8, 2026 12:29
## Considered and deferred

- src/gateway/server/ws-connection.test.ts:321 [BOT-NIT]: New test repeats the local socket harness pattern already used in this file. Extracting a helper would be cleanup, not a correctness fix, and would broaden this commit.

Signed-off-by: samzong <samzong.lu@gmail.com>
@frankekn frankekn force-pushed the fix/node-reconnect-registry branch from 19edcb3 to 7fdd864 Compare May 8, 2026 04:30
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 8, 2026
@frankekn frankekn merged commit 1819e41 into openclaw:main May 8, 2026
106 of 110 checks passed
@frankekn
Copy link
Copy Markdown
Member

frankekn commented May 8, 2026

Merged. Thanks @samzong.

github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 9, 2026
Preserve node registry ownership across same-node WebSocket reconnect races so stale old-socket closes cannot clear the replacement session or complete the wrong pending invoke.

Thanks @samzong.
rogerdigital pushed a commit to rogerdigital/openclaw that referenced this pull request May 9, 2026
Preserve node registry ownership across same-node WebSocket reconnect races so stale old-socket closes cannot clear the replacement session or complete the wrong pending invoke.

Thanks @samzong.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gateway Gateway runtime proof: supplied External PR includes structured after-fix real behavior proof. size: S triage: blank-template Candidate: PR template appears mostly untouched.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants