Skip to content

fix(meshcore): pin virtual-node DeviceInfo to protocol v1 (#3705)#3828

Merged
Yeraze merged 2 commits into
mainfrom
fix/meshcore-vn-deviceinfo-protocol-version
Jun 28, 2026
Merged

fix(meshcore): pin virtual-node DeviceInfo to protocol v1 (#3705)#3828
Yeraze merged 2 commits into
mainfrom
fix/meshcore-vn-deviceinfo-protocol-version

Conversation

@Yeraze

@Yeraze Yeraze commented Jun 28, 2026

Copy link
Copy Markdown
Owner

Summary

Fixes #3705 — MeshCore virtual node: app connects once after a container restart, then gets stuck on "Connecting…" forever.

Root cause (from the packet captures in the issue)

The captures disprove the earlier thread hypothesis that the app's reconnect path simply never sends AppStart. Decoding the frames shows the meshcore-flutter app sends DeviceQuery first, then AppStart — and it aborts the handshake based on the DeviceInfo reply:

DeviceQuery reply (DeviceInfo) App sends AppStart?
Working connect (cold cache, right after restart) version byte = 1, model MeshMonitor Virtual Node ✅ yes → SelfInfo → connected
Failing connect (warm cache, seconds later) version byte = real node's fw ver, model Heltec V3, build 25-Jun-2026 ❌ no → drops socket after ~5s

MeshCoreVirtualNodeServer.handleDeviceQuery() was sending localNode.firmwareVer (the real node's firmware-version byte, read by the manager's background deviceQuery()) as the DeviceInfo version field. That byte is the companion protocol version the app must use — but the VN only implements v1 frames.

The timing explains the symptom exactly: right after a restart the manager hasn't finished its async deviceQuery(), so localNode.firmwareVer is undefined and the code fell back to SUPPORTED_COMPANION_PROTOCOL_VERSION (1) → app connects. Once the real value is cached, every subsequent DeviceQuery returns a version the app can't reconcile with our v1 wire format → handshake aborts.

Fix

Always advertise SUPPORTED_COMPANION_PROTOCOL_VERSION for the DeviceInfo version byte. Build date / model stay real (display-only) so the app shows a faithful node identity.

Tests

  • Strengthened the existing DeviceQuery test to assert the version byte == SUPPORTED_COMPANION_PROTOCOL_VERSION.
  • Added a regression test: with the local node reporting firmwareVer: 7, DeviceInfo still pins the version byte to v1.
  • Full meshcoreVirtualNodeServer.test.ts suite: 21 passing (success: true via JSON reporter).

Caveat

Both the version byte and the model/build strings differ between the working and failing captures, so packets alone can't prove the version byte is the sole trigger (vs. the app rejecting a changed identity). The version byte is the high-confidence cause — it's the protocol-negotiation field the VN genuinely can't honor above v1. If field testing shows it's insufficient, the follow-up is to also make model/build deterministic across connects.

🤖 Generated with Claude Code

https://claude.ai/code/session_011JEaCGwY9Wz8jeV4e22GW4

The MeshCore Virtual Node leaked the real node's firmware-version byte
into the DeviceInfo response once the manager's background deviceQuery()
had cached it. The first byte of DeviceInfo is the *companion protocol*
version the app must use, not the proxied node's firmware version — and
the VN only implements v1 frames.

Packet captures in #3705 show the meshcore-flutter app sends DeviceQuery
*before* AppStart and aborts the handshake (never sending AppStart, then
dropping the socket after ~5s) when DeviceInfo reports a version it can't
reconcile with our v1 wire format. Right after a container restart the
cache is cold, so we fell back to v1 and the app connected — which is
exactly the reported "works once after restart, then never again"
symptom.

Always advertise SUPPORTED_COMPANION_PROTOCOL_VERSION for the version
byte. Build date / model stay real (display-only) so the app still shows
a faithful node identity. Adds a regression test asserting the version
byte stays v1 even when the local node reports firmwareVer 7.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011JEaCGwY9Wz8jeV4e22GW4
@github-actions

github-actions Bot commented Jun 28, 2026

Copy link
Copy Markdown

Claude finished @Yeraze's task —— View job


PR Review

  • Gather context and read changed files
  • Review code quality and correctness
  • Check test coverage
  • Post feedback

Summary

This is a well-targeted, minimal fix for a timing-dependent bug. The root cause analysis in the PR description is excellent and the fix is correct.


Code Review

src/server/meshcoreVirtualNodeServer.tshandleDeviceQuery() (line 355–374)

The fix is correct. The firmwareVer field in DeviceInfo is the companion protocol negotiation field, not a display value. Pinning it to SUPPORTED_COMPANION_PROTOCOL_VERSION (always 1) is the right behavior for the virtual node, which only implements v1 wire frames.

One minor style note: The localNode variable is fetched but only used for display fields (firmwareBuild, model). If localNode is null (manager not yet connected), the function still sends a valid DeviceInfo response using defaults — this is arguably the correct behavior for DeviceQuery (unlike AppStart which correctly early-exits with BadState). The current behavior is fine, but a comment acknowledging the intentional difference from handleAppStart would add clarity to future readers.

// Note: intentionally does NOT gate on isConnected() — DeviceQuery is sent
// by the app BEFORE AppStart, so we must reply even with a cold manager.
// Display fields fall back to safe defaults; the version byte is always pinned.

This is purely a suggestion, not a required change.

No issues with the fix itself.


src/server/meshcoreVirtualNodeServer.test.ts

Test coverage is strong. Two improvements were made:

  1. Line 179–184 — Strengthened the existing DeviceQuery test to assert payload.readInt8(1) === SUPPORTED_COMPANION_PROTOCOL_VERSION. Previously it only checked the response code; now it validates the protocol version byte, which is the precise field the bug was in.

  2. Lines 186–207 — New regression test constructs a manager with firmwareVer: 7 (simulating the warm-cache state from the issue) and asserts:

    • payload[0] is ResponseCodes.DeviceInfo
    • payload.readInt8(1) equals SUPPORTED_COMPANION_PROTOCOL_VERSION
    • payload.readInt8(1) is not 7

    The dual assertion (both positive and negative) makes the regression intent maximally clear. Good pattern.

Minor: The regression test spins up a second full server instance with a try/finally. This works correctly. An alternative would be to expose a test helper or reuse the beforeEach manager via mutation (e.g., (manager as any).localNode.firmwareVer = 7), but the current approach avoids coupling to internal implementation details of FakeManager, which is the right call.


Verdict

Approve — The fix is correct, minimal, and well-tested. The regression test directly covers the bug scenario described in the issue. No functional concerns.


…-up)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011JEaCGwY9Wz8jeV4e22GW4
@Yeraze

Yeraze commented Jun 28, 2026

Copy link
Copy Markdown
Owner Author

Thanks for the review! Applied the optional clarity suggestion in the latest commit — added a comment on handleDeviceQuery noting it intentionally does not gate on isConnected()/localNode (unlike handleAppStart), because the app sends DeviceQuery before AppStart and we must reply even with a cold manager. No logic change.

@github-actions

github-actions Bot commented Jun 28, 2026

Copy link
Copy Markdown

Claude finished @Yeraze's task —— View job


PR Review

  • Gather context and read changed files
  • Review code quality and correctness
  • Check test coverage
  • Post feedback

Summary

This follow-up commit adds the clarity comment suggested in the previous review. The change is purely documentation — no logic was altered.


handleDeviceQuery() comment (lines 357–372)

The comment is accurate and informative. A few observations:

Positive: It correctly explains both why localNode gating is intentionally absent (unlike handleAppStart) and why the version byte must always be SUPPORTED_COMPANION_PROTOCOL_VERSION. The comment also cross-references the issue (#3705) and describes the exact failure mode, which is exactly the right level of detail for a subtle timing-dependent behavior.

Minor nit (non-blocking): The comment is quite long (11 lines). The last paragraph is a full re-statement of the root cause from the PR description. For source-level comments, a shorter form is often preferred — something like:

// Does NOT gate on isConnected()/localNode (unlike handleAppStart, which
// replies BadState): DeviceQuery arrives *before* AppStart, so we must
// reply even with a cold manager; display fields fall back to safe defaults.
//
// IMPORTANT: firmwareVer must always be SUPPORTED_COMPANION_PROTOCOL_VERSION —
// NOT the real node's firmwareVer. This byte is the companion *protocol*
// negotiation field. Leaking the real fw version causes the app to abort the
// handshake (see issue #3705).

This is purely a style preference — the current version is not wrong, just verbose. Given that the PR description already documents the full root cause, the in-code comment doesn't need to repeat all of it.


Verdict

Approve — The clarifying comment fulfills the review suggestion and accurately documents the intentional design difference. No functional concerns. The fix, tests, and documentation are all correct.


@Yeraze Yeraze merged commit 54587ba into main Jun 28, 2026
20 checks passed
@Yeraze Yeraze deleted the fix/meshcore-vn-deviceinfo-protocol-version branch June 28, 2026 13:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] MeshCore: Cannot connect to virtual node

1 participant