fix(meshcore): heartbeat/auto-reconnect, drop detection, reconnect-after-disconnect (#3705)#3794
Conversation
…3705) Two gaps left a MeshCore source unable to recover from a dropped real-node link without a manual reconnect or container restart: 1. The manager ignored the native backend's own 'disconnected' event (it only logged it), so a socket/serial-level drop left `connected = true`. isConnected() kept returning true, so the Virtual Node server answered AppStart with a stale SelfInfo while real sends silently failed, and nothing recovered. handleUnexpectedDisconnect() now reflects reality (drops to disconnected + stops the VN server) or, when auto-reconnect is enabled, hands off to the existing backoff machinery. An intentionalTeardown guard plus a stale-backend-instance check keep it from firing on our own teardowns. 2. Heartbeat/auto-reconnect was unreachable: neither the source config type nor meshcoreConfigFromSource() carried heartbeatIntervalSeconds, and there was no UI for it — so startHeartbeat() always early-returned and the reconnect path (incl. PR #3706's fix) never ran. Added a heartbeat field to the MeshCore source form (mirroring Meshtastic), plumbed heartbeatIntervalSeconds through MeshCoreSourceConfig and meshcoreConfigFromSource(). The update route already restarts the manager on any config change, so editing it reconnects. Regression tests cover the drop-detection branches and the config passthrough. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011JEaCGwY9Wz8jeV4e22GW4
The UI Disconnect (POST /api/sources/:id/disconnect) called meshcoreManagerRegistry.remove(), deleting the manager from the registry. Every /api/sources/:id/meshcore/* route is behind a guard that 404s with "No MeshCore manager for source <id>" when the manager is absent — so after a manual disconnect the page's status/read polling errored out and the source couldn't be driven again without a container restart. Disconnect now tears down the device link via manager.disconnect() (which already stops the VN server, heartbeat and schedulers) while leaving the manager registered, so /meshcore/* keeps serving a clean disconnected state and /connect re-establishes against the existing manager. Mirrors the intent of the autoConnect=false stop/start workflow without making the source unaddressable. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011JEaCGwY9Wz8jeV4e22GW4
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011JEaCGwY9Wz8jeV4e22GW4
|
Claude finished @Yeraze's task —— View job PR Review: fix(meshcore): heartbeat/auto-reconnect, drop detection, reconnect-after-disconnect (#3705)Tasks:
SummaryThis PR fixes three distinct but related lifecycle bugs in the MeshCore source management. The changes are well-targeted, well-documented, and include appropriate regression tests. Here is my detailed analysis: Code Quality & Best Practices
backend.on('disconnected', () => {
if (this.nativeBackend !== backend) return; // ← stale backend guard
void this.handleUnexpectedDisconnect();
});This prevents a late event from a replaced backend from triggering a spurious recovery cycle — a subtle race that would be easy to miss.
Potential Bugs / Issues1. 2. 3. Config-change triggered reconnect (sourceRoutes line 692) — When Performance ConsiderationsNo performance concerns. The Security ConsiderationsNo security issues. The new disconnect behavior (keep manager, call Test CoverageThe test coverage is solid and targets the right things:
One gap to note: The Minor Observations
VerdictThe core fixes are correct, well-reasoned, and the regression tests are appropriately targeted. The The one substantive note is the potential for confusing error messages when |
Addresses Claude review finding #2: when a socket drop is handled with auto-reconnect disabled, handleUnexpectedDisconnect() left this.nativeBackend pointing at the closed connection, so sendBridgeCommand()'s `!nativeBackend` guard never tripped and callers got a write-to-closed error instead of a clean "disconnected" until /connect ran. Now tears down and nulls the dead backend (connectionState is set to 'disconnected' first so any re-emitted event short-circuits). Added a regression test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011JEaCGwY9Wz8jeV4e22GW4
|
Claude finished @Yeraze's task —— View job PR Review: fix(meshcore): heartbeat/auto-reconnect, drop detection, reconnect-after-disconnect (#3705)Tasks:
SummaryThis PR correctly addresses three distinct MeshCore connection lifecycle bugs. The changes are well-targeted, well-documented, and include appropriate regression tests. Here is my detailed analysis: Code Quality & Best Practices
backend.on('disconnected', () => {
if (this.nativeBackend !== backend) return; // ← stale backend guard
void this.handleUnexpectedDisconnect();
});This prevents a late event from a replaced backend from triggering a spurious recovery cycle.
Potential Issues1. 2. 3. Minor: Test CoverageThe test coverage is solid and targets the right behaviors:
One remaining gap: No test exercises what happens when Performance & SecurityNo performance or security issues. The VerdictApprove. The three fixes are correct, complementary, and the test suite exercises the right regression paths. The The remaining items are minor UX considerations (heartbeat default value) that don't affect correctness. |
Summary
Investigating #3705 (MeshCore Virtual Node stuck "Connecting") surfaced three real gaps in how MeshMonitor handles a MeshCore source's connection lifecycle. None of them were the app-side
AppStartbehavior the issue ultimately hinges on, but they explain the "works once, then stuck until restart" class of failure — including the reporter's restart-fixes-it symptom and a locally-reproduced "Disconnect → can't reconnect" bug. Firmware was verified againstcompanion-v1.15.0/v1.16.0(silent-until-AppStart, single-client, byte-identical between versions); findings posted on #3705.Changes
meshcoreConfigFromSource()carriedheartbeatIntervalSeconds, and there was no UI for it — sostartHeartbeat()always early-returned and the reconnect path (including PR fix(meshcore): stop virtual node server during transport teardown #3706's fix) never ran. Added a Heartbeat field to the MeshCore source form (mirrors Meshtastic) and plumbedheartbeatIntervalSecondsthroughMeshCoreSourceConfig→meshcoreConfigFromSource(). The update route already restarts the manager on config change.disconnectedevent, leavingconnected = trueon a real link drop — so the VN server served a stale identity and sends silently failed with no recovery. AddedhandleUnexpectedDisconnect()(guarded by anintentionalTeardownflag + a stale-backend-instance check) that auto-reconnects when heartbeat is enabled, or cleanly marks the source disconnected otherwise.POST /api/sources/:id/disconnect) calledmeshcoreManagerRegistry.remove(), deleting the manager; every/meshcore/*route then 404'd with "No MeshCore manager for source" and the source was unaddressable until a restart. Disconnect now callsmanager.disconnect()and keeps the manager registered.docs/features/meshcore.md; CHANGELOG entries.Issues Resolved
Relates to #3705 (addresses the real-node link-recovery failure modes; the residual app-side
AppStart-on-reconnect behavior is upstream in meshcore-flutter). Activates PR #3706's teardown fix, which was previously unreachable by default.Documentation Updates
docs/features/meshcore.md— added the optional Heartbeat step to "Adding a Source".CHANGELOG.md— Added (heartbeat) + Fixed (disconnect reconnect, socket-drop detection) entries.Testing
tsconfig.server.json— no new errors in changed files)/meshcore/*routes return 200 (not 404) while disconnected, full node state restored on reconnect🤖 Generated with Claude Code