Skip to content

fix(meshcore): heartbeat/auto-reconnect, drop detection, reconnect-after-disconnect (#3705)#3794

Merged
Yeraze merged 4 commits into
mainfrom
fix/meshcore-heartbeat-and-drop-detection
Jun 26, 2026
Merged

fix(meshcore): heartbeat/auto-reconnect, drop detection, reconnect-after-disconnect (#3705)#3794
Yeraze merged 4 commits into
mainfrom
fix/meshcore-heartbeat-and-drop-detection

Conversation

@Yeraze

@Yeraze Yeraze commented Jun 26, 2026

Copy link
Copy Markdown
Owner

Summary

Investigating #3705 (MeshCore Virtual Node stuck "Connecting") surfaced three real gaps in how MeshMonitor handles a MeshCore source's connection lifecycle. None of them were the app-side AppStart behavior the issue ultimately hinges on, but they explain the "works once, then stuck until restart" class of failure — including the reporter's restart-fixes-it symptom and a locally-reproduced "Disconnect → can't reconnect" bug. Firmware was verified against companion-v1.15.0/v1.16.0 (silent-until-AppStart, single-client, byte-identical between versions); findings posted on #3705.

Changes

  • Heartbeat / auto-reconnect for MeshCore sources (was unreachable). Neither the source config type nor meshcoreConfigFromSource() carried heartbeatIntervalSeconds, and there was no UI for it — so startHeartbeat() always early-returned and the reconnect path (including PR fix(meshcore): stop virtual node server during transport teardown #3706's fix) never ran. Added a Heartbeat field to the MeshCore source form (mirrors Meshtastic) and plumbed heartbeatIntervalSeconds through MeshCoreSourceConfigmeshcoreConfigFromSource(). The update route already restarts the manager on config change.
  • Socket-drop detection. The manager only logged the native backend's disconnected event, leaving connected = true on a real link drop — so the VN server served a stale identity and sends silently failed with no recovery. Added handleUnexpectedDisconnect() (guarded by an intentionalTeardown flag + a stale-backend-instance check) that auto-reconnects when heartbeat is enabled, or cleanly marks the source disconnected otherwise.
  • Reconnect-after-disconnect. The UI Disconnect (POST /api/sources/:id/disconnect) called meshcoreManagerRegistry.remove(), deleting the manager; every /meshcore/* route then 404'd with "No MeshCore manager for source" and the source was unaddressable until a restart. Disconnect now calls manager.disconnect() and keeps the manager registered.
  • Docs: documented the new Heartbeat field in docs/features/meshcore.md; CHANGELOG entries.

Issues Resolved

Relates to #3705 (addresses the real-node link-recovery failure modes; the residual app-side AppStart-on-reconnect behavior is upstream in meshcore-flutter). Activates PR #3706's teardown fix, which was previously unreachable by default.

Documentation Updates

  • docs/features/meshcore.md — added the optional Heartbeat step to "Adding a Source".
  • CHANGELOG.md — Added (heartbeat) + Fixed (disconnect reconnect, socket-drop detection) entries.

Testing

  • Unit tests pass (full suite: 7657 passed, 0 failed, 0 suite failures)
  • TypeScript compiles cleanly (tsconfig.server.json — no new errors in changed files)
  • New regression tests: drop-detection branches, heartbeat config passthrough, disconnect keeps manager registered
  • Manually verified live against a real Heltec V3 (v1.15.0) in the dev container: 3× disconnect→reconnect cycles, /meshcore/* routes return 200 (not 404) while disconnected, full node state restored on reconnect
  • Reviewer: edit a MeshCore source, set a Heartbeat value, confirm it persists and that pulling the link triggers an automatic reconnect

🤖 Generated with Claude Code

Yeraze and others added 3 commits June 26, 2026 12:07
…3705)

Two gaps left a MeshCore source unable to recover from a dropped real-node
link without a manual reconnect or container restart:

1. The manager ignored the native backend's own 'disconnected' event (it only
   logged it), so a socket/serial-level drop left `connected = true`.
   isConnected() kept returning true, so the Virtual Node server answered
   AppStart with a stale SelfInfo while real sends silently failed, and nothing
   recovered. handleUnexpectedDisconnect() now reflects reality (drops to
   disconnected + stops the VN server) or, when auto-reconnect is enabled, hands
   off to the existing backoff machinery. An intentionalTeardown guard plus a
   stale-backend-instance check keep it from firing on our own teardowns.

2. Heartbeat/auto-reconnect was unreachable: neither the source config type nor
   meshcoreConfigFromSource() carried heartbeatIntervalSeconds, and there was no
   UI for it — so startHeartbeat() always early-returned and the reconnect path
   (incl. PR #3706's fix) never ran. Added a heartbeat field to the MeshCore
   source form (mirroring Meshtastic), plumbed heartbeatIntervalSeconds through
   MeshCoreSourceConfig and meshcoreConfigFromSource(). The update route already
   restarts the manager on any config change, so editing it reconnects.

Regression tests cover the drop-detection branches and the config passthrough.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011JEaCGwY9Wz8jeV4e22GW4
The UI Disconnect (POST /api/sources/:id/disconnect) called
meshcoreManagerRegistry.remove(), deleting the manager from the registry.
Every /api/sources/:id/meshcore/* route is behind a guard that 404s with
"No MeshCore manager for source <id>" when the manager is absent — so after a
manual disconnect the page's status/read polling errored out and the source
couldn't be driven again without a container restart.

Disconnect now tears down the device link via manager.disconnect() (which
already stops the VN server, heartbeat and schedulers) while leaving the
manager registered, so /meshcore/* keeps serving a clean disconnected state and
/connect re-establishes against the existing manager. Mirrors the intent of the
autoConnect=false stop/start workflow without making the source unaddressable.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011JEaCGwY9Wz8jeV4e22GW4
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011JEaCGwY9Wz8jeV4e22GW4
@github-actions

github-actions Bot commented Jun 26, 2026

Copy link
Copy Markdown

Claude finished @Yeraze's task —— View job


PR Review: fix(meshcore): heartbeat/auto-reconnect, drop detection, reconnect-after-disconnect (#3705)

Tasks:

  • Gather context and understand the PR
  • Read changed files
  • Analyze code quality, potential bugs, and test coverage
  • Post final review feedback

Summary

This PR fixes three distinct but related lifecycle bugs in the MeshCore source management. The changes are well-targeted, well-documented, and include appropriate regression tests. Here is my detailed analysis:


Code Quality & Best Practices

src/server/meshcoreRegistry.ts — The MeshCoreSourceConfig interface addition of heartbeatIntervalSeconds (lines 32) is clean, matches the Meshtastic precedent, and the JSDoc is precise. The passthrough in meshcoreConfigFromSource() for both SERIAL and TCP paths (lines 78, 89) is correctly symmetrical.

src/server/meshcoreManager.ts — The new intentionalTeardown flag (line 562) is the right pattern for distinguishing expected closes from unexpected drops. The stale-backend-instance guard on the disconnected listener (lines 986–988) is an excellent defensive touch:

backend.on('disconnected', () => {
  if (this.nativeBackend !== backend) return;   // ← stale backend guard
  void this.handleUnexpectedDisconnect();
});

This prevents a late event from a replaced backend from triggering a spurious recovery cycle — a subtle race that would be easy to miss.

handleUnexpectedDisconnect() (lines 4471–4495) — The two-guard pattern (intentionalTeardown + connectionState check) is correct and the logic flows clearly. The "else: drop to disconnected + stop VN server" branch (no auto-reconnect) properly cleans up so stale state doesn't linger.

src/server/routes/sourceRoutes.ts — disconnect handler (lines 1095–1108) — The comment explaining why the manager is kept registered is helpful and the reasoning is sound. One minor observation: the route checks existingMc?.isConnected() for the "already stopped" early return, but existingMc can be undefined (the ?. handles that). If the manager doesn't exist yet (source was never connected), the code falls through to the alreadyStopped: true branch correctly since there's no manager to disconnect. This is fine.

src/pages/DashboardPage.tsx — The heartbeat field for MeshCore sources reuses the existing formHeartbeat state variable that already serves the Meshtastic form. Since the form only shows one source type at a time, sharing the state is acceptable, but it's worth noting that setFormHeartbeat(String(cfg?.heartbeatIntervalSeconds ?? 0)) (line 415) runs for both Meshtastic TCP and MeshCore sources from the same code path (line 407 branches on type, but line 415 is shared). This is intentional and correct as the field is the same semantic concept.


Potential Bugs / Issues

1. formHeartbeat shared state — MeshCore path validates range 0–3600; Meshtastic path validates separately. The MeshCore path at line 542–546 correctly validates mcHeartbeat, but if formHeartbeat is set to a very large value and then the user switches form type without clearing, stale values could be submitted. This is a minor pre-existing UX issue, not introduced by this PR.

2. handleUnexpectedDisconnect() does not null out nativeBackend — When shouldReconnect = false, the method sets connected = false and stops the VN server but does not null out this.nativeBackend. In contrast, disconnect() explicitly nulls it (line 924). This means isConnected() would return false (correct), but this.nativeBackend still holds the old (dead) object. This seems intentional given the stale-backend guard on the event listener, but it means sendBridgeCommand() would still try to use the dead backend until a manual reconnect clears it. A guard at the start of sendBridgeCommand() presumably catches this since it checks if (!this.nativeBackend) — but here the backend is non-null and dead, so that guard won't help. This could cause confusing error messages rather than clean "disconnected" errors until /connect is called.

3. Config-change triggered reconnect (sourceRoutes line 692) — When meshcoreManagerRegistry.remove(source.id) is called in the "config changed" branch, the manager IS deleted. This is different from the disconnect route's behavior (manager kept). This is intentional (config changes need a fresh instance) but creates a brief window where the source is unaddressable between remove() and the subsequent getOrCreate() + connect(). This was pre-existing behavior and is acceptable given the atomic nature of the code path.


Performance Considerations

No performance concerns. The handleUnexpectedDisconnect is on the hot path (socket close event) but is fast — two boolean checks and either a beginReconnect() or async VN server stop.


Security Considerations

No security issues. The new disconnect behavior (keep manager, call disconnect()) is strictly more conservative than the old behavior (delete from registry). Routes still require requirePermission('sources', 'write').


Test Coverage

The test coverage is solid and targets the right things:

  • meshcoreManager.dropDetection.test.ts — Tests all four branches of handleUnexpectedDisconnect(): no-reconnect path, with-reconnect path, intentionalTeardown guard, and connectionState guard. The fifth test verifying that disconnect() sets the flag is a good regression anchor.

  • meshcoreRegistry.test.ts — Two new tests for heartbeatIntervalSeconds passthrough on both SERIAL and TCP paths, and one for the "undefined when not configured" case. Clean.

  • sourceRoutes.autoConnect.test.ts — The new MeshCore disconnect tests (lines 273–321) directly target the regression: verify that disconnect() is called but meshcoreManagerRegistry.remove() is NOT called. This is exactly the right thing to test.

One gap to note: The handleUnexpectedDisconnect test at line 70 sets connectionState = 'reconnecting' to verify the no-op path, but doesn't test what happens when connectionState = 'connecting' (before initial connection completes). This is a low-risk gap since both states pass the === 'connected' check, but the connecting case isn't explicitly exercised.


Minor Observations

  • The MeshCoreSourceConfig interface is defined in meshcoreRegistry.ts (not exported), while the runtime MeshCoreConfig interface is in meshcoreManager.ts. This split is understandable given the layering, but a future cleanup could unify them or at least cross-reference them.

  • The heartbeat UI help text (line 1486) says "Applies to Companion devices only" — this correctly matches the code where startHeartbeat() is only called when this.nativeBackend is set (line 786), which is the Companion path.

  • The formHeartbeat default is '30' (line 135 in state init). For new MeshCore sources this means the field starts at 30 seconds even before a user has set anything. Since heartbeatIntervalSeconds is only added to the config when mcHeartbeat > 0 (line 547), and the form validation allows 0–3600, users can set it to 0 to disable. But the default of 30 displayed in the form might surprise users who expect heartbeat to default to off. This is a UX consideration worth discussing — the Meshtastic form also defaults to 30 via the same formHeartbeat state.


Verdict

The core fixes are correct, well-reasoned, and the regression tests are appropriately targeted. The intentionalTeardown guard + stale-backend-instance check is the right solution for the problem domain. The "keep manager registered on disconnect" change is the right call for the UX and addresses the root cause directly.

The one substantive note is the potential for confusing error messages when handleUnexpectedDisconnect runs with shouldReconnect = false and the dead nativeBackend isn't nulled — but this is likely recovered cleanly on the next /connect call.

Addresses Claude review finding #2: when a socket drop is handled with
auto-reconnect disabled, handleUnexpectedDisconnect() left this.nativeBackend
pointing at the closed connection, so sendBridgeCommand()'s `!nativeBackend`
guard never tripped and callers got a write-to-closed error instead of a clean
"disconnected" until /connect ran. Now tears down and nulls the dead backend
(connectionState is set to 'disconnected' first so any re-emitted event
short-circuits). Added a regression test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011JEaCGwY9Wz8jeV4e22GW4
@github-actions

github-actions Bot commented Jun 26, 2026

Copy link
Copy Markdown

Claude finished @Yeraze's task —— View job


PR Review: fix(meshcore): heartbeat/auto-reconnect, drop detection, reconnect-after-disconnect (#3705)

Tasks:

  • Gather context and understand the PR
  • Read changed files
  • Analyze code quality, potential bugs, and test coverage
  • Post final review feedback

Summary

This PR correctly addresses three distinct MeshCore connection lifecycle bugs. The changes are well-targeted, well-documented, and include appropriate regression tests. Here is my detailed analysis:


Code Quality & Best Practices

src/server/meshcoreRegistry.ts — The heartbeatIntervalSeconds addition to MeshCoreSourceConfig (line 32) is clean with precise JSDoc. The passthrough in meshcoreConfigFromSource() is correctly symmetrical for both SERIAL (line 78) and TCP (line 89) paths.

src/server/meshcoreManager.ts — The intentionalTeardown flag (line 562) is exactly the right primitive for distinguishing expected closes from unexpected socket drops. The stale-backend-instance guard on the disconnected listener (lines 986–988) is an excellent defensive touch:

backend.on('disconnected', () => {
  if (this.nativeBackend !== backend) return; // ← stale backend guard
  void this.handleUnexpectedDisconnect();
});

This prevents a late event from a replaced backend from triggering a spurious recovery cycle.

handleUnexpectedDisconnect() (lines 4471–4510) — The two-guard pattern (intentionalTeardown + connectionState !== 'connected') is correct and the logic flows clearly. Notably, the PR includes explicit nativeBackend nulling (lines 4500–4507) with a try/catch for the dead-backend cleanup — this addresses the "confusing write-to-closed error" issue that was flagged in the previous review pass.

src/server/routes/sourceRoutes.ts — disconnect handler (lines 1095–1108) — The inline comment explaining why the manager is kept registered is valuable. The code logic is correct.

src/pages/DashboardPage.tsx — The heartbeat field for MeshCore shares formHeartbeat state with the Meshtastic form. Since the form only renders one source type at a time and both types use the same 0–3600 range with identical validation logic (lines 543–546 for MeshCore, lines 568–571 for Meshtastic), this sharing is intentional and correct.


Potential Issues

1. formHeartbeat defaults to '30' for new MeshCore sources. The initial state (presumably set at component initialization) means a new MeshCore source shows 30 seconds in the heartbeat field before a user interacts. Setting it to '0' would be less surprising for users who expect heartbeat to default to off for a new source. Since heartbeatIntervalSeconds is only emitted when mcHeartbeat > 0 (line 547), this doesn't cause a bug, but a first-time user may not notice the pre-filled value. Low severity.

2. alreadyStopped behavior when manager doesn't exist for MeshCore source. If a MeshCore source exists in the database but was never connected (no manager in the registry), meshcoreManagerRegistry.get(source.id) returns undefined, existingMc?.isConnected() is undefined (falsy), and the route returns alreadyStopped: true. This is the correct fallback behavior, but it differs subtly from the Meshtastic path which uses sourceManagerRegistry.getManager(source.id). No bug, but worth noting the two different registry APIs.

3. Minor: shouldReconnect state is not reset in handleUnexpectedDisconnect's no-reconnect branch. When shouldReconnect = false, the method nulls the backend and emits disconnected, but shouldReconnect stays false. This is correct since shouldReconnect reflects configuration (heartbeatIntervalSeconds > 0), not transient state — it shouldn't be toggled here. Just confirming this is intentional.


Test Coverage

The test coverage is solid and targets the right behaviors:

meshcoreManager.dropDetection.test.ts — Tests all key branches of handleUnexpectedDisconnect():

  • No-reconnect path (drops to disconnected, stops VN server) ✓
  • New in this revision: Dead nativeBackend is nulled after disconnect (line 44–54) — this directly tests the fix for the "confusing error" issue ✓
  • Auto-reconnect hands off to beginReconnect()
  • intentionalTeardown guard ignores the drop ✓
  • connectionState !== 'connected' guard is a no-op ✓
  • disconnect() sets the intentionalTeardown flag ✓

meshcoreRegistry.test.ts — New tests for heartbeat passthrough on both SERIAL and TCP paths, plus the "undefined when not configured" case. Clean.

sourceRoutes.autoConnect.test.ts — The three new MeshCore disconnect tests (lines 273–321) directly target the regression: manager kept, disconnect() called, remove() not called.

One remaining gap: No test exercises what happens when handleUnexpectedDisconnect fires and shouldReconnect = true but the beginReconnect() itself throws or the reconnect sequence stalls. This is a lower-level concern already covered by existing reconnect tests (presumably), so it's not blocking.


Performance & Security

No performance or security issues. The handleUnexpectedDisconnect path is correctly lightweight (two boolean checks before doing any I/O). The disconnect route retains the permission guard (requirePermission('sources', 'write')).


Verdict

Approve. The three fixes are correct, complementary, and the test suite exercises the right regression paths. The intentionalTeardown + stale-backend-instance double-guard is the right solution. The nativeBackend nulling in the no-reconnect path closes the "confusing write-to-closed error" gap identified in the first review. The disconnect route fix (keep manager, don't remove()) directly addresses the "source unaddressable after disconnect" root cause.

The remaining items are minor UX considerations (heartbeat default value) that don't affect correctness.

@Yeraze Yeraze merged commit 7c60423 into main Jun 26, 2026
17 checks passed
@Yeraze Yeraze deleted the fix/meshcore-heartbeat-and-drop-detection branch June 26, 2026 19:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant