Skip to content

[BUG] Large Meshtastic TCP node repeatedly disconnects during/after full config sync; passive per-source mode improves stability #3122

Description

@TheWISPRer

Describe the bug

MeshMonitor 4.6.1 has stability problems when connected to a very large Meshtastic TCP infrastructure node.

The node accepts TCP connections and short-lived Meshtastic CLI commands work, but MeshMonitor’s persistent client behavior is associated with repeated disconnects during or shortly after full config sync / NodeDB sync. The UI may show “connected,” and commands may sometimes send, but receive-side behavior becomes unreliable: message sounds can fire while channel chat updates lag or fail to appear, traffic monitor and channel chat can disagree, and the connection can enter repeated reconnect/config-sync loops.

This appears related to a large NodeDB/config stream plus repeated want_config_id and post-config outbound requests against a large TCP node.

To Reproduce

  1. Deploy MeshMonitor 4.6.1 with Docker Compose.
  2. Configure a Meshtastic TCP source pointing at a large infrastructure node.
  3. Source settings:
    • Type: Meshtastic TCP
    • Auto-connect: enabled
    • Virtual Node: disabled
    • Remote Admin Scanner: disabled
    • Time Sync Scheduler: disabled
    • Heartbeat: tested both default and explicit heartbeat
  4. Start MeshMonitor and allow it to connect.
  5. Watch logs during initial config sync and the minutes after configComplete.

Observed log pattern:

[DataEventEmitter] Connection status: connected
Starting init config capture for virtual node server
Init config capture complete! Captured ~900-1150 messages for virtual node replay
Config capture complete — schedulers will start over the next 55 seconds
[DataEventEmitter] Connection status: disconnected
Node disconnect notification sent for source ...
Requesting LoRa config from device...
Failed to request LoRa config: Error: Not connected to Meshtastic node
Requesting all module configs for backup...
Failed to request all module configs: Error: Not connected to Meshtastic node

In some runs, over ~20 minutes, the app repeatedly connected/disconnected and restarted init config capture.

Expected behavior

MeshMonitor should remain usable and receive packets reliably from large Meshtastic TCP nodes.

For large/fragile TCP sources, it should avoid repeatedly forcing a full config/NodeDB sync after every reconnect and should avoid post-config outbound requests that appear to destabilize the node/session.

Screenshots

No public screenshot included because the instance contains private mesh/user data. I can provide sanitized screenshots/log excerpts if helpful.

Environment

Host OS: Debian 12 VM in Hyper-V Server 2019
Deployment type: Docker Compose
Container version tested: ghcr.io/yeraze/meshmonitor:4.6.1
Earlier versions also showed similar behavior: 4.3.2, 4.5.0
Networking customizations:

  • MeshMonitor behind a local reverse proxy for HTTPS
  • Meshtastic TCP source is reached over a LAN-routed WireGuard VPN connection.
  • Database backend: SQLite

Node type:
Nebra CM3 / Raspberry Pi Compute Module 3 Rev 1.0, ARM64 Cortex-A53, running Debian GNU/Linux 13 (trixie) with meshtasticd 2.7.22. Meshtastic reports hwModel=PORTDUINO, pioEnv=native-tft, firmwareEdition=VANILLA, role=ROUTER, firmwareVersion=2.7.22. Large infrastructure/router-style deployment with dual Wehooper Pi-HAT LoRa radios.
MeshMonitor observed source size: <1100 nodes over 24h; Meshtastic CLI reported nodedbCount=1183 at the time of testing.

Node Connection Type:

  • Meshtastic TCP over WireGuard VPN. The node is a WireGuard client to a UniFi WireGuard server; MeshMonitor runs on a routed VLAN with access to the VPN network.

VirtualNode:

  • Disabled in the UI during testing.

Additional context

Things ruled out:

Another local process was not competing for the TCP connection.
Docker bridge networking was not the root cause; host networking behaved similarly.
Increasing Linux TCP buffers did not materially fix the issue.
Meshtastic CLI worked for short-lived commands against the same node, suggesting the TCP API is reachable but MeshMonitor’s persistent sync behavior is the trigger.
SQLite telemetry bloat occurred during reconnect loops, but clearing telemetry only reduced DB size; it did not fix the disconnect loop.

Relevant behavior:

MeshMonitor connects
→ sends want_config_id
→ large config / NodeDB / NodeInfo stream arrives
→ configComplete
→ post-config scheduler/request block starts
→ TCP session closes or destabilizes
→ reconnect
→ want_config_id repeats
→ large sync repeats

Socket diagnostics showed the MeshMonitor container connection moving from ESTABLISHED to CLOSE-WAIT after failure. Since CLOSE-WAIT was on the MeshMonitor side, this indicates the remote Meshtastic TCP peer closed its side of the TCP session. MeshMonitor then had to recover from the remote close.

I do not want to overstate the cause: I am not claiming MeshMonitor directly causes the node to close the socket. The safer observation is that the remote peer appears to close the TCP session during or shortly after MeshMonitor’s full config/NodeDB sync. The working theory is that repeated full syncs and post-config requests from MeshMonitor trigger or amplify this behavior on large nodes.

During testing, several local patches were tried. These are not submitted as a PR yet, but may help point to a proper upstream design.

Patches that did not fully solve it:

  • Disabling virtual-node init replay cache accumulation reduced memory/cache work but did not stop disconnects.
  • SQLite telemetry cleanup reduced DB size but did not fix the root disconnect behavior.

Patches that materially helped:

  1. Skip virtual-node init replay cache accumulation for this large source.
  2. Skip post-config scheduler/outbound config requests for this source:
  • requestConfig(LoRa)
  • requestAllModuleConfigs
  • other staggered post-config jobs
  1. Preserve cached local node/config state across passive reconnects instead of treating each disconnect as a full cold start.
  2. Send want_config_id only in a controlled way:
  • First full config sync allowed.
  • Reconnects avoid immediate repeated full sync unless stale/recovery logic requires it.
  • Rate-limit controlled reconnect sync attempts.
  1. Add heartbeat/liveness behavior so a quiet source is not treated as stale just because no mesh traffic arrived.
  2. Add fast initial reconnect for the first post-sync drop, then restore normal reconnect backoff after startup settles.
  3. For channel chat display, use DB arrival time (createdAt) for channel-message ordering instead of remote packet timestamp, because some received packets had future-skewed device timestamps. This fixed cases where traffic monitor showed new messages but channel chat appeared stuck behind an old future-dated message.

The most useful shape seems to be a per-source “Passive Mode” / “Large TCP Node Mode,” not a global behavior change.

Suggested feature behavior:

Source setting: Passive Mode

When enabled for a Meshtastic TCP source:
- Disable init replay cache accumulation unless Virtual Node actually needs it.
- Avoid post-config outbound scheduler/config backup requests.
- Preserve cached local node/config state across reconnects.
- Do not send repeated want_config_id on every reconnect.
- Allow first full config sync, then controlled/rate-limited recovery sync.
- Optional heartbeat/liveness probe.
- Optional short initial reconnect delay during startup, then normal reconnect backoff.
- Prefer DB arrival time for channel chat ordering to avoid future-skewed packet timestamps pinning old messages at the bottom.

This should be per-source because smaller/normal nodes probably should keep the current stock behavior.

Current local proof-of-concept status:

  • A local patched image using per-source Passive Mode has been running substantially better against the large node.
  • The node still may close the initial TCP session after the first config sync, but a short startup reconnect grace period makes recovery much faster.
  • After recovery, MeshMonitor is generally usable and message receive behavior is significantly improved.
  • More soak time is planned before attempting a clean branch/PR, and I still need to compare against newer MeshMonitor versions such as 4.6.3+.

Related upstream Meshtastic firmware issue that may be relevant:
meshtastic/firmware#10101

If maintainers prefer a different architecture, I am happy to adapt the proof-of-concept into whatever source-level setting or transport behavior would fit MeshMonitor best.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions