Skip to content

fix(virtual-node): stop creating zombie nodes on map (#2602)#2615

Merged
Yeraze merged 1 commit into
mainfrom
fix/2602-virtual-node-zombies
Apr 9, 2026
Merged

fix(virtual-node): stop creating zombie nodes on map (#2602)#2615
Yeraze merged 1 commit into
mainfrom
fix/2602-virtual-node-zombies

Conversation

@Yeraze

@Yeraze Yeraze commented Apr 9, 2026

Copy link
Copy Markdown
Owner

Summary

Closes #2602.

Virtual Node Server (TCP :4404) was shipping synthetic / indirectly-discovered nodes to connected Meshtastic apps, where they showed up on the map/node list as "zombies" that could not be deleted — the apps' delete command targets the physical node's NodeDB, which never contained them in the first place.

Root cause: four independent code paths in meshtasticManager.ts were stamping a fresh lastHeard = Date.now()/1000 on rows that represent nodes we have never directly heard from. That stamp made them pass the getActiveNodes activity filter and leak out through virtualNodeServer.sendNodeInfosFromDb.

Changes

1. Stop stamping lastHeard on indirect / synthetic nodes

  • Broadcast pseudo-node (!ffffffff) — row is still created (required as FK target for messages.toNodeNum) but no longer carries lastHeard. meshtasticManager.ts:~4230
  • Traceroute intermediate hops — known hops are now skipped completely (no upsert → can't clobber name OR update lastHeard from relay activity). Unknown hops still get a stub row so name lookups resolve, but with NULL lastHeard. meshtasticManager.ts:~5300
  • estimateIntermediatePositions — FK-satisfaction stubs created here also no longer stamp lastHeard. meshtasticManager.ts:~5996
  • NeighborInfo neighbors-of-neighbors — placeholder rows for nodes only the reporter has heard no longer stamp lastHeard. meshtasticManager.ts:~6190

2. Virtual Node Server hardening

  • sendNodeInfosFromDb now scopes its getActiveNodes query to the manager's sourceId (multi-source isolation for 4.0)
  • Defensive filter: never ship rows with nodeNum === 0xFFFFFFFF or nodeId === '!ffffffff', even if pre-existing installs still have lastHeard stamped on the broadcast row

3. Ack-and-drop removeByNodenum

When a connected client sends an AdminMessage { removeByNodenum }:

  • Intercepted before the existing allow/block flow, so it fires regardless of allowAdminCommands
  • New helper meshtasticProtobufService.createFakeRoutingAck() builds a FromRadio { MeshPacket { Data { portnum=ROUTING_APP, Routing { errorReason=NONE }, requestId } } }
  • Sent only to the requesting client (no broadcast ghost acks)
  • Never forwarded to the physical node

This matches the project's existing precedent (issue comment): "Reset Node DB" is already disabled on the VN path because of the damage it can cause. Node deletion follows the same rule — it's a MeshMonitor-UI-only operation now.

Tests

  • New: src/server/virtualNodeServer.zombieFix.test.ts (8 tests, all passing)

    • Broadcast pseudo-node filter (by nodeNum AND nodeId)
    • sourceId scoping on getActiveNodes
    • Default 24h fallback when maxNodeAgeHours setting missing
    • removeByNodenum intercept under allowAdminCommands=false AND =true
    • setFavoriteNode (another admin op) is left alone → passes through normally
    • Graceful fallthrough on AdminMessage decode failure
  • Rewritten: src/server/meshtasticManager.traceroute-hops.test.ts (6 tests, all passing)

  • Full vitest suite: 4179 passed, 0 failed (213 test files, ~4min)

  • TypeScript tsc --noEmit: clean

System tests

./tests/system-tests.sh results:

Test Result Related to this fix?
Configuration Import
Quick Start No — channel 0 naming assertion (tests/test-quick-start.sh:393: Expected Channel 0 to be unnamed, got 'primary') — no channel-handling code was touched in this PR
Security No — tied to Quick Start result (tests/system-tests.sh:211)
V1 API
Reverse Proxy NoNo response received after 3 attempts nginx/proxy layer — no HTTP routing code was touched in this PR
Reverse Proxy + OIDC
Virtual Node CLI Yes — directly exercises this PR's code
Backup & Restore
Database Migration
DB Backing Consistency
API Exercise (3 DBs: sqlite + postgres + mysql)

The Virtual Node CLI system test (the one that directly covers the changed surface) passed. The 3 failing tests exercise code paths that are not in this PR's diff (channel name parsing, nginx reverse proxy HTTP response). Flagging them here rather than silently proceeding — happy to investigate in a separate PR if desired.

Test plan

  • Connect a Meshtastic app to the Virtual Node Server and confirm the map no longer contains !ffffffff, NeighborInfo placeholders, or traceroute-relay-only stubs
  • Attempt "delete node" from the connected app → verify the app's UI acks immediately and nothing changes on the physical device or in MeshMonitor
  • Confirm dashboard still shows real peers (direct packets continue to update lastHeard via the normal processMeshPacket path)
  • Trigger a traceroute → verify unknown hops appear in the DB with NULL lastHeard and do NOT appear in the VN's node list

🤖 Generated with Claude Code

Three independent code paths in meshtasticManager were stamping a fresh
`lastHeard` on synthetic / indirectly-discovered node rows, causing them
to pass the activity-time filter in `getActiveNodes` and leak to
connected Meshtastic clients via `sendNodeInfosFromDb`. The mobile app
then rendered them on its node list and map as "zombies" the user could
not delete (they don't exist in the physical node's NodeDB).

Fixes by source:

1. `processTextMessageProtobuf` broadcast pseudo-node creation: keep the
   `!ffffffff` row (FK target for `messages.toNodeNum`) but never stamp
   `lastHeard`. The row now stays out of `getActiveNodes` entirely.

2. `processTracerouteMessage` intermediate hop loop: known hops are now
   skipped completely (no upsert at all) so we don't accidentally clobber
   their `longName`/`shortName` *or* update their `lastHeard` from a
   relay event. Unknown hops still get a stub row so future name lookups
   resolve, but with NULL `lastHeard`. This intentionally reverses the
   behavior from #2610 — surfacing relay-only hops on the dashboard was
   the wrong tradeoff once we saw the same row was leaking to virtual
   node clients.

3. `estimateIntermediatePositions`: same fix — intermediate-hop stub
   rows created here for FK satisfaction also lose their `lastHeard`
   stamp.

4. NeighborInfo neighbors-of-neighbors creation in `processNeighborInfo`:
   placeholder rows for nodes the reporter has heard but the local node
   has not no longer get a `lastHeard` stamp.

5. `virtualNodeServer.sendNodeInfosFromDb` now scopes its query to the
   manager's `sourceId` (multi-source isolation) and defensively filters
   out any row with nodeNum 0xFFFFFFFF or nodeId `!ffffffff` to handle
   pre-existing installs whose broadcast row is already stamped.

6. `virtualNodeServer.handleClientMessage` intercepts AdminMessage
   `removeByNodenum` from connected clients, fabricates a routing-ack
   with errorReason=NONE so the client UI doesn't hang, and never
   forwards the request to the physical node. Per the issue resolution,
   node deletion is a MeshMonitor-only operation regardless of
   `allowAdminCommands`. New helper `meshtasticProtobufService.createFakeRoutingAck`
   builds the FromRadio response.

Tests:
- New `virtualNodeServer.zombieFix.test.ts` (8 tests): broadcast filter,
  textual !ffffffff filter, sourceId scoping, default 24h fallback,
  removeByNodenum intercept under both `allowAdminCommands` modes,
  setFavoriteNode passthrough, decode-failure fallthrough.
- Rewrote `meshtasticManager.traceroute-hops.test.ts` (6 tests) to
  assert the new behavior: known hops untouched, unknown hops stubbed
  WITHOUT lastHeard, reserved values still filtered. Mocks now track
  upsert state so downstream `estimateIntermediatePositions` calls
  resolve to the existing stub instead of duplicating it.

Closes #2602.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions

github-actions Bot commented Apr 9, 2026

Copy link
Copy Markdown

Claude finished @Yeraze's task —— View job


Code Review: Fix Virtual Node Zombies (#2602)

Overall Assessment: Excellent

This is a well-architected fix for a complex multi-system issue. The root cause analysis is thorough, the solution is surgical, and the implementation demonstrates strong software engineering practices.


🔍 Code Quality & Best Practices

Strengths:

  • Excellent documentation: Every code path includes detailed comments explaining the behavior change and rationale (e.g., meshtasticManager.ts:4234-4239)
  • Defensive programming: Multiple layers of protection in virtualNodeServer.ts:775-778 (both nodeNum === BROADCAST_NODE_NUM and nodeId === '!ffffffff')
  • Clean separation of concerns: Virtual Node hardening vs. data generation fixes are handled in appropriate modules
  • Comprehensive test coverage: New dedicated test file with 8 focused test cases plus rewritten existing tests

Minor observations:

  • Code consistently follows established patterns and error handling conventions
  • Variable naming is clear and intention-revealing (intermediateHops, BROADCAST_NODE_NUM)
  • Proper async/await usage throughout

🐛 Bug Analysis

No bugs identified. The implementation correctly addresses all aspects of the zombie node issue:

  1. Root cause fixed: Four code paths no longer stamp lastHeard on synthetic nodes (meshtasticManager.ts:4248-4250, 5338-5343, 6002-6007, 6205-6211)
  2. Defensive filtering: virtualNodeServer.ts:775-778 prevents any pre-existing zombie rows from leaking through
  3. Source isolation: getActiveNodes now properly scoped to sourceId (virtualNodeServer.ts:765)

Edge case handling:

  • Gracefully handles existing installations that may have pre-stamped broadcast rows
  • Properly scopes to source ID for multi-source deployments
  • Handles AdminMessage decode failures gracefully

🔒 Security Analysis

Strong security posture:

  1. Admin command filtering strengthened: removeByNodenum is now blocked regardless of allowAdminCommands setting
  2. Fake ACK implementation: meshtasticProtobufService.ts:1340-1380 creates proper routing acknowledgments without exposing internal state
  3. Input validation: AdminMessage decoding includes try/catch blocks with safe defaults
  4. Principle of least privilege: Virtual Node Server continues to block dangerous operations by default

No security concerns identified.


Performance Considerations

Positive impact:

  • Reduced database load: Skipping known nodes in traceroute processing (meshtasticManager.ts:5329-5334) avoids unnecessary upserts
  • Efficient filtering: Broadcast node filter happens early in the loop (virtualNodeServer.ts:776-778)
  • Minimal overhead: Source ID scoping adds one parameter to existing query

No performance regressions identified.


🧪 Test Coverage Analysis

Excellent coverage:

  1. New comprehensive test suite: virtualNodeServer.zombieFix.test.ts with 8 targeted test cases
  2. Integration scenarios: Tests both allowAdminCommands=false and =true configurations
  3. Edge case coverage: Broadcast node filtering, source ID scoping, decode failures
  4. Behavior preservation: Ensures setFavoriteNode still works while removeByNodenum is blocked
  5. Legacy compatibility: Tests handle existing installations with pre-stamped zombie data

Rewritten tests properly document behavior change: meshtasticManager.traceroute-hops.test.ts header clearly explains the #2610#2602 tradeoff decision.


🎯 Specific Technical Feedback

  1. meshtasticManager.ts:5320-5334: The approach of skipping known hops entirely is smart - avoids unnecessary database churn while preserving the intent of creating stubs for unknown nodes.

  2. virtualNodeServer.ts:378-408: The removeByNodenum interception logic is well-structured. Using createFakeRoutingAck maintains client compatibility while preventing harmful operations.

  3. meshtasticProtobufService.ts:1340+: New createFakeRoutingAck method follows established patterns and properly constructs protobuf messages.

  4. Test architecture: The mock setup in virtualNodeServer.zombieFix.test.ts is comprehensive and properly isolates the system under test.


🚀 Recommendations

This PR is ready for merge. Consider these minor enhancements for future iterations:

  1. Metrics: Could add counters for blocked removeByNodenum attempts to help administrators understand client behavior
  2. Configuration: The Virtual Node Server could expose zombie filtering as a configurable option (though the current behavior is correct by default)
  3. Documentation: Consider adding this fix to architecture documentation as an example of multi-system issue resolution

📊 System Test Results Assessment

The failing system tests (Quick Start, Security, Reverse Proxy) are correctly identified as unrelated to this PR's changes. The Virtual Node CLI test passed, which is the critical validation for this fix.

Verdict: System test results support approval.


Final Recommendation: APPROVE

This PR demonstrates excellent software engineering practices:

  • Thorough root cause analysis and surgical fixes
  • Comprehensive testing with clear behavior documentation
  • Strong security posture with defense-in-depth
  • Performance-conscious implementation
  • Clear documentation and maintainability

The zombie node issue is fully resolved without introducing regressions.

@github-actions

github-actions Bot commented Apr 9, 2026

Copy link
Copy Markdown
System Test Results

MeshMonitor System Test Results

Test Run: 2026-04-09 14:23:40 EDT

Test Summary

Test Suite Result
Configuration Import ✅ PASSED
Quick Start Test ❌ FAILED
Security Test ❌ FAILED
V1 API Test ✅ PASSED
Reverse Proxy Test ❌ FAILED
Reverse Proxy + OIDC ✅ PASSED
Virtual Node CLI Test ✅ PASSED
Backup & Restore Test ✅ PASSED
Database Migration Test ✅ PASSED
DB Backing Consistency ✅ PASSED
API Exercise (3 DBs) ✅ PASSED

❌ Overall Result: FAILED

Some tests failed. Please review the failures above and fix before creating/updating PR.

Failed Tests

  • Quick Start Test: Zero-config deployment test failed
  • Security Test: API endpoint security test failed
  • Reverse Proxy Test: Production HTTPS deployment test failed

@github-actions

github-actions Bot commented Apr 9, 2026

Copy link
Copy Markdown
System Test Results

MeshMonitor System Test Results

Test Run: 2026-04-09 14:42:44 EDT

Test Summary

Test Suite Result
Configuration Import ✅ PASSED
Quick Start Test ✅ PASSED
Security Test ✅ PASSED
V1 API Test ✅ PASSED
Reverse Proxy Test ✅ PASSED
Reverse Proxy + OIDC ✅ PASSED
Virtual Node CLI Test ✅ PASSED
Backup & Restore Test ✅ PASSED
Database Migration Test ✅ PASSED
DB Backing Consistency ✅ PASSED
API Exercise (3 DBs) ✅ PASSED

✅ Overall Result: PASSED

All deployment configurations are working correctly!

Test Details

Configuration Import:

  • Tests configuration import and device reboot cycle
  • Verifies channel roles, PSKs, and LoRa configuration
  • Note: Channel name verification skipped due to architectural limitation

Quick Start Test:

  • Zero-config deployment (no SESSION_SECRET or COOKIE_SECURE required)
  • HTTP access without HSTS
  • Auto-generated admin user with default credentials
  • Session cookies work over HTTP
  • Meshtastic node connection and message exchange verified

Security Test:

  • Verifies Node IP address hidden from anonymous users in API responses
  • Verifies MQTT configuration hidden from anonymous users
  • Verifies Node IP address visible to authenticated users
  • Verifies MQTT configuration visible to authenticated users
  • Verifies protected endpoints require authentication

V1 API Test:

  • Tests v1 REST API endpoints with Bearer token authentication
  • Verifies Bearer token requests bypass CSRF protection
  • Verifies POST/PUT/DELETE work without CSRF token when using Bearer auth
  • Verifies session-based requests still require CSRF token

Reverse Proxy Test:

  • Production deployment with COOKIE_SECURE=true
  • HTTPS-ready configuration
  • Trust proxy enabled for reverse proxy compatibility
  • CORS configured for HTTPS domain
  • Meshtastic node connection and message exchange verified

Reverse Proxy + OIDC Test:

  • OIDC authentication integration
  • Mock OIDC provider health checks
  • Authorization flow and session creation
  • Hybrid mode (OIDC + local auth)
  • Meshtastic node connection verified

Virtual Node CLI Test:

  • Virtual Node Server enabled on TCP port 4404
  • Meshtastic Python client successfully connects
  • Node data download and synchronization verified
  • Test message sent on gauntlet channel (index 3)
  • Message delivery confirmed via Web UI API
  • Virtual Node Server connection logging verified

Backup & Restore Test:

  • System backup created from running dev container
  • New container spun up with RESTORE_FROM_BACKUP env var
  • Data integrity verified (node count, message count, settings)
  • Restore event logged in audit log
  • Dev container unaffected by restore test

Database Migration Test:

  • SQLite to PostgreSQL migration verified
  • SQLite to MySQL migration verified
  • Data integrity confirmed for both target databases
  • Row counts match between source and target

DB Backing Consistency Test:

  • SQLite, PostgreSQL, and MySQL backends tested with same device
  • Node counts within ±10 across all three backends
  • Favorite counts identical across all backends
  • Key station verified as favorite on all backends

@Yeraze Yeraze merged commit c051ee4 into main Apr 9, 2026
18 of 19 checks passed
Yeraze added a commit that referenced this pull request May 15, 2026
…port is fresh (#3026)

PR #2615 (zombie-node fix) intentionally leaves `lastHeard=NULL` on
placeholder neighbor rows we've only learned about second-hand through a
NeighborInfo packet. That was correct in isolation, but the read-side
filter in both /api/neighbor-info (server.ts) and
/api/sources/:id/neighbor-info (sourceRoutes.ts) rejects any row whose
neighbor.lastHeard is NULL or older than the cutoff, so every
indirect-neighbor link gets dropped on read.

This change keeps the reporter freshness check (we must have directly
heard from the reporting node), but falls back to the NeighborInfo
record's own `timestamp` / firmware-supplied `lastRxTime` when the
neighbor's `lastHeard` is missing or stale. `timestamp` is stored in ms
(set via `Date.now()` in meshtasticManager), `lastRxTime` is firmware
seconds — handled accordingly.

Tests:
- src/server/routes/sourceRoutes.neighbor-info.test.ts — fixture
  timestamp corrected to ms (matches real DB shape); existing
  staleness/null tests retargeted at the reporter side; new cases cover
  neighbor.lastHeard NULL + fresh NI report (kept), NULL + stale NI
  report (dropped), and NULL + fresh lastRxTime (kept).
- src/server/server.neighbor-info-position.test.ts — inline shadow
  filter synced with the real handler so the test mirror does not
  drift.

Refs #3025

Authored by Merlin 🪄

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Yeraze added a commit that referenced this pull request Jun 19, 2026
…Neighbors view (#3560)

The live source map and the Map Analysis page disagreed on which neighbor
links to draw. Both read the same neighbor_info table, but applied different
freshness filters:

- Map Analysis "Neighbors" (/api/analysis/neighbors) windows by the
  NeighborInfo record's own `timestamp` (report recency).
- The live map (/api/sources/:id/neighbor-info and the legacy
  /api/neighbor-info) gated on the endpoint *nodes'* `lastHeard`, so a node's
  stale last-known neighbor list kept showing for as long as the node stayed
  otherwise active (still sending position/telemetry). That surfaced far more
  links than the analysis view and misrepresented how recently each RF link
  was actually observed.

Switch the live-map endpoints to report-time freshness (record `timestamp`
within maxNodeAge), matching the analysis definition. Because each NeighborInfo
packet deletes-then-reinserts the reporter's full list with a fresh timestamp,
a fresh record inherently means we heard the reporter within the window.
Keying off the record timestamp (instead of node `lastHeard`) also naturally
preserves indirect-neighbor links whose neighbor row has a null lastHeard
(#3025/#2615), since lastHeard is no longer consulted. Firmware-supplied
`lastRxTime` is also dropped from the predicate: it is subject to device clock
skew and the analysis view does not use it, so it could otherwise reintroduce
divergence.

Tests updated to assert report-time semantics (stale report dropped even when
endpoint nodes are still active; fresh report kept even when a node's lastHeard
is stale/null).


Claude-Session: https://claude.ai/code/session_01SVGxkuD4Fwa2JGVim8ZeVj

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] 3.12.0 - Virtual Node zombie nodes on map

1 participant