fix(Router): localize p_encrypted to prevent recursive-overwrite leak by nightjoker7 · Pull Request #10311 · meshtastic/firmware

nightjoker7 · 2026-04-26T14:48:19Z

What this changes

Promotes Router::p_encrypted from a class member to a function-local in handleReceived(), so each invocation owns its own decrypt copy for its full lifetime. ~5 lines of functional change; no behavioral change on the non-recursive path.

Why

Router::handleReceived() re-enters itself: callModules() invokes module replies that go back through MeshService::sendToMesh -> Router::sendLocal, and on a broadcast reply that recursively calls handleReceived(). The inner call writes its own packetPool.allocCopy(*p) into this->p_encrypted, dropping the outer call's pointer with the allocation still live. The inner call also nulls the member on its way out, so when the outer call reaches packetPool.release(p_encrypted) it releases nullptr. The outer's allocation is permanently leaked. ~381 B per recursion event.

Copilot's review of #8999 (the original introduction of the member) flagged this exact pattern at merge time:

Storing p_encrypted as a class member can cause issues with recursive or concurrent calls to handleReceived() since each call would overwrite the previous packet pointer.

The historical justification — a pending S&F path needing to retain the encrypted copy across calls — was satisfied differently by #9799 (ServerAPI converted to std::unique_ptr + cleanup on connection close), so the member is no longer load-bearing.

How this fits with active issues

The recursion-overwrite is one mechanism behind:

[Bug]: Heap memory leak when enabling MeshMonitor on Heltec V4 (firmware 2.7.15 and 2.7.19) #9632 — Heltec V4 heap leak when MeshMonitor enabled
[Bug]: Station G2 TCP connection drops repeatedly when MeshMonitor is connected (2.7.21 alpha) #10101 — Station G2 TCP drops repeatedly when MeshMonitor connected (ROUTER_LATE 2.7.20)
[Bug]: Configuring wifi on LILYGO T-LoRa V2.1-1.6 makes node unreachable for clients #8729 — LILYGO unreachable after wifi config (errno: 11 EAGAIN task-handle exhaustion during ServerAPI dump → abort())

All three share the trigger shape: persistent client + stale-connection timeout reconnects fire while a prior dump is still in flight. This PR closes one path on which that retry storm leaks; it does not claim to close all of them.

Reproduction and evidence

Two Station G2 devices on the same WiFi LAN. Each hammered with a slow-client TCP-API retry storm (open :4403, send want_config_id, read 2.5 s of the ServerAPI dump, close mid-stream, repeat at ~0.6/s). Pure TCP, no LoRa traffic. This is the trigger pattern that #9632 / #10101 attribute to MeshMonitor's STALE_CONNECTION_TIMEOUT=5min reconnect loop.

Build	heapFree drift over ~9 min	rebootCount delta
this patch	-1.5 KB (≈390 B/min, noise floor)	0
stock 2.7.13	-73 KB (8.1 KB/min)	+1 (OOM-class crash at t≈9 min)

Stock's heapTotal also contracted by ~11 KB during the test (allocator pool itself shrinking) — characteristic of a leak the heap can't reclaim. The patched build's heapTotal was stable to within 80 B.

The bench A/B without retry-storm trigger (~4,000 broadcast text sends each, with and without MQTT bridge re-injection) shows no drift on either build — consistent with the corpus, since the recursion window closes when the client is fast.

Why the diff is small

This matches the merge pattern of the recently landed leak fixes (#9799, #9693, #5901, #5328, #9781). Deliberately not bundled with any allocator restructure or other concerns.

Risk

Minimal. Lifetime of the function-local is identical to the prior member usage inside handleReceived (allocated at entry, released before return). All references to p_encrypted were already confined to that function body — verified by grep across src/. No subclass, friend class, or external module accesses the field. The pre-existing null-check at the MQTT publish site (added by #9136 as a workaround) stays in place because allocCopy can still legitimately return nullptr on packetPool exhaustion.

Router::handleReceived stores its allocCopy of the encrypted packet in the class member p_encrypted. callModules() invokes module replies that re-enter the router via MeshService::sendToMesh -> Router::sendLocal, which on a broadcast reply recursively calls handleReceived. The inner call overwrites the outer's p_encrypted without releasing it; on the way out it nulls the member, the outer release(p_encrypted) now releases nullptr, and the original allocation is permanently leaked. ~381 B per recursion. Promote p_encrypted to a function-local so each invocation owns its own copy for its full lifetime. The MQTT-publish null check at the call site (added by PR meshtastic#9136 as a workaround for this bug) stays in place because allocCopy can still legitimately return nullptr on packetPool exhaustion. Copilot's review of PR meshtastic#8999 (the original introduction) flagged this exact pattern at merge time: "Storing p_encrypted as a class member can cause issues with recursive or concurrent calls to handleReceived() since each call would overwrite the previous packet pointer." The historical reason for the member (S&F needing to retain the encrypted copy across calls) was satisfied differently by PR meshtastic#9799 (ServerAPI converted to std::unique_ptr + cleanup on connection close), so the member is no longer load-bearing. Reproduces issues meshtastic#9632 / meshtastic#10101 / meshtastic#8729 (heap leak when MeshMonitor connected; TCP drops on Station G2 / LILYGO ServerAPI dump abort). Hardware A/B on Station G2 under sustained TCP-API retry storm (open :4403, request config, disconnect mid-stream, repeat at ~0.6/s) - 9 min run: | Build | heapFree drift | rebootCount delta | | this patch | -1.5 KB (noise)| 0 | | stock 2.7.13 | -73 KB (8.1KB/min) | +1 (OOM crash) | Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot

Pull request overview

Fixes a confirmed heap leak in the mesh Router receive path by making the encrypted packet copy (p_encrypted) invocation-scoped, preventing recursive calls to Router::handleReceived() from overwriting the pointer and skipping the corresponding packetPool.release().

Changes:

Remove Router::p_encrypted as a class member in favor of a function-local variable in handleReceived().
Keep the allocation/release lifetime within handleReceived() while making it safe under re-entrant execution.
Simplify teardown by relying on packetPool.release(nullptr) being a safe no-op.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
`src/mesh/Router.h`	Removes the `p_encrypted` member to eliminate shared mutable state across `handleReceived()` invocations.
`src/mesh/Router.cpp`	Introduces a function-local `p_encrypted` copy and releases it at function exit, preventing recursive overwrite leaks during broadcast-induced re-entry.

…#10311) Router::handleReceived stores its allocCopy of the encrypted packet in the class member p_encrypted. callModules() invokes module replies that re-enter the router via MeshService::sendToMesh -> Router::sendLocal, which on a broadcast reply recursively calls handleReceived. The inner call overwrites the outer's p_encrypted without releasing it; on the way out it nulls the member, the outer release(p_encrypted) now releases nullptr, and the original allocation is permanently leaked. ~381 B per recursion. Promote p_encrypted to a function-local so each invocation owns its own copy for its full lifetime. The MQTT-publish null check at the call site (added by PR #9136 as a workaround for this bug) stays in place because allocCopy can still legitimately return nullptr on packetPool exhaustion. Copilot's review of PR #8999 (the original introduction) flagged this exact pattern at merge time: "Storing p_encrypted as a class member can cause issues with recursive or concurrent calls to handleReceived() since each call would overwrite the previous packet pointer." The historical reason for the member (S&F needing to retain the encrypted copy across calls) was satisfied differently by PR #9799 (ServerAPI converted to std::unique_ptr + cleanup on connection close), so the member is no longer load-bearing. Reproduces issues #9632 / #10101 / #8729 (heap leak when MeshMonitor connected; TCP drops on Station G2 / LILYGO ServerAPI dump abort). Hardware A/B on Station G2 under sustained TCP-API retry storm (open :4403, request config, disconnect mid-stream, repeat at ~0.6/s) - 9 min run: | Build | heapFree drift | rebootCount delta | | this patch | -1.5 KB (noise)| 0 | | stock 2.7.13 | -73 KB (8.1KB/min) | +1 (OOM crash) | Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ben Meadors <benmmeadors@gmail.com>

…mLock concurrency::Lock has no destructor — xSemaphoreCreateBinary is never paired with vSemaphoreDelete (commit 0a6059b, 2020-07-05). For instances on a churning lifecycle, that leaks ~80 B per destruction. The two members covered here are PhoneAPI::nodeInfoMutex and StreamAPI::streamLock; both are recreated on every TCP API stream connect/disconnect cycle, which makes the leak measurable under MeshMonitor / web-flasher / CLI-loop reconnect cadence (~12 reconnects/hour × 80 B × 2 locks = ~23 KB/day). std::mutex is the right primitive here: the C++ standard library contract guarantees a proper RAII destructor that releases the underlying synchronization handle. Replacing per-instance concurrency::Lock with per-instance std::mutex closes the leak without introducing the cross-transport contention that the previous file-scope-static workaround caused (Copilot's review concern in this PR's earlier revision — hardware-measured 3.5x latency regression on Heltec V4). Net change: 4 files, 22 insertions, 16 deletions. Hardware-validated: - Heltec V4 TFT (ESP32-S3 + 2 MB PSRAM): NodeInfo response latency 1.1s avg vs 26.5s with file-scope-static lock-as-globals (24x faster) - Station G2 (ESP32-S3 + 8 MB PSRAM): DM round-trips clean, traceroute clean - Heltec V3 (ESP32-S3, no PSRAM, ~70 KB free heap idle): boots cleanly, normal init, no crash on tight-RAM target The leak path itself is unambiguous from std::mutex's RAII destructor contract; Lock-only-fix vs combined-fix bench A/B is documented in companion meshtastic#10311 (already merged). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…meshtastic#10311) Router::handleReceived stores its allocCopy of the encrypted packet in the class member p_encrypted. callModules() invokes module replies that re-enter the router via MeshService::sendToMesh -> Router::sendLocal, which on a broadcast reply recursively calls handleReceived. The inner call overwrites the outer's p_encrypted without releasing it; on the way out it nulls the member, the outer release(p_encrypted) now releases nullptr, and the original allocation is permanently leaked. ~381 B per recursion. Promote p_encrypted to a function-local so each invocation owns its own copy for its full lifetime. The MQTT-publish null check at the call site (added by PR meshtastic#9136 as a workaround for this bug) stays in place because allocCopy can still legitimately return nullptr on packetPool exhaustion. Copilot's review of PR meshtastic#8999 (the original introduction) flagged this exact pattern at merge time: "Storing p_encrypted as a class member can cause issues with recursive or concurrent calls to handleReceived() since each call would overwrite the previous packet pointer." The historical reason for the member (S&F needing to retain the encrypted copy across calls) was satisfied differently by PR meshtastic#9799 (ServerAPI converted to std::unique_ptr + cleanup on connection close), so the member is no longer load-bearing. Reproduces issues meshtastic#9632 / meshtastic#10101 / meshtastic#8729 (heap leak when MeshMonitor connected; TCP drops on Station G2 / LILYGO ServerAPI dump abort). Hardware A/B on Station G2 under sustained TCP-API retry storm (open :4403, request config, disconnect mid-stream, repeat at ~0.6/s) - 9 min run: | Build | heapFree drift | rebootCount delta | | this patch | -1.5 KB (noise)| 0 | | stock 2.7.13 | -73 KB (8.1KB/min) | +1 (OOM crash) | Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ben Meadors <benmmeadors@gmail.com>

github-actions Bot added the bugfix Pull request that fixes bugs label Apr 26, 2026

thebentern requested a review from Copilot April 27, 2026 01:20

Merge branch 'develop' into fix-router-pencrypted-recursion

b270a6f

Copilot started reviewing on behalf of thebentern April 27, 2026 01:20 View session

Copilot AI reviewed Apr 27, 2026

View reviewed changes

thebentern merged commit bfadf0c into meshtastic:develop Apr 27, 2026
81 of 82 checks passed

nightjoker7 mentioned this pull request Apr 27, 2026

fix(api): use std::mutex for PhoneAPI nodeInfoMutex + StreamAPI streamLock #10315

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix(Router): localize p_encrypted to prevent recursive-overwrite leak#10311

fix(Router): localize p_encrypted to prevent recursive-overwrite leak#10311
thebentern merged 2 commits into
meshtastic:developfrom
nightjoker7:fix-router-pencrypted-recursion

nightjoker7 commented Apr 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Uh oh!

Conversation

nightjoker7 commented Apr 26, 2026

What this changes

Why

How this fits with active issues

Reproduction and evidence

Why the diff is small

Risk

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants