fix(Router): localize p_encrypted to prevent recursive-overwrite leak#10311
Merged
thebentern merged 2 commits intoApr 27, 2026
Merged
Conversation
Router::handleReceived stores its allocCopy of the encrypted packet in the class member p_encrypted. callModules() invokes module replies that re-enter the router via MeshService::sendToMesh -> Router::sendLocal, which on a broadcast reply recursively calls handleReceived. The inner call overwrites the outer's p_encrypted without releasing it; on the way out it nulls the member, the outer release(p_encrypted) now releases nullptr, and the original allocation is permanently leaked. ~381 B per recursion. Promote p_encrypted to a function-local so each invocation owns its own copy for its full lifetime. The MQTT-publish null check at the call site (added by PR meshtastic#9136 as a workaround for this bug) stays in place because allocCopy can still legitimately return nullptr on packetPool exhaustion. Copilot's review of PR meshtastic#8999 (the original introduction) flagged this exact pattern at merge time: "Storing p_encrypted as a class member can cause issues with recursive or concurrent calls to handleReceived() since each call would overwrite the previous packet pointer." The historical reason for the member (S&F needing to retain the encrypted copy across calls) was satisfied differently by PR meshtastic#9799 (ServerAPI converted to std::unique_ptr + cleanup on connection close), so the member is no longer load-bearing. Reproduces issues meshtastic#9632 / meshtastic#10101 / meshtastic#8729 (heap leak when MeshMonitor connected; TCP drops on Station G2 / LILYGO ServerAPI dump abort). Hardware A/B on Station G2 under sustained TCP-API retry storm (open :4403, request config, disconnect mid-stream, repeat at ~0.6/s) - 9 min run: | Build | heapFree drift | rebootCount delta | | this patch | -1.5 KB (noise)| 0 | | stock 2.7.13 | -73 KB (8.1KB/min) | +1 (OOM crash) | Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Fixes a confirmed heap leak in the mesh Router receive path by making the encrypted packet copy (p_encrypted) invocation-scoped, preventing recursive calls to Router::handleReceived() from overwriting the pointer and skipping the corresponding packetPool.release().
Changes:
- Remove
Router::p_encryptedas a class member in favor of a function-local variable inhandleReceived(). - Keep the allocation/release lifetime within
handleReceived()while making it safe under re-entrant execution. - Simplify teardown by relying on
packetPool.release(nullptr)being a safe no-op.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
src/mesh/Router.h |
Removes the p_encrypted member to eliminate shared mutable state across handleReceived() invocations. |
src/mesh/Router.cpp |
Introduces a function-local p_encrypted copy and releases it at function exit, preventing recursive overwrite leaks during broadcast-induced re-entry. |
thebentern
added a commit
that referenced
this pull request
Apr 27, 2026
…#10311) Router::handleReceived stores its allocCopy of the encrypted packet in the class member p_encrypted. callModules() invokes module replies that re-enter the router via MeshService::sendToMesh -> Router::sendLocal, which on a broadcast reply recursively calls handleReceived. The inner call overwrites the outer's p_encrypted without releasing it; on the way out it nulls the member, the outer release(p_encrypted) now releases nullptr, and the original allocation is permanently leaked. ~381 B per recursion. Promote p_encrypted to a function-local so each invocation owns its own copy for its full lifetime. The MQTT-publish null check at the call site (added by PR #9136 as a workaround for this bug) stays in place because allocCopy can still legitimately return nullptr on packetPool exhaustion. Copilot's review of PR #8999 (the original introduction) flagged this exact pattern at merge time: "Storing p_encrypted as a class member can cause issues with recursive or concurrent calls to handleReceived() since each call would overwrite the previous packet pointer." The historical reason for the member (S&F needing to retain the encrypted copy across calls) was satisfied differently by PR #9799 (ServerAPI converted to std::unique_ptr + cleanup on connection close), so the member is no longer load-bearing. Reproduces issues #9632 / #10101 / #8729 (heap leak when MeshMonitor connected; TCP drops on Station G2 / LILYGO ServerAPI dump abort). Hardware A/B on Station G2 under sustained TCP-API retry storm (open :4403, request config, disconnect mid-stream, repeat at ~0.6/s) - 9 min run: | Build | heapFree drift | rebootCount delta | | this patch | -1.5 KB (noise)| 0 | | stock 2.7.13 | -73 KB (8.1KB/min) | +1 (OOM crash) | Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ben Meadors <benmmeadors@gmail.com>
nightjoker7
added a commit
to nightjoker7/firmware
that referenced
this pull request
Apr 27, 2026
…mLock concurrency::Lock has no destructor — xSemaphoreCreateBinary is never paired with vSemaphoreDelete (commit 0a6059b, 2020-07-05). For instances on a churning lifecycle, that leaks ~80 B per destruction. The two members covered here are PhoneAPI::nodeInfoMutex and StreamAPI::streamLock; both are recreated on every TCP API stream connect/disconnect cycle, which makes the leak measurable under MeshMonitor / web-flasher / CLI-loop reconnect cadence (~12 reconnects/hour × 80 B × 2 locks = ~23 KB/day). std::mutex is the right primitive here: the C++ standard library contract guarantees a proper RAII destructor that releases the underlying synchronization handle. Replacing per-instance concurrency::Lock with per-instance std::mutex closes the leak without introducing the cross-transport contention that the previous file-scope-static workaround caused (Copilot's review concern in this PR's earlier revision — hardware-measured 3.5x latency regression on Heltec V4). Net change: 4 files, 22 insertions, 16 deletions. Hardware-validated: - Heltec V4 TFT (ESP32-S3 + 2 MB PSRAM): NodeInfo response latency 1.1s avg vs 26.5s with file-scope-static lock-as-globals (24x faster) - Station G2 (ESP32-S3 + 8 MB PSRAM): DM round-trips clean, traceroute clean - Heltec V3 (ESP32-S3, no PSRAM, ~70 KB free heap idle): boots cleanly, normal init, no crash on tight-RAM target The leak path itself is unambiguous from std::mutex's RAII destructor contract; Lock-only-fix vs combined-fix bench A/B is documented in companion meshtastic#10311 (already merged). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Evil8it
pushed a commit
to Evil8it/ME4TACTNK
that referenced
this pull request
Jun 10, 2026
…meshtastic#10311) Router::handleReceived stores its allocCopy of the encrypted packet in the class member p_encrypted. callModules() invokes module replies that re-enter the router via MeshService::sendToMesh -> Router::sendLocal, which on a broadcast reply recursively calls handleReceived. The inner call overwrites the outer's p_encrypted without releasing it; on the way out it nulls the member, the outer release(p_encrypted) now releases nullptr, and the original allocation is permanently leaked. ~381 B per recursion. Promote p_encrypted to a function-local so each invocation owns its own copy for its full lifetime. The MQTT-publish null check at the call site (added by PR meshtastic#9136 as a workaround for this bug) stays in place because allocCopy can still legitimately return nullptr on packetPool exhaustion. Copilot's review of PR meshtastic#8999 (the original introduction) flagged this exact pattern at merge time: "Storing p_encrypted as a class member can cause issues with recursive or concurrent calls to handleReceived() since each call would overwrite the previous packet pointer." The historical reason for the member (S&F needing to retain the encrypted copy across calls) was satisfied differently by PR meshtastic#9799 (ServerAPI converted to std::unique_ptr + cleanup on connection close), so the member is no longer load-bearing. Reproduces issues meshtastic#9632 / meshtastic#10101 / meshtastic#8729 (heap leak when MeshMonitor connected; TCP drops on Station G2 / LILYGO ServerAPI dump abort). Hardware A/B on Station G2 under sustained TCP-API retry storm (open :4403, request config, disconnect mid-stream, repeat at ~0.6/s) - 9 min run: | Build | heapFree drift | rebootCount delta | | this patch | -1.5 KB (noise)| 0 | | stock 2.7.13 | -73 KB (8.1KB/min) | +1 (OOM crash) | Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Ben Meadors <benmmeadors@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this changes
Promotes
Router::p_encryptedfrom a class member to a function-local inhandleReceived(), so each invocation owns its own decrypt copy for its full lifetime. ~5 lines of functional change; no behavioral change on the non-recursive path.Why
Router::handleReceived()re-enters itself:callModules()invokes module replies that go back throughMeshService::sendToMesh -> Router::sendLocal, and on a broadcast reply that recursively callshandleReceived(). The inner call writes its ownpacketPool.allocCopy(*p)intothis->p_encrypted, dropping the outer call's pointer with the allocation still live. The inner call also nulls the member on its way out, so when the outer call reachespacketPool.release(p_encrypted)it releasesnullptr. The outer's allocation is permanently leaked. ~381 B per recursion event.Copilot's review of #8999 (the original introduction of the member) flagged this exact pattern at merge time:
The historical justification — a pending S&F path needing to retain the encrypted copy across calls — was satisfied differently by #9799 (ServerAPI converted to
std::unique_ptr+ cleanup on connection close), so the member is no longer load-bearing.How this fits with active issues
The recursion-overwrite is one mechanism behind:
ROUTER_LATE2.7.20)errno: 11 EAGAINtask-handle exhaustion during ServerAPI dump →abort())All three share the trigger shape: persistent client + stale-connection timeout reconnects fire while a prior dump is still in flight. This PR closes one path on which that retry storm leaks; it does not claim to close all of them.
Reproduction and evidence
Two Station G2 devices on the same WiFi LAN. Each hammered with a slow-client TCP-API retry storm (open
:4403, sendwant_config_id, read 2.5 s of the ServerAPI dump, close mid-stream, repeat at ~0.6/s). Pure TCP, no LoRa traffic. This is the trigger pattern that #9632 / #10101 attribute to MeshMonitor'sSTALE_CONNECTION_TIMEOUT=5minreconnect loop.Stock's
heapTotalalso contracted by ~11 KB during the test (allocator pool itself shrinking) — characteristic of a leak the heap can't reclaim. The patched build'sheapTotalwas stable to within 80 B.The bench A/B without retry-storm trigger (~4,000 broadcast text sends each, with and without MQTT bridge re-injection) shows no drift on either build — consistent with the corpus, since the recursion window closes when the client is fast.
Why the diff is small
This matches the merge pattern of the recently landed leak fixes (#9799, #9693, #5901, #5328, #9781). Deliberately not bundled with any allocator restructure or other concerns.
Risk
Minimal. Lifetime of the function-local is identical to the prior member usage inside
handleReceived(allocated at entry, released before return). All references top_encryptedwere already confined to that function body — verified by grep acrosssrc/. No subclass, friend class, or external module accesses the field. The pre-existing null-check at the MQTT publish site (added by #9136 as a workaround) stays in place becauseallocCopycan still legitimately returnnullptronpacketPoolexhaustion.