Skip to content

fix(Router): localize p_encrypted to prevent recursive-overwrite leak#10311

Merged
thebentern merged 2 commits into
meshtastic:developfrom
nightjoker7:fix-router-pencrypted-recursion
Apr 27, 2026
Merged

fix(Router): localize p_encrypted to prevent recursive-overwrite leak#10311
thebentern merged 2 commits into
meshtastic:developfrom
nightjoker7:fix-router-pencrypted-recursion

Conversation

@nightjoker7

Copy link
Copy Markdown
Contributor

What this changes

Promotes Router::p_encrypted from a class member to a function-local in handleReceived(), so each invocation owns its own decrypt copy for its full lifetime. ~5 lines of functional change; no behavioral change on the non-recursive path.

Why

Router::handleReceived() re-enters itself: callModules() invokes module replies that go back through MeshService::sendToMesh -> Router::sendLocal, and on a broadcast reply that recursively calls handleReceived(). The inner call writes its own packetPool.allocCopy(*p) into this->p_encrypted, dropping the outer call's pointer with the allocation still live. The inner call also nulls the member on its way out, so when the outer call reaches packetPool.release(p_encrypted) it releases nullptr. The outer's allocation is permanently leaked. ~381 B per recursion event.

Copilot's review of #8999 (the original introduction of the member) flagged this exact pattern at merge time:

Storing p_encrypted as a class member can cause issues with recursive or concurrent calls to handleReceived() since each call would overwrite the previous packet pointer.

The historical justification — a pending S&F path needing to retain the encrypted copy across calls — was satisfied differently by #9799 (ServerAPI converted to std::unique_ptr + cleanup on connection close), so the member is no longer load-bearing.

How this fits with active issues

The recursion-overwrite is one mechanism behind:

All three share the trigger shape: persistent client + stale-connection timeout reconnects fire while a prior dump is still in flight. This PR closes one path on which that retry storm leaks; it does not claim to close all of them.

Reproduction and evidence

Two Station G2 devices on the same WiFi LAN. Each hammered with a slow-client TCP-API retry storm (open :4403, send want_config_id, read 2.5 s of the ServerAPI dump, close mid-stream, repeat at ~0.6/s). Pure TCP, no LoRa traffic. This is the trigger pattern that #9632 / #10101 attribute to MeshMonitor's STALE_CONNECTION_TIMEOUT=5min reconnect loop.

Build heapFree drift over ~9 min rebootCount delta
this patch -1.5 KB (≈390 B/min, noise floor) 0
stock 2.7.13 -73 KB (8.1 KB/min) +1 (OOM-class crash at t≈9 min)

Stock's heapTotal also contracted by ~11 KB during the test (allocator pool itself shrinking) — characteristic of a leak the heap can't reclaim. The patched build's heapTotal was stable to within 80 B.

The bench A/B without retry-storm trigger (~4,000 broadcast text sends each, with and without MQTT bridge re-injection) shows no drift on either build — consistent with the corpus, since the recursion window closes when the client is fast.

Why the diff is small

This matches the merge pattern of the recently landed leak fixes (#9799, #9693, #5901, #5328, #9781). Deliberately not bundled with any allocator restructure or other concerns.

Risk

Minimal. Lifetime of the function-local is identical to the prior member usage inside handleReceived (allocated at entry, released before return). All references to p_encrypted were already confined to that function body — verified by grep across src/. No subclass, friend class, or external module accesses the field. The pre-existing null-check at the MQTT publish site (added by #9136 as a workaround) stays in place because allocCopy can still legitimately return nullptr on packetPool exhaustion.

Router::handleReceived stores its allocCopy of the encrypted packet in the
class member p_encrypted. callModules() invokes module replies that re-enter
the router via MeshService::sendToMesh -> Router::sendLocal, which on a
broadcast reply recursively calls handleReceived. The inner call overwrites
the outer's p_encrypted without releasing it; on the way out it nulls the
member, the outer release(p_encrypted) now releases nullptr, and the original
allocation is permanently leaked. ~381 B per recursion.

Promote p_encrypted to a function-local so each invocation owns its own copy
for its full lifetime. The MQTT-publish null check at the call site (added by
PR meshtastic#9136 as a workaround for this bug) stays in place because allocCopy can
still legitimately return nullptr on packetPool exhaustion.

Copilot's review of PR meshtastic#8999 (the original introduction) flagged this exact
pattern at merge time:
  "Storing p_encrypted as a class member can cause issues with recursive or
  concurrent calls to handleReceived() since each call would overwrite the
  previous packet pointer."

The historical reason for the member (S&F needing to retain the encrypted
copy across calls) was satisfied differently by PR meshtastic#9799 (ServerAPI converted
to std::unique_ptr + cleanup on connection close), so the member is no longer
load-bearing.

Reproduces issues meshtastic#9632 / meshtastic#10101 / meshtastic#8729 (heap leak when MeshMonitor
connected; TCP drops on Station G2 / LILYGO ServerAPI dump abort).

Hardware A/B on Station G2 under sustained TCP-API retry storm (open :4403,
request config, disconnect mid-stream, repeat at ~0.6/s) - 9 min run:

  | Build         | heapFree drift | rebootCount delta |
  | this patch    | -1.5 KB (noise)| 0                 |
  | stock 2.7.13  | -73 KB (8.1KB/min) | +1 (OOM crash) |

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions github-actions Bot added the bugfix Pull request that fixes bugs label Apr 26, 2026
@thebentern thebentern requested a review from Copilot April 27, 2026 01:20

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a confirmed heap leak in the mesh Router receive path by making the encrypted packet copy (p_encrypted) invocation-scoped, preventing recursive calls to Router::handleReceived() from overwriting the pointer and skipping the corresponding packetPool.release().

Changes:

  • Remove Router::p_encrypted as a class member in favor of a function-local variable in handleReceived().
  • Keep the allocation/release lifetime within handleReceived() while making it safe under re-entrant execution.
  • Simplify teardown by relying on packetPool.release(nullptr) being a safe no-op.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
src/mesh/Router.h Removes the p_encrypted member to eliminate shared mutable state across handleReceived() invocations.
src/mesh/Router.cpp Introduces a function-local p_encrypted copy and releases it at function exit, preventing recursive overwrite leaks during broadcast-induced re-entry.

@thebentern thebentern merged commit bfadf0c into meshtastic:develop Apr 27, 2026
81 of 82 checks passed
thebentern added a commit that referenced this pull request Apr 27, 2026
…#10311)

Router::handleReceived stores its allocCopy of the encrypted packet in the
class member p_encrypted. callModules() invokes module replies that re-enter
the router via MeshService::sendToMesh -> Router::sendLocal, which on a
broadcast reply recursively calls handleReceived. The inner call overwrites
the outer's p_encrypted without releasing it; on the way out it nulls the
member, the outer release(p_encrypted) now releases nullptr, and the original
allocation is permanently leaked. ~381 B per recursion.

Promote p_encrypted to a function-local so each invocation owns its own copy
for its full lifetime. The MQTT-publish null check at the call site (added by
PR #9136 as a workaround for this bug) stays in place because allocCopy can
still legitimately return nullptr on packetPool exhaustion.

Copilot's review of PR #8999 (the original introduction) flagged this exact
pattern at merge time:
  "Storing p_encrypted as a class member can cause issues with recursive or
  concurrent calls to handleReceived() since each call would overwrite the
  previous packet pointer."

The historical reason for the member (S&F needing to retain the encrypted
copy across calls) was satisfied differently by PR #9799 (ServerAPI converted
to std::unique_ptr + cleanup on connection close), so the member is no longer
load-bearing.

Reproduces issues #9632 / #10101 / #8729 (heap leak when MeshMonitor
connected; TCP drops on Station G2 / LILYGO ServerAPI dump abort).

Hardware A/B on Station G2 under sustained TCP-API retry storm (open :4403,
request config, disconnect mid-stream, repeat at ~0.6/s) - 9 min run:

  | Build         | heapFree drift | rebootCount delta |
  | this patch    | -1.5 KB (noise)| 0                 |
  | stock 2.7.13  | -73 KB (8.1KB/min) | +1 (OOM crash) |

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: Ben Meadors <benmmeadors@gmail.com>
nightjoker7 added a commit to nightjoker7/firmware that referenced this pull request Apr 27, 2026
…mLock

concurrency::Lock has no destructor — xSemaphoreCreateBinary is never paired
with vSemaphoreDelete (commit 0a6059b, 2020-07-05). For instances on a
churning lifecycle, that leaks ~80 B per destruction. The two members
covered here are PhoneAPI::nodeInfoMutex and StreamAPI::streamLock; both
are recreated on every TCP API stream connect/disconnect cycle, which
makes the leak measurable under MeshMonitor / web-flasher / CLI-loop
reconnect cadence (~12 reconnects/hour × 80 B × 2 locks = ~23 KB/day).

std::mutex is the right primitive here: the C++ standard library
contract guarantees a proper RAII destructor that releases the underlying
synchronization handle. Replacing per-instance concurrency::Lock with
per-instance std::mutex closes the leak without introducing the
cross-transport contention that the previous file-scope-static workaround
caused (Copilot's review concern in this PR's earlier revision —
hardware-measured 3.5x latency regression on Heltec V4).

Net change: 4 files, 22 insertions, 16 deletions.

Hardware-validated:
- Heltec V4 TFT (ESP32-S3 + 2 MB PSRAM): NodeInfo response latency
  1.1s avg vs 26.5s with file-scope-static lock-as-globals (24x faster)
- Station G2 (ESP32-S3 + 8 MB PSRAM): DM round-trips clean,
  traceroute clean
- Heltec V3 (ESP32-S3, no PSRAM, ~70 KB free heap idle): boots
  cleanly, normal init, no crash on tight-RAM target

The leak path itself is unambiguous from std::mutex's RAII destructor
contract; Lock-only-fix vs combined-fix bench A/B is documented in
companion meshtastic#10311 (already merged).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Evil8it pushed a commit to Evil8it/ME4TACTNK that referenced this pull request Jun 10, 2026
…meshtastic#10311)

Router::handleReceived stores its allocCopy of the encrypted packet in the
class member p_encrypted. callModules() invokes module replies that re-enter
the router via MeshService::sendToMesh -> Router::sendLocal, which on a
broadcast reply recursively calls handleReceived. The inner call overwrites
the outer's p_encrypted without releasing it; on the way out it nulls the
member, the outer release(p_encrypted) now releases nullptr, and the original
allocation is permanently leaked. ~381 B per recursion.

Promote p_encrypted to a function-local so each invocation owns its own copy
for its full lifetime. The MQTT-publish null check at the call site (added by
PR meshtastic#9136 as a workaround for this bug) stays in place because allocCopy can
still legitimately return nullptr on packetPool exhaustion.

Copilot's review of PR meshtastic#8999 (the original introduction) flagged this exact
pattern at merge time:
  "Storing p_encrypted as a class member can cause issues with recursive or
  concurrent calls to handleReceived() since each call would overwrite the
  previous packet pointer."

The historical reason for the member (S&F needing to retain the encrypted
copy across calls) was satisfied differently by PR meshtastic#9799 (ServerAPI converted
to std::unique_ptr + cleanup on connection close), so the member is no longer
load-bearing.

Reproduces issues meshtastic#9632 / meshtastic#10101 / meshtastic#8729 (heap leak when MeshMonitor
connected; TCP drops on Station G2 / LILYGO ServerAPI dump abort).

Hardware A/B on Station G2 under sustained TCP-API retry storm (open :4403,
request config, disconnect mid-stream, repeat at ~0.6/s) - 9 min run:

  | Build         | heapFree drift | rebootCount delta |
  | this patch    | -1.5 KB (noise)| 0                 |
  | stock 2.7.13  | -73 KB (8.1KB/min) | +1 (OOM crash) |

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: Ben Meadors <benmmeadors@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugfix Pull request that fixes bugs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants