PhoneAPI: rate-limit handleStartConfig to prevent reconnect-storm heap fragmentation by nightjoker7 · Pull Request #10271 · meshtastic/firmware

nightjoker7 · 2026-04-23T13:58:30Z

Summary

Add a 2000 ms minimum interval between full config-dump handshakes in PhoneAPI::handleStartConfig. A misbehaving or flapping client that reconnects faster than a config dump can complete is currently able to exhaust the internal SRAM heap via fragmentation.

Evidence

Measured on a Station G2 (ESP32-S3, 248 KB internal SRAM heap, 8 MB PSRAM) running 2.7.23.ef4693d with full DEBUG_HEAP instrumentation:

Meshtastic-python `TCPInterface` client with a reader thread that gets `Connection reset by peer` from a remote under heap pressure → Python side tears down and reconnects → full `want_config_id` re-issued → new `handleStartConfig` every ~1 s.
466 complete reconnect cycles over 85 minutes.
Net heap loss: 108 KB (avg 232 B / median 192 B per cycle).
`heap_free` trajectory: 150 KB → 50 KB over the 85 min, crashed into reboot.
Post-reboot with the storm stopped: heap rock-stable at 145 KB for 11 hours (no continued leak).

Stage-by-stage heap delta per cycle (from log parse over 398 cycles):

Segment	Avg Δ	Median Δ
`wants_config → got_files`	+2,815	+2,052
`force_close → incoming_api`	+1,472	+1,164
`send_myinfo → send_metadata`	-3,556	-3,692
`send_cfg_device → send_cfg_sec`	-501	-568

Most of the loss is heap fragmentation from the repeated alloc/free of mid-sized protobuf encode buffers (MY_INFO, OWN_NODEINFO, METADATA, NodeDB entries) interleaved with longer-lived queue entries. Once the storm stops, the free total does not recover until reboot.

Fix

Add `PhoneAPI::lastStartConfigMs` as a static member (shared across all transport subclasses — BLE, WiFi/TCP, Serial, HTTP, Ethernet) and reject handshakes within 2000 ms of the previous one:

2000 ms chosen to be just longer than a full config dump takes to complete on a saturated (250-node) NodeDB, so legitimate reconnect-after-disconnect flows still work.
Well-behaved clients only issue one `want_config_id` per reconnect so are unaffected.
The throttled path returns early before touching state, observers, or `getFiles` — no resources are leaked per throttled call.
The client either waits and retries or times out and recovers naturally.

Test plan

Builds clean on heltec-v4 (ESP32-S3) — local verification
CI build across all targets
Repro: run a meshtastic-python TCP reconnect loop against a Station G2, confirm heap_free no longer bleeds out

Related work

The underlying fragmentation vulnerability is an architectural issue — large per-session protobuf buffers land in internal SRAM even on boards with 8 MB of idle PSRAM. Separately scoping a PSRAM-routing change for the large allocators; this PR is the narrow defensive fix.

…on reconnect storms A misbehaving or flapping client (e.g. a TCP reader-thread loop that RSTs and reconnects every second) can drive handleStartConfig at roughly 1 Hz. Every handshake exercises the protobuf encode path for: - MY_INFO / OWN_NODEINFO / METADATA - all channels, all configs, all module configs - the entire NodeDB (up to MAX_NUM_NODES = 250 entries) - a recursive filesystem walk via getFiles("/", 10) Measured on a Station G2 (ESP32-S3, 248 KB internal SRAM heap): - 466 reconnect cycles over 85 minutes - Net heap loss: 108 KB (avg 232 B / median 192 B per cycle) - Dropped heap_free from ~150 KB to ~50 KB, wedged TCP handshake Most of this is heap fragmentation from the repeated alloc/free of mid-sized protobuf encode buffers interleaved with longer-lived state. Once the reconnect storm stops, heap does not recover until reboot. Add a 2000 ms minimum interval between full config dumps, shared across all PhoneAPI transports (BLE / WiFi / Serial / HTTP / Ethernet) via a static member. Well-behaved clients issue one want_config_id per reconnect so are unaffected; a flapping client is throttled to one handshake per ~2 s, which matches roughly the time a single config dump needs to complete. The throttled path returns early without touching state, observers, or running getFiles — so no resources are leaked per throttled call. The client either waits and retries or times out and recovers naturally.

Copilot

Pull request overview

Adds a global rate-limit to PhoneAPI::handleStartConfig() to prevent reconnect storms from repeatedly triggering full config-dump handshakes that fragment internal SRAM heap, improving stability under misbehaving/flapping clients.

Changes:

Introduces a shared PhoneAPI::lastStartConfigMs timestamp used to throttle config-dump handshakes across all PhoneAPI transports.
Enforces a 2000 ms minimum interval between full config-dump handshakes, logging when throttling occurs.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
src/mesh/PhoneAPI.h	Adds shared static timestamp used to rate-limit config handshakes across transports.
src/mesh/PhoneAPI.cpp	Implements the handshake throttling check and defines the shared timestamp + interval constant.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

@thebentern

Three issues raised in the automated review: 1. **Nonce race**: If a new want_config_id arrives while a previous config dump is still in-flight, the throttled path returned early but config_nonce had already been overwritten in the caller — so sendConfigComplete would return the new nonce to a client that asked for the old one. Fix: pass the incoming nonce as a parameter and only commit it to config_nonce AFTER the throttle check passes. 2. **Data race on lastStartConfigMs**: Shared static accessed without synchronization from BLE, TCP/HTTP, and serial-console tasks. Fix: std::atomic<uint32_t> with compare_exchange_weak so concurrent handshakes across transports can't both win the throttle check. 3. **Minor**: Captured 'now = millis()' once — already addressed by @thebentern's pre-applied autofix. Side benefit: PacketAPI's want_config_id handler had a latent bug where it declared a local 'config_nonce' that shadowed the member, so the member was never updated and handleStartConfig used whatever stale value was previously stored. The signature change forces us to pass the new nonce explicitly, fixing this too.

nightjoker7 · 2026-04-23T20:22:16Z

Thanks @copilot — all three points addressed in 179821e00:

Nonce race — Changed handleStartConfig to take the incoming nonce as a parameter and only commit it to config_nonce after the throttle check passes. An in-flight dump keeps its original nonce so sendConfigComplete matches what the client asked for. Side benefit: this also fixed a latent bug in PacketAPI where uint32_t config_nonce = mr->want_config_id; declared a local that shadowed the member — the member was never being updated, so handshakes from that path used whatever stale value was left over.
Data race on lastStartConfigMs — Changed to std::atomic<uint32_t> with compare_exchange_weak so concurrent handshakes across transports (BLE callback task vs APIServerPort OSThread vs serial-console task) can't both pass the check.
millis() multiple times — Already fixed by @thebentern's pre-applied autofix in 4c8f555a5.

Built clean on heltec-v4. Ready for re-review.

Copilot

Pull request overview

Adds a global rate limit for full config-dump handshakes in PhoneAPI::handleStartConfig to mitigate heap fragmentation/reboot risk caused by rapid reconnect loops across transports.

Changes:

Changes handleStartConfig() to accept and commit the client-provided want_config_id nonce only after passing a global throttle check.
Introduces a shared, cross-transport minimum interval (2000 ms) between config handshakes using a static std::atomic<uint32_t>.
Updates PacketAPI and PhoneAPI ToRadio handlers to pass the nonce into handleStartConfig(...).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
src/mesh/api/PacketAPI.cpp	Passes `want_config_id` nonce into `handleStartConfig(...)` for PacketAPI (screen) clients.
src/mesh/PhoneAPI.h	Declares shared atomic `lastStartConfigMs` and updates `handleStartConfig` signature/docs.
src/mesh/PhoneAPI.cpp	Implements global handshake throttling and commits nonce after throttle passes.

Copilot (follow-up on the atomic rate-limit fix): 1. Backwards-time CAS race: the prior revision captured `now = millis()` once before the CAS loop. If a concurrent task raced past us, our compare_exchange_weak failure path updated `last` to the newer value; looping with a stale `now` let the next CAS commit the older timestamp into the atomic, breaking the rate limit for whichever handshake arrived next. Move `now = millis()` inside the loop so it always advances monotonically. 2. LOG_WARN flood during a reconnect-storm: this branch fires once per throttled handshake attempt, which is exactly the scenario the throttle exists to defend against. Gate the log on a 5s Throttle window so a flapping client can't turn the defensive code into its own log-flood.

Copilot

Pull request overview

This PR adds a global rate limit to PhoneAPI::handleStartConfig to prevent rapid reconnect/config-handshake cycles (across any transport) from repeatedly triggering large config dumps that fragment internal SRAM heap on ESP32-class targets.

Changes:

Add a shared, atomic PhoneAPI::lastStartConfigMs timestamp and enforce a 2000ms minimum interval between accepted config handshakes.
Update handleStartConfig to accept and commit the want_config_id nonce only after passing the throttle check.
Update PacketAPI and PhoneAPI::handleToRadio call sites to pass through the nonce.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
`src/mesh/api/PacketAPI.cpp`	Passes `want_config_id` into the updated `handleStartConfig(newConfigNonce)` API.
`src/mesh/PhoneAPI.h`	Introduces atomic shared throttle state and updates the `handleStartConfig` signature/contract.
`src/mesh/PhoneAPI.cpp`	Implements the cross-transport handshake rate limit and throttled logging behavior.

+            static uint32_t lastThrottleLogMs = 0;
+            if (!Throttle::isWithinTimespanMs(lastThrottleLogMs, HANDSHAKE_THROTTLE_LOG_INTERVAL_MS)) {
+                lastThrottleLogMs = now;
+                LOG_WARN("Config handshake throttled (last one %u ms ago, min interval %u ms)",
+                         (unsigned)(now - last), (unsigned)MIN_CONFIG_HANDSHAKE_INTERVAL_MS);
+            }


…light The original throttle returned early on every handshake within the 2s window regardless of state. That stranded fresh-client connections (state == STATE_SEND_NOTHING) and post-dump re-handshake requests (state == STATE_SEND_PACKETS) — the early-return doesn't advance state, so getFromRadio returned nothing and clients sat on "reading config" forever. Symptom: after a successful BLE handshake (which sets static lastStartConfigMs), opening a TCP/IP connection from the Android app within 2 seconds caused the TCP handshake to silently fail. State never advanced, app stuck on "reading config" indefinitely. Reproducible on Heltec V4 TFT 2026-04-27. Restrict the throttle to states [SEND_UIDATA..SEND_COMPLETE_ID], i.e. only when a dump is actually in-flight. Fresh clients and post-dump steady-state re-handshakes pass through. The original anti-fragmentation guarantee is preserved: concurrent overlapping dumps still get caught by the rate limit. Hardware-verified on V4 TFT: pre-fix, second handshake at state=11 was throttled at 570ms; post-fix, second handshake passes through and Android completes config dump cleanly across BLE->TCP transport switches. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

nightjoker7 · 2026-04-27T19:31:04Z

Caught a state-machine bug in my own throttle while hardware-testing today on a Heltec V4 TFT: the original early-return path didn't advance the state machine, so when a client issued a second want_config_id while state was either STATE_SEND_NOTHING (fresh client) or STATE_SEND_PACKETS (post-dump steady state), the throttle would silently swallow the request and the client stalled on "reading config" forever.

Reproduced 2026-04-27 during a BLE → TCP transport switch on the Android app: BLE handshake completed → static lastStartConfigMs captured timestamp → switched to TCP → second handshake hit at state=11 (POST_DUMP) within 570ms → throttle returned early → state never advanced → Android stuck.

Fixed in 6d9641fc9: restrict the throttle to states [SEND_UIDATA..SEND_COMPLETE_ID] only — i.e. only when a dump is actually in-flight, which is what the heap-fragmentation guarantee the throttle was meant to enforce actually cares about. Fresh clients and post-dump re-handshakes now pass through unconditionally. The original anti-fragmentation property is preserved: concurrent overlapping dumps from a flapping client still hit the throttle.

Hardware-verified on V4 TFT both before and after the fix:

Pre-fix: second handshake hit at state=11, throttled at 570ms, Android stalled on every BLE→TCP cycle (4× attempts captured)
Post-fix: second handshake passes through, dump completes cleanly, Android moves past "reading config" to the main UI

Also added state=%d to the throttle WARN log so future debugging shows exactly which state the throttle fired in.

cvaldess · 2026-05-17T15:56:39Z

Tested on Nordic nRF54L15-DK with the iOS Meshtastic app — and the
throttle / lightweight-request path actually fires in normal use, not
just under a synthetic reconnect storm.

Trace from the smoke test (uptime in seconds):

INFO  | 25 [BleDeferred] Client wants config, nonce=69420
INFO  | 25 [BleDeferred] Start API client config millis=25231
DEBUG | 26 Send config: device
...                              ← full config + module config stream
DEBUG | 27 FromRadio=STATE_SEND_FILEMANIFEST
DEBUG | 27 FromRadio=STATE_SEND_PACKETS
INFO  | 28 [BleDeferred] Client wants config, nonce=69421
INFO  | 28 [BleDeferred] Start API client config millis=28050
INFO  | 28 [BleDeferred] Client only wants node info, skipping other config  ← new path

iOS's second wantConfig arrived ~2.8 s after the first START and was
correctly identified as a node-info-only request, so the heavy dump was
bypassed. Heap pressure on this port is real (256 KB RAM, ~50 % used
at idle) so any reconnect-storm protection is welcome.

All 5 commits cherry-picked clean (only conflict on rebase was an include
swap <cstdint> → <atomic> in PhoneAPI.h, trivially resolved). MITM
pair, encrypted ATT, MTU 185, full config + 2nd lightweight wantConfig
all green. LGTM.

Tested-by: cvaldess

github-actions Bot added needs-review Needs human review bugfix Pull request that fixes bugs labels Apr 23, 2026

thebentern requested a review from Copilot April 23, 2026 13:59

Copilot started reviewing on behalf of thebentern April 23, 2026 14:00 View session

Copilot AI reviewed Apr 23, 2026

View reviewed changes

Comment thread src/mesh/PhoneAPI.cpp Outdated

Comment thread src/mesh/PhoneAPI.cpp Outdated

Comment thread src/mesh/PhoneAPI.cpp Outdated

thebentern and others added 2 commits April 23, 2026 09:18

Potential fix for pull request finding

4c8f555

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

thebentern requested a review from Copilot April 23, 2026 23:21

Copilot started reviewing on behalf of thebentern April 23, 2026 23:23 View session

Copilot AI reviewed Apr 23, 2026

View reviewed changes

Comment thread src/mesh/PhoneAPI.cpp Outdated

Comment thread src/mesh/PhoneAPI.cpp Outdated

thebentern requested a review from Copilot April 27, 2026 13:04

Copilot started reviewing on behalf of thebentern April 27, 2026 13:04 View session

Copilot AI reviewed Apr 27, 2026

View reviewed changes

cvaldess mentioned this pull request May 17, 2026

fix(mesh): return real status from reconfigure() and stop panicking on SPI errors #10430

Open

2 tasks

cvaldess mentioned this pull request May 17, 2026

RadioLibInterface: harden RX path (4 related fixes) #10252

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

PhoneAPI: rate-limit handleStartConfig to prevent reconnect-storm heap fragmentation#10271

PhoneAPI: rate-limit handleStartConfig to prevent reconnect-storm heap fragmentation#10271
nightjoker7 wants to merge 5 commits into
meshtastic:developfrom
nightjoker7:fix/phoneapi-throttle-config-handshake

nightjoker7 commented Apr 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nightjoker7 commented Apr 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

nightjoker7 commented Apr 27, 2026

Uh oh!

cvaldess commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Uh oh!

Conversation

nightjoker7 commented Apr 23, 2026

Summary

Evidence

Fix

Test plan

Related work

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nightjoker7 commented Apr 23, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

nightjoker7 commented Apr 27, 2026

Uh oh!

cvaldess commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants