Skip to content

PhoneAPI: rate-limit handleStartConfig to prevent reconnect-storm heap fragmentation#10271

Open
nightjoker7 wants to merge 5 commits into
meshtastic:developfrom
nightjoker7:fix/phoneapi-throttle-config-handshake
Open

PhoneAPI: rate-limit handleStartConfig to prevent reconnect-storm heap fragmentation#10271
nightjoker7 wants to merge 5 commits into
meshtastic:developfrom
nightjoker7:fix/phoneapi-throttle-config-handshake

Conversation

@nightjoker7

Copy link
Copy Markdown
Contributor

Summary

Add a 2000 ms minimum interval between full config-dump handshakes in PhoneAPI::handleStartConfig. A misbehaving or flapping client that reconnects faster than a config dump can complete is currently able to exhaust the internal SRAM heap via fragmentation.

Evidence

Measured on a Station G2 (ESP32-S3, 248 KB internal SRAM heap, 8 MB PSRAM) running 2.7.23.ef4693d with full DEBUG_HEAP instrumentation:

  • Meshtastic-python `TCPInterface` client with a reader thread that gets `Connection reset by peer` from a remote under heap pressure → Python side tears down and reconnects → full `want_config_id` re-issued → new `handleStartConfig` every ~1 s.
  • 466 complete reconnect cycles over 85 minutes.
  • Net heap loss: 108 KB (avg 232 B / median 192 B per cycle).
  • `heap_free` trajectory: 150 KB → 50 KB over the 85 min, crashed into reboot.
  • Post-reboot with the storm stopped: heap rock-stable at 145 KB for 11 hours (no continued leak).

Stage-by-stage heap delta per cycle (from log parse over 398 cycles):

Segment Avg Δ Median Δ
`wants_config → got_files` +2,815 +2,052
`force_close → incoming_api` +1,472 +1,164
`send_myinfo → send_metadata` -3,556 -3,692
`send_cfg_device → send_cfg_sec` -501 -568

Most of the loss is heap fragmentation from the repeated alloc/free of mid-sized protobuf encode buffers (MY_INFO, OWN_NODEINFO, METADATA, NodeDB entries) interleaved with longer-lived queue entries. Once the storm stops, the free total does not recover until reboot.

Fix

Add `PhoneAPI::lastStartConfigMs` as a static member (shared across all transport subclasses — BLE, WiFi/TCP, Serial, HTTP, Ethernet) and reject handshakes within 2000 ms of the previous one:

  • 2000 ms chosen to be just longer than a full config dump takes to complete on a saturated (250-node) NodeDB, so legitimate reconnect-after-disconnect flows still work.
  • Well-behaved clients only issue one `want_config_id` per reconnect so are unaffected.
  • The throttled path returns early before touching state, observers, or `getFiles` — no resources are leaked per throttled call.
  • The client either waits and retries or times out and recovers naturally.

Test plan

  • Builds clean on heltec-v4 (ESP32-S3) — local verification
  • CI build across all targets
  • Repro: run a meshtastic-python TCP reconnect loop against a Station G2, confirm heap_free no longer bleeds out

Related work

The underlying fragmentation vulnerability is an architectural issue — large per-session protobuf buffers land in internal SRAM even on boards with 8 MB of idle PSRAM. Separately scoping a PSRAM-routing change for the large allocators; this PR is the narrow defensive fix.

…on reconnect storms

A misbehaving or flapping client (e.g. a TCP reader-thread loop that
RSTs and reconnects every second) can drive handleStartConfig at
roughly 1 Hz. Every handshake exercises the protobuf encode path for:

  - MY_INFO / OWN_NODEINFO / METADATA
  - all channels, all configs, all module configs
  - the entire NodeDB (up to MAX_NUM_NODES = 250 entries)
  - a recursive filesystem walk via getFiles("/", 10)

Measured on a Station G2 (ESP32-S3, 248 KB internal SRAM heap):
  - 466 reconnect cycles over 85 minutes
  - Net heap loss: 108 KB (avg 232 B / median 192 B per cycle)
  - Dropped heap_free from ~150 KB to ~50 KB, wedged TCP handshake

Most of this is heap fragmentation from the repeated alloc/free of
mid-sized protobuf encode buffers interleaved with longer-lived state.
Once the reconnect storm stops, heap does not recover until reboot.

Add a 2000 ms minimum interval between full config dumps, shared
across all PhoneAPI transports (BLE / WiFi / Serial / HTTP / Ethernet)
via a static member. Well-behaved clients issue one want_config_id
per reconnect so are unaffected; a flapping client is throttled to
one handshake per ~2 s, which matches roughly the time a single
config dump needs to complete.

The throttled path returns early without touching state, observers,
or running getFiles — so no resources are leaked per throttled call.
The client either waits and retries or times out and recovers naturally.
@github-actions github-actions Bot added needs-review Needs human review bugfix Pull request that fixes bugs labels Apr 23, 2026
@thebentern thebentern requested a review from Copilot April 23, 2026 13:59

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a global rate-limit to PhoneAPI::handleStartConfig() to prevent reconnect storms from repeatedly triggering full config-dump handshakes that fragment internal SRAM heap, improving stability under misbehaving/flapping clients.

Changes:

  • Introduces a shared PhoneAPI::lastStartConfigMs timestamp used to throttle config-dump handshakes across all PhoneAPI transports.
  • Enforces a 2000 ms minimum interval between full config-dump handshakes, logging when throttling occurs.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
src/mesh/PhoneAPI.h Adds shared static timestamp used to rate-limit config handshakes across transports.
src/mesh/PhoneAPI.cpp Implements the handshake throttling check and defines the shared timestamp + interval constant.

Comment thread src/mesh/PhoneAPI.cpp Outdated
Comment thread src/mesh/PhoneAPI.cpp Outdated
Comment thread src/mesh/PhoneAPI.cpp Outdated
thebentern and others added 2 commits April 23, 2026 09:18
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Three issues raised in the automated review:

1. **Nonce race**: If a new want_config_id arrives while a previous
   config dump is still in-flight, the throttled path returned early
   but config_nonce had already been overwritten in the caller — so
   sendConfigComplete would return the new nonce to a client that
   asked for the old one. Fix: pass the incoming nonce as a parameter
   and only commit it to config_nonce AFTER the throttle check passes.

2. **Data race on lastStartConfigMs**: Shared static accessed without
   synchronization from BLE, TCP/HTTP, and serial-console tasks. Fix:
   std::atomic<uint32_t> with compare_exchange_weak so concurrent
   handshakes across transports can't both win the throttle check.

3. **Minor**: Captured 'now = millis()' once — already addressed by
   @thebentern's pre-applied autofix.

Side benefit: PacketAPI's want_config_id handler had a latent bug
where it declared a local 'config_nonce' that shadowed the member,
so the member was never updated and handleStartConfig used whatever
stale value was previously stored. The signature change forces us
to pass the new nonce explicitly, fixing this too.
@nightjoker7

Copy link
Copy Markdown
Contributor Author

Thanks @copilot — all three points addressed in 179821e00:

  1. Nonce race — Changed handleStartConfig to take the incoming nonce as a parameter and only commit it to config_nonce after the throttle check passes. An in-flight dump keeps its original nonce so sendConfigComplete matches what the client asked for. Side benefit: this also fixed a latent bug in PacketAPI where uint32_t config_nonce = mr->want_config_id; declared a local that shadowed the member — the member was never being updated, so handshakes from that path used whatever stale value was left over.

  2. Data race on lastStartConfigMs — Changed to std::atomic<uint32_t> with compare_exchange_weak so concurrent handshakes across transports (BLE callback task vs APIServerPort OSThread vs serial-console task) can't both pass the check.

  3. millis() multiple times — Already fixed by @thebentern's pre-applied autofix in 4c8f555a5.

Built clean on heltec-v4. Ready for re-review.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a global rate limit for full config-dump handshakes in PhoneAPI::handleStartConfig to mitigate heap fragmentation/reboot risk caused by rapid reconnect loops across transports.

Changes:

  • Changes handleStartConfig() to accept and commit the client-provided want_config_id nonce only after passing a global throttle check.
  • Introduces a shared, cross-transport minimum interval (2000 ms) between config handshakes using a static std::atomic<uint32_t>.
  • Updates PacketAPI and PhoneAPI ToRadio handlers to pass the nonce into handleStartConfig(...).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
src/mesh/api/PacketAPI.cpp Passes want_config_id nonce into handleStartConfig(...) for PacketAPI (screen) clients.
src/mesh/PhoneAPI.h Declares shared atomic lastStartConfigMs and updates handleStartConfig signature/docs.
src/mesh/PhoneAPI.cpp Implements global handshake throttling and commits nonce after throttle passes.

Comment thread src/mesh/PhoneAPI.cpp Outdated
Comment thread src/mesh/PhoneAPI.cpp Outdated
Copilot (follow-up on the atomic rate-limit fix):

1. Backwards-time CAS race: the prior revision captured `now = millis()`
   once before the CAS loop. If a concurrent task raced past us, our
   compare_exchange_weak failure path updated `last` to the newer value;
   looping with a stale `now` let the next CAS commit the older timestamp
   into the atomic, breaking the rate limit for whichever handshake
   arrived next. Move `now = millis()` inside the loop so it always
   advances monotonically.

2. LOG_WARN flood during a reconnect-storm: this branch fires once per
   throttled handshake attempt, which is exactly the scenario the throttle
   exists to defend against. Gate the log on a 5s Throttle window so a
   flapping client can't turn the defensive code into its own log-flood.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a global rate limit to PhoneAPI::handleStartConfig to prevent rapid reconnect/config-handshake cycles (across any transport) from repeatedly triggering large config dumps that fragment internal SRAM heap on ESP32-class targets.

Changes:

  • Add a shared, atomic PhoneAPI::lastStartConfigMs timestamp and enforce a 2000ms minimum interval between accepted config handshakes.
  • Update handleStartConfig to accept and commit the want_config_id nonce only after passing the throttle check.
  • Update PacketAPI and PhoneAPI::handleToRadio call sites to pass through the nonce.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
src/mesh/api/PacketAPI.cpp Passes want_config_id into the updated handleStartConfig(newConfigNonce) API.
src/mesh/PhoneAPI.h Introduces atomic shared throttle state and updates the handleStartConfig signature/contract.
src/mesh/PhoneAPI.cpp Implements the cross-transport handshake rate limit and throttled logging behavior.

Comment thread src/mesh/PhoneAPI.cpp
Comment on lines +84 to +89
static uint32_t lastThrottleLogMs = 0;
if (!Throttle::isWithinTimespanMs(lastThrottleLogMs, HANDSHAKE_THROTTLE_LOG_INTERVAL_MS)) {
lastThrottleLogMs = now;
LOG_WARN("Config handshake throttled (last one %u ms ago, min interval %u ms)",
(unsigned)(now - last), (unsigned)MIN_CONFIG_HANDSHAKE_INTERVAL_MS);
}
…light

The original throttle returned early on every handshake within the 2s window
regardless of state. That stranded fresh-client connections (state ==
STATE_SEND_NOTHING) and post-dump re-handshake requests (state ==
STATE_SEND_PACKETS) — the early-return doesn't advance state, so getFromRadio
returned nothing and clients sat on "reading config" forever.

Symptom: after a successful BLE handshake (which sets static lastStartConfigMs),
opening a TCP/IP connection from the Android app within 2 seconds caused the
TCP handshake to silently fail. State never advanced, app stuck on "reading
config" indefinitely. Reproducible on Heltec V4 TFT 2026-04-27.

Restrict the throttle to states [SEND_UIDATA..SEND_COMPLETE_ID], i.e. only
when a dump is actually in-flight. Fresh clients and post-dump steady-state
re-handshakes pass through. The original anti-fragmentation guarantee is
preserved: concurrent overlapping dumps still get caught by the rate limit.

Hardware-verified on V4 TFT: pre-fix, second handshake at state=11 was
throttled at 570ms; post-fix, second handshake passes through and Android
completes config dump cleanly across BLE->TCP transport switches.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@nightjoker7

Copy link
Copy Markdown
Contributor Author

Caught a state-machine bug in my own throttle while hardware-testing today on a Heltec V4 TFT: the original early-return path didn't advance the state machine, so when a client issued a second want_config_id while state was either STATE_SEND_NOTHING (fresh client) or STATE_SEND_PACKETS (post-dump steady state), the throttle would silently swallow the request and the client stalled on "reading config" forever.

Reproduced 2026-04-27 during a BLE → TCP transport switch on the Android app: BLE handshake completed → static lastStartConfigMs captured timestamp → switched to TCP → second handshake hit at state=11 (POST_DUMP) within 570ms → throttle returned early → state never advanced → Android stuck.

Fixed in 6d9641fc9: restrict the throttle to states [SEND_UIDATA..SEND_COMPLETE_ID] only — i.e. only when a dump is actually in-flight, which is what the heap-fragmentation guarantee the throttle was meant to enforce actually cares about. Fresh clients and post-dump re-handshakes now pass through unconditionally. The original anti-fragmentation property is preserved: concurrent overlapping dumps from a flapping client still hit the throttle.

Hardware-verified on V4 TFT both before and after the fix:

  • Pre-fix: second handshake hit at state=11, throttled at 570ms, Android stalled on every BLE→TCP cycle (4× attempts captured)
  • Post-fix: second handshake passes through, dump completes cleanly, Android moves past "reading config" to the main UI

Also added state=%d to the throttle WARN log so future debugging shows exactly which state the throttle fired in.

@cvaldess

Copy link
Copy Markdown
Contributor

Tested on Nordic nRF54L15-DK with the iOS Meshtastic app — and the
throttle / lightweight-request path actually fires in normal use, not
just under a synthetic reconnect storm.

Trace from the smoke test (uptime in seconds):

INFO  | 25 [BleDeferred] Client wants config, nonce=69420
INFO  | 25 [BleDeferred] Start API client config millis=25231
DEBUG | 26 Send config: device
...                              ← full config + module config stream
DEBUG | 27 FromRadio=STATE_SEND_FILEMANIFEST
DEBUG | 27 FromRadio=STATE_SEND_PACKETS
INFO  | 28 [BleDeferred] Client wants config, nonce=69421
INFO  | 28 [BleDeferred] Start API client config millis=28050
INFO  | 28 [BleDeferred] Client only wants node info, skipping other config  ← new path

iOS's second wantConfig arrived ~2.8 s after the first START and was
correctly identified as a node-info-only request, so the heavy dump was
bypassed. Heap pressure on this port is real (256 KB RAM, ~50 % used
at idle) so any reconnect-storm protection is welcome.

All 5 commits cherry-picked clean (only conflict on rebase was an include
swap <cstdint><atomic> in PhoneAPI.h, trivially resolved). MITM
pair, encrypted ATT, MTU 185, full config + 2nd lightweight wantConfig
all green. LGTM.

Tested-by: cvaldess

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugfix Pull request that fixes bugs needs-review Needs human review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants