PhoneAPI: rate-limit handleStartConfig to prevent reconnect-storm heap fragmentation#10271
PhoneAPI: rate-limit handleStartConfig to prevent reconnect-storm heap fragmentation#10271nightjoker7 wants to merge 5 commits into
Conversation
…on reconnect storms
A misbehaving or flapping client (e.g. a TCP reader-thread loop that
RSTs and reconnects every second) can drive handleStartConfig at
roughly 1 Hz. Every handshake exercises the protobuf encode path for:
- MY_INFO / OWN_NODEINFO / METADATA
- all channels, all configs, all module configs
- the entire NodeDB (up to MAX_NUM_NODES = 250 entries)
- a recursive filesystem walk via getFiles("/", 10)
Measured on a Station G2 (ESP32-S3, 248 KB internal SRAM heap):
- 466 reconnect cycles over 85 minutes
- Net heap loss: 108 KB (avg 232 B / median 192 B per cycle)
- Dropped heap_free from ~150 KB to ~50 KB, wedged TCP handshake
Most of this is heap fragmentation from the repeated alloc/free of
mid-sized protobuf encode buffers interleaved with longer-lived state.
Once the reconnect storm stops, heap does not recover until reboot.
Add a 2000 ms minimum interval between full config dumps, shared
across all PhoneAPI transports (BLE / WiFi / Serial / HTTP / Ethernet)
via a static member. Well-behaved clients issue one want_config_id
per reconnect so are unaffected; a flapping client is throttled to
one handshake per ~2 s, which matches roughly the time a single
config dump needs to complete.
The throttled path returns early without touching state, observers,
or running getFiles — so no resources are leaked per throttled call.
The client either waits and retries or times out and recovers naturally.
There was a problem hiding this comment.
Pull request overview
Adds a global rate-limit to PhoneAPI::handleStartConfig() to prevent reconnect storms from repeatedly triggering full config-dump handshakes that fragment internal SRAM heap, improving stability under misbehaving/flapping clients.
Changes:
- Introduces a shared
PhoneAPI::lastStartConfigMstimestamp used to throttle config-dump handshakes across allPhoneAPItransports. - Enforces a 2000 ms minimum interval between full config-dump handshakes, logging when throttling occurs.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| src/mesh/PhoneAPI.h | Adds shared static timestamp used to rate-limit config handshakes across transports. |
| src/mesh/PhoneAPI.cpp | Implements the handshake throttling check and defines the shared timestamp + interval constant. |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Three issues raised in the automated review: 1. **Nonce race**: If a new want_config_id arrives while a previous config dump is still in-flight, the throttled path returned early but config_nonce had already been overwritten in the caller — so sendConfigComplete would return the new nonce to a client that asked for the old one. Fix: pass the incoming nonce as a parameter and only commit it to config_nonce AFTER the throttle check passes. 2. **Data race on lastStartConfigMs**: Shared static accessed without synchronization from BLE, TCP/HTTP, and serial-console tasks. Fix: std::atomic<uint32_t> with compare_exchange_weak so concurrent handshakes across transports can't both win the throttle check. 3. **Minor**: Captured 'now = millis()' once — already addressed by @thebentern's pre-applied autofix. Side benefit: PacketAPI's want_config_id handler had a latent bug where it declared a local 'config_nonce' that shadowed the member, so the member was never updated and handleStartConfig used whatever stale value was previously stored. The signature change forces us to pass the new nonce explicitly, fixing this too.
|
Thanks @copilot — all three points addressed in 179821e00:
Built clean on heltec-v4. Ready for re-review. |
There was a problem hiding this comment.
Pull request overview
Adds a global rate limit for full config-dump handshakes in PhoneAPI::handleStartConfig to mitigate heap fragmentation/reboot risk caused by rapid reconnect loops across transports.
Changes:
- Changes
handleStartConfig()to accept and commit the client-providedwant_config_idnonce only after passing a global throttle check. - Introduces a shared, cross-transport minimum interval (2000 ms) between config handshakes using a static
std::atomic<uint32_t>. - Updates PacketAPI and PhoneAPI ToRadio handlers to pass the nonce into
handleStartConfig(...).
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| src/mesh/api/PacketAPI.cpp | Passes want_config_id nonce into handleStartConfig(...) for PacketAPI (screen) clients. |
| src/mesh/PhoneAPI.h | Declares shared atomic lastStartConfigMs and updates handleStartConfig signature/docs. |
| src/mesh/PhoneAPI.cpp | Implements global handshake throttling and commits nonce after throttle passes. |
Copilot (follow-up on the atomic rate-limit fix): 1. Backwards-time CAS race: the prior revision captured `now = millis()` once before the CAS loop. If a concurrent task raced past us, our compare_exchange_weak failure path updated `last` to the newer value; looping with a stale `now` let the next CAS commit the older timestamp into the atomic, breaking the rate limit for whichever handshake arrived next. Move `now = millis()` inside the loop so it always advances monotonically. 2. LOG_WARN flood during a reconnect-storm: this branch fires once per throttled handshake attempt, which is exactly the scenario the throttle exists to defend against. Gate the log on a 5s Throttle window so a flapping client can't turn the defensive code into its own log-flood.
There was a problem hiding this comment.
Pull request overview
This PR adds a global rate limit to PhoneAPI::handleStartConfig to prevent rapid reconnect/config-handshake cycles (across any transport) from repeatedly triggering large config dumps that fragment internal SRAM heap on ESP32-class targets.
Changes:
- Add a shared, atomic
PhoneAPI::lastStartConfigMstimestamp and enforce a 2000ms minimum interval between accepted config handshakes. - Update
handleStartConfigto accept and commit thewant_config_idnonce only after passing the throttle check. - Update
PacketAPIandPhoneAPI::handleToRadiocall sites to pass through the nonce.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
src/mesh/api/PacketAPI.cpp |
Passes want_config_id into the updated handleStartConfig(newConfigNonce) API. |
src/mesh/PhoneAPI.h |
Introduces atomic shared throttle state and updates the handleStartConfig signature/contract. |
src/mesh/PhoneAPI.cpp |
Implements the cross-transport handshake rate limit and throttled logging behavior. |
| static uint32_t lastThrottleLogMs = 0; | ||
| if (!Throttle::isWithinTimespanMs(lastThrottleLogMs, HANDSHAKE_THROTTLE_LOG_INTERVAL_MS)) { | ||
| lastThrottleLogMs = now; | ||
| LOG_WARN("Config handshake throttled (last one %u ms ago, min interval %u ms)", | ||
| (unsigned)(now - last), (unsigned)MIN_CONFIG_HANDSHAKE_INTERVAL_MS); | ||
| } |
…light The original throttle returned early on every handshake within the 2s window regardless of state. That stranded fresh-client connections (state == STATE_SEND_NOTHING) and post-dump re-handshake requests (state == STATE_SEND_PACKETS) — the early-return doesn't advance state, so getFromRadio returned nothing and clients sat on "reading config" forever. Symptom: after a successful BLE handshake (which sets static lastStartConfigMs), opening a TCP/IP connection from the Android app within 2 seconds caused the TCP handshake to silently fail. State never advanced, app stuck on "reading config" indefinitely. Reproducible on Heltec V4 TFT 2026-04-27. Restrict the throttle to states [SEND_UIDATA..SEND_COMPLETE_ID], i.e. only when a dump is actually in-flight. Fresh clients and post-dump steady-state re-handshakes pass through. The original anti-fragmentation guarantee is preserved: concurrent overlapping dumps still get caught by the rate limit. Hardware-verified on V4 TFT: pre-fix, second handshake at state=11 was throttled at 570ms; post-fix, second handshake passes through and Android completes config dump cleanly across BLE->TCP transport switches. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Caught a state-machine bug in my own throttle while hardware-testing today on a Heltec V4 TFT: the original early-return path didn't advance the state machine, so when a client issued a second Reproduced 2026-04-27 during a BLE → TCP transport switch on the Android app: BLE handshake completed → static Fixed in 6d9641fc9: restrict the throttle to states Hardware-verified on V4 TFT both before and after the fix:
Also added |
|
Tested on Nordic nRF54L15-DK with the iOS Meshtastic app — and the Trace from the smoke test (uptime in seconds): iOS's second All 5 commits cherry-picked clean (only conflict on rebase was an include Tested-by: cvaldess |
Summary
Add a 2000 ms minimum interval between full config-dump handshakes in PhoneAPI::handleStartConfig. A misbehaving or flapping client that reconnects faster than a config dump can complete is currently able to exhaust the internal SRAM heap via fragmentation.
Evidence
Measured on a Station G2 (ESP32-S3, 248 KB internal SRAM heap, 8 MB PSRAM) running
2.7.23.ef4693dwith full DEBUG_HEAP instrumentation:Stage-by-stage heap delta per cycle (from log parse over 398 cycles):
Most of the loss is heap fragmentation from the repeated alloc/free of mid-sized protobuf encode buffers (MY_INFO, OWN_NODEINFO, METADATA, NodeDB entries) interleaved with longer-lived queue entries. Once the storm stops, the free total does not recover until reboot.
Fix
Add `PhoneAPI::lastStartConfigMs` as a static member (shared across all transport subclasses — BLE, WiFi/TCP, Serial, HTTP, Ethernet) and reject handshakes within 2000 ms of the previous one:
Test plan
Related work
The underlying fragmentation vulnerability is an architectural issue — large per-session protobuf buffers land in internal SRAM even on boards with 8 MB of idle PSRAM. Separately scoping a PSRAM-routing change for the large allocators; this PR is the narrow defensive fix.