Skip to content

mqtt: harden bridge lifecycle and packet queueing#3

Closed
robekl wants to merge 15 commits into
agessaman:mqtt-bridge-implementationfrom
robekl:fix/mqtt-bridge-lowmem
Closed

mqtt: harden bridge lifecycle and packet queueing#3
robekl wants to merge 15 commits into
agessaman:mqtt-bridge-implementationfrom
robekl:fix/mqtt-bridge-lowmem

Conversation

@robekl

@robekl robekl commented Mar 4, 2026

Copy link
Copy Markdown

Summary

This PR aggregates reliability and low-memory hardening changes for MQTT bridge behavior on embedded targets, compared to mqtt-bridge-implementation (dcbbe5a6).

Focus areas:

  • Correctness under asynchronous packet lifetimes
  • Restart/lifecycle safety
  • Heap/stack pressure control
  • Safer CLI parsing and config persistence
  • Defensive string/config handling

Conceptual Changes

1) Queue packet snapshots instead of Packet* references

Problem
Queued MQTT work previously stored raw packet pointers that could outlive their owner lifecycle, creating stale pointer/use-after-free risk in async publish paths.

Change
Queue entries now store serialized packet bytes (writeTo) and reconstruct temporary packets (readFrom) when processing.

Why this fixes it
Queue ownership is now explicit and self-contained. Publish logic no longer depends on external packet lifetime guarantees, eliminating pointer lifetime races.


2) Make bridge begin/end lifecycle restart-safe on ESP

Problem
Restarting bridge/task paths could leak or retain stale state (event handlers, analyzer clients, buffers), and begin/end symmetry was incomplete.

Change

  • Move long-lived buffer allocation into begin() with null checks
  • Reset token/reconnect timestamps on begin
  • Cleanly remove WiFi event handler in end()
  • Cleanly disconnect/delete analyzer clients in end()
  • Free queue storage/mutex/task resources on failure and shutdown paths

Why this fixes it
Resource ownership and teardown are now consistent across start/stop cycles, reducing leaks and duplicate callback/resource state after bridge restarts.


3) Switch JWT JSON docs to static allocation, then harden overflow handling

Problem
JWT creation used dynamic JSON allocation, causing heap churn/fragmentation risk on constrained devices. Static capacity change introduced potential silent overflow risk if not checked.

Change

  • Replace dynamic docs with static docs
  • Add explicit doc.overflowed() checks for header/payload builders
  • Increase payload doc capacity from 384 to 512 for safer optional-claim headroom

Why this fixes it
Removes heap churn while failing safely on capacity overflow. Increasing to 512 reduces false failures and optional-claim truncation risk while keeping deterministic memory use.


4) Remove large stack JSON fallback in packet/raw publish path

Problem
Fallback to large stack buffers under allocation failure increased stack pressure and risked instability on weak/embedded targets.

Change
Publish now requires PSRAM/heap buffer allocation and skips publish when allocation fails.

Why this fixes it
Keeps stack usage bounded and predictable in hot paths. Under pressure, behavior degrades by dropping publish rather than risking stack-related instability.


5) CLI timezone and memory command correctness fixes

Problem

  • get timezone matched before get timezone.offset due to prefix overlap
  • timezone.offset parsing used permissive conversion (atoi-style behavior), allowing malformed input to be accepted as 0
  • Non-ESP memory command referenced ESP-only APIs

Change

  • Evaluate timezone.offset before timezone
  • Parse with strtol + full-string validation and range check (-12..+14)
  • Add non-ESP-safe memory command response

Why this fixes it
Prevents wrong key matching and malformed config acceptance. Improves command correctness across both ESP and non-ESP builds.


6) Preserve user MQTT prefs on boot; restore safe fresh-install origin default

Problem
Boot-time logic was overwriting persisted mqtt_origin/analyzer settings, but removing it entirely could leave fresh installs with empty origin.

Change

  • Remove unconditional boot-time overwrite in examples
  • Add conditional fallback in prefs load: if mqtt_origin is empty, set it to node_name and persist /mqtt_prefs

Why this fixes it
Keeps user-configured values intact while still giving first-boot devices a deterministic non-empty origin.


7) Harden broker and build-flag string handling

Problem
Potential null/termination edge cases when copying broker/build-flag strings.

Change

  • Ensure explicit null termination after bounded copies
  • Guard setBroker() against null input pointers for host/user/password

Why this fixes it
Prevents accidental unterminated strings and null pointer misuse in config plumbing.


8) Reduce queued-packet metadata footprint

Problem
Snapshot queue correctness introduced additional per-entry memory/copy cost.

Change
Trim queue struct fields and narrow types:

  • Remove unused queued timestamp
  • Use uint8_t for packet length
  • Use uint16_t for raw length

Why this fixes it
Preserves snapshot correctness while slightly reducing queue item size and copy overhead.


9) Align comments with actual publish behavior

Problem
Comments still claimed stack fallback behavior after logic changed to skip publish on alloc failure.

Change
Update comments to document real behavior.

Why this fixes it
Prevents maintenance confusion and incorrect assumptions during future debugging/refactors.


10) Check /mqtt_prefs write result

Problem
Config persistence writes ignored byte-count result, masking partial writes or storage faults.

Change
Verify file.write(...) byte count equals expected struct size and log partial-write failures.

Why this fixes it
Improves reliability/diagnostics for flash/filesystem failure cases and prevents silent persistence mismatch.

Notes

  • This PR intentionally prioritizes deterministic behavior and stability under memory pressure over best-effort publish during low-memory states.
  • Queue snapshot model remains a deliberate correctness-first tradeoff; this PR includes a small memory-footprint refinement but does not redesign the model.

@agessaman

Copy link
Copy Markdown
Owner

Hi, sorry I didn't see this earlier. Looks like some good changes that were needed. I'm going to give it some testing over the next few days before merging.

@agessaman agessaman self-assigned this Mar 9, 2026
@agessaman agessaman added the enhancement New feature or request label Mar 9, 2026
agessaman pushed a commit that referenced this pull request Mar 20, 2026
Default LNA enabled=true and fix the sleep order
@agessaman agessaman closed this Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants