Skip to content

fix(firmware): harden ESP32 OTA + nRF DFU update paths (hardware-validated)#5915

Merged
jamesarich merged 8 commits into
mainfrom
claude/exciting-kirch-e975a5
Jun 23, 2026
Merged

fix(firmware): harden ESP32 OTA + nRF DFU update paths (hardware-validated)#5915
jamesarich merged 8 commits into
mainfrom
claude/exciting-kirch-e975a5

Conversation

@jamesarich

Copy link
Copy Markdown
Collaborator

Why

:feature:firmware was traced end-to-end against the firmware source, the AdminMessage OTA/DFU protocol, the meshtastic/esp32-unified-ota loader, and the Nordic DFU spec to confirm every update path actually works. That audit surfaced several latent, OTA-breaking bugs — all fixed here and confirmed on real hardware (Pixel 6a driving a Heltec V3 over Wi-Fi & BLE, and a RAK4631 / WisMesh Pocket over Nordic DFU).

🐛 Bug Fixes

  • ESP32 BLE OTA hang: the transport streamed fixed 512-byte chunks, but Android negotiates MTU 512 → a 509-byte write payload, so every chunk fragmented into [509, 3] while the device sends one ACK per drained chunk → the per-fragment ACK wait timed out (10 s) and BLE OTA failed in the common case. Each chunk is now clamped to the negotiated write payload (1 chunk = 1 write = 1 ACK), and the streaming loop is response-type-driven (ACK→continue, OK→done, ERR→fail) so a late device error can never be mis-reported as success.
  • Manifest never parsed: FirmwareManifest.hwModel was typed String, but published .mt.json manifests emit it as an integer, so every manifest threw and the retriever silently fell back to filename heuristics. The model now parses the real manifest shape.
  • Wrong artifact on the no-manifest path: for pre-2.7.17 releases the plain firmware-<target>.bin is a merged bootloader+app image; flashing it to app0 left it misaligned and the device's esp_ota_end rejected it after a byte-perfect transfer. The fallback now prefers the bare-app -update.bin.
  • BLE OTA review hardening: @Volatile on the connection-loss guard, explicit failure on a premature OK, and a guard against a non-positive negotiated write length.

🌟 New

  • Slow-bootloader tip: when an nRF52 DFU runs on a bootloader that can't negotiate a large MTU (e.g. a stock Adafruit AdaDFU capped at MTU 23 → 20-byte packets), the upload screen now shows a hint that flashing the OTAFIX bootloader enables much faster BLE updates.

🛠️ Refactoring & Performance

  • MTU-gated DFU packet size: Legacy DFU now sizes packets to the negotiated ATT MTU (word-aligned to the bootloader's multiple-of-4 rule, capped at 244) instead of an OTAFIX advertised-name heuristic — self-gating, and ~12× faster on bootloaders that grant a high MTU (e.g. OTAFIX 2.1). Also fixes a latent word-alignment bug in the old high-MTU path.
  • Dedup: dropped redundant per-transport scan constants (equal to the scanForBleDevice defaults) and an unused DFU UUID.

🧹 Chores

  • Removed an unnecessary Android-style apostrophe escape in a Compose string resource (would have rendered a literal backslash).

Testing Performed

Unit tests (feature/firmware commonTest):

  • BleOtaTransportTest — chunk-clamp (1 write/1 ACK per chunk), final-chunk ERR → failure (not success), premature-OK → failure, post-loop terminal OK/ERR.
  • FirmwareManifestTest — real-shaped manifest (integer hwModel) parses + resolves app0.
  • CommonFirmwareRetrieverTest — prefers -update.bin over the plain .bin when both exist.
  • LegacyDfuTransportTest — high-MTU packet path + 4-byte-alignment floor.

Full baseline green: spotlessApply spotlessCheck detekt assembleDebug test allTests (651 tests).

Hardware validation (Pixel 6a):

Path Device Result
ESP32 Wi-Fi OTA Heltec V3 ✅ flashed + rebooted
ESP32 BLE OTA Heltec V3 ✅ 4118 chunks, clean 1-ACK cadence
ESP32 OTA (no-manifest 2.7.15) Heltec V3 -update.bin → success
nRF52 Nordic DFU RAK4631 / WisMesh Pocket ✅ full Legacy DFU flash

Follow-ups filed separately: a BLE-transport dedup refactor (touches these just-validated connect/stream paths) and an OTAFIX bootloader "how-to" link for the slow-DFU case.

jamesarich and others added 7 commits June 22, 2026 20:32
The BLE OTA transport streamed fixed 512-byte chunks, but the Android BLE
layer negotiates MTU 512 (requestMtu(512)) -> a 509-byte write payload. Each
512-byte chunk fragmented into two GATT writes ([509, 3]) and the loop waited
for one ACK per fragment, while the device (esp32-unified-ota) is a byte stream
that coalesces the writes and emits a single ACK per drain. The second
waitForResponse then timed out (10s), failing BLE OTA in the common case.

Clamp each streamed chunk to the negotiated write payload so one chunk is
exactly one GATT write eliciting exactly one device response, and drive the
loop on response type (ACK->continue, OK->complete on the last chunk,
ERR->fail) instead of a fragment count. Success now requires an explicit
terminal OK and any ERR fails the transfer, so a late device error can never
be reported as success.

WiFi OTA (no per-chunk ACK) and the nRF DFU paths are unaffected.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Apply review findings on the BLE OTA streaming rewrite:
- Throw TransferFailed on an OK received before the final chunk instead of
  silently treating it as an ACK (the device sends OK only at completion, so an
  early OK signals a size disagreement).
- Mark isConnected @volatile: it is written from the connectionState collector
  and read by the streaming loop's connection-loss guard on a different
  dispatcher, matching the @volatile idiom in KableBleConnection.
- Guard writeData against a non-positive negotiated write length so the write
  loop cannot spin without advancing (fall back to a single whole-buffer write).

Add tests for the post-loop terminal path (ACK-then-OK success, ACK-then-ERR
failure) and the premature-OK failure.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
FirmwareManifest declared `hwModel: String`, but published manifests emit it as
an integer (the HardwareModel enum value), so every real manifest threw
JsonDecodingException and FirmwareRetriever silently fell back to filename
heuristics -- never using the authoritative app0 entry or its md5.

Model only the consumed `files` array and let ignoreUnknownKeys drop the
decorative scalar metadata, so an upstream scalar-type change can never break
OTA resolution again. The test is rewritten against the real manifest shape
(integer hwModel), which now regresses the bug.

Verified on hardware (Pixel 6a -> Heltec V3): the 2.7.25 manifest now resolves
"firmware-heltec-v3-2.7.25.104df5f.bin (app0, md5=...)" and the WiFi OTA
completes -- device reboots into the new image and reconnects.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…P32 OTA

For pre-2.7.17 releases (which ship no .mt.json manifest), firmware-<target>-
<version>.bin is a *merged* image (bootloader + partition table + app), while
firmware-<target>-<version>-update.bin is the bare app image. The no-manifest
fallback tried the plain .bin first, so it flashed the merged image to the app0
partition: the bytes transferred intact (device SHA256 matched) but esp_ota_end
rejected the misaligned image ("OTA End failed").

Try -update.bin before the plain .bin in the fallback. Safe for 2.7.17+, which
resolve via the manifest and ship no -update.bin, so they fall through to the
plain .bin (= app0). Adds a regression test for the both-present case.

Found via hardware testing: flashing 2.7.15 to a Heltec V3 failed at esp_ota_end
with the merged .bin.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…tloader notice

Legacy DFU streamed fixed 20-byte packets unless the bootloader advertised the
OTAFIX `_DFU` name. Gate the packet size on the negotiated ATT MTU instead
(word-aligned to the bootloader's multiple-of-4 rule, capped at 244): a
bootloader that can't take large writes won't grant a large MTU, so it
self-gates back to 20. This unlocks ~12x faster DFU on high-MTU bootloaders
(e.g. OTAFIX 2.1) and fixes a latent word-alignment bug in the old high-MTU
path. Removes the dead OTAFIX_NAME_SUFFIX and an unverified "bricks the device"
comment the Adafruit bootloader source contradicts.

Stock RAK4631 (AdaDFU) negotiates only MTU 23, so it stays at 20-byte packets;
when that slow path is taken the upload screen now shows a tip that flashing the
OTAFIX bootloader enables much faster BLE updates (ProgressState.hint, surfaced
via DfuUploadTransport.isLowSpeedTransfer).

Verified on hardware (Pixel 6a + RAK4631): self-gates to 20-byte packets and the
DFU completes; tests cover the high-MTU path and the 4-byte-alignment floor.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The three BLE transports each re-declared SCAN_RETRY_COUNT=3 / SCAN_RETRY_DELAY=2s
and passed them explicitly to scanForBleDevice, whose defaults are already those
values — rely on the defaults. Also remove the unused BUTTONLESS_WITH_BONDS
(8EC90004) UUID; only BUTTONLESS_NO_BONDS is ever written. No behavior change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…hint

Compose Multiplatform string resources don't use Android's apostrophe escaping;
the `\'` would render a literal backslash in the UI. Match the file's existing
unescaped convention (e.g. firmware_update_save_dfu_file's "device's").

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions github-actions Bot added the bugfix PR tag label Jun 23, 2026
@jamesarich jamesarich marked this pull request as ready for review June 23, 2026 17:32
@jamesarich jamesarich enabled auto-merge June 23, 2026 17:32
workpad.md holds the /craft pipeline's per-task working notes; it leaked into
version control via an earlier run. Remove it and gitignore root-level
workpad files so future runs stay untracked.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@jamesarich jamesarich force-pushed the claude/exciting-kirch-e975a5 branch from 7117020 to 4a648d8 Compare June 23, 2026 17:34
@jamesarich jamesarich added this pull request to the merge queue Jun 23, 2026
Merged via the queue into main with commit b692670 Jun 23, 2026
22 checks passed
@jamesarich jamesarich deleted the claude/exciting-kirch-e975a5 branch June 23, 2026 17:55
jeremiah-k pushed a commit to jeremiah-k/Meshtastic-Android that referenced this pull request Jun 23, 2026
…audit (meshtastic#5916)

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugfix PR tag

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant