Skip to content

fix(meshcore): raw-frame commands share the uncorrelated Err ack channel and can false-reject each other #3725

Description

@Yeraze

Summary

Several MeshCore native-backend handlers send a raw frame via sendToRadioFrame and then once()-race the device's immediate command ack against ResponseCodes.Err. The Err channel is connection-wide and carries no correlation tag, so if a second command's Err arrives during another command's ack window, it fires the wrong handler's once(Err) listener and falsely rejects an unrelated command.

Surfaced during review of #3722 (the 0x8C reply-theft fix). That PR serialized the 0x8C reply phase; this is the separate, pre-existing Ok/Sent/Err command-ack phase.

Affected handlers (src/server/meshcoreNativeBackend.ts)

Handler Ack awaited Lines
discover_path Sent vs Err ~922–936
discover_nodes Ok vs Err ~962–976
request_regions Sent (then BinaryResponse) vs Err ~1015–1042

Each does c.once(Err, onErr) where onErr rejects the promise. None owns the Err channel exclusively, so a concurrent send_message / discover_* / other command failing with Err in the window between our frame send and our own ack will reject us spuriously.

Severity

Low. These are manual/scheduled, retryable operations, and device Err responses are relatively rare, so the race window is narrow. Worst case is a spurious "Device rejected …" failure on a discovery/regions request that would have succeeded — a re-run clears it. No data corruption.

Why it's not a trivial patch

  • For request_regions there's a cheap partial mitigation: only treat Err as ours while !sentReceived (once Sent arrives, any later Err is definitely another command's). One line, clearly correct, but only helps this one handler.
  • discover_path and discover_nodes resolve their entire ack in one immediate window (Sent/Ok vs Err fired back-to-back), so there's no later phase to gate on. The only correct fix for them is serializing the command-ack window so just one raw-frame command is awaiting an ack at a time.

Proposed fix (design decision for maintainer)

Add a per-instance command-ack mutex (mirroring runExclusiveBinaryResponse from #3722, which is per-source by construction — each source owns its own backend instance and connection) that wraps the send→ack window of the raw-frame handlers. Options on breadth:

  1. Narrow: one runExclusiveCommandAck lock covering just the Ok/Sent/Err window of the three handlers above.
  2. Unified: a single "exclusive radio op" lock that subsumes both the command-ack window and the 0x8C reply window (would absorb runExclusiveBinaryResponse), giving one simple mental model at the cost of serializing these admin/discovery/telemetry ops against each other more broadly.

Either keeps the lock per-source so concurrent sources never block each other. Each handler needs a focused test (a concurrent foreign Err must not reject the in-flight command).

Acceptance criteria

  • A stray Err from a concurrent command cannot reject discover_path / discover_nodes / request_regions.
  • Per-source isolation preserved (no cross-source blocking).
  • Regression tests per handler.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingmeshcore

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions