Summary
Several MeshCore native-backend handlers send a raw frame via sendToRadioFrame and then once()-race the device's immediate command ack against ResponseCodes.Err. The Err channel is connection-wide and carries no correlation tag, so if a second command's Err arrives during another command's ack window, it fires the wrong handler's once(Err) listener and falsely rejects an unrelated command.
Surfaced during review of #3722 (the 0x8C reply-theft fix). That PR serialized the 0x8C reply phase; this is the separate, pre-existing Ok/Sent/Err command-ack phase.
Affected handlers (src/server/meshcoreNativeBackend.ts)
| Handler |
Ack awaited |
Lines |
discover_path |
Sent vs Err |
~922–936 |
discover_nodes |
Ok vs Err |
~962–976 |
request_regions |
Sent (then BinaryResponse) vs Err |
~1015–1042 |
Each does c.once(Err, onErr) where onErr rejects the promise. None owns the Err channel exclusively, so a concurrent send_message / discover_* / other command failing with Err in the window between our frame send and our own ack will reject us spuriously.
Severity
Low. These are manual/scheduled, retryable operations, and device Err responses are relatively rare, so the race window is narrow. Worst case is a spurious "Device rejected …" failure on a discovery/regions request that would have succeeded — a re-run clears it. No data corruption.
Why it's not a trivial patch
- For
request_regions there's a cheap partial mitigation: only treat Err as ours while !sentReceived (once Sent arrives, any later Err is definitely another command's). One line, clearly correct, but only helps this one handler.
discover_path and discover_nodes resolve their entire ack in one immediate window (Sent/Ok vs Err fired back-to-back), so there's no later phase to gate on. The only correct fix for them is serializing the command-ack window so just one raw-frame command is awaiting an ack at a time.
Proposed fix (design decision for maintainer)
Add a per-instance command-ack mutex (mirroring runExclusiveBinaryResponse from #3722, which is per-source by construction — each source owns its own backend instance and connection) that wraps the send→ack window of the raw-frame handlers. Options on breadth:
- Narrow: one
runExclusiveCommandAck lock covering just the Ok/Sent/Err window of the three handlers above.
- Unified: a single "exclusive radio op" lock that subsumes both the command-ack window and the
0x8C reply window (would absorb runExclusiveBinaryResponse), giving one simple mental model at the cost of serializing these admin/discovery/telemetry ops against each other more broadly.
Either keeps the lock per-source so concurrent sources never block each other. Each handler needs a focused test (a concurrent foreign Err must not reject the in-flight command).
Acceptance criteria
- A stray
Err from a concurrent command cannot reject discover_path / discover_nodes / request_regions.
- Per-source isolation preserved (no cross-source blocking).
- Regression tests per handler.
Summary
Several MeshCore native-backend handlers send a raw frame via
sendToRadioFrameand thenonce()-race the device's immediate command ack againstResponseCodes.Err. TheErrchannel is connection-wide and carries no correlation tag, so if a second command'sErrarrives during another command's ack window, it fires the wrong handler'sonce(Err)listener and falsely rejects an unrelated command.Surfaced during review of #3722 (the 0x8C reply-theft fix). That PR serialized the
0x8Creply phase; this is the separate, pre-existingOk/Sent/Errcommand-ack phase.Affected handlers (
src/server/meshcoreNativeBackend.ts)discover_pathSentvsErrdiscover_nodesOkvsErrrequest_regionsSent(thenBinaryResponse) vsErrEach does
c.once(Err, onErr)whereonErrrejects the promise. None owns theErrchannel exclusively, so a concurrentsend_message/discover_*/ other command failing withErrin the window between our frame send and our own ack will reject us spuriously.Severity
Low. These are manual/scheduled, retryable operations, and device
Errresponses are relatively rare, so the race window is narrow. Worst case is a spurious "Device rejected …" failure on a discovery/regions request that would have succeeded — a re-run clears it. No data corruption.Why it's not a trivial patch
request_regionsthere's a cheap partial mitigation: only treatErras ours while!sentReceived(onceSentarrives, any laterErris definitely another command's). One line, clearly correct, but only helps this one handler.discover_pathanddiscover_nodesresolve their entire ack in one immediate window (Sent/OkvsErrfired back-to-back), so there's no later phase to gate on. The only correct fix for them is serializing the command-ack window so just one raw-frame command is awaiting an ack at a time.Proposed fix (design decision for maintainer)
Add a per-instance command-ack mutex (mirroring
runExclusiveBinaryResponsefrom #3722, which is per-source by construction — each source owns its own backend instance and connection) that wraps the send→ack window of the raw-frame handlers. Options on breadth:runExclusiveCommandAcklock covering just theOk/Sent/Errwindow of the three handlers above.0x8Creply window (would absorbrunExclusiveBinaryResponse), giving one simple mental model at the cost of serializing these admin/discovery/telemetry ops against each other more broadly.Either keeps the lock per-source so concurrent sources never block each other. Each handler needs a focused test (a concurrent foreign
Errmust not reject the in-flight command).Acceptance criteria
Errfrom a concurrent command cannot rejectdiscover_path/discover_nodes/request_regions.