fix(autoping): match acks by destination node + close duplicate-send race#3522
Conversation
Users reported auto-ping responses arriving out of order or failing even when the target was reachable. Two root causes: 1. ACK misattribution. A want_ack DM produces TWO Routing packets with the SAME request_id, both error_reason=NONE, differing only by 'from': an implicit transmit ACK from our OWN local node (arrives first, only means the packet entered the mesh) and the real end-to-end ACK from the destination (arrives second, or never). A genuine failure is a non-zero error_reason NAK (MAX_RETRANSMIT) from the local node. handleAutoPingResponse matched on request_id alone, so it latched the transmit ACK and reported a false success with a transmit-only duration — dropping the real ACK and any later failure NAK. Now it resolves a ping only on error_reason!=0 (NAK) or error_reason==0 AND from==the destination (real ACK); transmit/relay confirmations are ignored so the session keeps waiting. (Firmware behavior confirmed via the meshtastic-expert: ReliableRouter implicit-ack vs destination ack.) 2. Send race. sendNextAutoPing checked pendingRequestId, then awaited the send before setting it, and the setInterval callback isn't awaited — so a tick firing during a slow send (queue/airtime backpressure, or a short interval) slipped past the guard and launched a duplicate ping whose ack never matched. Added a synchronous 'sending' guard, and arm pendingRequestId + the ack timeout synchronously after the send (timeoutMs resolved once at session start) so a fast response can't leave an orphaned timeout. Source-awareness was already correct (per-source manager/sessions/settings). Adds meshtasticManager.autoPingAck.test.ts driving the real methods: self-transmit/relay acks ignored, destination ack counts once, non-zero error_reason is a NAK, timeout cleared on resolve, and the duplicate-send guard. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Claude finished @Yeraze's task —— View job PR Review: fix(autoping): match acks by destination node + close duplicate-send race
Overall AssessmentThis is a well-reasoned, focused bug fix. The root cause analysis in the PR description is accurate, the fix is minimal and correct, and the new tests verify the exact failure modes. A few observations below. Bug 1 Fix — ACK Matching (
|
…#3522) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Claude finished @Yeraze's task —— View job PR Review: fix(autoping): match acks by destination node + close duplicate-send race
Overall AssessmentThis is a well-reasoned, focused bug fix. The root cause analysis in the PR description is accurate, the implementation is minimal and correct, and the new tests verify the exact failure modes described. A few observations below. Bug 1 Fix — ACK Matching (
|
Release candidate for 4.11.0. Bumps the version across all five tracked locations (package.json, package-lock.json, helm Chart.yaml version+appVersion, desktop/package.json, desktop/src-tauri/tauri.conf.json). 4.11.0 carries the multi-source route-extraction refactors, the telemetry ingestion unification (#3506/#3507/#3515), the auto-ping ack/race fix (#3522), NodeInfo position-precision guard (#3516), backup-route hardening (#3524), and the Node Details telemetry range selector (#3530). Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summary
Users reported auto-ping responses arriving out of order or failing even when the target node was reachable. Investigation found two root causes (source-awareness itself was already correct — sessions, settings, and response handling are all per-source).
Bug 1 — local transmit ACK misread as a delivery success
A
want_ackDM produces two Routing packets with the samerequest_id, botherror_reason=NONE, differing only byfrom(confirmed against current firmware —ReliableRouterimplicit-ack vs destination ack):A genuine delivery failure is a non-zero
error_reasonNAK (e.g.MAX_RETRANSMIT) from the local node.handleAutoPingResponsematched onrequest_idalone, so it latched the transmit ACK, recorded a false success with a transmit-only duration, and cleared the pending slot — dropping the real ACK and any later failure NAK. Behavior therefore varied by topology (zero-hop direct often worked; multi-hop reported false successes), which matches "mostly works but many fail / out of order."Fix: resolve a ping only on
error_reason != 0(NAK) orerror_reason == 0 AND from == the destination(real ACK). Transmit/relay confirmations (error_reason 0not from the destination) are ignored so the session keeps waiting for the real signal/timeout.Bug 2 — check-then-act send race
sendNextAutoPingcheckedpendingRequestId, then awaited the send before setting it, and thesetIntervalcallback isn't awaited. A tick firing during a slow send (queue/airtime backpressure, or a short user interval) slipped past the guard and launched a duplicate ping whose ack never matched → recorded as a timeout, clobbered timers.Fix: a synchronous
sendingguard set before the first await; and armpendingRequestId+ the ack timeout synchronously after the send (timeoutMsresolved once at session start) so a fast response can't leave an orphaned timeout.Tests
New
meshtasticManager.autoPingAck.test.tsdrives the real methods (the existingauto-ping.test.tsonly re-implements logic inline):error_reason⇒ NAK; pending timeout cleared on resolveFull suite: 6631 passed, 0 failed.
🤖 Generated with Claude Code