Skip to content

fix(autoping): match acks by destination node + close duplicate-send race#3522

Merged
Yeraze merged 2 commits into
mainfrom
fix/autoping-ack-matching-and-race
Jun 17, 2026
Merged

fix(autoping): match acks by destination node + close duplicate-send race#3522
Yeraze merged 2 commits into
mainfrom
fix/autoping-ack-matching-and-race

Conversation

@Yeraze

@Yeraze Yeraze commented Jun 17, 2026

Copy link
Copy Markdown
Owner

Summary

Users reported auto-ping responses arriving out of order or failing even when the target node was reachable. Investigation found two root causes (source-awareness itself was already correct — sessions, settings, and response handling are all per-source).

Bug 1 — local transmit ACK misread as a delivery success

A want_ack DM produces two Routing packets with the same request_id, both error_reason=NONE, differing only by from (confirmed against current firmware — ReliableRouter implicit-ack vs destination ack):

  1. an implicit transmit ACK from our own local node — arrives first, only means the packet entered the mesh;
  2. the real end-to-end ACK from the destination — arrives second, or never.

A genuine delivery failure is a non-zero error_reason NAK (e.g. MAX_RETRANSMIT) from the local node.

handleAutoPingResponse matched on request_id alone, so it latched the transmit ACK, recorded a false success with a transmit-only duration, and cleared the pending slot — dropping the real ACK and any later failure NAK. Behavior therefore varied by topology (zero-hop direct often worked; multi-hop reported false successes), which matches "mostly works but many fail / out of order."

Fix: resolve a ping only on error_reason != 0 (NAK) or error_reason == 0 AND from == the destination (real ACK). Transmit/relay confirmations (error_reason 0 not from the destination) are ignored so the session keeps waiting for the real signal/timeout.

Bug 2 — check-then-act send race

sendNextAutoPing checked pendingRequestId, then awaited the send before setting it, and the setInterval callback isn't awaited. A tick firing during a slow send (queue/airtime backpressure, or a short user interval) slipped past the guard and launched a duplicate ping whose ack never matched → recorded as a timeout, clobbered timers.

Fix: a synchronous sending guard set before the first await; and arm pendingRequestId + the ack timeout synchronously after the send (timeoutMs resolved once at session start) so a fast response can't leave an orphaned timeout.

Tests

New meshtasticManager.autoPingAck.test.ts drives the real methods (the existing auto-ping.test.ts only re-implements logic inline):

  • self-transmit ACK and relay NONE are ignored; destination ACK counts exactly once
  • non-zero error_reason ⇒ NAK; pending timeout cleared on resolve
  • the duplicate-send guard (a second tick during an in-flight send sends nothing)

Full suite: 6631 passed, 0 failed.

🤖 Generated with Claude Code

Users reported auto-ping responses arriving out of order or failing even when
the target was reachable. Two root causes:

1. ACK misattribution. A want_ack DM produces TWO Routing packets with the SAME
   request_id, both error_reason=NONE, differing only by 'from': an implicit
   transmit ACK from our OWN local node (arrives first, only means the packet
   entered the mesh) and the real end-to-end ACK from the destination (arrives
   second, or never). A genuine failure is a non-zero error_reason NAK
   (MAX_RETRANSMIT) from the local node. handleAutoPingResponse matched on
   request_id alone, so it latched the transmit ACK and reported a false success
   with a transmit-only duration — dropping the real ACK and any later failure
   NAK. Now it resolves a ping only on error_reason!=0 (NAK) or error_reason==0
   AND from==the destination (real ACK); transmit/relay confirmations are
   ignored so the session keeps waiting. (Firmware behavior confirmed via the
   meshtastic-expert: ReliableRouter implicit-ack vs destination ack.)

2. Send race. sendNextAutoPing checked pendingRequestId, then awaited the send
   before setting it, and the setInterval callback isn't awaited — so a tick
   firing during a slow send (queue/airtime backpressure, or a short interval)
   slipped past the guard and launched a duplicate ping whose ack never matched.
   Added a synchronous 'sending' guard, and arm pendingRequestId + the ack
   timeout synchronously after the send (timeoutMs resolved once at session
   start) so a fast response can't leave an orphaned timeout.

Source-awareness was already correct (per-source manager/sessions/settings).

Adds meshtasticManager.autoPingAck.test.ts driving the real methods:
self-transmit/relay acks ignored, destination ack counts once, non-zero
error_reason is a NAK, timeout cleared on resolve, and the duplicate-send guard.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 17, 2026

Copy link
Copy Markdown

Claude finished @Yeraze's task —— View job


PR Review: fix(autoping): match acks by destination node + close duplicate-send race

  • Read changed files
  • Review code quality and correctness
  • Check test coverage
  • Post final review

Overall Assessment

This is a well-reasoned, focused bug fix. The root cause analysis in the PR description is accurate, the fix is minimal and correct, and the new tests verify the exact failure modes. A few observations below.


Bug 1 Fix — ACK Matching (handleAutoPingResponse) ✅

meshtasticManager.ts:9450–9495

The logic correctly distinguishes:

  • errorReason != 0 → NAK (any sender)
  • errorReason == 0 && fromNum == session.requestedBy → real destination ACK

One subtle correctness note: the early return at line 9461 (when ignoring a transmit-only ACK) exits the entire function, not just the loop iteration. This is fine because requestId is unique per ping, so at most one session can match — but it's worth being aware of. If the loop were ever refactored to handle multiple sessions, this return should become a continue.

meshtasticManager.ts:7221–7223 — The typeof errorReason === 'number' guard is a good defensive move. The existing routing.error_reason || routing.errorReason extraction at line 7199 could return undefined, and the fallback to 1 (a non-zero failure code) is a safe conservative choice.


Bug 2 Fix — Send Race (sendNextAutoPing) ✅

meshtasticManager.ts:9376–9424

The fix is correct. Setting session.sending = true synchronously before the first await closes the check-then-act window. Resetting it in finally ensures no permanent lockout on error.

One minor observation: lastPingSentAt is set after await sendTextMessage() returns (line 9403), so it measures the time from when the message was queued/sent, not from when the user triggered the ping. This appears intentional (measuring round-trip from confirmed send) and is unchanged from before, just noted for clarity.


Test Coverage ✅

meshtasticManager.autoPingAck.test.ts

Coverage is solid for the stated bugs:

  • Self-transmit ACK ignored (line 67)
  • Relay NONE ignored (line 76)
  • Destination ACK resolves once (line 83)
  • Sequential self-transmit then destination ACK counts exactly once (line 93)
  • Non-zero errorReason → NAK (line 102)
  • Pending timeout cleared on resolve (line 112)
  • Request ID mismatch ignored (line 121)
  • Duplicate-send guard (line 147)

Gap worth noting: There's no test for the scenario where the NAK arrives before the timeout fires — i.e., verifying pendingTimeout is cleared on a NAK (only the ACK path is checked in the timeout test at line 112). This is a minor omission; the code path does clear it (clearTimeout(session.pendingTimeout) is called unconditionally before the status branch), but a test would make it explicit.

Gap worth noting: No test for requestId being falsy/undefined (the guard at line 7220 in processRoutingErrorMessage — coverage here depends on the existing integration flow, which is acceptable).

The test imports the singleton manager (line 43: module.default), which means tests share state between describe blocks unless autoPingSessions.clear() is called in beforeEach — which it is. Good.


Minor Observations

  1. Comment verbosity: The inline comments in sendNextAutoPing and handleAutoPingResponse are thorough — arguably more detailed than needed for production code, but they're accurate and will be valuable for future maintainers given how subtle the timing bugs were.

  2. session.sending not guarded in handleAutoPingTimeout: If a timeout fires while sending = true (unlikely — the timeout only starts after a successful send — but theoretically possible if send returns quickly and timeout is very short), handleAutoPingTimeout doesn't check sending. Not a bug in practice because pendingRequestId is set synchronously with pendingTimeout, so the timeout timer can't fire while the send is still awaiting.

  3. processRoutingErrorMessage doesn't early-return after auto-ping match: At line 7223, handleAutoPingResponse is called but the function continues into the general message-delivery ACK path. This is unchanged from before and correct because auto-ping DMs are also real messages in the DB, but it means a ping ACK also triggers getMessageByRequestIdAsync lookup. Low cost, not a bug.


Verdict

The fix is correct, well-documented, and appropriately tested. The two gaps in test coverage are minor. Ready to merge.

…#3522)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 17, 2026

Copy link
Copy Markdown

Claude finished @Yeraze's task —— View job


PR Review: fix(autoping): match acks by destination node + close duplicate-send race

  • Read changed files
  • Review code quality and correctness
  • Check test coverage
  • Post final review

Overall Assessment

This is a well-reasoned, focused bug fix. The root cause analysis in the PR description is accurate, the implementation is minimal and correct, and the new tests verify the exact failure modes described. A few observations below.


Bug 1 Fix — ACK Matching (handleAutoPingResponse) ✅

meshtasticManager.ts:9450–9495

The logic correctly distinguishes the three cases:

  • errorReason != 0 → NAK (any sender — local node reports delivery failure)
  • errorReason == 0 && fromNum == session.requestedBy → real destination ACK
  • Anything else → transmit/relay confirmation, ignored

Subtle correctness note: The early return at line 9461 exits the entire function, not just the loop iteration. This is correct only because requestId is unique per ping, so at most one session matches. If this loop is ever refactored to handle multiple sessions concurrently, that return would need to become continue.

meshtasticManager.ts:7221–7223 — The typeof errorReason === 'number' guard with fallback to 1 is a good defensive move. routing.error_reason || routing.errorReason at line 7199 can return undefined (e.g., for a packet with an explicit error_reason=0), and defaulting to 1 (non-zero failure) is the safer conservative choice.

Minor observation: processRoutingErrorMessage at line 7223 calls handleAutoPingResponse and then continues into the general message-delivery ACK path below (line 7227+). This means a ping ACK also triggers getMessageByRequestIdAsync. This is technically correct (ping DMs are real messages in the DB), but it's a subtle flow — the auto-ping handler being a side-effect of general routing message processing rather than consuming it. Worth a comment noting this intentional fall-through if it isn't already clear to future readers.


Bug 2 Fix — Send Race (sendNextAutoPing) ✅

meshtasticManager.ts:9376–9425

The fix is correct. session.sending = true is set synchronously before the first await at line 9391, which closes the check-then-act window. The finally block at line 9422 ensures sending is always reset even on error, preventing a permanent lockout.

One correctness question: In the error path (catch block, lines 9409–9419), session.pendingRequestId is never set (correctly, since the send failed), but session.completedPings and session.failedPings are incremented. This means after a send failure, the sending = false reset in finally allows the next interval tick to attempt another ping — good. However, the status: 'timeout' label used for a send failure (line 9414) is slightly misleading; it's a send error, not a timeout. This is pre-existing behavior and not introduced by this PR, just noted.

lastPingSentAt timing: Set after await sendTextMessage() returns (line 9403), so it measures round-trip from when the message was queued/sent, not when the user triggered the ping. This appears intentional and is unchanged from before.


Test Coverage ✅

meshtasticManager.autoPingAck.test.ts

Coverage is solid for the stated bugs:

  • Self-transmit ACK ignored (line 67) ✅
  • Relay NONE ignored (line 76) ✅
  • Destination ACK resolves exactly once (line 83) ✅
  • Sequential self-transmit + destination ACK counts exactly once (line 93) ✅
  • Non-zero errorReason → NAK (line 102) ✅
  • Pending timeout cleared on ACK (line 112) ✅
  • Pending timeout cleared on NAK (line 121) ✅ — good, this was flagged as missing in earlier review and has now been added
  • Request ID mismatch ignored (line 131) ✅
  • Duplicate-send guard (line 157) ✅

One minor gap: No test for the scenario where sendTextMessage throws (the catch block in sendNextAutoPing). That path increments failedPings and resets sending — the finally reset is critical and untested. Low risk since finally is straightforward, but a test would harden the behavior.

Test design note: The test imports the singleton manager (line 43: module.default), which means tests share state unless cleaned between runs — which they are (autoPingSessions.clear() in beforeEach). Good discipline.


Minor Observations

  1. Comment verbosity: The inline comments in sendNextAutoPing and handleAutoPingResponse are more detailed than typical production code, but they're accurate and will be valuable for future maintainers given how subtle these timing bugs are. Appropriate for this case.

  2. handleAutoPingTimeout doesn't check session.sending: If a timeout fires while sending = true (theoretically possible if timeoutMs is extremely short and the send queues quickly), the timeout handler runs. In practice, pendingTimeout is only set after a successful await sendTextMessage() returns (line 9404), and sending is reset in finally immediately after — so by the time pendingTimeout is armed, sending is already false. Not a bug, but the sequencing is worth being aware of.

  3. handleAutoPingTimeout has a guard if (session.pendingRequestId === null) return (line 9501) but no corresponding guard in handleAutoPingResponse for the case where pendingTimeout fires concurrently with a real ACK. The pendingRequestId !== requestId check in the loop (line 9453) handles this correctly since pendingRequestId is nulled after resolution — safe.


Verdict

The fix is correct, well-documented, and well-tested. The two minor gaps (send-error path test, fall-through comment) are optional polish. Ready to merge.

@Yeraze Yeraze merged commit 25302e5 into main Jun 17, 2026
20 checks passed
@Yeraze Yeraze deleted the fix/autoping-ack-matching-and-race branch June 17, 2026 14:22
@Yeraze Yeraze mentioned this pull request Jun 17, 2026
Yeraze added a commit that referenced this pull request Jun 17, 2026
Release candidate for 4.11.0. Bumps the version across all five tracked
locations (package.json, package-lock.json, helm Chart.yaml version+appVersion,
desktop/package.json, desktop/src-tauri/tauri.conf.json).

4.11.0 carries the multi-source route-extraction refactors, the telemetry
ingestion unification (#3506/#3507/#3515), the auto-ping ack/race fix (#3522),
NodeInfo position-precision guard (#3516), backup-route hardening (#3524), and
the Node Details telemetry range selector (#3530).

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant