Skip to content

NextHopRouter: fix 49.7-day millis() rollover in retransmission timing#10227

Open
nightjoker7 wants to merge 3 commits into
meshtastic:developfrom
nightjoker7:fix/nexthoprouter-millis-rollover
Open

NextHopRouter: fix 49.7-day millis() rollover in retransmission timing#10227
nightjoker7 wants to merge 3 commits into
meshtastic:developfrom
nightjoker7:fix/nexthoprouter-millis-rollover

Conversation

@nightjoker7

Copy link
Copy Markdown
Contributor

Summary

Fix the // FIXME, handle 51 day rolloever here!!! in NextHopRouter::doRetransmissions() by switching the retransmission-due check to a signed-difference comparison.

Problem

// FIXME, handle 51 day rolloever here!!!
if (p.nextTxMsec <= now) {

millis() wraps every ~49.7 days (2^32 ms). Both p.nextTxMsec and now are uint32_t. When now wraps past the stored nextTxMsec, the plain <= comparison silently does the wrong thing in two ways:

  1. Just before rollover (large nextTxMsec, small now): all pending retransmissions stall until now catches up — minutes to days depending on how close to the wrap the packets were queued.
  2. Just after rollover (small nextTxMsec, large now before wrap): the entire retransmission queue becomes due simultaneously, causing a burst of airtime at the rollover boundary.

The packet-retransmission table is long-lived on routers and infrastructure nodes that stay up for weeks, so this FIXME is reachable in normal operation.

Fix

Standard Arduino/embedded idiom for rollover-safe "deadline has passed" checks — subtract first, then compare as signed:

if ((int32_t)(p.nextTxMsec - now) <= 0) {

(p.nextTxMsec - now) is computed in uint32_t and yields a value whose two's-complement interpretation as int32_t is the true signed time delta, provided the actual delta is within ±2^31 ms (~24.8 days). Retransmission deadlines are milliseconds to tens of seconds in the future, so this is always satisfied.

This is the same pattern used elsewhere in the Arduino ecosystem for rollover-safe timing (e.g. (long)(millis() - target) >= 0).

Testing

  • Builds cleanly against develop.
  • Running on a 4-node fleet (RAK4631 / Station G2 / T114 / Heltec V3). Reliable-DM retransmissions continue to behave correctly under normal millis() values; the rollover case is by construction impossible to exercise in a reasonable test window but the replacement expression is a drop-in for the classic Arduino rollover-safe compare.

Risk

Minimal. One-line change. On non-rollover timelines (int32_t)(future - now) is positive and <= 0 is false exactly when nextTxMsec <= now was false — identical behavior. Difference only manifests across the wrap boundary, which is the bug being fixed.

Resolves the "FIXME, handle 51 day rolloever here!!!" in
NextHopRouter::doRetransmissions() by switching the retransmission-due
comparison from plain unsigned <= to a signed-difference cast.

The previous p.nextTxMsec <= now comparison silently breaks across the
~49.7 day millis() wraparound: pending retransmissions either stall
for the remainder of the wrap window, or all fire simultaneously at
the rollover boundary. Long-running router/infrastructure nodes do hit
this in practice.

The replacement (int32_t)(p.nextTxMsec - now) <= 0 is the standard
Arduino/embedded idiom for rollover-safe deadline checks and behaves
identically to the original for any non-wrap timing.
@github-actions github-actions Bot added the bugfix Pull request that fixes bugs label Apr 21, 2026
@thebentern thebentern requested a review from Copilot April 21, 2026 20:38

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Fixes retransmission scheduling across the ~49.7-day millis() rollover by replacing an unsafe unsigned <= time comparison with a rollover-safe delta comparison.

Changes:

  • Replace p.nextTxMsec <= now with a signed-delta due check to handle millis() wraparound.
  • Remove the old FIXME and document the rollover behavior and fix rationale inline.

Comment thread src/mesh/NextHopRouter.cpp Outdated
…ransmit check

Review feedback from @Copilot on PR meshtastic#10227: casting a uint32_t
subtraction to int32_t is implementation-defined in C++ when the
unsigned value exceeds INT32_MAX (even though it works on typical
two's-complement targets).

Switch to the fully well-defined unsigned half-range form:
  nextTxMsec is in the past-or-equal iff (now - nextTxMsec) has not
  wrapped past 2^31 ms. Future offsets < 2^31 ms wrap into the top
  half and read as 'not yet'.

Same semantics as the signed-cast version on every two's-complement
platform we care about, but portable to any conforming C++ impl.
@nightjoker7

Copy link
Copy Markdown
Contributor Author

Addressed in 1cd825bce: switched from signed-cast of unsigned subtraction to the fully well-defined unsigned half-range form — (now - nextTxMsec) < 0x80000000u. Same semantics on every two's-complement platform, but portable to any conforming C++ impl.

@cvaldess

Copy link
Copy Markdown
Contributor

Tested on Nordic nRF54L15-DK (Zephyr 4.2.1 + arm-none-eabi-gcc).

  • Cherry-picked both commits clean on top of upstream/develop.
  • Builds without warnings, no footprint change.
  • doRetransmissions() runs as expected during ~5 min of mesh traffic;
    no behavioural regressions observed (LoRa RX/TX, BLE pair, config stream
    all OK).

Can't repro the 49.7-day rollover in a smoke test obviously, but the
unsigned half-range form is well-defined per the standard whereas the
prior (int32_t)(unsigned - unsigned) <= 0 was implementation-defined
when the result exceeded INT32_MAX — so this is also a portability win,
not just a correctness one. LGTM.

Tested-by: cvaldess

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugfix Pull request that fixes bugs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants