NextHopRouter: fix 49.7-day millis() rollover in retransmission timing#10227
NextHopRouter: fix 49.7-day millis() rollover in retransmission timing#10227nightjoker7 wants to merge 3 commits into
Conversation
Resolves the "FIXME, handle 51 day rolloever here!!!" in NextHopRouter::doRetransmissions() by switching the retransmission-due comparison from plain unsigned <= to a signed-difference cast. The previous p.nextTxMsec <= now comparison silently breaks across the ~49.7 day millis() wraparound: pending retransmissions either stall for the remainder of the wrap window, or all fire simultaneously at the rollover boundary. Long-running router/infrastructure nodes do hit this in practice. The replacement (int32_t)(p.nextTxMsec - now) <= 0 is the standard Arduino/embedded idiom for rollover-safe deadline checks and behaves identically to the original for any non-wrap timing.
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Fixes retransmission scheduling across the ~49.7-day millis() rollover by replacing an unsafe unsigned <= time comparison with a rollover-safe delta comparison.
Changes:
- Replace
p.nextTxMsec <= nowwith a signed-delta due check to handlemillis()wraparound. - Remove the old FIXME and document the rollover behavior and fix rationale inline.
…ransmit check Review feedback from @Copilot on PR meshtastic#10227: casting a uint32_t subtraction to int32_t is implementation-defined in C++ when the unsigned value exceeds INT32_MAX (even though it works on typical two's-complement targets). Switch to the fully well-defined unsigned half-range form: nextTxMsec is in the past-or-equal iff (now - nextTxMsec) has not wrapped past 2^31 ms. Future offsets < 2^31 ms wrap into the top half and read as 'not yet'. Same semantics as the signed-cast version on every two's-complement platform we care about, but portable to any conforming C++ impl.
|
Addressed in 1cd825bce: switched from signed-cast of unsigned subtraction to the fully well-defined unsigned half-range form — |
|
Tested on Nordic nRF54L15-DK (Zephyr 4.2.1 + arm-none-eabi-gcc).
Can't repro the 49.7-day rollover in a smoke test obviously, but the Tested-by: cvaldess |
Summary
Fix the
// FIXME, handle 51 day rolloever here!!!inNextHopRouter::doRetransmissions()by switching the retransmission-due check to a signed-difference comparison.Problem
millis()wraps every ~49.7 days (2^32 ms). Bothp.nextTxMsecandnowareuint32_t. Whennowwraps past the storednextTxMsec, the plain<=comparison silently does the wrong thing in two ways:nextTxMsec, smallnow): all pending retransmissions stall untilnowcatches up — minutes to days depending on how close to the wrap the packets were queued.nextTxMsec, largenowbefore wrap): the entire retransmission queue becomes due simultaneously, causing a burst of airtime at the rollover boundary.The packet-retransmission table is long-lived on routers and infrastructure nodes that stay up for weeks, so this FIXME is reachable in normal operation.
Fix
Standard Arduino/embedded idiom for rollover-safe "deadline has passed" checks — subtract first, then compare as signed:
(p.nextTxMsec - now)is computed inuint32_tand yields a value whose two's-complement interpretation asint32_tis the true signed time delta, provided the actual delta is within ±2^31 ms (~24.8 days). Retransmission deadlines are milliseconds to tens of seconds in the future, so this is always satisfied.This is the same pattern used elsewhere in the Arduino ecosystem for rollover-safe timing (e.g.
(long)(millis() - target) >= 0).Testing
develop.Risk
Minimal. One-line change. On non-rollover timelines
(int32_t)(future - now)is positive and<= 0is false exactly whennextTxMsec <= nowwas false — identical behavior. Difference only manifests across the wrap boundary, which is the bug being fixed.