Fix handling of radio/network overruns by TMRh20 · Pull Request #249 · nRF24/RF24Network

TMRh20 · 2025-05-01T11:02:34Z

The network can get stuck in an available loop if the radio is receiving data faster than the network can process it. If this happens, the RX FIFO needs to be flushed
Also return NETWORK_OVERRUN when this happens
Add network system type 160 NETWORK_OVERRUN
Also return before radio.read() if payload too small

I was finally able to recreate this issue repeatedly using the following Arduino Sketch while the RPi was actively transmitting and receiving data.

#include <SPI.h>
#include <RF24.h>
#include <RF24Network.h>

RF24 radio(7, 8);  // nRF24L01(+) radio attached using Getting Started board

RF24Network network(radio);  // Network uses that radio

const uint16_t this_node = 04;   // Address of our node in Octal format
const uint16_t other_node = 00;  // Address of the other node in Octal format

const unsigned long interval = 0;  // How often (in ms) to send 'hello world' to the other unit

unsigned long last_sent;     // When did we last send?
unsigned long packets_sent;  // How many have we sent already


struct payload_t {  // Structure of our payload
  unsigned long ms;
  unsigned long counter;
};

void setup(void) {
  Serial.begin(115200);
  while (!Serial) {
    // some boards need this because of native USB capability
  }
  Serial.println(F("RF24Network/examples/helloworld_tx/"));

  if (!radio.begin()) {
    Serial.println(F("Radio hardware not responding!"));
    while (1) {
      // hold in infinite loop
    }
  }
  radio.setChannel(50);
  network.begin(/*node address*/ this_node);
  RF24NetworkHeader header(/*to node*/ other_node);
  payload_t payload = { millis(), packets_sent++ };
  network.write(header, &payload, sizeof(payload));
  radio.stopListening();
}

void loop() {

  uint8_t data[32];
  radio.write(data,sizeof(data));

}

The system type used etc can be up for debate, I just threw this fix together.

Closes nRF24/RF24#1033

- The network can get stuck in an available loop if the radio is receiving data faster than the network can process it. If this happens, the RX FIFO needs to be flushed - Also return NETWORK_OVERRUN when this happens - Add network system type 160 NETWORK_OVERRUN - Also return before radio.read() if payload too small

TMRh20 · 2025-05-01T11:05:30Z

nrf_to_nrf lib has no flush_rx function, so the build is failing...

TMRh20 · 2025-05-01T11:24:40Z

Added flush_rx() to nrf_to_nrf, I'm tempted to just merge this anyway, because it may take a while to pickup on the new release for the jobs to run successfully.

- Need to clear the packets from the radio even if they are shorter than RF24Network header size

TMRh20 · 2025-05-01T16:06:18Z

So current thoughts are that this addresses the problem, but I'm still not entirely sure why the radio would so often go into an available() loop unless there was random interference popping up. It seems to be pretty random and still a bit difficult to recreate on demand.

As per below, it seems to happen in bunches, with multiple timeouts happening in short periods of time:

pi@RPi5:~ $ tail -F failLog.txt
10
tail: failLog.txt: file truncated
13
tail: failLog.txt: file truncated
14
tail: failLog.txt: file truncated
15
tail: failLog.txt: file truncated
17
tail: failLog.txt: file truncated
19
tail: failLog.txt: file truncated
20

I also want to play around with the timeout, because 1 second seems like a long time to remain in the available function.

2bndy5 · 2025-05-01T16:43:56Z

The payload is not removed from the RX FIFO unless the entire payload's length is read(). Is it possible that getDynamicPayloadSize() returns something shorter than the actual payload length?

2bndy5 · 2025-05-01T16:46:43Z

100 milliseconds would match the timeout in RF24::write().

TMRh20 · 2025-05-01T18:04:08Z

Is it possible that getDynamicPayloadSize() returns something shorter than the actual payload length?

Hmm, the DPL is just an added field in the overall packet that gets sent, so it probably would be possible for this to happen. I should be able to do it with NRF52 devices on purpose. I can probably test that out later.

2bndy5 · 2025-05-01T18:13:32Z

I doubt you'd be able to with RF24 lib because payloads written to the TX FIFO are truncated to 32 bytes (or they should be). If that software limit didn't exist, then I believe it might be possible to mess up the dynamic payload size by uploading a buffer larger than 32 bytes to the TX FIFO when dynamic payloads is enabled.

TMRh20 · 2025-05-01T22:38:16Z

Yeah, it looks like I can't replicate that behavior with RF52 either.

100ms seems to be a good value for the timeout in testing, I think we can go with that.

2bndy5

I'm guessing you have local patches for the higher layer(s) to handle this new msg type.

2bndy5 · 2025-05-02T07:50:55Z

RF24Network.h

+/**
+ * Messages of this type indicate the network is being overrun with data & the RX FIFO has been flushed.
+ **/
+#define NETWORK_OVERRUN 160


Here's my thoughts on the actual number. Beware, I still thinking in binary from exposing the STATUS byte...

The msg type is an 8-bit var. I was thinking we'd assert the top 3 bits (0b11100000 = 0xE0 = 224) and use the lower 5 bits to indicate the layer.

It doesn't really matter though. As long as the number isn't already used.

If you really want to get fancy about it, we could declare an enum of ErrorKinds (or ReservedMsgTypes). That way the source code (and docs) would be a bit more organized about OSI errors.

This idea can wait I guess, until we flesh out what errors we want to propagate up the stack...

I was thinking we'd assert the top 3 bits (0b11100000 = 0xE0 = 224) and use the lower 5 bits to indicate the layer.

I think I'd need to see how this works out before forming an opinion. Knowing you it will all make sense in the end. :p

If you really want to get fancy about it, we could declare an enum of ErrorKinds.

That I understand, and its sounds like a good idea, but maybe for another PR.

2bndy5 · 2025-05-02T07:52:25Z

RF24Network.cpp

-            radio.failureDetected = 1;
-#endif
-            break;
+        if (millis() > timeout) {


Ah, that sweet performant math. 😆

TMRh20 · 2025-05-02T08:00:30Z

I'm guessing you have local patches for the higher layer(s) to handle this new msg type.

Not yet, currently nothing really needs to be done on encountering this error, but it would be nice for RF24Gateway to display the error count, instead of logging to a file like i have it doing currently.

2bndy5 · 2025-05-02T08:02:21Z

Yeah File IO is so timely. Although on Linux, its probably the same as writing to stdout or stderr.

- The rx buffer also needs to be flushed if R_RX_PL_WID returns 0 - No need for the delay nRF24/RF24Network#249

* Fix handling of radio/network overruns - The network can get stuck in an available loop if the radio is receiving data faster than the network can process it. If this happens, the RX FIFO needs to be flushed - Also return NETWORK_OVERRUN when this happens - Add network system type 160 NETWORK_OVERRUN * Set timeout to 100 in available()

TMRh20 requested a review from 2bndy5 May 1, 2025 11:26

Unneeded change

7a088ae

- Need to clear the packets from the radio even if they are shorter than RF24Network header size

Set timeout to 100 in available()

4b8cf7b

2bndy5 approved these changes May 2, 2025

View reviewed changes

TMRh20 merged commit 54063b3 into master May 2, 2025
25 checks passed

TMRh20 deleted the FixAvailableLoop branch May 2, 2025 07:59

TMRh20 mentioned this pull request May 2, 2025

Utilize network overrun var in the main code and display in examples nRF24/RF24Gateway#51

Merged

TMRh20 added a commit to nRF24/RF24 that referenced this pull request May 2, 2025

Fix for Dynamic Payloads

4c3e2fc

- The rx buffer also needs to be flushed if R_RX_PL_WID returns 0 - No need for the delay nRF24/RF24Network#249

TMRh20 mentioned this pull request May 2, 2025

Fix for Dynamic Payloads nRF24/RF24#1036

Merged

TMRh20 added a commit to nRF24/RF24 that referenced this pull request May 2, 2025

Fix for Dynamic Payloads causing apparent Radio Failures (#1036)

9de7fdd

- The rx buffer also needs to be flushed if R_RX_PL_WID returns 0 - No need for the delay nRF24/RF24Network#249

2bndy5 mentioned this pull request May 7, 2025

updates from C++ upstream nRF24/CircuitPython_nRF24L01#66

Merged

Conversation

TMRh20 commented May 1, 2025

Uh oh!

TMRh20 commented May 1, 2025

Uh oh!

TMRh20 commented May 1, 2025

Uh oh!

TMRh20 commented May 1, 2025

Uh oh!

2bndy5 commented May 1, 2025

Uh oh!

2bndy5 commented May 1, 2025

Uh oh!

TMRh20 commented May 1, 2025

Uh oh!

2bndy5 commented May 1, 2025

Uh oh!

TMRh20 commented May 1, 2025

Uh oh!

2bndy5 left a comment

Choose a reason for hiding this comment

Uh oh!

2bndy5 May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TMRh20 May 2, 2025

Choose a reason for hiding this comment

Uh oh!

2bndy5 May 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

TMRh20 commented May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

2bndy5 commented May 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

2bndy5 May 2, 2025 •

edited

Loading

TMRh20 commented May 2, 2025 •

edited

Loading