Skip to content

Fix handling of radio/network overruns#249

Merged
TMRh20 merged 3 commits intomasterfrom
FixAvailableLoop
May 2, 2025
Merged

Fix handling of radio/network overruns#249
TMRh20 merged 3 commits intomasterfrom
FixAvailableLoop

Conversation

@TMRh20
Copy link
Member

@TMRh20 TMRh20 commented May 1, 2025

  • The network can get stuck in an available loop if the radio is receiving data faster than the network can process it. If this happens, the RX FIFO needs to be flushed
  • Also return NETWORK_OVERRUN when this happens
  • Add network system type 160 NETWORK_OVERRUN
  • Also return before radio.read() if payload too small

I was finally able to recreate this issue repeatedly using the following Arduino Sketch while the RPi was actively transmitting and receiving data.

#include <SPI.h>
#include <RF24.h>
#include <RF24Network.h>

RF24 radio(7, 8);  // nRF24L01(+) radio attached using Getting Started board

RF24Network network(radio);  // Network uses that radio

const uint16_t this_node = 04;   // Address of our node in Octal format
const uint16_t other_node = 00;  // Address of the other node in Octal format

const unsigned long interval = 0;  // How often (in ms) to send 'hello world' to the other unit

unsigned long last_sent;     // When did we last send?
unsigned long packets_sent;  // How many have we sent already


struct payload_t {  // Structure of our payload
  unsigned long ms;
  unsigned long counter;
};

void setup(void) {
  Serial.begin(115200);
  while (!Serial) {
    // some boards need this because of native USB capability
  }
  Serial.println(F("RF24Network/examples/helloworld_tx/"));

  if (!radio.begin()) {
    Serial.println(F("Radio hardware not responding!"));
    while (1) {
      // hold in infinite loop
    }
  }
  radio.setChannel(50);
  network.begin(/*node address*/ this_node);
  RF24NetworkHeader header(/*to node*/ other_node);
  payload_t payload = { millis(), packets_sent++ };
  network.write(header, &payload, sizeof(payload));
  radio.stopListening();
}

void loop() {

  uint8_t data[32];
  radio.write(data,sizeof(data));

}

The system type used etc can be up for debate, I just threw this fix together.

Closes nRF24/RF24#1033

- The network can get stuck in an available loop if the radio is receiving data faster than the network can process it. If this happens, the RX FIFO needs to be flushed
- Also return NETWORK_OVERRUN when this happens
- Add network system type 160 NETWORK_OVERRUN
- Also return before radio.read() if payload too small
@TMRh20
Copy link
Member Author

TMRh20 commented May 1, 2025

nrf_to_nrf lib has no flush_rx function, so the build is failing...

@TMRh20
Copy link
Member Author

TMRh20 commented May 1, 2025

Added flush_rx() to nrf_to_nrf, I'm tempted to just merge this anyway, because it may take a while to pickup on the new release for the jobs to run successfully.

@TMRh20 TMRh20 requested a review from 2bndy5 May 1, 2025 11:26
- Need to clear the packets from the radio even if they are shorter than RF24Network header size
@TMRh20
Copy link
Member Author

TMRh20 commented May 1, 2025

So current thoughts are that this addresses the problem, but I'm still not entirely sure why the radio would so often go into an available() loop unless there was random interference popping up. It seems to be pretty random and still a bit difficult to recreate on demand.

As per below, it seems to happen in bunches, with multiple timeouts happening in short periods of time:

pi@RPi5:~ $ tail -F failLog.txt
10
tail: failLog.txt: file truncated
13
tail: failLog.txt: file truncated
14
tail: failLog.txt: file truncated
15
tail: failLog.txt: file truncated
17
tail: failLog.txt: file truncated
19
tail: failLog.txt: file truncated
20

I also want to play around with the timeout, because 1 second seems like a long time to remain in the available function.

@2bndy5
Copy link
Member

2bndy5 commented May 1, 2025

The payload is not removed from the RX FIFO unless the entire payload's length is read(). Is it possible that getDynamicPayloadSize() returns something shorter than the actual payload length?

@2bndy5
Copy link
Member

2bndy5 commented May 1, 2025

100 milliseconds would match the timeout in RF24::write().

@TMRh20
Copy link
Member Author

TMRh20 commented May 1, 2025

Is it possible that getDynamicPayloadSize() returns something shorter than the actual payload length?

Hmm, the DPL is just an added field in the overall packet that gets sent, so it probably would be possible for this to happen. I should be able to do it with NRF52 devices on purpose. I can probably test that out later.

@2bndy5
Copy link
Member

2bndy5 commented May 1, 2025

I doubt you'd be able to with RF24 lib because payloads written to the TX FIFO are truncated to 32 bytes (or they should be). If that software limit didn't exist, then I believe it might be possible to mess up the dynamic payload size by uploading a buffer larger than 32 bytes to the TX FIFO when dynamic payloads is enabled.

@TMRh20
Copy link
Member Author

TMRh20 commented May 1, 2025

Yeah, it looks like I can't replicate that behavior with RF52 either.

100ms seems to be a good value for the timeout in testing, I think we can go with that.

Copy link
Member

@2bndy5 2bndy5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing you have local patches for the higher layer(s) to handle this new msg type.

/**
* Messages of this type indicate the network is being overrun with data & the RX FIFO has been flushed.
**/
#define NETWORK_OVERRUN 160
Copy link
Member

@2bndy5 2bndy5 May 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's my thoughts on the actual number. Beware, I still thinking in binary from exposing the STATUS byte...

The msg type is an 8-bit var. I was thinking we'd assert the top 3 bits (0b11100000 = 0xE0 = 224) and use the lower 5 bits to indicate the layer.

It doesn't really matter though. As long as the number isn't already used.


If you really want to get fancy about it, we could declare an enum of ErrorKinds (or ReservedMsgTypes). That way the source code (and docs) would be a bit more organized about OSI errors.

This idea can wait I guess, until we flesh out what errors we want to propagate up the stack...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking we'd assert the top 3 bits (0b11100000 = 0xE0 = 224) and use the lower 5 bits to indicate the layer.

I think I'd need to see how this works out before forming an opinion. Knowing you it will all make sense in the end. :p

If you really want to get fancy about it, we could declare an enum of ErrorKinds.

That I understand, and its sounds like a good idea, but maybe for another PR.

radio.failureDetected = 1;
#endif
break;
if (millis() > timeout) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that sweet performant math. 😆

@TMRh20 TMRh20 merged commit 54063b3 into master May 2, 2025
25 checks passed
@TMRh20 TMRh20 deleted the FixAvailableLoop branch May 2, 2025 07:59
@TMRh20
Copy link
Member Author

TMRh20 commented May 2, 2025

I'm guessing you have local patches for the higher layer(s) to handle this new msg type.

Not yet, currently nothing really needs to be done on encountering this error, but it would be nice for RF24Gateway to display the error count, instead of logging to a file like i have it doing currently.

@2bndy5
Copy link
Member

2bndy5 commented May 2, 2025

Yeah File IO is so timely. Although on Linux, its probably the same as writing to stdout or stderr.

TMRh20 added a commit to nRF24/RF24 that referenced this pull request May 2, 2025
- The rx buffer also needs to be flushed if R_RX_PL_WID returns 0
- No need for the delay

nRF24/RF24Network#249
TMRh20 added a commit to nRF24/RF24 that referenced this pull request May 2, 2025
- The rx buffer also needs to be flushed if R_RX_PL_WID returns 0
- No need for the delay

nRF24/RF24Network#249
2bndy5 pushed a commit that referenced this pull request May 4, 2025
* Fix handling of radio/network overruns

- The network can get stuck in an available loop if the radio is receiving data faster than the network can process it. If this happens, the RX FIFO needs to be flushed
- Also return NETWORK_OVERRUN when this happens
- Add network system type 160 NETWORK_OVERRUN

* Set timeout to 100 in available()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

csDelay required for Linux devices

2 participants