Skip to content

Conversation

@CapnBry
Copy link
Member

@CapnBry CapnBry commented Aug 4, 2025

Changes OtaCrcInitializer to also include the OtaNonce to attempt to guarantee RX and TX sync. The purpose is to resolve "low LQ" connections where an RX is close but is only randomly receiving packets by chance due to the TX restarting and having a completely different OtaNonce.

Much thanks go to @mha1 for all the technical discussion, proof of concept testing, initial work, and finding the F500 telemetry issue.

Fixes #3258
Perhaps addresses #3157

Details

If a TX stops for a moment due to being power cycled, the user hitting Bind while connected, or some config commits, a receiver may not fully time out and go to the disconnected state, meaning that if it manages to receive packets by random chance it will continue to stay in a "connected" state but at a very low LQ. The receiver usually will not get a sync packet to resolve the issue as the RX and TX will never be on the sync channel at the same time. This PR aims to resolve this by initializing the packet's CRC using the OtaNonce. CRC errors effectively prevent those random packet receptions from keeping the link alive, and the RX will return to the disconnected state and listen for a new SYNC to restart.

Technical

OtaCrcInitializer = (UID[4] << 8) | UID[5]) ^ (OTA_VERSION_ID << 8) ^ OtaNonce

The OTA version has been shifted up into the high byte of the initializer (was in the low byte), which will allow for up to 16 OTA versions before there is a version collision on the standard CRC-14. The Nonce is then XORed into the low byte if the packet is not a SYNC-type packet. SYNC packets always use 0 as the Nonce for the purpose of CRC, as the receiver does not know the OtaNonce before SYNC is received. This "unprotected" SYNC packet does not break the purpose of the PR as the SYNC packet would resolve the low-LQ connection even without this additional check.

OTA changes

  • Wide packet mode no longer uses the FHSS index to attempt to guarantee RX / TX sync. This is redundant with these changes and while the Wide CRC init is helpful in preventing a single Nonce slip, it clearly did not work when the two were further apart.

RX code changes

  • Changed to move the OtaNonce advancement to at the start of the packet period, rather than doing it at the end and always using OtaNonce + 1 in all its calculations.
  • Always FHSS at the beginning of the packet period rather than maybe doing it after packet reception. There's no safety afforded by doing this in advance, as if a packet is missed it will have to do the FHSS before it can receive anyway so there's no point to having this redundant code.
  • Change scheduling of the RX's packet period start (TOCK) to much later for F500, and any other future low duty cycle packet rates.

TOCK issue

There's a major architecture issue with F500 that caused telemetry to stop working once the OtaNonce integrity check was added. This is because the RX receives the packet roughly 512us into the 2000us packet period. The old code schedules the next TOCK for 200us after the packet reception, which meant that the F500 telemetry would actually be received in the previous packet period on the TX. Excuse the crudity of this drawing but note how for F500 the RX's next packet period begins so early that it is received in the previous packet period

image
Rate TOA (us) TXdone (us) Telem Arrives at TX (us)
50Hz 10798 10885 2460
100Hz Full 7605 7695 5875
150Hz 5871 5961 5718
250Hz 3300 3422 3188
333Hz Full 2374 2469 2256
500Hz 1507 1600 1492
F500 389 512 1335 in Nonce-1
F1000 389 512 331

To resolve this, the RX code now will schedule the TOCK roughly 1x TOA from where it thinks the end of the TX's packet period is. This estimate does not include the roughly 2x 100us of time required on the TX and the RX to begin a transmission, so this guarantees that the RX's telemetry is received by the TX in the appropriate packet period. Affected packet rates:

  • F500 - Telemetry received ~950us later, around 350us into proper period
  • F1000, D500, D250 - Telemetry received ~20us later, around 350us into proper period

Tested

Tested on every available packet rate on each and every one of the following hardware combinations

  • Ranger Micro + HappyModel EP2 (FCC)
  • Ranger Micro + HappyModel EP2 (LBT)
  • HappyModel ES900 TX + HappyModel ES900 RX
  • HappyModel ES900 TX + DAKEFPV 900 (fail on 200Hz)
  • BetaFPV SuperG Gemini + HappyModel EP Dual (in Gemini, Switch, Ant 2 modes)
  • RadioMaster TX15 + HappyModel EP2 (FCC)
  • RadioMaster TX15 + DAKEFPV 900 (fail on 200Hz Team900)

// For rates where the TOA is longer than half the packet period schedule the TOCK for rougly 1x TOA before
// the TX's end of the period so telemetry is received by the TX in the correct period. For all others,
// schedule TOCK to be PACKET_TO_TOCK_SLACK (us) after RX packet reception.
int32_t slack = std::max(ExpressLRS_currAirRate_Modparams->interval - 2 * ExpressLRS_currAirRate_RFperfParams->TOA, (int32_t)PACKET_TO_TOCK_SLACK);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to check is how well this plays with the transmitters Rxnb timeout, which is set to 1 interval.

With this change and at close range the transmitter will have finished receiving tlm right at the end of the timeout, which is ok because the timeout is cancelled on preamble/sync. Add delay due to ToF and you have artificially caused a tlm loss due to the preamble not being received before the timeout.

I know I know... its F500 and who TF is flying at any kind of range. But this is the type of thing to bite us later. So it may be worthwhile looking at the transmitters timeout interval and having it cancel a period of TOA before the next packet Txnb.

Copy link
Member Author

@CapnBry CapnBry Aug 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah damn, you're right about that, I had forgotten about the Rx timeout.

For F500 the transmit ends at ~500us, meaning there's a deadline of approximately 500us into the next cycle for a preamble to be received and cancel the timer. Given that the F500 telemetry RXdoneISR fires around 350us into the cycle (this PR) and subtracting the 389us TOA, that means the preamble started at roughly -40us which would be 540us of "range margin" before telemetry stopped working (~80km round trip for vacuum light).

It's been a while, but remind me what the RX timeout does for us? For telemetry it seems the timeout could only be a hindrance to achieving long range since it would always fire too early. It should be set for 2 * interval - 2 * TOA - 2 * Overhead for it to fire at the last possible start-of-reception in the telemetry packet period. For F500 that's 2*2000 - 2*389 - 2*100 = 3022 or more than 50% additional time left on the table just from the timeout. For the RX it seems we have other mechanisms of timing which would fire before the timeout as well. Is it just a safety mechanism to prevent stalling, and if so can we up it to 1.5x interval?

@JyeSmith
Copy link
Member

JyeSmith commented Aug 4, 2025

Nice addition of jamming nonce in the crc.

Some SPI trances showing the IRQs would be great to confirm the changes around SLACK.

@CapnBry
Copy link
Member Author

CapnBry commented Aug 17, 2025

There hasn't been any movement on this because I was unsure how we can calculate an appropriate timeout before the first packet is even transmitted. The timeout is strictly used on the TX side to take the RF out of receive mode, I believe for the purpose of reducing the turnaround time for the next period's transmit and keep it on a steady schedule. The complexity is that there's a relatively fixed amount of time between TOCK and starting the transmit, and possibly an LBA Clear Channel Assessment, and only then timeout starts at the end of the transmit.

Table below shows measured timing of how far from the end of a period a typical telemetry packet is received using the existing interval timeout. Note that the F500 value is using the RX sync adjustment from this PR; without it its value is >100% due to arriving in the previous period. "Percentage of interval remaining" is the units used to allow the values from different packet rates to be compared. This was measured by turning on the ISR for RX_TX_TIMEOUT. LBT was not used for this measurement as LBT almost always throws a timeout during the CCA, making measurement more cumbersome.

Rate (non-LBT) Telem Arrives (% rem) Timeout (% rem)
50Hz 87 44
100Hz Full 40 21
150Hz 13 7
250Hz 19 9
333Hz Full 23 11
500Hz 23 10
F500 80 65
F1000 61 30

-- Deletes a bunch of other tables of calculated data and theory to get to the point --

  1. We don't seem to be leaving any telemetry range performance on the the table with the existing timeout (F500 excluded). Great!
  2. We're fine with the timeout firing way later in the period than it would be to safely receive telemetry, giving a lot of leeway in what's an appropriate value

Results

For rates where TOA is less than half the packet period, use the theoretical time remaining this period (interval - TOA) plus half packet period. This is at least 1 full interval, but has enough padding that it shouldn't get close to the next TOCK.

F500 - Timeout increased to 2000 -> 2611us. It now times out with 34% remaining compared to 65% on master (measured both), increasing the range margin from 540us (~80km) to 1160us (~173km) but still less than the roughly 1700us (~255km) achievable using master's system. That value is impossible by any fixed value logic in which the telemetry safely arrives in the current period.

Just like the F500 change from the OP, all other rates are unchanged.

@CapnBry
Copy link
Member Author

CapnBry commented Sep 15, 2025

Thanks for the help keeping this up to date, @mha1 ❤️ !

@mha1
Copy link
Contributor

mha1 commented Sep 15, 2025

@CapnBry: Sure thing.

Re-tested it after the merge commit and everything works fine.

This PR could use some reviews and approvals.

@pkendall64
Copy link
Collaborator

I was going to ask @CapnBry, is this ready to review and for merging?

@CapnBry
Copy link
Member Author

CapnBry commented Sep 15, 2025

I was going to ask @CapnBry, is this ready to review and for merging?

Yes it is. I've tested it six ways to Sunday on all the modes for SX127x, SX1280, and LR1121 (although I can't remember now why it was OK Team900 200Hz didn't work at original time of testing). All the timing was tested in software measuring from the TOCK ISR to the appropriate RF ISR and the only one that has any change beyond a handful of microseconds is F500.

Funny that the whole point of this PR was just to put the OtaNonce in the CRC seed, which Mickey did pretty fast, and then it became dozens of hours of figuring out then shifting the F500 telemetry to be received in the correct period 🤣

EDIT: I haven't done a full run through of all the testing after the merges but if we're going to move on getting this merged I can make another pass to confirm!

@pkendall64
Copy link
Collaborator

pkendall64 commented Sep 16, 2025

I think we will need to get the 3.x.x merge done to master, then merge master to this branch and test that LR1121 & LBT is still working correctly before merging this to master.

@CapnBry
Copy link
Member Author

CapnBry commented Sep 16, 2025

Yeah that's a good idea. I'll wait for 3.x.x-maint to get rolled in, merge to this pr and run the full suite again.

@CapnBry
Copy link
Member Author

CapnBry commented Sep 20, 2025

Thanks, PK! I'm out of town through Sunday but I'll have to run through the 800 different test cases again on Monday and report back

@CapnBry
Copy link
Member Author

CapnBry commented Sep 23, 2025

Testing methodology

  • RX connected to standalone_test.py to monitor RQly
  • TX DEBUG with timing info added / monitor TQly
  • Wide switch mode
  • Std telemetry ratio, some additionally tested at 1:2
  • At least 20 seconds per rate
    • Verify 100% LQ on RX
    • Verify 100% LQ on TX
    • Note 1RSS/2RSS for anomalies
    • Note telemetry arrival time

Results

  • 2.4GHz LBT SX1280 shows sub 100 LQ on all rates. This seems far to consistent to be LBT doing its thing. bdc340b (random commit from early Aug) and 3.5.6 also affected so I suppose this is just LBT? But why only SX1280 and not LR1121
  • master issue "No more deferred function slots available" on rate switch 2.4GHz SX1280 LBT and lock up. Incredibly rare (3x in maybe 50 switches)
  • ?!??!??! TX on bdc340 (4.0) can connect to 3.5.6 RX?

Other than that, everything related to this PR checks out, so long as DEBUG_LOG isn't defined. With DEBUG_LOG, 2.4GHz K modes sometimes could get out of sync, but the CRC was correct on the packets received, which is baffling to me. Why just the 2.4 K modes only? Could not reproduce it without DEBUG_LOG.

Test Details

LR1121 Gemini 915/FCC (4 full tests)

  • TX Radiomaster Nomad
  • RX BAYCK Dual Band Gemini
  • Also tested with LR1121 "Ant 1" with practically identical results
  • RX Test2 SX1280 HappyModel EP2 (Overlapping 2.4)
  • RX Test3 SX127x BetaFPV SuperP 900 (Overlapping 900)
  • Unable to reproduce 2.4 K mode issues with DEBUG_LOG off
Rate Telem Arrives (% rem) Note
900 D50 8
900 25Hz 45
900 50Hz 8
900 100Hz 14
900 100 Full 57
900 200Hz 7
900 200 Full 36
900 250Hz 31
900 K1000 Full 26 TLM CRC Error rate increased?
2.4 50Hz 86
2.4 100 Full 38
2.4 150Hz 10
2.4 250Hz 12
2.4 333 Full 14
2.4 500Hz 7
2.4 DK250 22 FAIL - Connect drops / partial sync
2.4 DK500 22 FAIL ^^
2.4 K1000 15 FAIL ^^
X 100 Full 31
X 150Hz 32

LR1121 Gemini 868/LBT

  • TX Radiomaster Nomad
  • RX BAYCK Dual Band Gemini
  • Frequent Non-TLM CRC Errors (Noise?) on 868 100Hz Full, 200Hz
  • RX: Guru Meditation Error: Core 1 panic once when changing rates on 868, repeated on 2.4, repeated on X. Various places between timer locked / matching tx power. Unable to reproduce with DEBUG_LOG off
  • Unable to reproduce 2.4 K mode issues with DEBUG_LOG off
Rate Telem Arrives (% rem) Note
868 D50 8
868 25Hz 45
868 50Hz 8
868 100Hz 14
868 100 Full 57
868 200Hz 7
868 200 Full 36
868 250Hz 31
868 K1000 26
2.4 50Hz 85
2.4 100 Full 38
2.4 150Hz 10
2.4 250Hz 11
2.4 333 Full 13
2.4 500Hz 7
2.4 DK250 19 FAIL - Connect drops / partial sync
2.4 DK500 18 FAIL ^^
2.4 K1000 13 FAIL ^^
X 100 Full 32
X 150Hz 33

SX1280 2.4 FCC

  • TX Radiomaster Ranger Micro
  • RX HappyModel EP2
Rate Telem Arrives (% rem) Note
50Hz 87
100 Full 40
150Hz 12
250Hz 17
333 Full 21
500Hz 19
D250 61
D500 61
F500 77
F1000 54

SX1280 2.4 LBT

  • TX Radiomaster Ranger Micro
  • RX HappyModel EP2
  • "No more deferred function slots available" a ton then locked up on switch 50Hz -> 100, 150Hz -> 250Hz, incredibly rare, rate doesn't matter
Rate Telem Arrives (% rem) Note
50Hz 86 94% RQly
100 Full 39 90s% RQly
150Hz 11 85-95% RQly
250Hz 15 90s% RQly
333 Full 18 90s% RQly
500Hz 15 90s% RQly
D250 52
D500 52
F500 72 95-99% RQly
F1000 44 97-100% RQly

SX127x 915 FCC

  • TX HappyModel ES900
  • RX BetaFPV SuperD 900
Rate Telem Arrives (% rem) Note
D50 9
25Hz 45
50Hz 8
100Hz 15
100 Full 58
200Hz 8

DEBUG_LOG diff

diff --git a/src/lib/CrsfProtocol/CRSFEndpoint.cpp b/src/lib/CrsfProtocol/CRSFEndpoint.cpp
index c6cf682b..f95e3b4a 100644
--- a/src/lib/CrsfProtocol/CRSFEndpoint.cpp
+++ b/src/lib/CrsfProtocol/CRSFEndpoint.cpp
@@ -304,9 +304,9 @@ void CRSFEndpoint::parameterUpdateReq(const crsf_addr_e origin, const bool isElr
     switch (parameterType)
     {
     case CRSF_FRAMETYPE_PARAMETER_WRITE:
-        DBGLN("Set parameter [%s]=%u", parameter->name, parameterArg);
         if (parameterIndex < MAX_CRSF_PARAMETERS && paramCallbacks[parameterIndex])
         {
+            DBGLN("Set parameter [%s]=%u", parameter->name, parameterArg);
             // While the command is executing, the handset will send `WRITE state=lcsQuery`.
             // paramCallbacks will set the value when nextStatusChunk == 0, or send any
             // remaining chunks when nextStatusChunk != 0
diff --git a/src/lib/Handset/CRSFHandset.cpp b/src/lib/Handset/CRSFHandset.cpp
index 2313a6b7..6dbdf269 100644
--- a/src/lib/Handset/CRSFHandset.cpp
+++ b/src/lib/Handset/CRSFHandset.cpp
@@ -585,9 +585,9 @@ bool CRSFHandset::UARTwdt()
             retval = true;
         }
 #ifdef DEBUG_OPENTX_SYNC
-        if (abs((int)((1000000 / (ExpressLRS_currAirRate_Modparams->interval * ExpressLRS_currAirRate_Modparams->numOfSends)) - (int)GoodPktsCount)) > 1)
+        //if (abs((int)((1000000 / (ExpressLRS_currAirRate_Modparams->interval * ExpressLRS_currAirRate_Modparams->numOfSends)) - (int)GoodPktsCount)) > 1)
 #endif
-            DBGLN("UART STATS Bad:Good = %u:%u", BadPktsCount, GoodPktsCount);
+            //DBGLN("UART STATS Bad:Good = %u:%u", BadPktsCount, GoodPktsCount);

         UARTwdtLastChecked = now;
         if (retval)
diff --git a/src/src/tx_main.cpp b/src/src/tx_main.cpp
index 3563d0e8..efd811b2 100644
--- a/src/src/tx_main.cpp
+++ b/src/src/tx_main.cpp
@@ -743,11 +743,13 @@ void ICACHE_RAM_ATTR nonceAdvance()
   }
 }

+uint32_t tockUs;
 /*
  * Called as the TOCK timer ISR when there is a CRSF connection from the handset
  */
 void ICACHE_RAM_ATTR timerCallback()
 {
+  tockUs = micros();
   /* If we are busy writing to EEPROM (committing config changes) then we just advance the nonces, i.e. no SPI traffic
*/
   if (commitInProgress)
   {
@@ -927,6 +929,13 @@ bool ICACHE_RAM_ATTR RXdoneISR(SX12xxDriverCommon::rx_status const status)
   {
     return false; // Already received tlm, do not run ProcessTLMpacket() again.
   }
+  static uint32_t last;
+  uint32_t now = millis();
+  if (now - last >= 10)
+  {
+    DBGLN("REM %u LQ %u", (ExpressLRS_currAirRate_Modparams->interval - (micros() - tockUs)) * 100 / ExpressLRS_currAirRate_Modparams->interval, LQCalc.getLQ());
+    last = now;
+  }

   bool packetSuccessful = ProcessTLMpacket(status);
 #if defined(Regulatory_Domain_EU_CE_2400)

@CapnBry
Copy link
Member Author

CapnBry commented Sep 23, 2025

Bit of a problem. Both Mickey and I experience different issues with this and LBT in long term testing, so hold off on this until we can determine where the problem lies.

He experiences telemetry drop off that stays dropped off and may not recover, while the uplink remains steady. I experienced 3x instances of the receiver losing connection for ~2.6s as if the TX stopped transmission. Further testing and logging is underway.

@CapnBry
Copy link
Member Author

CapnBry commented Oct 8, 2025

Two weeks of testing later, the issue was determined to be present in the RF drivers in master and 3.x.x-maint, so I'd like to move forward on this PR. Changes to resolve the connection loss and jitter forthcoming from my dedicated peers.

@CapnBry CapnBry merged commit eb376dc into ExpressLRS:master Oct 8, 2025
25 checks passed
CapnBry added a commit to CapnBry/ExpressLRS that referenced this pull request Oct 9, 2025
* initial commit

* Adjust RX TOCK to later in period for low duty cycle rates

* Move all FHSS to TOCK only

* Extend TXs telemetry timeout for short TOA packets

* Remove incorrect comment

---------

Co-authored-by: mha1
Co-authored-by: pkendall64
@CapnBry CapnBry deleted the Exp_Desync_C branch October 25, 2025 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ExpressLRS partial desync and inability to reestablish connection

4 participants