Skip to content

Fix spiLock deadlocks / frequent watchdog restarts in WarmNodeStore::save()#10809

Merged
thebentern merged 3 commits into
meshtastic:developfrom
compumike:compumike/fix-deadlock-restarts-in-warm-node-store
Jun 29, 2026
Merged

Fix spiLock deadlocks / frequent watchdog restarts in WarmNodeStore::save()#10809
thebentern merged 3 commits into
meshtastic:developfrom
compumike:compumike/fix-deadlock-restarts-in-warm-node-store

Conversation

@compumike

Copy link
Copy Markdown
Contributor

Fixes frequent watchdog restarts on platforms with FSCom.

I was seeing frequent (every few minutes to tens of minutes) hangs and then watchdog reboots. I tracked it down to WarmNodeStore::save() which holds spiLock while calling SafeFile f(filename.c_str(), false);. The constructor for SafeFile itself tries to grab spiLock, which is already held, so this causes a deadlock, until the watchdog timer kicks in and reboots.

This PR makes it so WarmNodeStore::save() adds locking only around the calls that aren't already covered by an spiLock: the mkdir and the write calls.

(Also found and fixed similar in MessageStore::clearAllMessages, but that's not very commonly used.)

It appears that this logic was introduced about two weeks ago in https://github.com/meshtastic/firmware/pull/10705/changes#diff-fd0d79c004a2b4611b3eee806f16eef5d803bebd2e1a13f7b891e6b4bc312b2dR523-R534 @NomDeTom @thebentern

My logs looked like:

DEBUG | 16:07:05 264 [Router] Saved Pubkey:  ...
DEBUG | 16:07:05 264 [Router] Update changed=1 user Redacted/RDCT, id=0x12345678, channel=1
DEBUG | 16:07:06 264 [Router] Node status update: 78 online, 250 total
DEBUG | 16:07:06 264 [Router] Save to disk 16
DEBUG | 16:07:06 264 [Router] Opening /prefs/nodes.proto, fullAtomic=0
INFO  | 16:07:06 264 [Router] Save /prefs/nodes.proto
E (355189) task_wdt: Task watchdog got triggered. The following tasks/users did not reset the watchdog in time:
E (355189) task_wdt:  - loopTask (CPU 1)
E (355190) task_wdt: Tasks currently running:
E (355190) task_wdt: CPU 0: IDLE0
E (355190) task_wdt: CPU 1: IDLE1
E (355190) task_wdt: Aborting.
E (355210) task_wdt: Print CPU 1 (current core) backtrace
...

🤝 Attestations

Testing is in progress; would appreciate other testers!

@github-actions

github-actions Bot commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

⚡ Try this PR in the Web Flasher

Flash this PR in the Web Flasher

firmware commit boards expires

Warning

This is an automated, unreviewed CI test build. Back up your device configuration
before flashing, and only flash devices you are able to recover.

Supported boards built by this PR (25)
Device Board Platform
Crowpanel Adv 3.5 TFT elecrow-adv-35-tft esp32-s3
Heltec HT62 heltec-ht62-esp32c3-sx1262 esp32-c3
Heltec Mesh Node 096 heltec-mesh-node-t096 nrf52840
Heltec Mesh Node T1 heltec-mesh-node-t1 nrf52840
Heltec Mesh Node T114 heltec-mesh-node-t114 nrf52840
Heltec V3 heltec-v3 esp32-s3
Heltec V4 heltec-v4 esp32-s3
Raspberry Pi Pico pico rp2040
Raspberry Pi Pico W picow rp2040
RAK WisMesh Tag rak_wismeshtag nrf52840
RAK WisBlock 11200 rak11200 esp32
RAK WisBlock 11310 rak11310 rp2040
RAK3312 rak3312 esp32-s3
RAK WisBlock 4631 rak4631 nrf52840
Seeed Wio Tracker L1 seeed_wio_tracker_L1 nrf52840
Seeed Xiao NRF52840 Kit seeed_xiao_nrf52840_kit nrf52840
Seeed Xiao ESP32-S3 seeed-xiao-s3 esp32-s3
Station G2 station-g2 esp32-s3
Station G3 station-g3 esp32-s3
LILYGO T-Deck t-deck-tft esp32-s3
LILYGO T-Echo t-echo nrf52840
LILYGO T-Echo Plus t-echo-plus nrf52840
LILYGO T-Impulse Plus t-impulse-plus nrf52840
LilyGo T3-C6 tlora-c6 esp32-c6
Seeed SenseCAP T1000-E tracker-t1000-e nrf52840

Build artifacts expire on 2026-07-28. Updated for c42dc2b.

@cvaldess

Copy link
Copy Markdown
Contributor

Confirmed on hardware — this fixes the RP2350/W5500 reboot-loop tracked in #10746.

Your spiLock deadlock diagnosis matches every symptom we saw there. WarmNodeStore::save() takes LockGuard g(spiLock) and then constructs SafeFile(warmFileName, false), whose constructor calls openFile() which grabs spiLock again — non-recursive, so it hangs until the watchdog fires. That explains why our captures cut off right after Save /prefs/nodes.proto with no warm.dat line (the deadlock happens at the LockGuard before the Opening … LOG_DEBUG), why the post-reboot filesystem had nodes.proto but never warm.dat, and why WARM_NODE_COUNT=0 avoided it while #10759's watchdog_update() (=150) did not.

Test setup

Result — survives, with warm.dat written

[Router] Node database full with 120 nodes ... Erasing oldest entry
[Router] Save /prefs/nodes.proto
[Router] Opening /prefs/warm.dat, fullAtomic=0
[Router] WarmStore: saved 8 warm nodes to /prefs/warm.dat

The Opening /prefs/warm.dat debug line — which never appeared in the broken build because the deadlock hit the LockGuard ahead of it — now shows, the warm write completes, and the board rides through repeated full-DB evictions with rebootCount = 0, no watchdog resets, and monotonic uptime. Previously this same sequence reboot-looped every ~60–100 s.

Thanks for tracking this down — happy to keep it on a longer soak if useful.

@compumike

Copy link
Copy Markdown
Contributor Author

Works for me here -- now getting multi-hour uptime on a Heltec V4, when previously it would reboot in < 20 minutes.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes watchdog-triggered hangs on FSCom platforms by removing a non-recursive spiLock re-entrancy deadlock caused by calling SafeFile while already holding spiLock. It narrows explicit spiLock usage to only the filesystem operations that aren’t already internally locked by SafeFile.

Changes:

  • Update WarmNodeStore::save() to avoid holding spiLock across SafeFile construction, while still locking mkdir and write calls.
  • Apply the same deadlock fix pattern to MessageStore::clearAllMessages().

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
src/MessageStore.cpp Avoids nested spiLock acquisition by moving locking to just the write call (constructor/close already lock internally).
src/mesh/WarmNodeStore.cpp Prevents spiLock re-entrancy deadlock during warm store snapshot persistence by not holding the lock during SafeFile construction.

@github-actions

Copy link
Copy Markdown
Contributor

Firmware Size Report

22 targets | vs develop: 22 increased, net +11,072 (+10.8 KB)

Target Size vs develop
heltec-vision-master-e213-inkhud 2,230,240 📈 +2,880 (+2.8 KB)
station-g3 2,266,848 📈 +496
rak11200 1,861,088 📈 +480
station-g2 2,266,832 📈 +480
t-deck-tft 3,811,152 📈 +480
Show 17 more target(s)
Target Size vs develop
heltec-v3 2,264,544 📈 +464
heltec-v4 2,277,760 📈 +464
rak3312 2,272,704 📈 +464
heltec-ht62-esp32c3-sx1262 2,135,440 📈 +448
seeed-xiao-s3 2,276,592 📈 +448
tlora-c6 2,368,752 📈 +448
t-eth-elite 2,491,920 📈 +432
elecrow-adv-35-tft 3,417,040 📈 +416
pico2w 1,220,924 📈 +408
picow 1,245,240 📈 +360
pico2 770,128 📈 +352
seeed_xiao_rp2350 768,288 📈 +352
pico 783,008 📈 +320
rak11310 805,744 📈 +320
seeed_xiao_rp2040 781,224 📈 +320
wio-e5 238,756 📈 +128
rak3172 186,408 📈 +112

Updated for 8c32f8d

@NomDeTom

Copy link
Copy Markdown
Collaborator

I'm trying to work out when these additional lock guards were added. The automated tools were very aggressive in demanding that spilock was guarded, but it hadn't occurred to me that it was being done twice in some places.

I'll see if I can recover one of my older histories and work it out.

@NomDeTom

Copy link
Copy Markdown
Collaborator

@compumike I've spent a bit of time looking at this, and I bow to your superior knowledge.

Is it worth adding some commentary to SafeFile to note the fact that there's different levels of locking needed?

@thebentern thebentern merged commit b44ed45 into meshtastic:develop Jun 29, 2026
91 checks passed
@compumike

Copy link
Copy Markdown
Contributor Author

Is it worth adding some commentary to SafeFile to note the fact that there's different levels of locking needed?

Yes, it could be worth adding a comment inside SafeFile, similar to the ones I added in this PR. Feel free to do so if you'd like!

I don't have enough historical context to know why SafeFile was implemented the way it was, or what the other callers do! It seems "cleaner" to either have all SafeFile methods do the locking, or none (and have the caller always be responsible for that), but that's a bigger investigation and refactor.

cvaldess added a commit to cvaldess/firmware that referenced this pull request Jun 29, 2026
…ODE_COUNT=0)

Upstream meshtastic#10809 fixes the real root cause of the full-DB-save watchdog
reboot: a nested spiLock deadlock in WarmNodeStore::save() (SafeFile's
ctor/close re-acquire the non-recursive spiLock already held by the
LockGuard). With that merged into develop, the warm tier can run again on
RP2350 via the ARCH_RP2040 branch (=150, meshtastic#10759). Drop the
WARM_NODE_COUNT=0 override from both W5500 variants so they inherit the
upstream fix instead of disabling the warm tier.

Validated on-hardware (wiznet_5500_evb_pico2_e22p @ .236): ~18h soak with
the warm tier active, 208 eviction-at-full (120 nodes) events and 44
warm.dat writes (150 warm nodes), zero watchdog reboots. Build SUCCESS
(2.8.0.b5c2a81). See firmware#10809, firmware#10746.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugfix Pull request that fixes bugs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants