Fix `spiLock` deadlocks / frequent watchdog restarts in `WarmNodeStore::save()` by compumike · Pull Request #10809 · meshtastic/firmware

compumike · 2026-06-28T21:53:22Z

Fixes frequent watchdog restarts on platforms with FSCom.

I was seeing frequent (every few minutes to tens of minutes) hangs and then watchdog reboots. I tracked it down to WarmNodeStore::save() which holds spiLock while calling SafeFile f(filename.c_str(), false);. The constructor for SafeFile itself tries to grab spiLock, which is already held, so this causes a deadlock, until the watchdog timer kicks in and reboots.

This PR makes it so WarmNodeStore::save() adds locking only around the calls that aren't already covered by an spiLock: the mkdir and the write calls.

(Also found and fixed similar in MessageStore::clearAllMessages, but that's not very commonly used.)

It appears that this logic was introduced about two weeks ago in https://github.com/meshtastic/firmware/pull/10705/changes#diff-fd0d79c004a2b4611b3eee806f16eef5d803bebd2e1a13f7b891e6b4bc312b2dR523-R534 @NomDeTom @thebentern

My logs looked like:

DEBUG | 16:07:05 264 [Router] Saved Pubkey:  ...
DEBUG | 16:07:05 264 [Router] Update changed=1 user Redacted/RDCT, id=0x12345678, channel=1
DEBUG | 16:07:06 264 [Router] Node status update: 78 online, 250 total
DEBUG | 16:07:06 264 [Router] Save to disk 16
DEBUG | 16:07:06 264 [Router] Opening /prefs/nodes.proto, fullAtomic=0
INFO  | 16:07:06 264 [Router] Save /prefs/nodes.proto
E (355189) task_wdt: Task watchdog got triggered. The following tasks/users did not reset the watchdog in time:
E (355189) task_wdt:  - loopTask (CPU 1)
E (355190) task_wdt: Tasks currently running:
E (355190) task_wdt: CPU 0: IDLE0
E (355190) task_wdt: CPU 1: IDLE1
E (355190) task_wdt: Aborting.
E (355210) task_wdt: Print CPU 1 (current core) backtrace
...

🤝 Attestations

Testing is in progress; would appreciate other testers!

…save()

github-actions · 2026-06-28T21:53:31Z

⚡ Try this PR in the Web Flasher

Warning

This is an automated, unreviewed CI test build. Back up your device configuration
before flashing, and only flash devices you are able to recover.

Supported boards built by this PR (25)

Device	Board	Platform
Crowpanel Adv 3.5 TFT	`elecrow-adv-35-tft`	esp32-s3
Heltec HT62	`heltec-ht62-esp32c3-sx1262`	esp32-c3
Heltec Mesh Node 096	`heltec-mesh-node-t096`	nrf52840
Heltec Mesh Node T1	`heltec-mesh-node-t1`	nrf52840
Heltec Mesh Node T114	`heltec-mesh-node-t114`	nrf52840
Heltec V3	`heltec-v3`	esp32-s3
Heltec V4	`heltec-v4`	esp32-s3
Raspberry Pi Pico	`pico`	rp2040
Raspberry Pi Pico W	`picow`	rp2040
RAK WisMesh Tag	`rak_wismeshtag`	nrf52840
RAK WisBlock 11200	`rak11200`	esp32
RAK WisBlock 11310	`rak11310`	rp2040
RAK3312	`rak3312`	esp32-s3
RAK WisBlock 4631	`rak4631`	nrf52840
Seeed Wio Tracker L1	`seeed_wio_tracker_L1`	nrf52840
Seeed Xiao NRF52840 Kit	`seeed_xiao_nrf52840_kit`	nrf52840
Seeed Xiao ESP32-S3	`seeed-xiao-s3`	esp32-s3
Station G2	`station-g2`	esp32-s3
Station G3	`station-g3`	esp32-s3
LILYGO T-Deck	`t-deck-tft`	esp32-s3
LILYGO T-Echo	`t-echo`	nrf52840
LILYGO T-Echo Plus	`t-echo-plus`	nrf52840
LILYGO T-Impulse Plus	`t-impulse-plus`	nrf52840
LilyGo T3-C6	`tlora-c6`	esp32-c6
Seeed SenseCAP T1000-E	`tracker-t1000-e`	nrf52840

Build artifacts expire on 2026-07-28. Updated for c42dc2b.

cvaldess · 2026-06-28T23:26:25Z

Confirmed on hardware — this fixes the RP2350/W5500 reboot-loop tracked in #10746.

Your spiLock deadlock diagnosis matches every symptom we saw there. WarmNodeStore::save() takes LockGuard g(spiLock) and then constructs SafeFile(warmFileName, false), whose constructor calls openFile() which grabs spiLock again — non-recursive, so it hangs until the watchdog fires. That explains why our captures cut off right after Save /prefs/nodes.proto with no warm.dat line (the deadlock happens at the LockGuard before the Opening … LOG_DEBUG), why the post-reboot filesystem had nodes.proto but never warm.dat, and why WARM_NODE_COUNT=0 avoided it while #10759's watchdog_update() (=150) did not.

Test setup

Board: WIZnet W5500-EVB-Pico2 (RP2350, 2 MB) — the exact board from NodeDB warmstore (#10705) causes hard reboot on RP2350/W5500 at full-DB nodes.proto save #10746
develop + this PR's diff, warm store active (override removed, so it takes the ARCH_RP2040 branch → WARM_NODE_COUNT = 150)
Forced the failing path: hot store full at 120, eviction triggered by an incoming NodeInfo (portnum 4)

Result — survives, with `warm.dat` written

[Router] Node database full with 120 nodes ... Erasing oldest entry
[Router] Save /prefs/nodes.proto
[Router] Opening /prefs/warm.dat, fullAtomic=0
[Router] WarmStore: saved 8 warm nodes to /prefs/warm.dat

The Opening /prefs/warm.dat debug line — which never appeared in the broken build because the deadlock hit the LockGuard ahead of it — now shows, the warm write completes, and the board rides through repeated full-DB evictions with rebootCount = 0, no watchdog resets, and monotonic uptime. Previously this same sequence reboot-looped every ~60–100 s.

Thanks for tracking this down — happy to keep it on a longer soak if useful.

…ode-store

compumike · 2026-06-29T00:17:04Z

Works for me here -- now getting multi-hour uptime on a Heltec V4, when previously it would reboot in < 20 minutes.

Copilot

Pull request overview

This PR fixes watchdog-triggered hangs on FSCom platforms by removing a non-recursive spiLock re-entrancy deadlock caused by calling SafeFile while already holding spiLock. It narrows explicit spiLock usage to only the filesystem operations that aren’t already internally locked by SafeFile.

Changes:

Update WarmNodeStore::save() to avoid holding spiLock across SafeFile construction, while still locking mkdir and write calls.
Apply the same deadlock fix pattern to MessageStore::clearAllMessages().

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
src/MessageStore.cpp	Avoids nested `spiLock` acquisition by moving locking to just the `write` call (constructor/close already lock internally).
src/mesh/WarmNodeStore.cpp	Prevents `spiLock` re-entrancy deadlock during warm store snapshot persistence by not holding the lock during `SafeFile` construction.

github-actions · 2026-06-29T07:05:10Z

Firmware Size Report

22 targets | vs develop: 22 increased, net +11,072 (+10.8 KB)

Target	Size	vs `develop`
`heltec-vision-master-e213-inkhud`	2,230,240	📈 +2,880 (+2.8 KB)
`station-g3`	2,266,848	📈 +496
`rak11200`	1,861,088	📈 +480
`station-g2`	2,266,832	📈 +480
`t-deck-tft`	3,811,152	📈 +480

Show 17 more target(s)

Target	Size	vs `develop`
`heltec-v3`	2,264,544	📈 +464
`heltec-v4`	2,277,760	📈 +464
`rak3312`	2,272,704	📈 +464
`heltec-ht62-esp32c3-sx1262`	2,135,440	📈 +448
`seeed-xiao-s3`	2,276,592	📈 +448
`tlora-c6`	2,368,752	📈 +448
`t-eth-elite`	2,491,920	📈 +432
`elecrow-adv-35-tft`	3,417,040	📈 +416
`pico2w`	1,220,924	📈 +408
`picow`	1,245,240	📈 +360
`pico2`	770,128	📈 +352
`seeed_xiao_rp2350`	768,288	📈 +352
`pico`	783,008	📈 +320
`rak11310`	805,744	📈 +320
`seeed_xiao_rp2040`	781,224	📈 +320
`wio-e5`	238,756	📈 +128
`rak3172`	186,408	📈 +112

Updated for 8c32f8d

NomDeTom · 2026-06-29T07:09:05Z

I'm trying to work out when these additional lock guards were added. The automated tools were very aggressive in demanding that spilock was guarded, but it hadn't occurred to me that it was being done twice in some places.

I'll see if I can recover one of my older histories and work it out.

NomDeTom · 2026-06-29T09:05:54Z

@compumike I've spent a bit of time looking at this, and I bow to your superior knowledge.

Is it worth adding some commentary to SafeFile to note the fact that there's different levels of locking needed?

compumike · 2026-06-29T12:55:50Z

Is it worth adding some commentary to SafeFile to note the fact that there's different levels of locking needed?

Yes, it could be worth adding a comment inside SafeFile, similar to the ones I added in this PR. Feel free to do so if you'd like!

I don't have enough historical context to know why SafeFile was implemented the way it was, or what the other callers do! It seems "cleaner" to either have all SafeFile methods do the locking, or none (and have the caller always be responsible for that), but that's a bigger investigation and refactor.

…ODE_COUNT=0) Upstream meshtastic#10809 fixes the real root cause of the full-DB-save watchdog reboot: a nested spiLock deadlock in WarmNodeStore::save() (SafeFile's ctor/close re-acquire the non-recursive spiLock already held by the LockGuard). With that merged into develop, the warm tier can run again on RP2350 via the ARCH_RP2040 branch (=150, meshtastic#10759). Drop the WARM_NODE_COUNT=0 override from both W5500 variants so they inherit the upstream fix instead of disabling the warm tier. Validated on-hardware (wiznet_5500_evb_pico2_e22p @ .236): ~18h soak with the warm tier active, 208 eviction-at-full (120 nodes) events and 44 warm.dat writes (150 warm nodes), zero watchdog reboots. Build SUCCESS (2.8.0.b5c2a81). See firmware#10809, firmware#10746. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Fix spiLock deadlocks / frequent watchdog restarts in WarmNodeStore::…

672996c

…save()

github-actions Bot added the bugfix Pull request that fixes bugs label Jun 28, 2026

compumike mentioned this pull request Jun 28, 2026

NodeDB warmstore (#10705) causes hard reboot on RP2350/W5500 at full-DB nodes.proto save #10746

Open

thebentern added 2 commits June 28, 2026 18:47

Merge branch 'develop' into compumike/fix-deadlock-restarts-in-warm-n…

702ec1c

…ode-store

Merge branch 'develop' into compumike/fix-deadlock-restarts-in-warm-n…

c42dc2b

…ode-store

thebentern requested a review from Copilot June 29, 2026 01:30

Copilot started reviewing on behalf of thebentern June 29, 2026 01:31 View session

Copilot AI reviewed Jun 29, 2026

View reviewed changes

thebentern merged commit b44ed45 into meshtastic:develop Jun 29, 2026
91 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix `spiLock` deadlocks / frequent watchdog restarts in `WarmNodeStore::save()`#10809

Fix `spiLock` deadlocks / frequent watchdog restarts in `WarmNodeStore::save()`#10809
thebentern merged 3 commits into
meshtastic:developfrom
compumike:compumike/fix-deadlock-restarts-in-warm-node-store

compumike commented Jun 28, 2026

Uh oh!

github-actions Bot commented Jun 28, 2026 •

edited

Loading

Uh oh!

cvaldess commented Jun 28, 2026

Uh oh!

compumike commented Jun 29, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

github-actions Bot commented Jun 29, 2026

Uh oh!

NomDeTom commented Jun 29, 2026

Uh oh!

NomDeTom commented Jun 29, 2026

Uh oh!

Uh oh!

compumike commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Uh oh!

Conversation

compumike commented Jun 28, 2026

🤝 Attestations

Uh oh!

github-actions Bot commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚡ Try this PR in the Web Flasher

Uh oh!

cvaldess commented Jun 28, 2026

Test setup

Result — survives, with warm.dat written

Uh oh!

compumike commented Jun 29, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

github-actions Bot commented Jun 29, 2026

Firmware Size Report

Uh oh!

NomDeTom commented Jun 29, 2026

Uh oh!

NomDeTom commented Jun 29, 2026

Uh oh!

Uh oh!

compumike commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

github-actions Bot commented Jun 28, 2026 •

edited

Loading

Result — survives, with `warm.dat` written