Fix spiLock deadlocks / frequent watchdog restarts in WarmNodeStore::save()#10809
Conversation
⚡ Try this PR in the Web FlasherWarning This is an automated, unreviewed CI test build. Back up your device configuration Supported boards built by this PR (25)
Build artifacts expire on 2026-07-28. Updated for |
|
Confirmed on hardware — this fixes the RP2350/W5500 reboot-loop tracked in #10746. Your Test setup
Result — survives, with
|
|
Works for me here -- now getting multi-hour uptime on a Heltec V4, when previously it would reboot in < 20 minutes. |
There was a problem hiding this comment.
Pull request overview
This PR fixes watchdog-triggered hangs on FSCom platforms by removing a non-recursive spiLock re-entrancy deadlock caused by calling SafeFile while already holding spiLock. It narrows explicit spiLock usage to only the filesystem operations that aren’t already internally locked by SafeFile.
Changes:
- Update
WarmNodeStore::save()to avoid holdingspiLockacrossSafeFileconstruction, while still lockingmkdirandwritecalls. - Apply the same deadlock fix pattern to
MessageStore::clearAllMessages().
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| src/MessageStore.cpp | Avoids nested spiLock acquisition by moving locking to just the write call (constructor/close already lock internally). |
| src/mesh/WarmNodeStore.cpp | Prevents spiLock re-entrancy deadlock during warm store snapshot persistence by not holding the lock during SafeFile construction. |
Firmware Size Report22 targets | vs
Show 17 more target(s)
Updated for 8c32f8d |
|
I'm trying to work out when these additional lock guards were added. The automated tools were very aggressive in demanding that spilock was guarded, but it hadn't occurred to me that it was being done twice in some places. I'll see if I can recover one of my older histories and work it out. |
|
@compumike I've spent a bit of time looking at this, and I bow to your superior knowledge. Is it worth adding some commentary to SafeFile to note the fact that there's different levels of locking needed? |
Yes, it could be worth adding a comment inside I don't have enough historical context to know why |
…ODE_COUNT=0) Upstream meshtastic#10809 fixes the real root cause of the full-DB-save watchdog reboot: a nested spiLock deadlock in WarmNodeStore::save() (SafeFile's ctor/close re-acquire the non-recursive spiLock already held by the LockGuard). With that merged into develop, the warm tier can run again on RP2350 via the ARCH_RP2040 branch (=150, meshtastic#10759). Drop the WARM_NODE_COUNT=0 override from both W5500 variants so they inherit the upstream fix instead of disabling the warm tier. Validated on-hardware (wiznet_5500_evb_pico2_e22p @ .236): ~18h soak with the warm tier active, 208 eviction-at-full (120 nodes) events and 44 warm.dat writes (150 warm nodes), zero watchdog reboots. Build SUCCESS (2.8.0.b5c2a81). See firmware#10809, firmware#10746. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fixes frequent watchdog restarts on platforms with
FSCom.I was seeing frequent (every few minutes to tens of minutes) hangs and then watchdog reboots. I tracked it down to
WarmNodeStore::save()which holdsspiLockwhile callingSafeFile f(filename.c_str(), false);. The constructor forSafeFileitself tries to grabspiLock, which is already held, so this causes a deadlock, until the watchdog timer kicks in and reboots.This PR makes it so
WarmNodeStore::save()adds locking only around the calls that aren't already covered by anspiLock: themkdirand thewritecalls.(Also found and fixed similar in
MessageStore::clearAllMessages, but that's not very commonly used.)It appears that this logic was introduced about two weeks ago in https://github.com/meshtastic/firmware/pull/10705/changes#diff-fd0d79c004a2b4611b3eee806f16eef5d803bebd2e1a13f7b891e6b4bc312b2dR523-R534 @NomDeTom @thebentern
My logs looked like:
🤝 Attestations
Testing is in progress; would appreciate other testers!