fix(kanban-db): WAL file descriptor leak on connect/close cycles (fixes #30799)#31130
Conversation
|
Closely related to #31158 (which I just opened), but the failure modes are not identical:
Same underlying class (gateway-side WAL FD lifecycle in a long-running multi-threaded process with concurrent multi-process writers), different specific manifestation. The I have a deterministic reproduction of the #31158 wedge (4 verifier-subprocess completions on a single board → wedge; restart resolves; reproduces on next run). Happy to apply this PR's patch locally and run my repro against it to either confirm or rule out that #31130 also addresses the deleted-inode-FD class, if that's useful data for moving this PR forward. Currently I have a separate local workaround on #31158 (switched the kanban DB to |
|
Thanks for the detailed write-up @valhir1. The inode-deletion wedge is an interesting failure mode -- I hadn't considered the 'last connection cleanup' unlinking -wal/-shm while gateway threads are between connect() calls. Yes, please do apply the patch and run your repro against it. The PR branch is fix/30799-kanban-fd-leak-wal-checkpoint -- you can grab it with: git fetch origin fix/30799-kanban-fd-leak-wal-checkpoint The key change is in tools/kanban_db.py: the _WalSafeConnection wrapper calls PRAGMA wal_checkpoint(TRUNCATE) before each close(). If it resolves your wedge, that's useful signal that the WAL consolidation theory holds across both failure modes. If not, your journal_mode=DELETE workaround is pragmatic and you should ship it -- WAL's concurrent-reader benefit is marginal for the low-frequency kanban DB. Either way, posting your findings on #31158 would help the maintainers understand the full scope of the WAL lifecycle issue. |
NousResearch#30799) Long-running gateway processes open a new kanban SQLite connection every dispatcher tick. In WAL mode SQLite defers cleanup of WAL/shm file descriptors, causing a slow FD leak that eventually hits the process limit (observed: ~500 kanban.db + ~500 kanban.db-wal FDs after 14 hours) and triggers cascading failures. Fix: add _WalSafeConnection, a sqlite3.Connection subclass that runs PRAGMA wal_checkpoint(TRUNCATE) before each close. This forces SQLite to consolidate the WAL and release its file descriptors immediately. All existing callers get this behaviour automatically since connect() uses the subclass via the factory parameter.
fd2c5e4 to
aef989d
Compare
Problem
Long-running gateway processes open a new kanban SQLite connection every dispatcher tick. In WAL mode SQLite defers cleanup of WAL/shm file descriptors, causing a slow FD leak that eventually hits the process limit (observed: ~500 kanban.db + ~500 kanban.db-wal FDs after 14 hours) and triggers cascading failures --
too many open files,unable to open database file.Fix
Added
_WalSafeConnection, asqlite3.Connectionsubclass that runsPRAGMA wal_checkpoint(TRUNCATE)before eachclose(). This forces SQLite to consolidate the WAL and release its file descriptors immediately.connect()uses the subclass via thefactoryparameter so all existing callers get the behaviour automatically -- no API changes needed.Testing
tests/hermes_cli/test_kanban_cli.py: 46 passed (no regressions)Closes #30799