Skip to content

fix(kanban-db): WAL file descriptor leak on connect/close cycles (fixes #30799)#31130

Open
mohamedorigami-jpg wants to merge 1 commit into
NousResearch:mainfrom
mohamedorigami-jpg:fix/30799-kanban-fd-leak-wal-checkpoint
Open

fix(kanban-db): WAL file descriptor leak on connect/close cycles (fixes #30799)#31130
mohamedorigami-jpg wants to merge 1 commit into
NousResearch:mainfrom
mohamedorigami-jpg:fix/30799-kanban-fd-leak-wal-checkpoint

Conversation

@mohamedorigami-jpg

Copy link
Copy Markdown
Contributor

Problem

Long-running gateway processes open a new kanban SQLite connection every dispatcher tick. In WAL mode SQLite defers cleanup of WAL/shm file descriptors, causing a slow FD leak that eventually hits the process limit (observed: ~500 kanban.db + ~500 kanban.db-wal FDs after 14 hours) and triggers cascading failures -- too many open files, unable to open database file.

Fix

Added _WalSafeConnection, a sqlite3.Connection subclass that runs PRAGMA wal_checkpoint(TRUNCATE) before each close(). This forces SQLite to consolidate the WAL and release its file descriptors immediately.

connect() uses the subclass via the factory parameter so all existing callers get the behaviour automatically -- no API changes needed.

Testing

  • tests/hermes_cli/test_kanban_cli.py: 46 passed (no regressions)
  • Integration: WAL/shm files confirmed removed after close
  • Full block/unblock cycle with multi-word reasons verified end-to-end

Closes #30799

@alt-glitch alt-glitch added type/bug Something isn't working comp/cli CLI entry point, hermes_cli/, setup wizard P3 Low — cosmetic, nice to have labels May 23, 2026
@valhir1

valhir1 commented May 23, 2026

Copy link
Copy Markdown

Closely related to #31158 (which I just opened), but the failure modes are not identical:

  • This PR's symptom: WAL FDs accumulate over time → eventual "too many open files" / "unable to open database file"
  • My kanban dispatcher wedges under multi-thread + subprocess concurrency due to WAL/SHM cache poisoning #31158 symptom: WAL/SHM FDs in the long-running gateway end up referencing deleted inodes (verifier subprocess's connection close triggers SQLite's "last connection cleanup" while gateway threads are momentarily between connect()s, unlinking -wal/-shm; gateway's existing mmaps then point at the deleted-inode). Surfaces as sqlite3.OperationalError: disk I/O error on SELECTs in release_stale_claims.

Same underlying class (gateway-side WAL FD lifecycle in a long-running multi-threaded process with concurrent multi-process writers), different specific manifestation.

The _WalSafeConnection approach in this PR — PRAGMA wal_checkpoint(TRUNCATE) before each close()might also resolve my wedge: the checkpoint should force WAL consolidation so a closing subprocess's "last connection cleanup" finds nothing to unlink, leaving the gateway's other connections with valid (not deleted) mmap targets. But I haven't validated that, and it's an inference from SQLite's documented per-process WAL state semantics — not a measurement.

I have a deterministic reproduction of the #31158 wedge (4 verifier-subprocess completions on a single board → wedge; restart resolves; reproduces on next run). Happy to apply this PR's patch locally and run my repro against it to either confirm or rule out that #31130 also addresses the deleted-inode-FD class, if that's useful data for moving this PR forward.

Currently I have a separate local workaround on #31158 (switched the kanban DB to journal_mode=DELETE) — pragmatic, removes the failure class by construction, but trades away WAL's concurrent-reader benefits (acceptable for the small, low-frequency kanban DB in my case, but your approach is more general).

@mohamedorigami-jpg

Copy link
Copy Markdown
Contributor Author

Thanks for the detailed write-up @valhir1. The inode-deletion wedge is an interesting failure mode -- I hadn't considered the 'last connection cleanup' unlinking -wal/-shm while gateway threads are between connect() calls.

Yes, please do apply the patch and run your repro against it. The PR branch is fix/30799-kanban-fd-leak-wal-checkpoint -- you can grab it with:

git fetch origin fix/30799-kanban-fd-leak-wal-checkpoint
git checkout fix/30799-kanban-fd-leak-wal-checkpoint

The key change is in tools/kanban_db.py: the _WalSafeConnection wrapper calls PRAGMA wal_checkpoint(TRUNCATE) before each close(). If it resolves your wedge, that's useful signal that the WAL consolidation theory holds across both failure modes. If not, your journal_mode=DELETE workaround is pragmatic and you should ship it -- WAL's concurrent-reader benefit is marginal for the low-frequency kanban DB.

Either way, posting your findings on #31158 would help the maintainers understand the full scope of the WAL lifecycle issue.

NousResearch#30799)

Long-running gateway processes open a new kanban SQLite connection
every dispatcher tick.  In WAL mode SQLite defers cleanup of WAL/shm
file descriptors, causing a slow FD leak that eventually hits the
process limit (observed: ~500 kanban.db + ~500 kanban.db-wal FDs
after 14 hours) and triggers cascading failures.

Fix: add _WalSafeConnection, a sqlite3.Connection subclass that runs
PRAGMA wal_checkpoint(TRUNCATE) before each close.  This forces SQLite
to consolidate the WAL and release its file descriptors immediately.
All existing callers get this behaviour automatically since connect()
uses the subclass via the factory parameter.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

kanban dispatcher FD leak: SQLite connections not releasing file descriptors in WAL mode

3 participants