fix(kanban-db): WAL file descriptor leak on connect/close cycles (fixes #30799) by mohamedorigami-jpg · Pull Request #31130 · NousResearch/hermes-agent

mohamedorigami-jpg · 2026-05-23T20:23:42Z

Problem

Long-running gateway processes open a new kanban SQLite connection every dispatcher tick. In WAL mode SQLite defers cleanup of WAL/shm file descriptors, causing a slow FD leak that eventually hits the process limit (observed: ~500 kanban.db + ~500 kanban.db-wal FDs after 14 hours) and triggers cascading failures -- too many open files, unable to open database file.

Fix

Added _WalSafeConnection, a sqlite3.Connection subclass that runs PRAGMA wal_checkpoint(TRUNCATE) before each close(). This forces SQLite to consolidate the WAL and release its file descriptors immediately.

connect() uses the subclass via the factory parameter so all existing callers get the behaviour automatically -- no API changes needed.

Testing

tests/hermes_cli/test_kanban_cli.py: 46 passed (no regressions)
Integration: WAL/shm files confirmed removed after close
Full block/unblock cycle with multi-word reasons verified end-to-end

Closes #30799

valhir1 · 2026-05-23T22:34:54Z

Closely related to #31158 (which I just opened), but the failure modes are not identical:

This PR's symptom: WAL FDs accumulate over time → eventual "too many open files" / "unable to open database file"
My kanban dispatcher wedges under multi-thread + subprocess concurrency due to WAL/SHM cache poisoning #31158 symptom: WAL/SHM FDs in the long-running gateway end up referencing deleted inodes (verifier subprocess's connection close triggers SQLite's "last connection cleanup" while gateway threads are momentarily between connect()s, unlinking -wal/-shm; gateway's existing mmaps then point at the deleted-inode). Surfaces as sqlite3.OperationalError: disk I/O error on SELECTs in release_stale_claims.

Same underlying class (gateway-side WAL FD lifecycle in a long-running multi-threaded process with concurrent multi-process writers), different specific manifestation.

The _WalSafeConnection approach in this PR — PRAGMA wal_checkpoint(TRUNCATE) before each close() — might also resolve my wedge: the checkpoint should force WAL consolidation so a closing subprocess's "last connection cleanup" finds nothing to unlink, leaving the gateway's other connections with valid (not deleted) mmap targets. But I haven't validated that, and it's an inference from SQLite's documented per-process WAL state semantics — not a measurement.

I have a deterministic reproduction of the #31158 wedge (4 verifier-subprocess completions on a single board → wedge; restart resolves; reproduces on next run). Happy to apply this PR's patch locally and run my repro against it to either confirm or rule out that #31130 also addresses the deleted-inode-FD class, if that's useful data for moving this PR forward.

Currently I have a separate local workaround on #31158 (switched the kanban DB to journal_mode=DELETE) — pragmatic, removes the failure class by construction, but trades away WAL's concurrent-reader benefits (acceptable for the small, low-frequency kanban DB in my case, but your approach is more general).

mohamedorigami-jpg · 2026-05-24T09:29:31Z

Thanks for the detailed write-up @valhir1. The inode-deletion wedge is an interesting failure mode -- I hadn't considered the 'last connection cleanup' unlinking -wal/-shm while gateway threads are between connect() calls.

Yes, please do apply the patch and run your repro against it. The PR branch is fix/30799-kanban-fd-leak-wal-checkpoint -- you can grab it with:

git fetch origin fix/30799-kanban-fd-leak-wal-checkpoint
git checkout fix/30799-kanban-fd-leak-wal-checkpoint

The key change is in tools/kanban_db.py: the _WalSafeConnection wrapper calls PRAGMA wal_checkpoint(TRUNCATE) before each close(). If it resolves your wedge, that's useful signal that the WAL consolidation theory holds across both failure modes. If not, your journal_mode=DELETE workaround is pragmatic and you should ship it -- WAL's concurrent-reader benefit is marginal for the low-frequency kanban DB.

Either way, posting your findings on #31158 would help the maintainers understand the full scope of the WAL lifecycle issue.

NousResearch#30799) Long-running gateway processes open a new kanban SQLite connection every dispatcher tick. In WAL mode SQLite defers cleanup of WAL/shm file descriptors, causing a slow FD leak that eventually hits the process limit (observed: ~500 kanban.db + ~500 kanban.db-wal FDs after 14 hours) and triggers cascading failures. Fix: add _WalSafeConnection, a sqlite3.Connection subclass that runs PRAGMA wal_checkpoint(TRUNCATE) before each close. This forces SQLite to consolidate the WAL and release its file descriptors immediately. All existing callers get this behaviour automatically since connect() uses the subclass via the factory parameter.

alt-glitch added type/bug Something isn't working comp/cli CLI entry point, hermes_cli/, setup wizard P3 Low — cosmetic, nice to have labels May 23, 2026

alt-glitch mentioned this pull request May 23, 2026

kanban dispatcher wedges under multi-thread + subprocess concurrency due to WAL/SHM cache poisoning #31158

Closed

mohamedorigami-jpg mentioned this pull request May 24, 2026

fix(kanban): stop treating board default_workdir as a scratch workspace #31358

Open

Tranquil-Flow mentioned this pull request May 24, 2026

fix(state): proactively skip WAL journal mode on BTRFS filesystems (#30846) #31586

Open

alt-glitch mentioned this pull request May 25, 2026

Gateway embedded Kanban dispatcher opens SQLite WAL connections every tick, causing FD/WAL pressure #31736

Closed

This was referenced May 25, 2026

fix(gateway): use shared per-board kanban connection to prevent WAL inode-rotation race #32226

Closed

fix(gateway): cache kanban DB connections per OS thread in GatewayRunner #32322

Closed

alt-glitch mentioned this pull request May 26, 2026

feat(memory): add staleness warning for outdated memory files #32321

Closed

6 tasks

This was referenced May 26, 2026

fix(gateway): add WAL pinner to hold shared lock and prevent sidecar unlink (Bug I.2) #32531

Closed

fix(kanban): batch-salvage 8 SQLite corruption hardening fixes (closes #31158, refs #29610) #32857

Closed

kshitijk4poor mentioned this pull request May 27, 2026

fix(kanban): batch-salvage 7 SQLite corruption hardening fixes from #32857 #33482

Merged

alt-glitch mentioned this pull request May 28, 2026

kanban_db.py: connection leak causes 'Too many open files' on macOS (FD exhaustion) #33580

Closed

someaka mentioned this pull request May 30, 2026

[Bug]: kanban specify helpers leak sqlite connections in long-lived processes #28802

Open

1 task

mohamedorigami-jpg force-pushed the fix/30799-kanban-fd-leak-wal-checkpoint branch from fd2c5e4 to aef989d Compare May 31, 2026 11:41

alt-glitch mentioned this pull request May 31, 2026

fix(db): close() methods on SQLite classes + WAL FD leak fix #36116

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kanban-db): WAL file descriptor leak on connect/close cycles (fixes #30799)#31130

fix(kanban-db): WAL file descriptor leak on connect/close cycles (fixes #30799)#31130
mohamedorigami-jpg wants to merge 1 commit into
NousResearch:mainfrom
mohamedorigami-jpg:fix/30799-kanban-fd-leak-wal-checkpoint

mohamedorigami-jpg commented May 23, 2026

Uh oh!

valhir1 commented May 23, 2026

Uh oh!

mohamedorigami-jpg commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mohamedorigami-jpg commented May 23, 2026

Problem

Fix

Testing

Uh oh!

valhir1 commented May 23, 2026

Uh oh!

mohamedorigami-jpg commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants