fix(gateway): use shared per-board kanban connection to prevent WAL inode-rotation race#32226
Conversation
…node-rotation race
Multiple gateway watcher paths opened and closed short-lived SQLite
connections to the same kanban DB in the same process. When a
connection closed as the last WAL holder, SQLite's in-process
unixShmNode unlinked and recreated the -shm/-wal files. Remaining
open connections held mmap references to the deleted inodes; the next
fresh connection found stale entries in the per-process WAL lock table
and raised SQLITE_IOERR_SHMMAP ("disk I/O error").
Fix: add _kanban_conn_cache (dict[str, Connection], guarded by
threading.Lock) on Gateway. _kb_conn(slug) lazily creates and caches
one connection per board slug; all watcher paths (_kanban_notifier_
watcher, _kanban_sub/unsub/rewind, _tick_once_for_board,
_ready_nonempty, auto-subscribe handler) now use the shared
connection instead of open/close per call. Connections are closed
only in stop(). kanban_db.connect() gains check_same_thread kwarg
so the shared connection can be used safely across watcher threads
(kanban_db's own _WRITE_LOCK/_INIT_LOCK already serializes writes).
Addresses upstream community report NousResearch#31158.
|
Self-review note: do not merge as-is. Deployed locally on a hermes profile and reproduced kanban.db corruption on both default and writing boards within ~30 min of restart (default = index damage, writing = b-tree shared-page damage in |
|
PR is unsafe and needs more research. Closing for now. |
|
For future reference: this same shared-per-board approach was applied locally and reverted after production corruption was confirmed. Concurrent thread access to one SQLite connection violates SQLite's threading model. The dispatcher wedge it tried to fix is now addressed at a lower layer by #32489 (skip redundant WAL pragma on already-WAL connections), bundled into the consolidated #32857. |
apply_wal_with_fallback() issued PRAGMA journal_mode=WAL on every call,
including connections to DBs already in WAL mode. This triggered the WAL
init code path, causing SQLite to acquire EXCLUSIVE, checkpoint, and unlink
kanban.db-{wal,shm}. Other open connections received (deleted) FDs and
raised sqlite3.OperationalError: disk I/O error.
Add a cheap read probe (PRAGMA journal_mode, no flock/checkpoint/unlink)
before the set-pragma path. If already wal, return early. The set-pragma
and DELETE fallback paths are unchanged.
Closes #31158. Addresses root cause that PRs #32226 and #32322 attempted
via connection-sharing/caching approaches.
apply_wal_with_fallback() issued PRAGMA journal_mode=WAL on every call,
including connections to DBs already in WAL mode. This triggered the WAL
init code path, causing SQLite to acquire EXCLUSIVE, checkpoint, and unlink
kanban.db-{wal,shm}. Other open connections received (deleted) FDs and
raised sqlite3.OperationalError: disk I/O error.
Add a cheap read probe (PRAGMA journal_mode, no flock/checkpoint/unlink)
before the set-pragma path. If already wal, return early. The set-pragma
and DELETE fallback paths are unchanged.
Closes NousResearch#31158. Addresses root cause that PRs NousResearch#32226 and NousResearch#32322 attempted
via connection-sharing/caching approaches.
apply_wal_with_fallback() issued PRAGMA journal_mode=WAL on every call,
including connections to DBs already in WAL mode. This triggered the WAL
init code path, causing SQLite to acquire EXCLUSIVE, checkpoint, and unlink
kanban.db-{wal,shm}. Other open connections received (deleted) FDs and
raised sqlite3.OperationalError: disk I/O error.
Add a cheap read probe (PRAGMA journal_mode, no flock/checkpoint/unlink)
before the set-pragma path. If already wal, return early. The set-pragma
and DELETE fallback paths are unchanged.
Closes NousResearch#31158. Addresses root cause that PRs NousResearch#32226 and NousResearch#32322 attempted
via connection-sharing/caching approaches.
#AI commit#
apply_wal_with_fallback() issued PRAGMA journal_mode=WAL on every call,
including connections to DBs already in WAL mode. This triggered the WAL
init code path, causing SQLite to acquire EXCLUSIVE, checkpoint, and unlink
kanban.db-{wal,shm}. Other open connections received (deleted) FDs and
raised sqlite3.OperationalError: disk I/O error.
Add a cheap read probe (PRAGMA journal_mode, no flock/checkpoint/unlink)
before the set-pragma path. If already wal, return early. The set-pragma
and DELETE fallback paths are unchanged.
Closes NousResearch#31158. Addresses root cause that PRs NousResearch#32226 and NousResearch#32322 attempted
via connection-sharing/caching approaches.
apply_wal_with_fallback() issued PRAGMA journal_mode=WAL on every call,
including connections to DBs already in WAL mode. This triggered the WAL
init code path, causing SQLite to acquire EXCLUSIVE, checkpoint, and unlink
kanban.db-{wal,shm}. Other open connections received (deleted) FDs and
raised sqlite3.OperationalError: disk I/O error.
Add a cheap read probe (PRAGMA journal_mode, no flock/checkpoint/unlink)
before the set-pragma path. If already wal, return early. The set-pragma
and DELETE fallback paths are unchanged.
Closes NousResearch#31158. Addresses root cause that PRs NousResearch#32226 and NousResearch#32322 attempted
via connection-sharing/caching approaches.
apply_wal_with_fallback() issued PRAGMA journal_mode=WAL on every call,
including connections to DBs already in WAL mode. This triggered the WAL
init code path, causing SQLite to acquire EXCLUSIVE, checkpoint, and unlink
kanban.db-{wal,shm}. Other open connections received (deleted) FDs and
raised sqlite3.OperationalError: disk I/O error.
Add a cheap read probe (PRAGMA journal_mode, no flock/checkpoint/unlink)
before the set-pragma path. If already wal, return early. The set-pragma
and DELETE fallback paths are unchanged.
Closes NousResearch#31158. Addresses root cause that PRs NousResearch#32226 and NousResearch#32322 attempted
via connection-sharing/caching approaches.
What does this PR do?
This PR fixes a SQLite WAL inode-rotation race in the gateway dispatcher that causes
disk I/O errorEIO loops after the first dispatch tick. The fix is a reliability improvement (single shared connection per board per process) with no security-boundary implications per SECURITY.md §3.2.The gateway holds multiple short-lived SQLite connections to the same kanban DB in the same process (notifier watcher, sub/unsub/rewind helpers, per-tick dispatcher connect, ready-probe, auto-subscribe handler). When the last WAL holder closes, SQLite's in-process
unixShmNodeunlinks and recreates the-shm/-walfiles. Other open connections still hold mmap references to the deleted inodes; the next fresh connection finds stale entries in the per-process WAL lock table and raisesSQLITE_IOERR_SHMMAP, surfacing assqlite3.OperationalError: disk I/O error. From that point every dispatcher tick fails until the gateway is restarted.This is reliably reproducible: on a gateway running for an hour or so under moderate kanban write activity,
ls /proc/<gateway_pid>/fd/ | grep deletedshows lingeringkanban.db-shm (deleted)andkanban.db-wal (deleted)mappings, and every subsequentPRAGMA journal_mode=WAL(or any WAL-touching query) raises EIO from inside the dispatcher while a separate process can connect and query the same DB without issue.Fix
Add a
_kanban_conn_cache: dict[str, sqlite3.Connection]onGateway, guarded by athreading.Lock. The new_kb_conn(slug)helper lazily creates and caches one connection per board slug; all watcher paths now use the shared cached connection instead of opening and closing one per call:_kanban_notifier_watcher_kanban_sub/_kanban_unsub/_kanban_rewind/_kanban_advance_tick_once_for_board(dispatcher tick)_ready_nonempty(cheap dispatcher health probe)Connections are closed exactly once in
stop().kanban_db.connect()gains acheck_same_threadkeyword so the shared connection can be used safely from the gateway's worker threads —kanban_db's own_WRITE_LOCKand_INIT_LOCKalready serialize writes, so connection sharing across threads is safe under the existing locking discipline.By construction, no per-tick close happens anymore, so the inode-rotation race cannot fire.
Related Issue
Addresses #31158 — upstream community report of the exact same symptom (deleted-inode FDs, EIO after a few completions) with the same diagnosis (per-process
unixShmNodepoisoning). The issue suggests three surgical alternatives; this PR implements the first one (single long-lived connection per board).Coordinates with (does NOT duplicate):
wal_checkpoint(TRUNCATE)on close. Different mechanism; this PR does not touchhermes_cli/kanban.pyand addresses the inode-rotation race rather than the FD-count growth.hermes_state.apply_wal_with_fallback(transient EIO silently downgrading to DELETE mode). Nohermes_state.pychanges in this PR.Type of Change
Changes Made
gateway/run.py— added_kanban_conn_cachedict +_kb_conn(slug)helper; routed every watcher / helper that previously called_kb.connect()directly to the cached connection; cache cleanup instop().hermes_cli/kanban_db.py—connect()acceptscheck_same_threadkeyword (default preserves current behavior).tests/hermes_cli/test_kanban_dispatcher_wal.py— new file. 6 tests covering: slug normalization, concurrent first-init serialization, clean shutdown semantics, identity (not equality) of cached connections,check_same_thread=Falsepropagation, multi-board isolation.tests/gateway/test_kanban_notifier.py— small adjustments so existing notifier tests work with the cached connection.How to Test
Manual reproduction (before the fix): run hermes-gateway with the kanban dispatcher embedded, drive 1–10 kanban writes/min, watch
/proc/<gateway_pid>/fd/accumulatekanban.db-shm (deleted)/kanban.db-wal (deleted)entries, and observesqlite3.OperationalError: disk I/O erroringateway.logon every subsequent dispatcher tick. After the fix: the same(deleted)mappings still appear briefly during normal SQLite lifecycle but no longer cause errors, because no further per-tick connection lifecycle triggers fresh inode rotation against stale mmaps. Verified locally — gateway has been ticking cleanly for hours with 0 EIO events.Checklist
fix(gateway): ...)scripts/run_tests.shpasses locally (184 kanban + notifier tests green)