fix(gateway): use shared per-board kanban connection to prevent WAL inode-rotation race by steveonjava · Pull Request #32226 · NousResearch/hermes-agent

steveonjava · 2026-05-25T20:06:34Z

What does this PR do?

This PR fixes a SQLite WAL inode-rotation race in the gateway dispatcher that causes disk I/O error EIO loops after the first dispatch tick. The fix is a reliability improvement (single shared connection per board per process) with no security-boundary implications per SECURITY.md §3.2.

The gateway holds multiple short-lived SQLite connections to the same kanban DB in the same process (notifier watcher, sub/unsub/rewind helpers, per-tick dispatcher connect, ready-probe, auto-subscribe handler). When the last WAL holder closes, SQLite's in-process unixShmNode unlinks and recreates the -shm / -wal files. Other open connections still hold mmap references to the deleted inodes; the next fresh connection finds stale entries in the per-process WAL lock table and raises SQLITE_IOERR_SHMMAP, surfacing as sqlite3.OperationalError: disk I/O error. From that point every dispatcher tick fails until the gateway is restarted.

This is reliably reproducible: on a gateway running for an hour or so under moderate kanban write activity, ls /proc/<gateway_pid>/fd/ | grep deleted shows lingering kanban.db-shm (deleted) and kanban.db-wal (deleted) mappings, and every subsequent PRAGMA journal_mode=WAL (or any WAL-touching query) raises EIO from inside the dispatcher while a separate process can connect and query the same DB without issue.

Fix

Add a _kanban_conn_cache: dict[str, sqlite3.Connection] on Gateway, guarded by a threading.Lock. The new _kb_conn(slug) helper lazily creates and caches one connection per board slug; all watcher paths now use the shared cached connection instead of opening and closing one per call:

_kanban_notifier_watcher
_kanban_sub / _kanban_unsub / _kanban_rewind / _kanban_advance
_tick_once_for_board (dispatcher tick)
_ready_nonempty (cheap dispatcher health probe)
auto-subscribe handler

Connections are closed exactly once in stop(). kanban_db.connect() gains a check_same_thread keyword so the shared connection can be used safely from the gateway's worker threads — kanban_db's own _WRITE_LOCK and _INIT_LOCK already serialize writes, so connection sharing across threads is safe under the existing locking discipline.

By construction, no per-tick close happens anymore, so the inode-rotation race cannot fire.

Related Issue

Addresses #31158 — upstream community report of the exact same symptom (deleted-inode FDs, EIO after a few completions) with the same diagnosis (per-process unixShmNode poisoning). The issue suggests three surgical alternatives; this PR implements the first one (single long-lived connection per board).

Coordinates with (does NOT duplicate):

fix(kanban-db): WAL file descriptor leak on connect/close cycles (fixes #30799) #31130 — addresses a slow FD-count leak via wal_checkpoint(TRUNCATE) on close. Different mechanism; this PR does not touch hermes_cli/kanban.py and addresses the inode-rotation race rather than the FD-count growth.
fix(state): never silently downgrade WAL to DELETE on transient EIO #31294 — fixes a different bug in hermes_state.apply_wal_with_fallback (transient EIO silently downgrading to DELETE mode). No hermes_state.py changes in this PR.

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature
Breaking change
Documentation update

Changes Made

gateway/run.py — added _kanban_conn_cache dict + _kb_conn(slug) helper; routed every watcher / helper that previously called _kb.connect() directly to the cached connection; cache cleanup in stop().
hermes_cli/kanban_db.py — connect() accepts check_same_thread keyword (default preserves current behavior).
tests/hermes_cli/test_kanban_dispatcher_wal.py — new file. 6 tests covering: slug normalization, concurrent first-init serialization, clean shutdown semantics, identity (not equality) of cached connections, check_same_thread=False propagation, multi-board isolation.
tests/gateway/test_kanban_notifier.py — small adjustments so existing notifier tests work with the cached connection.

How to Test

scripts/run_tests.sh                                   # full suite (matches CI)
pytest tests/hermes_cli/test_kanban_dispatcher_wal.py  # the new targeted tests
pytest tests/gateway/test_kanban_notifier.py           # adjusted notifier tests

Manual reproduction (before the fix): run hermes-gateway with the kanban dispatcher embedded, drive 1–10 kanban writes/min, watch /proc/<gateway_pid>/fd/ accumulate kanban.db-shm (deleted) / kanban.db-wal (deleted) entries, and observe sqlite3.OperationalError: disk I/O error in gateway.log on every subsequent dispatcher tick. After the fix: the same (deleted) mappings still appear briefly during normal SQLite lifecycle but no longer cause errors, because no further per-tick connection lifecycle triggers fresh inode rotation against stale mmaps. Verified locally — gateway has been ticking cleanly for hours with 0 EIO events.

Checklist

I've read the Contributing Guide
My commit messages follow Conventional Commits (fix(gateway): ...)
I've checked there isn't already a PR for this change
My PR includes only changes related to one bug/feature
scripts/run_tests.sh passes locally (184 kanban + notifier tests green)
I've added tests for my changes (required for bug fixes)
I've tested on my platform (Linux/macOS/Windows) — human reviewer to confirm

…node-rotation race Multiple gateway watcher paths opened and closed short-lived SQLite connections to the same kanban DB in the same process. When a connection closed as the last WAL holder, SQLite's in-process unixShmNode unlinked and recreated the -shm/-wal files. Remaining open connections held mmap references to the deleted inodes; the next fresh connection found stale entries in the per-process WAL lock table and raised SQLITE_IOERR_SHMMAP ("disk I/O error"). Fix: add _kanban_conn_cache (dict[str, Connection], guarded by threading.Lock) on Gateway. _kb_conn(slug) lazily creates and caches one connection per board slug; all watcher paths (_kanban_notifier_ watcher, _kanban_sub/unsub/rewind, _tick_once_for_board, _ready_nonempty, auto-subscribe handler) now use the shared connection instead of open/close per call. Connections are closed only in stop(). kanban_db.connect() gains check_same_thread kwarg so the shared connection can be used safely across watcher threads (kanban_db's own _WRITE_LOCK/_INIT_LOCK already serializes writes). Addresses upstream community report NousResearch#31158.

steveonjava · 2026-05-25T20:45:38Z

Self-review note: do not merge as-is. Deployed locally on a hermes profile and reproduced kanban.db corruption on both default and writing boards within ~30 min of restart (default = index damage, writing = b-tree shared-page damage in task_comments). The shared-connection pattern with check_same_thread=False and _WRITE_LOCK-only synchronization is unsafe: _WRITE_LOCK serializes writes against writes but does not serialize concurrent reads on the same connection against in-flight writes; the watcher threads (_kanban_notifier_watcher, _tick_once_for_board, _ready_nonempty, auto-subscribe handler, sub/unsub helpers) all share the same connection from different asyncio.to_thread workers and corrupt the b-tree under load.\n\nThe right fix is almost certainly a per-thread connection cache via threading.local() — each watcher thread gets its own long-lived connection, no cross-thread sharing, no inode-rotation race. Test tests/hermes_cli/test_kanban_dispatcher_wal.py would need to be rewritten to assert per-thread isolation rather than identity of the cached connection. Closing this draft until the rework is verified safe under concurrent traffic — reopening when ready.

steveonjava · 2026-05-25T23:21:04Z

PR is unsafe and needs more research. Closing for now.

steveonjava · 2026-05-26T22:14:37Z

For future reference: this same shared-per-board approach was applied locally and reverted after production corruption was confirmed. Concurrent thread access to one SQLite connection violates SQLite's threading model. The dispatcher wedge it tried to fix is now addressed at a lower layer by #32489 (skip redundant WAL pragma on already-WAL connections), bundled into the consolidated #32857.

apply_wal_with_fallback() issued PRAGMA journal_mode=WAL on every call, including connections to DBs already in WAL mode. This triggered the WAL init code path, causing SQLite to acquire EXCLUSIVE, checkpoint, and unlink kanban.db-{wal,shm}. Other open connections received (deleted) FDs and raised sqlite3.OperationalError: disk I/O error. Add a cheap read probe (PRAGMA journal_mode, no flock/checkpoint/unlink) before the set-pragma path. If already wal, return early. The set-pragma and DELETE fallback paths are unchanged. Closes #31158. Addresses root cause that PRs #32226 and #32322 attempted via connection-sharing/caching approaches.

apply_wal_with_fallback() issued PRAGMA journal_mode=WAL on every call, including connections to DBs already in WAL mode. This triggered the WAL init code path, causing SQLite to acquire EXCLUSIVE, checkpoint, and unlink kanban.db-{wal,shm}. Other open connections received (deleted) FDs and raised sqlite3.OperationalError: disk I/O error. Add a cheap read probe (PRAGMA journal_mode, no flock/checkpoint/unlink) before the set-pragma path. If already wal, return early. The set-pragma and DELETE fallback paths are unchanged. Closes NousResearch#31158. Addresses root cause that PRs NousResearch#32226 and NousResearch#32322 attempted via connection-sharing/caching approaches.

apply_wal_with_fallback() issued PRAGMA journal_mode=WAL on every call, including connections to DBs already in WAL mode. This triggered the WAL init code path, causing SQLite to acquire EXCLUSIVE, checkpoint, and unlink kanban.db-{wal,shm}. Other open connections received (deleted) FDs and raised sqlite3.OperationalError: disk I/O error. Add a cheap read probe (PRAGMA journal_mode, no flock/checkpoint/unlink) before the set-pragma path. If already wal, return early. The set-pragma and DELETE fallback paths are unchanged. Closes NousResearch#31158. Addresses root cause that PRs NousResearch#32226 and NousResearch#32322 attempted via connection-sharing/caching approaches. #AI commit#

apply_wal_with_fallback() issued PRAGMA journal_mode=WAL on every call, including connections to DBs already in WAL mode. This triggered the WAL init code path, causing SQLite to acquire EXCLUSIVE, checkpoint, and unlink kanban.db-{wal,shm}. Other open connections received (deleted) FDs and raised sqlite3.OperationalError: disk I/O error. Add a cheap read probe (PRAGMA journal_mode, no flock/checkpoint/unlink) before the set-pragma path. If already wal, return early. The set-pragma and DELETE fallback paths are unchanged. Closes NousResearch#31158. Addresses root cause that PRs NousResearch#32226 and NousResearch#32322 attempted via connection-sharing/caching approaches.

alt-glitch added type/bug Something isn't working comp/gateway Gateway runner, session dispatch, delivery comp/plugins Plugin system and bundled plugins P3 Low — cosmetic, nice to have labels May 25, 2026

steveonjava closed this May 25, 2026

steveonjava mentioned this pull request May 26, 2026

fix(gateway): cache kanban DB connections per OS thread in GatewayRunner #32322

Closed

alt-glitch mentioned this pull request May 26, 2026

feat(memory): add staleness warning for outdated memory files #32321

Closed

6 tasks

This was referenced May 26, 2026

fix(kanban): skip redundant WAL pragma on already-WAL connections #32489

Closed

fix(gateway): add WAL pinner to hold shared lock and prevent sidecar unlink (Bug I.2) #32531

Closed

This was referenced May 26, 2026

fix(kanban): batch-salvage 8 SQLite corruption hardening fixes (closes #31158, refs #29610) #32857

Closed

Bug: embedded Kanban dispatcher still leaks sqlite/WAL file descriptors after #28301 #29610

Closed

alt-glitch mentioned this pull request May 27, 2026

fix(gateway): close kanban DB connection after dispatch tick #33113

Closed

kshitijk4poor mentioned this pull request May 28, 2026

fix(kanban): remove false-positive corruption detection from separate probe connection #32449

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gateway): use shared per-board kanban connection to prevent WAL inode-rotation race#32226

fix(gateway): use shared per-board kanban connection to prevent WAL inode-rotation race#32226
steveonjava wants to merge 1 commit into
NousResearch:mainfrom
steveonjava:fix/kanban-dispatcher-wal-deleted-mapping-eio

steveonjava commented May 25, 2026

Uh oh!

steveonjava commented May 25, 2026

Uh oh!

steveonjava commented May 25, 2026

Uh oh!

steveonjava commented May 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

steveonjava commented May 25, 2026

What does this PR do?

Fix

Related Issue

Type of Change

Changes Made

How to Test

Checklist

Uh oh!

steveonjava commented May 25, 2026

Uh oh!

steveonjava commented May 25, 2026

Uh oh!

steveonjava commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

steveonjava commented May 26, 2026 •

edited

Loading