Skip to content

fix(gateway): use shared per-board kanban connection to prevent WAL inode-rotation race#32226

Closed
steveonjava wants to merge 1 commit into
NousResearch:mainfrom
steveonjava:fix/kanban-dispatcher-wal-deleted-mapping-eio
Closed

fix(gateway): use shared per-board kanban connection to prevent WAL inode-rotation race#32226
steveonjava wants to merge 1 commit into
NousResearch:mainfrom
steveonjava:fix/kanban-dispatcher-wal-deleted-mapping-eio

Conversation

@steveonjava

Copy link
Copy Markdown
Contributor

What does this PR do?

This PR fixes a SQLite WAL inode-rotation race in the gateway dispatcher that causes disk I/O error EIO loops after the first dispatch tick. The fix is a reliability improvement (single shared connection per board per process) with no security-boundary implications per SECURITY.md §3.2.

The gateway holds multiple short-lived SQLite connections to the same kanban DB in the same process (notifier watcher, sub/unsub/rewind helpers, per-tick dispatcher connect, ready-probe, auto-subscribe handler). When the last WAL holder closes, SQLite's in-process unixShmNode unlinks and recreates the -shm / -wal files. Other open connections still hold mmap references to the deleted inodes; the next fresh connection finds stale entries in the per-process WAL lock table and raises SQLITE_IOERR_SHMMAP, surfacing as sqlite3.OperationalError: disk I/O error. From that point every dispatcher tick fails until the gateway is restarted.

This is reliably reproducible: on a gateway running for an hour or so under moderate kanban write activity, ls /proc/<gateway_pid>/fd/ | grep deleted shows lingering kanban.db-shm (deleted) and kanban.db-wal (deleted) mappings, and every subsequent PRAGMA journal_mode=WAL (or any WAL-touching query) raises EIO from inside the dispatcher while a separate process can connect and query the same DB without issue.

Fix

Add a _kanban_conn_cache: dict[str, sqlite3.Connection] on Gateway, guarded by a threading.Lock. The new _kb_conn(slug) helper lazily creates and caches one connection per board slug; all watcher paths now use the shared cached connection instead of opening and closing one per call:

  • _kanban_notifier_watcher
  • _kanban_sub / _kanban_unsub / _kanban_rewind / _kanban_advance
  • _tick_once_for_board (dispatcher tick)
  • _ready_nonempty (cheap dispatcher health probe)
  • auto-subscribe handler

Connections are closed exactly once in stop(). kanban_db.connect() gains a check_same_thread keyword so the shared connection can be used safely from the gateway's worker threads — kanban_db's own _WRITE_LOCK and _INIT_LOCK already serialize writes, so connection sharing across threads is safe under the existing locking discipline.

By construction, no per-tick close happens anymore, so the inode-rotation race cannot fire.

Related Issue

Addresses #31158 — upstream community report of the exact same symptom (deleted-inode FDs, EIO after a few completions) with the same diagnosis (per-process unixShmNode poisoning). The issue suggests three surgical alternatives; this PR implements the first one (single long-lived connection per board).

Coordinates with (does NOT duplicate):

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature
  • Breaking change
  • Documentation update

Changes Made

  • gateway/run.py — added _kanban_conn_cache dict + _kb_conn(slug) helper; routed every watcher / helper that previously called _kb.connect() directly to the cached connection; cache cleanup in stop().
  • hermes_cli/kanban_db.pyconnect() accepts check_same_thread keyword (default preserves current behavior).
  • tests/hermes_cli/test_kanban_dispatcher_wal.py — new file. 6 tests covering: slug normalization, concurrent first-init serialization, clean shutdown semantics, identity (not equality) of cached connections, check_same_thread=False propagation, multi-board isolation.
  • tests/gateway/test_kanban_notifier.py — small adjustments so existing notifier tests work with the cached connection.

How to Test

scripts/run_tests.sh                                   # full suite (matches CI)
pytest tests/hermes_cli/test_kanban_dispatcher_wal.py  # the new targeted tests
pytest tests/gateway/test_kanban_notifier.py           # adjusted notifier tests

Manual reproduction (before the fix): run hermes-gateway with the kanban dispatcher embedded, drive 1–10 kanban writes/min, watch /proc/<gateway_pid>/fd/ accumulate kanban.db-shm (deleted) / kanban.db-wal (deleted) entries, and observe sqlite3.OperationalError: disk I/O error in gateway.log on every subsequent dispatcher tick. After the fix: the same (deleted) mappings still appear briefly during normal SQLite lifecycle but no longer cause errors, because no further per-tick connection lifecycle triggers fresh inode rotation against stale mmaps. Verified locally — gateway has been ticking cleanly for hours with 0 EIO events.

Checklist

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(gateway): ...)
  • I've checked there isn't already a PR for this change
  • My PR includes only changes related to one bug/feature
  • scripts/run_tests.sh passes locally (184 kanban + notifier tests green)
  • I've added tests for my changes (required for bug fixes)
  • I've tested on my platform (Linux/macOS/Windows) — human reviewer to confirm

…node-rotation race

Multiple gateway watcher paths opened and closed short-lived SQLite
connections to the same kanban DB in the same process. When a
connection closed as the last WAL holder, SQLite's in-process
unixShmNode unlinked and recreated the -shm/-wal files. Remaining
open connections held mmap references to the deleted inodes; the next
fresh connection found stale entries in the per-process WAL lock table
and raised SQLITE_IOERR_SHMMAP ("disk I/O error").

Fix: add _kanban_conn_cache (dict[str, Connection], guarded by
threading.Lock) on Gateway. _kb_conn(slug) lazily creates and caches
one connection per board slug; all watcher paths (_kanban_notifier_
watcher, _kanban_sub/unsub/rewind, _tick_once_for_board,
_ready_nonempty, auto-subscribe handler) now use the shared
connection instead of open/close per call. Connections are closed
only in stop(). kanban_db.connect() gains check_same_thread kwarg
so the shared connection can be used safely across watcher threads
(kanban_db's own _WRITE_LOCK/_INIT_LOCK already serializes writes).

Addresses upstream community report NousResearch#31158.
@alt-glitch alt-glitch added type/bug Something isn't working comp/gateway Gateway runner, session dispatch, delivery comp/plugins Plugin system and bundled plugins P3 Low — cosmetic, nice to have labels May 25, 2026
@steveonjava

Copy link
Copy Markdown
Contributor Author

Self-review note: do not merge as-is. Deployed locally on a hermes profile and reproduced kanban.db corruption on both default and writing boards within ~30 min of restart (default = index damage, writing = b-tree shared-page damage in task_comments). The shared-connection pattern with check_same_thread=False and _WRITE_LOCK-only synchronization is unsafe: _WRITE_LOCK serializes writes against writes but does not serialize concurrent reads on the same connection against in-flight writes; the watcher threads (_kanban_notifier_watcher, _tick_once_for_board, _ready_nonempty, auto-subscribe handler, sub/unsub helpers) all share the same connection from different asyncio.to_thread workers and corrupt the b-tree under load.\n\nThe right fix is almost certainly a per-thread connection cache via threading.local() — each watcher thread gets its own long-lived connection, no cross-thread sharing, no inode-rotation race. Test tests/hermes_cli/test_kanban_dispatcher_wal.py would need to be rewritten to assert per-thread isolation rather than identity of the cached connection. Closing this draft until the rework is verified safe under concurrent traffic — reopening when ready.

@steveonjava

Copy link
Copy Markdown
Contributor Author

PR is unsafe and needs more research. Closing for now.

@steveonjava

steveonjava commented May 26, 2026

Copy link
Copy Markdown
Contributor Author

For future reference: this same shared-per-board approach was applied locally and reverted after production corruption was confirmed. Concurrent thread access to one SQLite connection violates SQLite's threading model. The dispatcher wedge it tried to fix is now addressed at a lower layer by #32489 (skip redundant WAL pragma on already-WAL connections), bundled into the consolidated #32857.

kshitijk4poor pushed a commit that referenced this pull request May 27, 2026
apply_wal_with_fallback() issued PRAGMA journal_mode=WAL on every call,
including connections to DBs already in WAL mode. This triggered the WAL
init code path, causing SQLite to acquire EXCLUSIVE, checkpoint, and unlink
kanban.db-{wal,shm}. Other open connections received (deleted) FDs and
raised sqlite3.OperationalError: disk I/O error.

Add a cheap read probe (PRAGMA journal_mode, no flock/checkpoint/unlink)
before the set-pragma path. If already wal, return early. The set-pragma
and DELETE fallback paths are unchanged.

Closes #31158. Addresses root cause that PRs #32226 and #32322 attempted
via connection-sharing/caching approaches.
mathias3 pushed a commit to mathias3/hermes-agent that referenced this pull request May 28, 2026
apply_wal_with_fallback() issued PRAGMA journal_mode=WAL on every call,
including connections to DBs already in WAL mode. This triggered the WAL
init code path, causing SQLite to acquire EXCLUSIVE, checkpoint, and unlink
kanban.db-{wal,shm}. Other open connections received (deleted) FDs and
raised sqlite3.OperationalError: disk I/O error.

Add a cheap read probe (PRAGMA journal_mode, no flock/checkpoint/unlink)
before the set-pragma path. If already wal, return early. The set-pragma
and DELETE fallback paths are unchanged.

Closes NousResearch#31158. Addresses root cause that PRs NousResearch#32226 and NousResearch#32322 attempted
via connection-sharing/caching approaches.
Bryce-huang pushed a commit to wbkunlun/hermes-agent that referenced this pull request May 29, 2026
apply_wal_with_fallback() issued PRAGMA journal_mode=WAL on every call,
including connections to DBs already in WAL mode. This triggered the WAL
init code path, causing SQLite to acquire EXCLUSIVE, checkpoint, and unlink
kanban.db-{wal,shm}. Other open connections received (deleted) FDs and
raised sqlite3.OperationalError: disk I/O error.

Add a cheap read probe (PRAGMA journal_mode, no flock/checkpoint/unlink)
before the set-pragma path. If already wal, return early. The set-pragma
and DELETE fallback paths are unchanged.

Closes NousResearch#31158. Addresses root cause that PRs NousResearch#32226 and NousResearch#32322 attempted
via connection-sharing/caching approaches.

#AI commit#
mosaiq-systems pushed a commit to mosaiq-systems/hermes-agent that referenced this pull request May 29, 2026
apply_wal_with_fallback() issued PRAGMA journal_mode=WAL on every call,
including connections to DBs already in WAL mode. This triggered the WAL
init code path, causing SQLite to acquire EXCLUSIVE, checkpoint, and unlink
kanban.db-{wal,shm}. Other open connections received (deleted) FDs and
raised sqlite3.OperationalError: disk I/O error.

Add a cheap read probe (PRAGMA journal_mode, no flock/checkpoint/unlink)
before the set-pragma path. If already wal, return early. The set-pragma
and DELETE fallback paths are unchanged.

Closes NousResearch#31158. Addresses root cause that PRs NousResearch#32226 and NousResearch#32322 attempted
via connection-sharing/caching approaches.
KKT-OPT pushed a commit to KKT-OPT/hermes-agent that referenced this pull request May 31, 2026
apply_wal_with_fallback() issued PRAGMA journal_mode=WAL on every call,
including connections to DBs already in WAL mode. This triggered the WAL
init code path, causing SQLite to acquire EXCLUSIVE, checkpoint, and unlink
kanban.db-{wal,shm}. Other open connections received (deleted) FDs and
raised sqlite3.OperationalError: disk I/O error.

Add a cheap read probe (PRAGMA journal_mode, no flock/checkpoint/unlink)
before the set-pragma path. If already wal, return early. The set-pragma
and DELETE fallback paths are unchanged.

Closes NousResearch#31158. Addresses root cause that PRs NousResearch#32226 and NousResearch#32322 attempted
via connection-sharing/caching approaches.
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
apply_wal_with_fallback() issued PRAGMA journal_mode=WAL on every call,
including connections to DBs already in WAL mode. This triggered the WAL
init code path, causing SQLite to acquire EXCLUSIVE, checkpoint, and unlink
kanban.db-{wal,shm}. Other open connections received (deleted) FDs and
raised sqlite3.OperationalError: disk I/O error.

Add a cheap read probe (PRAGMA journal_mode, no flock/checkpoint/unlink)
before the set-pragma path. If already wal, return early. The set-pragma
and DELETE fallback paths are unchanged.

Closes NousResearch#31158. Addresses root cause that PRs NousResearch#32226 and NousResearch#32322 attempted
via connection-sharing/caching approaches.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery comp/plugins Plugin system and bundled plugins P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants