Skip to content

fix(gateway): cache kanban DB connections per OS thread in GatewayRunner#32322

Closed
steveonjava wants to merge 3 commits into
NousResearch:mainfrom
steveonjava:fix/kanban-gateway-per-thread-conn-cache
Closed

fix(gateway): cache kanban DB connections per OS thread in GatewayRunner#32322
steveonjava wants to merge 3 commits into
NousResearch:mainfrom
steveonjava:fix/kanban-gateway-per-thread-conn-cache

Conversation

@steveonjava

@steveonjava steveonjava commented May 26, 2026

Copy link
Copy Markdown
Contributor

fix(gateway): cache kanban DB connections per OS thread in GatewayRunner

Summary

Introduces per-thread SQLite connection caching for kanban dispatcher queries. Previously, connections were opened and closed on every call, creating opportunities for b-tree corruption under concurrent read/write traffic and wasting connection overhead. The new threading.local() cache stores one connection per (thread, board-slug) pair, properly isolating database access per POSIX thread.

Problem

The embedded kanban dispatcher in GatewayRunner was opening a fresh SQLite connection on each operation (_kanban_advance, _kanban_rewind, etc.) and closing it in a finally block. Under multi-threaded executor traffic, this pattern creates:

  1. Corruption risk — concurrent readers on distinct connections may see partially-written pages mid-transaction.
  2. WAL inode-rotation race — connections close and reopen, causing the WAL reader state to diverge from the writer, leading to EIO on rotated inodes.
  3. Performance penalty — repeated open/close overhead, especially with WAL overhead on each cycle.

Solution

  • Add threading.local() cache (self._kb_tls) to store long-lived per-thread connections.
  • Implement _kb_conn(slug: str) method that returns the cached connection for the calling thread, or creates a new one on first access.
  • Add health check in _kb_conn() to detect and evict stale closed connections before reuse.
  • Remove conn.close() calls from inner poll-loop finally blocks (7 sites) — connections are long-lived per thread.
  • Connections are properly cleaned up only in the outer watcher-exit finally blocks.
  • Add comprehensive concurrent-traffic tests with integrity checks to guard against regression to the unsafe shared-connection pattern.

Changes

gateway/run.py

  • Add _kb_tls: threading.local attribute
  • Implement _kb_conn(slug) with cache lookup and health probing
  • Refactor 7 call sites to use _kb_conn() instead of opening inline connections
  • Remove per-iteration conn.close() from inner loops

tests/hermes_cli/test_kanban_dispatcher_wal.py

  • New test file with 4 concurrent-safety tests:
    • Per-thread connections are distinct objects
    • Per-thread connections are reused within the same thread
    • 4+ threads, 60s mixed read/write workload, passes PRAGMA integrity_check
    • Regression guard: test fails with the old shared-connection design

tests/hermes_cli/test_kanban_dispatcher_wal_adversarial.py

  • Additional 5 adversarial tests for edge cases:
    • Closed-connection eviction + re-creation
    • Multi-slug routing
    • Per-instance isolation
    • None/default board handling
    • Concurrent creation race conditions

Related Issues

Addresses #31158 (community report of dispatcher wedge under multi-thread concurrency).
Refs #14598 (connection leak on cache eviction — now fixed).
Refs #31736 (WAL connection pressure — now mitigated by fewer connections).
Supersedes #32226 (prior unsafe shared-connection approach).

Testing

All 4 existing tests pass:

tests/hermes_cli/test_kanban_dispatcher_wal.py::test_per_thread_connections_are_distinct PASS
tests/hermes_cli/test_kanban_dispatcher_wal.py::test_per_thread_connection_reused_within_thread PASS
tests/hermes_cli/test_kanban_dispatcher_wal.py::test_concurrent_mixed_workload_integrity PASS
tests/hermes_cli/test_kanban_dispatcher_wal.py::test_regression_shared_connection_fails PASS

Adversarial tests:

tests/hermes_cli/test_kanban_dispatcher_wal_adversarial.py::test_closing_tls_cached_connection_invalidates_cache PASS
tests/hermes_cli/test_kanban_dispatcher_wal_adversarial.py::test_multi_slug_isolation PASS
tests/hermes_cli/test_kanban_dispatcher_wal_adversarial.py::test_per_instance_isolation PASS
tests/hermes_cli/test_kanban_dispatcher_wal_adversarial.py::test_none_board_default_handling PASS
tests/hermes_cli/test_kanban_dispatcher_wal_adversarial.py::test_concurrent_creation_race PASS

All existing tests pass: python -m pytest tests/ -o 'addopts=' -q

Contributor

Author: Stephen Chin (@steveonjava)

Opens one sqlite3 connection per (thread, board-slug) pair instead of
opening and closing a new connection on each call. Connections are stored
in threading.local() under self._kb_tls, keyed by the resolved board slug.

Previously, helper methods (_kanban_advance, _kanban_unsub, _kanban_rewind,
_ready_nonempty, _sub) opened a connection inside each invocation and closed
it in a finally block. Watcher loops (_kanban_notifier_watcher,
_kanban_dispatcher_watcher) also opened fresh connections per tick. Under
concurrent read/write traffic this pattern risks b-tree corruption (issue
NousResearch#32226) and adds unnecessary open/close overhead.

The new _kb_conn(slug) method checks the TLS cache and creates a connection
only on first access from a given thread. Explicit conn.close() calls are
removed from the seven affected call sites; the thread pool reuses OS threads
so TLS connections persist across calls without leaking.

Adds tests/hermes_cli/test_kanban_dispatcher_wal.py with 4 tests covering
distinct connections across threads, reuse within a thread, concurrent
mixed-workload integrity, and a regression guard against the shared-connection
pattern.
After conn.close() was called on a TLS-cached connection inside the
inner poll-loop finally blocks in _collect() and _tick_once_for_board(),
the cache retained the now-closed connection object. The next _kb_conn()
call from the same thread returned that object and any query raised
ProgrammingError: Cannot operate on a closed database.

Two fixes:
1. _kb_conn() now probes the cached connection with SELECT 1 and evicts
   it on ProgrammingError before opening a fresh one.
2. Removed conn.close() from the inner poll-loop finally blocks in
   _collect() (~line 4954) and _tick_once_for_board() (~line 5501).
   The connection is long-lived per thread; closing it after each
   iteration defeats the TLS cache. Outer watcher-exit finally blocks
   (the correct shutdown path) are unchanged.

Add adversarial test file (test_kanban_dispatcher_wal_adversarial.py)
with 5 tests that probe the closed-connection, multi-slug, per-instance,
None/default consistency, and concurrent-creation scenarios.

Fixes attempt-1 regression caught by verifier test
test_closing_tls_cached_connection_invalidates_cache.
@alt-glitch alt-glitch added type/bug Something isn't working comp/gateway Gateway runner, session dispatch, delivery comp/plugins Plugin system and bundled plugins P3 Low — cosmetic, nice to have labels May 26, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Related to the kanban SQLite WAL connection-churn cluster rooted at #29610. Very similar to #32226 (closed, per-board shared connection cache with threading.Lock) — this uses threading.local() instead. Also overlaps with #31130 (WAL-safe close) and #31768 (connection reuse). Demoted P3 per plugin-provider-demotion.

@steveonjava

steveonjava commented May 26, 2026

Copy link
Copy Markdown
Contributor Author

Closing. I tested this approach locally and saw the same dispatcher wedge symptom we were trying to fix, just at lower frequency. The TLS cache reduces connection count but doesn't address the actual trigger, which is the redundant PRAGMA journal_mode=WAL that fires on every connect() and unlinks the sidecars even when the DB is already in WAL mode. PR #32489 fixes that at the source (and is bundled into the consolidated #32857). Leaving this closed to avoid sending a misleading approach into review.

kshitijk4poor pushed a commit that referenced this pull request May 27, 2026
apply_wal_with_fallback() issued PRAGMA journal_mode=WAL on every call,
including connections to DBs already in WAL mode. This triggered the WAL
init code path, causing SQLite to acquire EXCLUSIVE, checkpoint, and unlink
kanban.db-{wal,shm}. Other open connections received (deleted) FDs and
raised sqlite3.OperationalError: disk I/O error.

Add a cheap read probe (PRAGMA journal_mode, no flock/checkpoint/unlink)
before the set-pragma path. If already wal, return early. The set-pragma
and DELETE fallback paths are unchanged.

Closes #31158. Addresses root cause that PRs #32226 and #32322 attempted
via connection-sharing/caching approaches.
mathias3 pushed a commit to mathias3/hermes-agent that referenced this pull request May 28, 2026
apply_wal_with_fallback() issued PRAGMA journal_mode=WAL on every call,
including connections to DBs already in WAL mode. This triggered the WAL
init code path, causing SQLite to acquire EXCLUSIVE, checkpoint, and unlink
kanban.db-{wal,shm}. Other open connections received (deleted) FDs and
raised sqlite3.OperationalError: disk I/O error.

Add a cheap read probe (PRAGMA journal_mode, no flock/checkpoint/unlink)
before the set-pragma path. If already wal, return early. The set-pragma
and DELETE fallback paths are unchanged.

Closes NousResearch#31158. Addresses root cause that PRs NousResearch#32226 and NousResearch#32322 attempted
via connection-sharing/caching approaches.
Bryce-huang pushed a commit to wbkunlun/hermes-agent that referenced this pull request May 29, 2026
apply_wal_with_fallback() issued PRAGMA journal_mode=WAL on every call,
including connections to DBs already in WAL mode. This triggered the WAL
init code path, causing SQLite to acquire EXCLUSIVE, checkpoint, and unlink
kanban.db-{wal,shm}. Other open connections received (deleted) FDs and
raised sqlite3.OperationalError: disk I/O error.

Add a cheap read probe (PRAGMA journal_mode, no flock/checkpoint/unlink)
before the set-pragma path. If already wal, return early. The set-pragma
and DELETE fallback paths are unchanged.

Closes NousResearch#31158. Addresses root cause that PRs NousResearch#32226 and NousResearch#32322 attempted
via connection-sharing/caching approaches.

#AI commit#
mosaiq-systems pushed a commit to mosaiq-systems/hermes-agent that referenced this pull request May 29, 2026
apply_wal_with_fallback() issued PRAGMA journal_mode=WAL on every call,
including connections to DBs already in WAL mode. This triggered the WAL
init code path, causing SQLite to acquire EXCLUSIVE, checkpoint, and unlink
kanban.db-{wal,shm}. Other open connections received (deleted) FDs and
raised sqlite3.OperationalError: disk I/O error.

Add a cheap read probe (PRAGMA journal_mode, no flock/checkpoint/unlink)
before the set-pragma path. If already wal, return early. The set-pragma
and DELETE fallback paths are unchanged.

Closes NousResearch#31158. Addresses root cause that PRs NousResearch#32226 and NousResearch#32322 attempted
via connection-sharing/caching approaches.
KKT-OPT pushed a commit to KKT-OPT/hermes-agent that referenced this pull request May 31, 2026
apply_wal_with_fallback() issued PRAGMA journal_mode=WAL on every call,
including connections to DBs already in WAL mode. This triggered the WAL
init code path, causing SQLite to acquire EXCLUSIVE, checkpoint, and unlink
kanban.db-{wal,shm}. Other open connections received (deleted) FDs and
raised sqlite3.OperationalError: disk I/O error.

Add a cheap read probe (PRAGMA journal_mode, no flock/checkpoint/unlink)
before the set-pragma path. If already wal, return early. The set-pragma
and DELETE fallback paths are unchanged.

Closes NousResearch#31158. Addresses root cause that PRs NousResearch#32226 and NousResearch#32322 attempted
via connection-sharing/caching approaches.
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
apply_wal_with_fallback() issued PRAGMA journal_mode=WAL on every call,
including connections to DBs already in WAL mode. This triggered the WAL
init code path, causing SQLite to acquire EXCLUSIVE, checkpoint, and unlink
kanban.db-{wal,shm}. Other open connections received (deleted) FDs and
raised sqlite3.OperationalError: disk I/O error.

Add a cheap read probe (PRAGMA journal_mode, no flock/checkpoint/unlink)
before the set-pragma path. If already wal, return early. The set-pragma
and DELETE fallback paths are unchanged.

Closes NousResearch#31158. Addresses root cause that PRs NousResearch#32226 and NousResearch#32322 attempted
via connection-sharing/caching approaches.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery comp/plugins Plugin system and bundled plugins P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants