fix(gateway): cache kanban DB connections per OS thread in GatewayRunner by steveonjava · Pull Request #32322 · NousResearch/hermes-agent

steveonjava · 2026-05-26T00:28:24Z

fix(gateway): cache kanban DB connections per OS thread in GatewayRunner

Summary

Introduces per-thread SQLite connection caching for kanban dispatcher queries. Previously, connections were opened and closed on every call, creating opportunities for b-tree corruption under concurrent read/write traffic and wasting connection overhead. The new threading.local() cache stores one connection per (thread, board-slug) pair, properly isolating database access per POSIX thread.

Problem

The embedded kanban dispatcher in GatewayRunner was opening a fresh SQLite connection on each operation (_kanban_advance, _kanban_rewind, etc.) and closing it in a finally block. Under multi-threaded executor traffic, this pattern creates:

Corruption risk — concurrent readers on distinct connections may see partially-written pages mid-transaction.
WAL inode-rotation race — connections close and reopen, causing the WAL reader state to diverge from the writer, leading to EIO on rotated inodes.
Performance penalty — repeated open/close overhead, especially with WAL overhead on each cycle.

Solution

Add threading.local() cache (self._kb_tls) to store long-lived per-thread connections.
Implement _kb_conn(slug: str) method that returns the cached connection for the calling thread, or creates a new one on first access.
Add health check in _kb_conn() to detect and evict stale closed connections before reuse.
Remove conn.close() calls from inner poll-loop finally blocks (7 sites) — connections are long-lived per thread.
Connections are properly cleaned up only in the outer watcher-exit finally blocks.
Add comprehensive concurrent-traffic tests with integrity checks to guard against regression to the unsafe shared-connection pattern.

Changes

gateway/run.py

Add _kb_tls: threading.local attribute
Implement _kb_conn(slug) with cache lookup and health probing
Refactor 7 call sites to use _kb_conn() instead of opening inline connections
Remove per-iteration conn.close() from inner loops

tests/hermes_cli/test_kanban_dispatcher_wal.py

New test file with 4 concurrent-safety tests:
- Per-thread connections are distinct objects
- Per-thread connections are reused within the same thread
- 4+ threads, 60s mixed read/write workload, passes PRAGMA integrity_check
- Regression guard: test fails with the old shared-connection design

tests/hermes_cli/test_kanban_dispatcher_wal_adversarial.py

Additional 5 adversarial tests for edge cases:
- Closed-connection eviction + re-creation
- Multi-slug routing
- Per-instance isolation
- None/default board handling
- Concurrent creation race conditions

Related Issues

Addresses #31158 (community report of dispatcher wedge under multi-thread concurrency).
Refs #14598 (connection leak on cache eviction — now fixed).
Refs #31736 (WAL connection pressure — now mitigated by fewer connections).
Supersedes #32226 (prior unsafe shared-connection approach).

Testing

All 4 existing tests pass:

tests/hermes_cli/test_kanban_dispatcher_wal.py::test_per_thread_connections_are_distinct PASS
tests/hermes_cli/test_kanban_dispatcher_wal.py::test_per_thread_connection_reused_within_thread PASS
tests/hermes_cli/test_kanban_dispatcher_wal.py::test_concurrent_mixed_workload_integrity PASS
tests/hermes_cli/test_kanban_dispatcher_wal.py::test_regression_shared_connection_fails PASS

Adversarial tests:

tests/hermes_cli/test_kanban_dispatcher_wal_adversarial.py::test_closing_tls_cached_connection_invalidates_cache PASS
tests/hermes_cli/test_kanban_dispatcher_wal_adversarial.py::test_multi_slug_isolation PASS
tests/hermes_cli/test_kanban_dispatcher_wal_adversarial.py::test_per_instance_isolation PASS
tests/hermes_cli/test_kanban_dispatcher_wal_adversarial.py::test_none_board_default_handling PASS
tests/hermes_cli/test_kanban_dispatcher_wal_adversarial.py::test_concurrent_creation_race PASS

All existing tests pass: python -m pytest tests/ -o 'addopts=' -q

Contributor

Author: Stephen Chin (@steveonjava)

Opens one sqlite3 connection per (thread, board-slug) pair instead of opening and closing a new connection on each call. Connections are stored in threading.local() under self._kb_tls, keyed by the resolved board slug. Previously, helper methods (_kanban_advance, _kanban_unsub, _kanban_rewind, _ready_nonempty, _sub) opened a connection inside each invocation and closed it in a finally block. Watcher loops (_kanban_notifier_watcher, _kanban_dispatcher_watcher) also opened fresh connections per tick. Under concurrent read/write traffic this pattern risks b-tree corruption (issue NousResearch#32226) and adds unnecessary open/close overhead. The new _kb_conn(slug) method checks the TLS cache and creates a connection only on first access from a given thread. Explicit conn.close() calls are removed from the seven affected call sites; the thread pool reuses OS threads so TLS connections persist across calls without leaking. Adds tests/hermes_cli/test_kanban_dispatcher_wal.py with 4 tests covering distinct connections across threads, reuse within a thread, concurrent mixed-workload integrity, and a regression guard against the shared-connection pattern.

After conn.close() was called on a TLS-cached connection inside the inner poll-loop finally blocks in _collect() and _tick_once_for_board(), the cache retained the now-closed connection object. The next _kb_conn() call from the same thread returned that object and any query raised ProgrammingError: Cannot operate on a closed database. Two fixes: 1. _kb_conn() now probes the cached connection with SELECT 1 and evicts it on ProgrammingError before opening a fresh one. 2. Removed conn.close() from the inner poll-loop finally blocks in _collect() (~line 4954) and _tick_once_for_board() (~line 5501). The connection is long-lived per thread; closing it after each iteration defeats the TLS cache. Outer watcher-exit finally blocks (the correct shutdown path) are unchanged. Add adversarial test file (test_kanban_dispatcher_wal_adversarial.py) with 5 tests that probe the closed-connection, multi-slug, per-instance, None/default consistency, and concurrent-creation scenarios. Fixes attempt-1 regression caught by verifier test test_closing_tls_cached_connection_invalidates_cache.

alt-glitch · 2026-05-26T00:47:28Z

Related to the kanban SQLite WAL connection-churn cluster rooted at #29610. Very similar to #32226 (closed, per-board shared connection cache with threading.Lock) — this uses threading.local() instead. Also overlaps with #31130 (WAL-safe close) and #31768 (connection reuse). Demoted P3 per plugin-provider-demotion.

steveonjava · 2026-05-26T22:14:34Z

Closing. I tested this approach locally and saw the same dispatcher wedge symptom we were trying to fix, just at lower frequency. The TLS cache reduces connection count but doesn't address the actual trigger, which is the redundant PRAGMA journal_mode=WAL that fires on every connect() and unlinks the sidecars even when the DB is already in WAL mode. PR #32489 fixes that at the source (and is bundled into the consolidated #32857). Leaving this closed to avoid sending a misleading approach into review.

apply_wal_with_fallback() issued PRAGMA journal_mode=WAL on every call, including connections to DBs already in WAL mode. This triggered the WAL init code path, causing SQLite to acquire EXCLUSIVE, checkpoint, and unlink kanban.db-{wal,shm}. Other open connections received (deleted) FDs and raised sqlite3.OperationalError: disk I/O error. Add a cheap read probe (PRAGMA journal_mode, no flock/checkpoint/unlink) before the set-pragma path. If already wal, return early. The set-pragma and DELETE fallback paths are unchanged. Closes #31158. Addresses root cause that PRs #32226 and #32322 attempted via connection-sharing/caching approaches.

apply_wal_with_fallback() issued PRAGMA journal_mode=WAL on every call, including connections to DBs already in WAL mode. This triggered the WAL init code path, causing SQLite to acquire EXCLUSIVE, checkpoint, and unlink kanban.db-{wal,shm}. Other open connections received (deleted) FDs and raised sqlite3.OperationalError: disk I/O error. Add a cheap read probe (PRAGMA journal_mode, no flock/checkpoint/unlink) before the set-pragma path. If already wal, return early. The set-pragma and DELETE fallback paths are unchanged. Closes NousResearch#31158. Addresses root cause that PRs NousResearch#32226 and NousResearch#32322 attempted via connection-sharing/caching approaches.

apply_wal_with_fallback() issued PRAGMA journal_mode=WAL on every call, including connections to DBs already in WAL mode. This triggered the WAL init code path, causing SQLite to acquire EXCLUSIVE, checkpoint, and unlink kanban.db-{wal,shm}. Other open connections received (deleted) FDs and raised sqlite3.OperationalError: disk I/O error. Add a cheap read probe (PRAGMA journal_mode, no flock/checkpoint/unlink) before the set-pragma path. If already wal, return early. The set-pragma and DELETE fallback paths are unchanged. Closes NousResearch#31158. Addresses root cause that PRs NousResearch#32226 and NousResearch#32322 attempted via connection-sharing/caching approaches. #AI commit#

apply_wal_with_fallback() issued PRAGMA journal_mode=WAL on every call, including connections to DBs already in WAL mode. This triggered the WAL init code path, causing SQLite to acquire EXCLUSIVE, checkpoint, and unlink kanban.db-{wal,shm}. Other open connections received (deleted) FDs and raised sqlite3.OperationalError: disk I/O error. Add a cheap read probe (PRAGMA journal_mode, no flock/checkpoint/unlink) before the set-pragma path. If already wal, return early. The set-pragma and DELETE fallback paths are unchanged. Closes NousResearch#31158. Addresses root cause that PRs NousResearch#32226 and NousResearch#32322 attempted via connection-sharing/caching approaches.

steveonjava added 3 commits May 25, 2026 17:05

chore(release): add steveonjava to AUTHOR_MAP

1feea4b

alt-glitch added type/bug Something isn't working comp/gateway Gateway runner, session dispatch, delivery comp/plugins Plugin system and bundled plugins P3 Low — cosmetic, nice to have labels May 26, 2026

herrschmidt mentioned this pull request May 26, 2026

fix(kanban): remove false-positive corruption detection from separate probe connection #32449

Closed

This was referenced May 26, 2026

fix(kanban): skip redundant WAL pragma on already-WAL connections #32489

Closed

fix(gateway): add WAL pinner to hold shared lock and prevent sidecar unlink (Bug I.2) #32531

Closed

steveonjava closed this May 26, 2026

This was referenced May 26, 2026

fix(kanban): batch-salvage 8 SQLite corruption hardening fixes (closes #31158, refs #29610) #32857

Closed

Bug: embedded Kanban dispatcher still leaks sqlite/WAL file descriptors after #28301 #29610

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gateway): cache kanban DB connections per OS thread in GatewayRunner#32322

fix(gateway): cache kanban DB connections per OS thread in GatewayRunner#32322
steveonjava wants to merge 3 commits into
NousResearch:mainfrom
steveonjava:fix/kanban-gateway-per-thread-conn-cache

steveonjava commented May 26, 2026 •

edited

Loading

Uh oh!

alt-glitch commented May 26, 2026

Uh oh!

steveonjava commented May 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

steveonjava commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

fix(gateway): cache kanban DB connections per OS thread in GatewayRunner

Summary

Problem

Solution

Changes

gateway/run.py

tests/hermes_cli/test_kanban_dispatcher_wal.py

tests/hermes_cli/test_kanban_dispatcher_wal_adversarial.py

Related Issues

Testing

Contributor

Uh oh!

alt-glitch commented May 26, 2026

Uh oh!

steveonjava commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

steveonjava commented May 26, 2026 •

edited

Loading

steveonjava commented May 26, 2026 •

edited

Loading