fix(gateway): cache kanban DB connections per OS thread in GatewayRunner#32322
fix(gateway): cache kanban DB connections per OS thread in GatewayRunner#32322steveonjava wants to merge 3 commits into
Conversation
Opens one sqlite3 connection per (thread, board-slug) pair instead of opening and closing a new connection on each call. Connections are stored in threading.local() under self._kb_tls, keyed by the resolved board slug. Previously, helper methods (_kanban_advance, _kanban_unsub, _kanban_rewind, _ready_nonempty, _sub) opened a connection inside each invocation and closed it in a finally block. Watcher loops (_kanban_notifier_watcher, _kanban_dispatcher_watcher) also opened fresh connections per tick. Under concurrent read/write traffic this pattern risks b-tree corruption (issue NousResearch#32226) and adds unnecessary open/close overhead. The new _kb_conn(slug) method checks the TLS cache and creates a connection only on first access from a given thread. Explicit conn.close() calls are removed from the seven affected call sites; the thread pool reuses OS threads so TLS connections persist across calls without leaking. Adds tests/hermes_cli/test_kanban_dispatcher_wal.py with 4 tests covering distinct connections across threads, reuse within a thread, concurrent mixed-workload integrity, and a regression guard against the shared-connection pattern.
After conn.close() was called on a TLS-cached connection inside the inner poll-loop finally blocks in _collect() and _tick_once_for_board(), the cache retained the now-closed connection object. The next _kb_conn() call from the same thread returned that object and any query raised ProgrammingError: Cannot operate on a closed database. Two fixes: 1. _kb_conn() now probes the cached connection with SELECT 1 and evicts it on ProgrammingError before opening a fresh one. 2. Removed conn.close() from the inner poll-loop finally blocks in _collect() (~line 4954) and _tick_once_for_board() (~line 5501). The connection is long-lived per thread; closing it after each iteration defeats the TLS cache. Outer watcher-exit finally blocks (the correct shutdown path) are unchanged. Add adversarial test file (test_kanban_dispatcher_wal_adversarial.py) with 5 tests that probe the closed-connection, multi-slug, per-instance, None/default consistency, and concurrent-creation scenarios. Fixes attempt-1 regression caught by verifier test test_closing_tls_cached_connection_invalidates_cache.
|
Related to the kanban SQLite WAL connection-churn cluster rooted at #29610. Very similar to #32226 (closed, per-board shared connection cache with threading.Lock) — this uses |
|
Closing. I tested this approach locally and saw the same dispatcher wedge symptom we were trying to fix, just at lower frequency. The TLS cache reduces connection count but doesn't address the actual trigger, which is the redundant |
apply_wal_with_fallback() issued PRAGMA journal_mode=WAL on every call,
including connections to DBs already in WAL mode. This triggered the WAL
init code path, causing SQLite to acquire EXCLUSIVE, checkpoint, and unlink
kanban.db-{wal,shm}. Other open connections received (deleted) FDs and
raised sqlite3.OperationalError: disk I/O error.
Add a cheap read probe (PRAGMA journal_mode, no flock/checkpoint/unlink)
before the set-pragma path. If already wal, return early. The set-pragma
and DELETE fallback paths are unchanged.
Closes #31158. Addresses root cause that PRs #32226 and #32322 attempted
via connection-sharing/caching approaches.
apply_wal_with_fallback() issued PRAGMA journal_mode=WAL on every call,
including connections to DBs already in WAL mode. This triggered the WAL
init code path, causing SQLite to acquire EXCLUSIVE, checkpoint, and unlink
kanban.db-{wal,shm}. Other open connections received (deleted) FDs and
raised sqlite3.OperationalError: disk I/O error.
Add a cheap read probe (PRAGMA journal_mode, no flock/checkpoint/unlink)
before the set-pragma path. If already wal, return early. The set-pragma
and DELETE fallback paths are unchanged.
Closes NousResearch#31158. Addresses root cause that PRs NousResearch#32226 and NousResearch#32322 attempted
via connection-sharing/caching approaches.
apply_wal_with_fallback() issued PRAGMA journal_mode=WAL on every call,
including connections to DBs already in WAL mode. This triggered the WAL
init code path, causing SQLite to acquire EXCLUSIVE, checkpoint, and unlink
kanban.db-{wal,shm}. Other open connections received (deleted) FDs and
raised sqlite3.OperationalError: disk I/O error.
Add a cheap read probe (PRAGMA journal_mode, no flock/checkpoint/unlink)
before the set-pragma path. If already wal, return early. The set-pragma
and DELETE fallback paths are unchanged.
Closes NousResearch#31158. Addresses root cause that PRs NousResearch#32226 and NousResearch#32322 attempted
via connection-sharing/caching approaches.
#AI commit#
apply_wal_with_fallback() issued PRAGMA journal_mode=WAL on every call,
including connections to DBs already in WAL mode. This triggered the WAL
init code path, causing SQLite to acquire EXCLUSIVE, checkpoint, and unlink
kanban.db-{wal,shm}. Other open connections received (deleted) FDs and
raised sqlite3.OperationalError: disk I/O error.
Add a cheap read probe (PRAGMA journal_mode, no flock/checkpoint/unlink)
before the set-pragma path. If already wal, return early. The set-pragma
and DELETE fallback paths are unchanged.
Closes NousResearch#31158. Addresses root cause that PRs NousResearch#32226 and NousResearch#32322 attempted
via connection-sharing/caching approaches.
apply_wal_with_fallback() issued PRAGMA journal_mode=WAL on every call,
including connections to DBs already in WAL mode. This triggered the WAL
init code path, causing SQLite to acquire EXCLUSIVE, checkpoint, and unlink
kanban.db-{wal,shm}. Other open connections received (deleted) FDs and
raised sqlite3.OperationalError: disk I/O error.
Add a cheap read probe (PRAGMA journal_mode, no flock/checkpoint/unlink)
before the set-pragma path. If already wal, return early. The set-pragma
and DELETE fallback paths are unchanged.
Closes NousResearch#31158. Addresses root cause that PRs NousResearch#32226 and NousResearch#32322 attempted
via connection-sharing/caching approaches.
apply_wal_with_fallback() issued PRAGMA journal_mode=WAL on every call,
including connections to DBs already in WAL mode. This triggered the WAL
init code path, causing SQLite to acquire EXCLUSIVE, checkpoint, and unlink
kanban.db-{wal,shm}. Other open connections received (deleted) FDs and
raised sqlite3.OperationalError: disk I/O error.
Add a cheap read probe (PRAGMA journal_mode, no flock/checkpoint/unlink)
before the set-pragma path. If already wal, return early. The set-pragma
and DELETE fallback paths are unchanged.
Closes NousResearch#31158. Addresses root cause that PRs NousResearch#32226 and NousResearch#32322 attempted
via connection-sharing/caching approaches.
fix(gateway): cache kanban DB connections per OS thread in GatewayRunner
Summary
Introduces per-thread SQLite connection caching for kanban dispatcher queries. Previously, connections were opened and closed on every call, creating opportunities for b-tree corruption under concurrent read/write traffic and wasting connection overhead. The new
threading.local()cache stores one connection per (thread, board-slug) pair, properly isolating database access per POSIX thread.Problem
The embedded kanban dispatcher in
GatewayRunnerwas opening a fresh SQLite connection on each operation (_kanban_advance, _kanban_rewind, etc.) and closing it in a finally block. Under multi-threaded executor traffic, this pattern creates:Solution
threading.local()cache (self._kb_tls) to store long-lived per-thread connections._kb_conn(slug: str)method that returns the cached connection for the calling thread, or creates a new one on first access._kb_conn()to detect and evict stale closed connections before reuse.conn.close()calls from inner poll-loop finally blocks (7 sites) — connections are long-lived per thread.Changes
gateway/run.py
_kb_tls: threading.localattribute_kb_conn(slug)with cache lookup and health probing_kb_conn()instead of opening inline connectionsconn.close()from inner loopstests/hermes_cli/test_kanban_dispatcher_wal.py
tests/hermes_cli/test_kanban_dispatcher_wal_adversarial.py
Related Issues
Addresses #31158 (community report of dispatcher wedge under multi-thread concurrency).
Refs #14598 (connection leak on cache eviction — now fixed).
Refs #31736 (WAL connection pressure — now mitigated by fewer connections).
Supersedes #32226 (prior unsafe shared-connection approach).
Testing
All 4 existing tests pass:
Adversarial tests:
All existing tests pass:
python -m pytest tests/ -o 'addopts=' -qContributor
Author: Stephen Chin (@steveonjava)