fix(kanban): pooled DB connection across gateway + dashboard (closes #33169 cluster) by teknium1 · Pull Request #33243 · NousResearch/hermes-agent

teknium1 · 2026-05-27T12:44:00Z

Summary

Stops the kanban DB false-positive corruption + backup storm when gateway and dashboard run as separate processes (#33169) by routing every long-lived caller through a pooled connection and stripping per-tick/per-request open/close churn.

Root cause: _guard_existing_db_is_healthy() opened a separate probe connection on every connect() call. Under WAL with concurrent writers, that probe transiently saw torn pages and declared healthy files corrupt, then moved them to .corrupt.*.bak. Per-tick open/close in the gateway dispatcher and per-request open/close in the dashboard amplified the race.

Changes

Salvaged @herrschmidt's fix(kanban): remove false-positive corruption detection from separate probe connection #32449 onto current main (single commit, authorship preserved): removes the separate probe connection, adds a singleton pooled connection per process+board (mirrors hermes_state.py:SessionDB), adds PRAGMA busy_timeout=30000, atexit PASSIVE checkpoint on shutdown.
Followed up to widen the fix:
- gateway/run.py: switch all 7 _kb.connect() sites (notifier, dispatcher, has_spawnable check, auto-subscribe) to get_connection(). No more per-tick churn.
- plugins/kanban/dashboard/plugin_api.py: switch _conn() to get_connection() and strip the 24 finally: conn.close() blocks that would otherwise evict the pool on every HTTP request.
- tools/kanban_tools.py: remove the dead try: ...; finally: pass blocks fix(kanban): remove false-positive corruption detection from separate probe connection #32449 left behind (9 sites). Cosmetic.
- hermes_cli/kanban_db.py: reorder _open_connection to run apply_wal_with_fallback BEFORE _check_db_health. Running integrity_check first held an implicit read transaction that broke the WAL→DELETE fallback on NFS/SMB. Wrap any WAL-setup DatabaseError in KanbanDbCorruptError for consistent caller error type. Fix init_db() to actually close the connection it opens — the PR's claim that connect() returns pooled was wrong (only get_connection() does).
scripts/release.py: add herrschmidt@users.noreply.github.com to AUTHOR_MAP.

Validation

	Before	After
`tests/hermes_cli/test_kanban_db.py`	n/a	172/172 pass
`tests/plugins/test_kanban_dashboard_plugin.py`	n/a	94/94 pass
`tests/tools/test_kanban_tools.py`	n/a	81/81 pass
`tests/gateway/` (full suite)	n/a	5944/5944 pass
Stress: `--run-stress tests/stress/test_kanban_singleton_pool.py`	n/a	7/7 pass
E2E: 2 processes × 15s concurrent	flood of `.corrupt.*.bak`	0 backups, 2948 tasks, integrity ok

E2E reproduces the exact #33169 scenario: two separate processes each hold a pooled connection to the same kanban.db and hammer it with concurrent reads + writes. Result: zero .corrupt.*.bak files produced, final PRAGMA integrity_check returns ok, both workers exit clean with zero errors.

Closes

[Bug] Kanban DB corruption when gateway and dashboard open the same board DB concurrently (WAL mode) #33169 (this issue), [Bug]: kanban.db corruption when multiple profile gateways share the same board DB #32424, [Bug]: Kanban DB intermittent corruption after worker crash — missing WAL checkpoint #32543, Kanban SQLite database corruption under rapid task creation #31502, kanban.db index corruption after frequent gateway restarts — dispatcher disables board permanently #30908, kanban.db corruption recurs under concurrent reclaim-SIGKILL even with synchronous=FULL + wal_autocheckpoint=100 #31618, feat(kanban): add busy_timeout PRAGMA to prevent WAL corruption under concurrent writers #32532

Credits

Backbone work by @herrschmidt (#32449). Cherry-picked with original authorship preserved; follow-up commit by us widens the fix across the gateway and dashboard call sites, fixes the WAL-fallback ordering, and lints up the residual dead code.

Remove _guard_existing_db_is_healthy() which opened a separate probe connection with timeout=5 to run PRAGMA integrity_check. This probe connection saw transient WAL states during checkpointing and produced false-positive corruption detection, triggering backup storms (73+ .corrupt.*.bak files in minutes). Replace with _check_db_health(conn, path) that runs integrity_check on the MAIN connection — no second connection, no transient WAL races. Add get_connection() singleton pool (one connection per process+board) following hermes_state.py SessionDB pattern. kanban_tools.py handlers now use get_connection() instead of per-call connect(), eliminating connection open/close churn that caused WAL-index initialization races. connect() remains backward-compatible: creates fresh connections for tests and legacy callers that expect independent connections. Add stress tests: 1000 rapid create+complete (integrity_check ok), concurrent connection integrity, singleton identity, thread safety. Related: #30445, #31618, #31502, #31795

Salvage follow-up for #32449 (singleton connection pool): 1. Switch all 7 `_kb.connect()` call sites in gateway/run.py to `get_connection()` so the long-lived gateway dispatcher and notifier stop opening/closing fresh connections on every tick. Per-tick churn is what triggered the false-positive corruption probe under WAL checkpointing in the first place (#33169). 2. Switch the dashboard's `_conn()` helper to `get_connection()` and strip the 24 `finally: conn.close()` blocks that would otherwise evict the pool on every HTTP request — restoring the per-request churn. The dashboard runs as a separate process from the gateway, so its own pool is independent; cross-process safety comes from the `busy_timeout=30000` PRAGMA added in #32449. 3. Clean up the dead `try: ...; finally: pass` blocks #32449 left in tools/kanban_tools.py (9 sites). Pure cosmetic — no behaviour change. 4. Reorder `_open_connection` to call `apply_wal_with_fallback` BEFORE `_check_db_health`. Running integrity_check first held an implicit read transaction that prevented the WAL→DELETE journal-mode fallback from acquiring its exclusive lock on NFS/SMB filesystems ("database is locked" against `PRAGMA journal_mode=DELETE`). Any DatabaseError from WAL setup itself is now wrapped in KanbanDbCorruptError so callers see a consistent error type. 5. Fix the dropped `conn.close()` in `init_db()` — the previous comment claimed connect() returned a pooled connection, but it doesn't (only get_connection does). The leaked fresh connection held WAL locks that broke the kanban_home fixture's downstream tests. E2E verified: two separate processes hammering the same kanban.db for 15s each (concurrent reads + writes), 2948 tasks created, zero `.corrupt.*.bak` files produced, final integrity_check ok. Closes #33169 and the wider kanban-WAL cluster (#32424, #32543, #31502, #30908, #31618).

github-actions · 2026-05-27T12:49:05Z

🔎 Lint report: `hermes/hermes-6bc90445` vs `origin/main`

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 9496 on HEAD, 9495 on base (🆕 +1)

🆕 New issues (1):

Rule	Count
`unresolved-import`	1

First entries

tests/stress/test_kanban_singleton_pool.py:21: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`

✅ Fixed issues: none

Unchanged: 5014 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

+def _try_stat(path: Path) -> Optional[os.stat_result]:
+    """Return ``path.stat()`` or ``None`` on any OSError (missing file, etc.)."""
+    try:
+        return path.stat()


+        path = db_path
+    else:
+        path = kanban_db_path(board=board)
+    resolved = str(path.resolve())


+    else:
+        path = kanban_db_path(board=board)
+    resolved = str(path.resolve())
+    path.parent.mkdir(parents=True, exist_ok=True)


+        path = db_path
+    else:
+        path = kanban_db_path(board=board)
+    resolved = str(path.resolve())


julio-cloudvisor · 2026-05-27T16:08:16Z

Datapoint from a production deployment: we hit exactly #33169 — separate gateway + dispatcher processes against one kanban.db, .corrupt.*.bak flood, real data eviction when the rename caught in-flight writers.

We filed #30973 + #31618 thinking it was a durability race and our synchronous=FULL PRAGMA reduced the symptom rate but didn't fix root cause — exactly the false-positive pattern this PR fixes. The rediagnosis (probe connection seeing torn WAL pages → false corruption → backup storm) matches what we observed: corruption "recurring" was actually the detector firing on healthy DBs that had just been written to.

The pooled singleton + per-tick churn removal + atexit checkpoint is the right shape. Happy to run this in production for a week against heartbeat-heavy concurrent-writer load (multi-process dispatcher + reviewer subprocess writing 50+ events in 5 minutes against one DB) once it's ready for outside test signal. If it lands, we'll drop #30973 from our local patches and close it out.

teknium1 · 2026-05-28T01:02:38Z

Closing in favor of a smaller, more focused approach.

Since this PR opened, a wave of kanban hardening landed on main covering overlapping ground:

dc98314 fix(kanban): skip redundant WAL pragma on already-WAL connections — addresses the same WAL-checkpoint-race surface as this PR's pool, but via pragma idempotence
6416dd5 fix(kanban): harden SQLite against torn-write corruption (synchronous=FULL + secure_delete + cell_size_check)
99c19eb fix(kanban): add post-commit page_count invariant check + wal_autocheckpoint=100
7 other kanban commits touching gateway/run.py and hermes_cli/kanban_db.py

The rebase would require manually merging the singleton pool + probe-removal architecture against all of the above. Given Teknium's pragma-idempotence fix already addresses the primary symptom this PR targeted (false-positive corruption from WAL checkpoint races during reopen), and #33159's FD leak is a much narrower fix, the better path is:

This PR closes (no work lost — herrschmidt's commit is preserved in git log via the salvage branch).
A small targeted PR addresses #33159 directly: add explicit try/finally: conn.close() to the with kb.connect() as conn: sites in hermes_cli/kanban_decompose.py and hermes_cli/kanban_specify.py that are reachable from the long-lived dashboard process.
The full pool architecture can be reconsidered later if FD leaks or WAL races still show up in production after the hardening pragmas land.

Credit to @herrschmidt for #32449's diagnosis and the singleton-pool design — it's the right architectural direction and remains a viable future PR if needed.

herrschmidt and others added 3 commits May 27, 2026 05:27

chore(release): map herrschmidt@users.noreply.github.com to herrschmidt

15d09ef

github-advanced-security AI found potential problems May 27, 2026

View reviewed changes

alt-glitch added type/bug Something isn't working comp/plugins Plugin system and bundled plugins comp/gateway Gateway runner, session dispatch, delivery P3 Low — cosmetic, nice to have labels May 27, 2026

julio-cloudvisor mentioned this pull request May 27, 2026

feat(kanban): opt-in HERMES_KANBAN_SYNCHRONOUS_MODE for synchronous=FULL durability hardening #30973

Closed

teknium1 closed this May 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kanban): pooled DB connection across gateway + dashboard (closes #33169 cluster)#33243

fix(kanban): pooled DB connection across gateway + dashboard (closes #33169 cluster)#33243
teknium1 wants to merge 3 commits into
mainfrom
hermes/hermes-6bc90445

teknium1 commented May 27, 2026

Uh oh!

github-actions Bot commented May 27, 2026

Uh oh!

julio-cloudvisor commented May 27, 2026

Uh oh!

teknium1 commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

teknium1 commented May 27, 2026

Summary

Changes

Validation

Closes

Credits

Uh oh!

github-actions Bot commented May 27, 2026

🔎 Lint report: hermes/hermes-6bc90445 vs origin/main

ruff

ty (type checker)

Uh oh!

julio-cloudvisor commented May 27, 2026

Uh oh!

teknium1 commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

🔎 Lint report: `hermes/hermes-6bc90445` vs `origin/main`