fix(kanban): pooled DB connection across gateway + dashboard (closes #33169 cluster)#33243
fix(kanban): pooled DB connection across gateway + dashboard (closes #33169 cluster)#33243teknium1 wants to merge 3 commits into
Conversation
Remove _guard_existing_db_is_healthy() which opened a separate probe connection with timeout=5 to run PRAGMA integrity_check. This probe connection saw transient WAL states during checkpointing and produced false-positive corruption detection, triggering backup storms (73+ .corrupt.*.bak files in minutes). Replace with _check_db_health(conn, path) that runs integrity_check on the MAIN connection — no second connection, no transient WAL races. Add get_connection() singleton pool (one connection per process+board) following hermes_state.py SessionDB pattern. kanban_tools.py handlers now use get_connection() instead of per-call connect(), eliminating connection open/close churn that caused WAL-index initialization races. connect() remains backward-compatible: creates fresh connections for tests and legacy callers that expect independent connections. Add stress tests: 1000 rapid create+complete (integrity_check ok), concurrent connection integrity, singleton identity, thread safety. Related: #30445, #31618, #31502, #31795
Salvage follow-up for #32449 (singleton connection pool): 1. Switch all 7 `_kb.connect()` call sites in gateway/run.py to `get_connection()` so the long-lived gateway dispatcher and notifier stop opening/closing fresh connections on every tick. Per-tick churn is what triggered the false-positive corruption probe under WAL checkpointing in the first place (#33169). 2. Switch the dashboard's `_conn()` helper to `get_connection()` and strip the 24 `finally: conn.close()` blocks that would otherwise evict the pool on every HTTP request — restoring the per-request churn. The dashboard runs as a separate process from the gateway, so its own pool is independent; cross-process safety comes from the `busy_timeout=30000` PRAGMA added in #32449. 3. Clean up the dead `try: ...; finally: pass` blocks #32449 left in tools/kanban_tools.py (9 sites). Pure cosmetic — no behaviour change. 4. Reorder `_open_connection` to call `apply_wal_with_fallback` BEFORE `_check_db_health`. Running integrity_check first held an implicit read transaction that prevented the WAL→DELETE journal-mode fallback from acquiring its exclusive lock on NFS/SMB filesystems ("database is locked" against `PRAGMA journal_mode=DELETE`). Any DatabaseError from WAL setup itself is now wrapped in KanbanDbCorruptError so callers see a consistent error type. 5. Fix the dropped `conn.close()` in `init_db()` — the previous comment claimed connect() returned a pooled connection, but it doesn't (only get_connection does). The leaked fresh connection held WAL locks that broke the kanban_home fixture's downstream tests. E2E verified: two separate processes hammering the same kanban.db for 15s each (concurrent reads + writes), 2948 tasks created, zero `.corrupt.*.bak` files produced, final integrity_check ok. Closes #33169 and the wider kanban-WAL cluster (#32424, #32543, #31502, #30908, #31618).
🔎 Lint report:
|
| Rule | Count |
|---|---|
unresolved-import |
1 |
First entries
tests/stress/test_kanban_singleton_pool.py:21: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`
✅ Fixed issues: none
Unchanged: 5014 pre-existing issues carried over.
Diagnostics are surfaced as warnings — this check never fails the build.
| def _try_stat(path: Path) -> Optional[os.stat_result]: | ||
| """Return ``path.stat()`` or ``None`` on any OSError (missing file, etc.).""" | ||
| try: | ||
| return path.stat() |
| path = db_path | ||
| else: | ||
| path = kanban_db_path(board=board) | ||
| resolved = str(path.resolve()) |
| else: | ||
| path = kanban_db_path(board=board) | ||
| resolved = str(path.resolve()) | ||
| path.parent.mkdir(parents=True, exist_ok=True) |
| path = db_path | ||
| else: | ||
| path = kanban_db_path(board=board) | ||
| resolved = str(path.resolve()) |
|
Datapoint from a production deployment: we hit exactly #33169 — separate gateway + dispatcher processes against one kanban.db, We filed #30973 + #31618 thinking it was a durability race and our The pooled singleton + per-tick churn removal + atexit checkpoint is the right shape. Happy to run this in production for a week against heartbeat-heavy concurrent-writer load (multi-process dispatcher + reviewer subprocess writing 50+ events in 5 minutes against one DB) once it's ready for outside test signal. If it lands, we'll drop #30973 from our local patches and close it out. |
|
Closing in favor of a smaller, more focused approach. Since this PR opened, a wave of kanban hardening landed on
The rebase would require manually merging the singleton pool + probe-removal architecture against all of the above. Given Teknium's pragma-idempotence fix already addresses the primary symptom this PR targeted (false-positive corruption from WAL checkpoint races during reopen), and
Credit to @herrschmidt for #32449's diagnosis and the singleton-pool design — it's the right architectural direction and remains a viable future PR if needed. |
Summary
Stops the kanban DB false-positive corruption + backup storm when gateway and dashboard run as separate processes (#33169) by routing every long-lived caller through a pooled connection and stripping per-tick/per-request open/close churn.
Root cause:
_guard_existing_db_is_healthy()opened a separate probe connection on everyconnect()call. Under WAL with concurrent writers, that probe transiently saw torn pages and declared healthy files corrupt, then moved them to.corrupt.*.bak. Per-tick open/close in the gateway dispatcher and per-request open/close in the dashboard amplified the race.Changes
hermes_state.py:SessionDB), addsPRAGMA busy_timeout=30000, atexit PASSIVE checkpoint on shutdown.gateway/run.py: switch all 7_kb.connect()sites (notifier, dispatcher, has_spawnable check, auto-subscribe) toget_connection(). No more per-tick churn.plugins/kanban/dashboard/plugin_api.py: switch_conn()toget_connection()and strip the 24finally: conn.close()blocks that would otherwise evict the pool on every HTTP request.tools/kanban_tools.py: remove the deadtry: ...; finally: passblocks fix(kanban): remove false-positive corruption detection from separate probe connection #32449 left behind (9 sites). Cosmetic.hermes_cli/kanban_db.py: reorder_open_connectionto runapply_wal_with_fallbackBEFORE_check_db_health. Running integrity_check first held an implicit read transaction that broke the WAL→DELETE fallback on NFS/SMB. Wrap any WAL-setup DatabaseError inKanbanDbCorruptErrorfor consistent caller error type. Fixinit_db()to actually close the connection it opens — the PR's claim thatconnect()returns pooled was wrong (onlyget_connection()does).scripts/release.py: addherrschmidt@users.noreply.github.comto AUTHOR_MAP.Validation
tests/hermes_cli/test_kanban_db.pytests/plugins/test_kanban_dashboard_plugin.pytests/tools/test_kanban_tools.pytests/gateway/(full suite)--run-stress tests/stress/test_kanban_singleton_pool.py.corrupt.*.bakE2E reproduces the exact #33169 scenario: two separate processes each hold a pooled connection to the same
kanban.dband hammer it with concurrent reads + writes. Result: zero.corrupt.*.bakfiles produced, finalPRAGMA integrity_checkreturnsok, both workers exit clean with zero errors.Closes
Credits
Backbone work by @herrschmidt (#32449). Cherry-picked with original authorship preserved; follow-up commit by us widens the fix across the gateway and dashboard call sites, fixes the WAL-fallback ordering, and lints up the residual dead code.