fix(kanban): remove false-positive corruption detection from separate probe connection#32449
Conversation
Remove _guard_existing_db_is_healthy() which opened a separate probe connection with timeout=5 to run PRAGMA integrity_check. This probe connection saw transient WAL states during checkpointing and produced false-positive corruption detection, triggering backup storms (73+ .corrupt.*.bak files in minutes). Replace with _check_db_health(conn, path) that runs integrity_check on the MAIN connection — no second connection, no transient WAL races. Add get_connection() singleton pool (one connection per process+board) following hermes_state.py SessionDB pattern. kanban_tools.py handlers now use get_connection() instead of per-call connect(), eliminating connection open/close churn that caused WAL-index initialization races. connect() remains backward-compatible: creates fresh connections for tests and legacy callers that expect independent connections. Add stress tests: 1000 rapid create+complete (integrity_check ok), concurrent connection integrity, singleton identity, thread safety. Related: NousResearch#30445, NousResearch#31618, NousResearch#31502, NousResearch#31795
|
Competing with #31795 and #32322 — all address false-positive kanban DB corruption from WAL transients, but with different approaches:
Also complementary with #31798 (dedupe corrupt backups). Part of kanban corruption cluster rooted at #26479. Related issues: #32424, #30445, #31618, #31502, #31891. |
|
Closing as superseded. The singleton-connection-pool approach in this PR was tested in production by @steveonjava and reverted after corruption was confirmed within hours — that work shipped as #32226 (closed) and is documented in #33482's PR body under "Approaches I tried locally and reverted":
Two related approaches (#32322, #32531) were also tested and reverted. The root cause was fixed differently — see #33482 commit dc98314fb ( |
Problem
Every kanban tool call opens two SQLite connections: a probe connection for
PRAGMA integrity_check(timeout=5) and the main connection for actual work (timeout=30). The probe connection sees transient WAL states from other gateway processes during their checkpoint operations. It then incorrectly declares the database corrupt.Evidence
73
.corrupt.*.bakfiles were created across 4 bursts since May 25. Backup sizes within a single burst are identical — same database, copied multiple times. All errors originate in_guard_existing_db_is_healthy()during connection open, never during writes. Four GitHub issues confirm the same pattern: #30445, #31618, #31502, and #31891. During live testing, the database was flagged corrupt 4 times by our own polling queries within minutes.Root Cause
_guard_existing_db_is_healthy()inkanban_db.pyopenssqlite3.connect(path, timeout=5)as an independent probe path. This connection has no relationship to the main connection and observes WAL checkpoint transients from concurrent gateway processes. The chain reaction is:integrity_checkreports "malformed" →_backup_corrupt_db()creates a timestamped backup →KanbanDbCorruptErrorpropagates → the worker retries and hits the same transient again. The retry loop produces ~70 backups in 5 minutes on a database that was never actually corrupt.Before / After
Fix
Two changes, no pragma tuning. First:
_guard_existing_db_is_healthy()becomes_check_db_health(conn, path)— integrity check runs on the main connection, not a separate probe. Second:get_connection()implements a singleton pool (pattern fromhermes_state.pySessionDB), reusing one connection per gateway process and board.connect()remains as a legacy wrapper for tests that need fresh non-pooled connections. Nosynchronous=FULL, nobusy_timeouttuning, no checkpoint management — the elegance is simply: wrong connection architecture → correct connection architecture.Why This PR Over Others
28 open PRs address kanban corruption, none merged. Most set pragma flags —
synchronous=FULL(3+ PRs),wal_autocheckpoint(2 PRs),secure_delete,cell_size_check. Those treat symptoms. This PR addresses the architecture: two connection paths where one suffices, per-call connections where a singleton pool suffices. It complements #31891 (transaction merging) and #31795 (probe retry) without conflict. It works alongside #32322 (connection caching in GatewayRunner).Tests
test_kanban_singleton_pool.py(new, 196 lines): 1000 rapid create+complete operations without corruption, concurrent connection integrity verification, singleton identity assertion, thread safety coverage. Existing suite: 172 passed, 7 skipped.Related
Closes: None (addresses the corruption cluster). Related: #30445, #31618, #31502, #31891, #31795, #32322.