Skip to content

fix(kanban): remove false-positive corruption detection from separate probe connection#32449

Closed
herrschmidt wants to merge 1 commit into
NousResearch:mainfrom
herrschmidt:fix/kanban-probe-connection-false-positive
Closed

fix(kanban): remove false-positive corruption detection from separate probe connection#32449
herrschmidt wants to merge 1 commit into
NousResearch:mainfrom
herrschmidt:fix/kanban-probe-connection-false-positive

Conversation

@herrschmidt

Copy link
Copy Markdown

Problem

Every kanban tool call opens two SQLite connections: a probe connection for PRAGMA integrity_check (timeout=5) and the main connection for actual work (timeout=30). The probe connection sees transient WAL states from other gateway processes during their checkpoint operations. It then incorrectly declares the database corrupt.

Evidence

73 .corrupt.*.bak files were created across 4 bursts since May 25. Backup sizes within a single burst are identical — same database, copied multiple times. All errors originate in _guard_existing_db_is_healthy() during connection open, never during writes. Four GitHub issues confirm the same pattern: #30445, #31618, #31502, and #31891. During live testing, the database was flagged corrupt 4 times by our own polling queries within minutes.

Root Cause

_guard_existing_db_is_healthy() in kanban_db.py opens sqlite3.connect(path, timeout=5) as an independent probe path. This connection has no relationship to the main connection and observes WAL checkpoint transients from concurrent gateway processes. The chain reaction is: integrity_check reports "malformed" → _backup_corrupt_db() creates a timestamped backup → KanbanDbCorruptError propagates → the worker retries and hits the same transient again. The retry loop produces ~70 backups in 5 minutes on a database that was never actually corrupt.

Before / After

BEFORE (two separate connection paths per call):
─────────────────────────────────────────────────
  kanban_tool_call()
    ├── sqlite3.connect(db, timeout=5)  ← PROBE (independent path)
    │     └── PRAGMA integrity_check
    │           ⚠ sees WAL transient from another gateway process
    │           → "malformed" → backup → KanbanDbCorruptError
    │
    └── sqlite3.connect(db, timeout=30) ← MAIN (never reached if probe fails)

AFTER (single pooled connection per process):
─────────────────────────────────────────────
  kanban_tool_call()
    └── get_connection() → POOLED connection (one per process+board)
          └── _check_db_health(conn, path)
                ✓ same connection, consistent state
                ✓ no separate path, no transient visibility

Fix

Two changes, no pragma tuning. First: _guard_existing_db_is_healthy() becomes _check_db_health(conn, path) — integrity check runs on the main connection, not a separate probe. Second: get_connection() implements a singleton pool (pattern from hermes_state.py SessionDB), reusing one connection per gateway process and board. connect() remains as a legacy wrapper for tests that need fresh non-pooled connections. No synchronous=FULL, no busy_timeout tuning, no checkpoint management — the elegance is simply: wrong connection architecture → correct connection architecture.

Why This PR Over Others

28 open PRs address kanban corruption, none merged. Most set pragma flags — synchronous=FULL (3+ PRs), wal_autocheckpoint (2 PRs), secure_delete, cell_size_check. Those treat symptoms. This PR addresses the architecture: two connection paths where one suffices, per-call connections where a singleton pool suffices. It complements #31891 (transaction merging) and #31795 (probe retry) without conflict. It works alongside #32322 (connection caching in GatewayRunner).

Tests

test_kanban_singleton_pool.py (new, 196 lines): 1000 rapid create+complete operations without corruption, concurrent connection integrity verification, singleton identity assertion, thread safety coverage. Existing suite: 172 passed, 7 skipped.

Related

Closes: None (addresses the corruption cluster). Related: #30445, #31618, #31502, #31891, #31795, #32322.

Remove _guard_existing_db_is_healthy() which opened a separate
probe connection with timeout=5 to run PRAGMA integrity_check.
This probe connection saw transient WAL states during checkpointing
and produced false-positive corruption detection, triggering
backup storms (73+ .corrupt.*.bak files in minutes).

Replace with _check_db_health(conn, path) that runs integrity_check
on the MAIN connection — no second connection, no transient WAL races.

Add get_connection() singleton pool (one connection per process+board)
following hermes_state.py SessionDB pattern. kanban_tools.py handlers
now use get_connection() instead of per-call connect(), eliminating
connection open/close churn that caused WAL-index initialization races.

connect() remains backward-compatible: creates fresh connections for
tests and legacy callers that expect independent connections.

Add stress tests: 1000 rapid create+complete (integrity_check ok),
concurrent connection integrity, singleton identity, thread safety.

Related: NousResearch#30445, NousResearch#31618, NousResearch#31502, NousResearch#31795
@alt-glitch alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/plugins Plugin system and bundled plugins comp/cli CLI entry point, hermes_cli/, setup wizard comp/tools Tool registry, model_tools, toolsets labels May 26, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Competing with #31795 and #32322 — all address false-positive kanban DB corruption from WAL transients, but with different approaches:

Also complementary with #31798 (dedupe corrupt backups). Part of kanban corruption cluster rooted at #26479. Related issues: #32424, #30445, #31618, #31502, #31891.

@kshitijk4poor

Copy link
Copy Markdown
Collaborator

Closing as superseded. The singleton-connection-pool approach in this PR was tested in production by @steveonjava and reverted after corruption was confirmed within hours — that work shipped as #32226 (closed) and is documented in #33482's PR body under "Approaches I tried locally and reverted":

Shared per-board _kb_conn cache (#32226). Tested locally and reverted after production corruption was confirmed within hours. Concurrent thread access to one SQLite connection violates SQLite's threading model under any meaningful load.

Two related approaches (#32322, #32531) were also tested and reverted. The root cause was fixed differently — see #33482 commit dc98314fb (skip redundant WAL pragma on already-WAL connections, the root-cause unlink-and-recreate race fix) and #33696 (@MoonRay305's cross-process first-connect flock + busy_timeout knob). Together they avoid the threading-model violation entirely. Thanks for thinking through this space — the safer-by-design approach won out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard comp/plugins Plugin system and bundled plugins comp/tools Tool registry, model_tools, toolsets P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants