fix(kanban): remove false-positive corruption detection from separate probe connection by herrschmidt · Pull Request #32449 · NousResearch/hermes-agent

herrschmidt · 2026-05-26T06:41:26Z

Problem

Every kanban tool call opens two SQLite connections: a probe connection for PRAGMA integrity_check (timeout=5) and the main connection for actual work (timeout=30). The probe connection sees transient WAL states from other gateway processes during their checkpoint operations. It then incorrectly declares the database corrupt.

Evidence

73 .corrupt.*.bak files were created across 4 bursts since May 25. Backup sizes within a single burst are identical — same database, copied multiple times. All errors originate in _guard_existing_db_is_healthy() during connection open, never during writes. Four GitHub issues confirm the same pattern: #30445, #31618, #31502, and #31891. During live testing, the database was flagged corrupt 4 times by our own polling queries within minutes.

Root Cause

_guard_existing_db_is_healthy() in kanban_db.py opens sqlite3.connect(path, timeout=5) as an independent probe path. This connection has no relationship to the main connection and observes WAL checkpoint transients from concurrent gateway processes. The chain reaction is: integrity_check reports "malformed" → _backup_corrupt_db() creates a timestamped backup → KanbanDbCorruptError propagates → the worker retries and hits the same transient again. The retry loop produces ~70 backups in 5 minutes on a database that was never actually corrupt.

Before / After

BEFORE (two separate connection paths per call):
─────────────────────────────────────────────────
  kanban_tool_call()
    ├── sqlite3.connect(db, timeout=5)  ← PROBE (independent path)
    │     └── PRAGMA integrity_check
    │           ⚠ sees WAL transient from another gateway process
    │           → "malformed" → backup → KanbanDbCorruptError
    │
    └── sqlite3.connect(db, timeout=30) ← MAIN (never reached if probe fails)

AFTER (single pooled connection per process):
─────────────────────────────────────────────
  kanban_tool_call()
    └── get_connection() → POOLED connection (one per process+board)
          └── _check_db_health(conn, path)
                ✓ same connection, consistent state
                ✓ no separate path, no transient visibility

Fix

Two changes, no pragma tuning. First: _guard_existing_db_is_healthy() becomes _check_db_health(conn, path) — integrity check runs on the main connection, not a separate probe. Second: get_connection() implements a singleton pool (pattern from hermes_state.py SessionDB), reusing one connection per gateway process and board. connect() remains as a legacy wrapper for tests that need fresh non-pooled connections. No synchronous=FULL, no busy_timeout tuning, no checkpoint management — the elegance is simply: wrong connection architecture → correct connection architecture.

Why This PR Over Others

28 open PRs address kanban corruption, none merged. Most set pragma flags — synchronous=FULL (3+ PRs), wal_autocheckpoint (2 PRs), secure_delete, cell_size_check. Those treat symptoms. This PR addresses the architecture: two connection paths where one suffices, per-call connections where a singleton pool suffices. It complements #31891 (transaction merging) and #31795 (probe retry) without conflict. It works alongside #32322 (connection caching in GatewayRunner).

Tests

test_kanban_singleton_pool.py (new, 196 lines): 1000 rapid create+complete operations without corruption, concurrent connection integrity verification, singleton identity assertion, thread safety coverage. Existing suite: 172 passed, 7 skipped.

Remove _guard_existing_db_is_healthy() which opened a separate probe connection with timeout=5 to run PRAGMA integrity_check. This probe connection saw transient WAL states during checkpointing and produced false-positive corruption detection, triggering backup storms (73+ .corrupt.*.bak files in minutes). Replace with _check_db_health(conn, path) that runs integrity_check on the MAIN connection — no second connection, no transient WAL races. Add get_connection() singleton pool (one connection per process+board) following hermes_state.py SessionDB pattern. kanban_tools.py handlers now use get_connection() instead of per-call connect(), eliminating connection open/close churn that caused WAL-index initialization races. connect() remains backward-compatible: creates fresh connections for tests and legacy callers that expect independent connections. Add stress tests: 1000 rapid create+complete (integrity_check ok), concurrent connection integrity, singleton identity, thread safety. Related: NousResearch#30445, NousResearch#31618, NousResearch#31502, NousResearch#31795

alt-glitch · 2026-05-26T06:48:31Z

Competing with #31795 and #32322 — all address false-positive kanban DB corruption from WAL transients, but with different approaches:

fix(kanban-db): retry integrity probe before flagging DB as corrupt #31795: retries integrity probe 3x with backoff before declaring corrupt
fix(gateway): cache kanban DB connections per OS thread in GatewayRunner #32322: per-thread connection caching in GatewayRunner to reduce WAL churn
This PR: removes the separate probe connection entirely, uses a singleton pooled connection

Also complementary with #31798 (dedupe corrupt backups). Part of kanban corruption cluster rooted at #26479. Related issues: #32424, #30445, #31618, #31502, #31891.

kshitijk4poor · 2026-05-28T06:39:59Z

Closing as superseded. The singleton-connection-pool approach in this PR was tested in production by @steveonjava and reverted after corruption was confirmed within hours — that work shipped as #32226 (closed) and is documented in #33482's PR body under "Approaches I tried locally and reverted":

Shared per-board _kb_conn cache (#32226). Tested locally and reverted after production corruption was confirmed within hours. Concurrent thread access to one SQLite connection violates SQLite's threading model under any meaningful load.

Two related approaches (#32322, #32531) were also tested and reverted. The root cause was fixed differently — see #33482 commit dc98314fb (skip redundant WAL pragma on already-WAL connections, the root-cause unlink-and-recreate race fix) and #33696 (@MoonRay305's cross-process first-connect flock + busy_timeout knob). Together they avoid the threading-model violation entirely. Thanks for thinking through this space — the safer-by-design approach won out.

alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/plugins Plugin system and bundled plugins comp/cli CLI entry point, hermes_cli/, setup wizard comp/tools Tool registry, model_tools, toolsets labels May 26, 2026

This was referenced May 26, 2026

fix(gateway): catch KanbanDbCorruptError in kanban dispatcher corrupt-board path #32490

Closed

[Bug] Kanban DB corruption when gateway and dashboard open the same board DB concurrently (WAL mode) #33169

Closed

This was referenced May 27, 2026

fix(kanban): pooled DB connection across gateway + dashboard (closes #33169 cluster) #33243

Closed

fix(kanban): retry corrupt-board dispatch after quarantine (salvage #33263) #33412

Merged

kshitijk4poor closed this May 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kanban): remove false-positive corruption detection from separate probe connection#32449

fix(kanban): remove false-positive corruption detection from separate probe connection#32449
herrschmidt wants to merge 1 commit into
NousResearch:mainfrom
herrschmidt:fix/kanban-probe-connection-false-positive

herrschmidt commented May 26, 2026

Uh oh!

alt-glitch commented May 26, 2026

Uh oh!

kshitijk4poor commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

herrschmidt commented May 26, 2026

Problem

Evidence

Root Cause

Before / After

Fix

Why This PR Over Others

Tests

Related

Uh oh!

alt-glitch commented May 26, 2026

Uh oh!

kshitijk4poor commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants