Skip to content

fix(kanban): pooled DB connection across gateway + dashboard (closes #33169 cluster)#33243

Closed
teknium1 wants to merge 3 commits into
mainfrom
hermes/hermes-6bc90445
Closed

fix(kanban): pooled DB connection across gateway + dashboard (closes #33169 cluster)#33243
teknium1 wants to merge 3 commits into
mainfrom
hermes/hermes-6bc90445

Conversation

@teknium1

Copy link
Copy Markdown
Contributor

Summary

Stops the kanban DB false-positive corruption + backup storm when gateway and dashboard run as separate processes (#33169) by routing every long-lived caller through a pooled connection and stripping per-tick/per-request open/close churn.

Root cause: _guard_existing_db_is_healthy() opened a separate probe connection on every connect() call. Under WAL with concurrent writers, that probe transiently saw torn pages and declared healthy files corrupt, then moved them to .corrupt.*.bak. Per-tick open/close in the gateway dispatcher and per-request open/close in the dashboard amplified the race.

Changes

  • Salvaged @herrschmidt's fix(kanban): remove false-positive corruption detection from separate probe connection #32449 onto current main (single commit, authorship preserved): removes the separate probe connection, adds a singleton pooled connection per process+board (mirrors hermes_state.py:SessionDB), adds PRAGMA busy_timeout=30000, atexit PASSIVE checkpoint on shutdown.
  • Followed up to widen the fix:
    • gateway/run.py: switch all 7 _kb.connect() sites (notifier, dispatcher, has_spawnable check, auto-subscribe) to get_connection(). No more per-tick churn.
    • plugins/kanban/dashboard/plugin_api.py: switch _conn() to get_connection() and strip the 24 finally: conn.close() blocks that would otherwise evict the pool on every HTTP request.
    • tools/kanban_tools.py: remove the dead try: ...; finally: pass blocks fix(kanban): remove false-positive corruption detection from separate probe connection #32449 left behind (9 sites). Cosmetic.
    • hermes_cli/kanban_db.py: reorder _open_connection to run apply_wal_with_fallback BEFORE _check_db_health. Running integrity_check first held an implicit read transaction that broke the WAL→DELETE fallback on NFS/SMB. Wrap any WAL-setup DatabaseError in KanbanDbCorruptError for consistent caller error type. Fix init_db() to actually close the connection it opens — the PR's claim that connect() returns pooled was wrong (only get_connection() does).
  • scripts/release.py: add herrschmidt@users.noreply.github.com to AUTHOR_MAP.

Validation

Before After
tests/hermes_cli/test_kanban_db.py n/a 172/172 pass
tests/plugins/test_kanban_dashboard_plugin.py n/a 94/94 pass
tests/tools/test_kanban_tools.py n/a 81/81 pass
tests/gateway/ (full suite) n/a 5944/5944 pass
Stress: --run-stress tests/stress/test_kanban_singleton_pool.py n/a 7/7 pass
E2E: 2 processes × 15s concurrent flood of .corrupt.*.bak 0 backups, 2948 tasks, integrity ok

E2E reproduces the exact #33169 scenario: two separate processes each hold a pooled connection to the same kanban.db and hammer it with concurrent reads + writes. Result: zero .corrupt.*.bak files produced, final PRAGMA integrity_check returns ok, both workers exit clean with zero errors.

Closes

Credits

Backbone work by @herrschmidt (#32449). Cherry-picked with original authorship preserved; follow-up commit by us widens the fix across the gateway and dashboard call sites, fixes the WAL-fallback ordering, and lints up the residual dead code.

herrschmidt and others added 3 commits May 27, 2026 05:27
Remove _guard_existing_db_is_healthy() which opened a separate
probe connection with timeout=5 to run PRAGMA integrity_check.
This probe connection saw transient WAL states during checkpointing
and produced false-positive corruption detection, triggering
backup storms (73+ .corrupt.*.bak files in minutes).

Replace with _check_db_health(conn, path) that runs integrity_check
on the MAIN connection — no second connection, no transient WAL races.

Add get_connection() singleton pool (one connection per process+board)
following hermes_state.py SessionDB pattern. kanban_tools.py handlers
now use get_connection() instead of per-call connect(), eliminating
connection open/close churn that caused WAL-index initialization races.

connect() remains backward-compatible: creates fresh connections for
tests and legacy callers that expect independent connections.

Add stress tests: 1000 rapid create+complete (integrity_check ok),
concurrent connection integrity, singleton identity, thread safety.

Related: #30445, #31618, #31502, #31795
Salvage follow-up for #32449 (singleton connection pool):

1. Switch all 7 `_kb.connect()` call sites in gateway/run.py to
   `get_connection()` so the long-lived gateway dispatcher and notifier
   stop opening/closing fresh connections on every tick. Per-tick churn
   is what triggered the false-positive corruption probe under WAL
   checkpointing in the first place (#33169).

2. Switch the dashboard's `_conn()` helper to `get_connection()` and
   strip the 24 `finally: conn.close()` blocks that would otherwise
   evict the pool on every HTTP request — restoring the per-request
   churn. The dashboard runs as a separate process from the gateway, so
   its own pool is independent; cross-process safety comes from the
   `busy_timeout=30000` PRAGMA added in #32449.

3. Clean up the dead `try: ...; finally: pass` blocks #32449 left in
   tools/kanban_tools.py (9 sites). Pure cosmetic — no behaviour change.

4. Reorder `_open_connection` to call `apply_wal_with_fallback` BEFORE
   `_check_db_health`. Running integrity_check first held an implicit
   read transaction that prevented the WAL→DELETE journal-mode fallback
   from acquiring its exclusive lock on NFS/SMB filesystems
   ("database is locked" against `PRAGMA journal_mode=DELETE`).
   Any DatabaseError from WAL setup itself is now wrapped in
   KanbanDbCorruptError so callers see a consistent error type.

5. Fix the dropped `conn.close()` in `init_db()` — the previous comment
   claimed connect() returned a pooled connection, but it doesn't (only
   get_connection does). The leaked fresh connection held WAL locks
   that broke the kanban_home fixture's downstream tests.

E2E verified: two separate processes hammering the same kanban.db
for 15s each (concurrent reads + writes), 2948 tasks created, zero
`.corrupt.*.bak` files produced, final integrity_check ok.

Closes #33169 and the wider kanban-WAL cluster (#32424, #32543,
#31502, #30908, #31618).
@github-actions

Copy link
Copy Markdown
Contributor

🔎 Lint report: hermes/hermes-6bc90445 vs origin/main

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 9496 on HEAD, 9495 on base (🆕 +1)

🆕 New issues (1):

Rule Count
unresolved-import 1
First entries
tests/stress/test_kanban_singleton_pool.py:21: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`

✅ Fixed issues: none

Unchanged: 5014 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

Comment thread hermes_cli/kanban_db.py
def _try_stat(path: Path) -> Optional[os.stat_result]:
"""Return ``path.stat()`` or ``None`` on any OSError (missing file, etc.)."""
try:
return path.stat()
Comment thread hermes_cli/kanban_db.py
path = db_path
else:
path = kanban_db_path(board=board)
resolved = str(path.resolve())
Comment thread hermes_cli/kanban_db.py
else:
path = kanban_db_path(board=board)
resolved = str(path.resolve())
path.parent.mkdir(parents=True, exist_ok=True)
Comment thread hermes_cli/kanban_db.py
path = db_path
else:
path = kanban_db_path(board=board)
resolved = str(path.resolve())
@alt-glitch alt-glitch added type/bug Something isn't working comp/plugins Plugin system and bundled plugins comp/gateway Gateway runner, session dispatch, delivery P3 Low — cosmetic, nice to have labels May 27, 2026
@julio-cloudvisor

Copy link
Copy Markdown
Contributor

Datapoint from a production deployment: we hit exactly #33169 — separate gateway + dispatcher processes against one kanban.db, .corrupt.*.bak flood, real data eviction when the rename caught in-flight writers.

We filed #30973 + #31618 thinking it was a durability race and our synchronous=FULL PRAGMA reduced the symptom rate but didn't fix root cause — exactly the false-positive pattern this PR fixes. The rediagnosis (probe connection seeing torn WAL pages → false corruption → backup storm) matches what we observed: corruption "recurring" was actually the detector firing on healthy DBs that had just been written to.

The pooled singleton + per-tick churn removal + atexit checkpoint is the right shape. Happy to run this in production for a week against heartbeat-heavy concurrent-writer load (multi-process dispatcher + reviewer subprocess writing 50+ events in 5 minutes against one DB) once it's ready for outside test signal. If it lands, we'll drop #30973 from our local patches and close it out.

@teknium1

Copy link
Copy Markdown
Contributor Author

Closing in favor of a smaller, more focused approach.

Since this PR opened, a wave of kanban hardening landed on main covering overlapping ground:

  • dc98314 fix(kanban): skip redundant WAL pragma on already-WAL connections — addresses the same WAL-checkpoint-race surface as this PR's pool, but via pragma idempotence
  • 6416dd5 fix(kanban): harden SQLite against torn-write corruption (synchronous=FULL + secure_delete + cell_size_check)
  • 99c19eb fix(kanban): add post-commit page_count invariant check + wal_autocheckpoint=100
  • 7 other kanban commits touching gateway/run.py and hermes_cli/kanban_db.py

The rebase would require manually merging the singleton pool + probe-removal architecture against all of the above. Given Teknium's pragma-idempotence fix already addresses the primary symptom this PR targeted (false-positive corruption from WAL checkpoint races during reopen), and #33159's FD leak is a much narrower fix, the better path is:

  1. This PR closes (no work lost — herrschmidt's commit is preserved in git log via the salvage branch).
  2. A small targeted PR addresses #33159 directly: add explicit try/finally: conn.close() to the with kb.connect() as conn: sites in hermes_cli/kanban_decompose.py and hermes_cli/kanban_specify.py that are reachable from the long-lived dashboard process.
  3. The full pool architecture can be reconsidered later if FD leaks or WAL races still show up in production after the hardening pragmas land.

Credit to @herrschmidt for #32449's diagnosis and the singleton-pool design — it's the right architectural direction and remains a viable future PR if needed.

@teknium1 teknium1 closed this May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery comp/plugins Plugin system and bundled plugins P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants