Skip to content

fix(kanban): close leaked decompose connections + don't mask I/O error on rollback#32415

Closed
mw777eds wants to merge 1 commit into
NousResearch:mainfrom
mw777eds:dev/kanban-db-corruption
Closed

fix(kanban): close leaked decompose connections + don't mask I/O error on rollback#32415
mw777eds wants to merge 1 commit into
NousResearch:mainfrom
mw777eds:dev/kanban-db-corruption

Conversation

@mw777eds

Copy link
Copy Markdown

Summary

Fixes a SQLite data-corruption bug on Hermes Kanban boards (observed on novadeck-atlas, ending in database disk image is malformed). Two compounding bugs in the gateway auto-decompose path:

  • Leaked connectionsdecompose_task opened the DB with with kb.connect() as conn:, but sqlite3.Connection's context manager only ends the transaction and never closes the connection. kb.connect() returns a raw sqlite3.connect(...), so each block leaked a live WAL connection (with -wal/-shm FDs) until GC. Sustained auto-decompose accumulated open handles, and FD/-shm open failures surfaced to SQLite as SQLITE_IOERR ("disk I/O error") even with free disk — the observed trigger. Fixed by wrapping all four kb.connect() sites in contextlib.closing.
  • write_txn masked the real error — on SQLITE_IOERR SQLite has already aborted the transaction, so the unconditional conn.execute("ROLLBACK") raised "cannot rollback - no transaction is active" (the exact production stack), hiding the original I/O error. Fixed by swallowing the rollback failure and re-raising the original exception.

Ruled out: legacy NovaDeck code (none exists — the name is just a user-created board), WAL/DELETE mixed-mode (ext4 always gets WAL), and write contention (WAL serializes writers; that yields SQLITE_BUSY, not SQLITE_IOERR).

Scope kept tight to the evidenced gateway/decompose path. The dashboard plugin (plugins/kanban/dashboard/plugin_api.py) has a similar non-closing pattern but is not in the production stack — noted as a follow-up, not touched here.

Changes

  • hermes_cli/kanban_db.py — robust write_txn rollback (try/except + re-raise original).
  • hermes_cli/kanban_decompose.pyimport contextlib; wrap 4 kb.connect() sites in contextlib.closing. Behavior-preserving: all writes already run inside their own write_txn (incl. recompute_ready), so the dropped implicit-commit was a no-op.
  • Regression tests in tests/hermes_cli/test_kanban_db.py and tests/hermes_cli/test_kanban_decompose.py.

Test plan

  • pytest tests/hermes_cli/{test_kanban_db,test_kanban_decompose,test_kanban_specify_db,test_kanban_core_functionality}.py — 346 passed
  • New tests verified as genuine guards (fail against un-fixed code)
  • Live repair still required: this prevents recurrence but does not repair the already-malformed on-disk novadeck-atlas/kanban.db — operator-driven recovery (.recover into a fresh DB, or re-init) is separate.

🤖 Generated with Claude Code

…r on rollback

decompose_task opened the DB with `with kb.connect() as conn:`, but
sqlite3's context manager only ends the transaction and never closes the
connection, leaking WAL -wal/-shm file descriptors until GC. Sustained
auto-decompose exhausted FDs, surfacing as SQLITE_IOERR and corrupting the
board DB. Wrap all four connect() sites in contextlib.closing.

write_txn's unconditional ROLLBACK raised "cannot rollback - no transaction
is active" after SQLite auto-aborted the txn on SQLITE_IOERR, masking the
real error. Swallow the rollback failure and re-raise the original exception.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@alt-glitch alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/plugins Plugin system and bundled plugins labels May 26, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Duplicate — this PR combines two fixes that are already tracked separately:

  1. Connection leak fix → duplicate of fix(kanban): close decomposer SQLite connections to stop fd leak #29525 (also duped by fix(kanban): close decompose sqlite connections #29550, fix(kanban): close decomposer SQLite connections #32135)
  2. ROLLBACK guard fix → duplicate of fix(kanban): guard write_txn ROLLBACK against SQLite auto-abort #31310 (also duped by fix(kanban): guard write_txn ROLLBACK against SQLite auto-abort #31264, resubmission of closed fix(kanban): guard write_txn ROLLBACK against SQLite auto-abort #29839)

The broader kanban SQLite hardening cluster is tracked at #31952, #30969, #31740.

@kshitijk4poor

Copy link
Copy Markdown
Collaborator

Closing as already fixed on main. Both halves of this PR landed via separate paths:

  1. write_txn rollback exception swallow#33482 commit e83252dc4 (fix(kanban): preserve original exception when write_txn rollback fails) — same mechanical change, same reasoning.

  2. Close decompose connections → already on main via commit ebe04c66c (fix(kanban): close kanban.db FD after every connect() in long-lived processes), which introduced kb.connect_closing() and converted kanban_decompose.py, kanban_specify.py, and kanban.py to use it.

Thanks for catching both.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/plugins Plugin system and bundled plugins P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants