fix(kanban): close leaked decompose connections + don't mask I/O error on rollback by mw777eds · Pull Request #32415 · NousResearch/hermes-agent

mw777eds · 2026-05-26T05:13:32Z

Summary

Fixes a SQLite data-corruption bug on Hermes Kanban boards (observed on novadeck-atlas, ending in database disk image is malformed). Two compounding bugs in the gateway auto-decompose path:

Leaked connections — decompose_task opened the DB with with kb.connect() as conn:, but sqlite3.Connection's context manager only ends the transaction and never closes the connection. kb.connect() returns a raw sqlite3.connect(...), so each block leaked a live WAL connection (with -wal/-shm FDs) until GC. Sustained auto-decompose accumulated open handles, and FD/-shm open failures surfaced to SQLite as SQLITE_IOERR ("disk I/O error") even with free disk — the observed trigger. Fixed by wrapping all four kb.connect() sites in contextlib.closing.
write_txn masked the real error — on SQLITE_IOERR SQLite has already aborted the transaction, so the unconditional conn.execute("ROLLBACK") raised "cannot rollback - no transaction is active" (the exact production stack), hiding the original I/O error. Fixed by swallowing the rollback failure and re-raising the original exception.

Ruled out: legacy NovaDeck code (none exists — the name is just a user-created board), WAL/DELETE mixed-mode (ext4 always gets WAL), and write contention (WAL serializes writers; that yields SQLITE_BUSY, not SQLITE_IOERR).

Scope kept tight to the evidenced gateway/decompose path. The dashboard plugin (plugins/kanban/dashboard/plugin_api.py) has a similar non-closing pattern but is not in the production stack — noted as a follow-up, not touched here.

Changes

hermes_cli/kanban_db.py — robust write_txn rollback (try/except + re-raise original).
hermes_cli/kanban_decompose.py — import contextlib; wrap 4 kb.connect() sites in contextlib.closing. Behavior-preserving: all writes already run inside their own write_txn (incl. recompute_ready), so the dropped implicit-commit was a no-op.
Regression tests in tests/hermes_cli/test_kanban_db.py and tests/hermes_cli/test_kanban_decompose.py.

Test plan

pytest tests/hermes_cli/{test_kanban_db,test_kanban_decompose,test_kanban_specify_db,test_kanban_core_functionality}.py — 346 passed
New tests verified as genuine guards (fail against un-fixed code)
Live repair still required: this prevents recurrence but does not repair the already-malformed on-disk novadeck-atlas/kanban.db — operator-driven recovery (.recover into a fresh DB, or re-init) is separate.

🤖 Generated with Claude Code

…r on rollback decompose_task opened the DB with `with kb.connect() as conn:`, but sqlite3's context manager only ends the transaction and never closes the connection, leaking WAL -wal/-shm file descriptors until GC. Sustained auto-decompose exhausted FDs, surfacing as SQLITE_IOERR and corrupting the board DB. Wrap all four connect() sites in contextlib.closing. write_txn's unconditional ROLLBACK raised "cannot rollback - no transaction is active" after SQLite auto-aborted the txn on SQLITE_IOERR, masking the real error. Swallow the rollback failure and re-raise the original exception. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

alt-glitch · 2026-05-26T05:28:49Z

Duplicate — this PR combines two fixes that are already tracked separately:

Connection leak fix → duplicate of fix(kanban): close decomposer SQLite connections to stop fd leak #29525 (also duped by fix(kanban): close decompose sqlite connections #29550, fix(kanban): close decomposer SQLite connections #32135)
ROLLBACK guard fix → duplicate of fix(kanban): guard write_txn ROLLBACK against SQLite auto-abort #31310 (also duped by fix(kanban): guard write_txn ROLLBACK against SQLite auto-abort #31264, resubmission of closed fix(kanban): guard write_txn ROLLBACK against SQLite auto-abort #29839)

The broader kanban SQLite hardening cluster is tracked at #31952, #30969, #31740.

kshitijk4poor · 2026-05-28T06:39:45Z

Closing as already fixed on main. Both halves of this PR landed via separate paths:

write_txn rollback exception swallow → #33482 commit e83252dc4 (fix(kanban): preserve original exception when write_txn rollback fails) — same mechanical change, same reasoning.
Close decompose connections → already on main via commit ebe04c66c (fix(kanban): close kanban.db FD after every connect() in long-lived processes), which introduced kb.connect_closing() and converted kanban_decompose.py, kanban_specify.py, and kanban.py to use it.

Thanks for catching both.

alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/plugins Plugin system and bundled plugins labels May 26, 2026

Danuselli mentioned this pull request May 26, 2026

feat(kanban): add busy_timeout PRAGMA to prevent WAL corruption under concurrent writers #32532

Closed

alt-glitch mentioned this pull request May 27, 2026

fix: add use_conn() context manager to prevent SQLite connection leak #32967

Open

23 tasks

kshitijk4poor closed this May 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kanban): close leaked decompose connections + don't mask I/O error on rollback#32415

fix(kanban): close leaked decompose connections + don't mask I/O error on rollback#32415
mw777eds wants to merge 1 commit into
NousResearch:mainfrom
mw777eds:dev/kanban-db-corruption

mw777eds commented May 26, 2026

Uh oh!

alt-glitch commented May 26, 2026

Uh oh!

kshitijk4poor commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mw777eds commented May 26, 2026

Summary

Changes

Test plan

Uh oh!

alt-glitch commented May 26, 2026

Uh oh!

kshitijk4poor commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants