fix(kanban): guard write_txn ROLLBACK against SQLite auto-abort by steveonjava · Pull Request #31310 · NousResearch/hermes-agent

steveonjava · 2026-05-24T04:55:59Z

Summary

Wraps the explicit ROLLBACK in write_txn with a try/except so a SQLite auto-rollback does not mask the original exception with a "cannot rollback - no transaction is active" error.

Problem

hermes_cli/kanban_db.py:write_txn runs BEGIN IMMEDIATE, yields, and on exception runs a bare conn.execute("ROLLBACK") before re-raising. When the body raises an OperationalError that SQLite has already auto-aborted on (disk I/O error, database is locked, database disk image is malformed), there is no active transaction left to roll back. The bare ROLLBACK then raises its own OperationalError("cannot rollback - no transaction is active") and that replaces the original exception in the traceback. Operators see the misleading rollback error instead of the actual disk or lock or corruption signal.

Fix

Catch sqlite3.OperationalError around the ROLLBACK and swallow it. The original exception still propagates via the outer raise.

Test

Added tests/hermes_cli/test_kanban_db.py::test_write_txn_preserves_original_exception_when_rollback_fails. Uses a mock connection that raises OperationalError("cannot rollback - no transaction is active") on every ROLLBACK call, then verifies that the original RuntimeError("boom") is the exception the caller sees and that ROLLBACK was attempted.

Targeted run: pytest tests/hermes_cli/test_kanban_db.py::test_write_txn_preserves_original_exception_when_rollback_fails passes.

Files changed

hermes_cli/kanban_db.py: try/except around ROLLBACK in write_txn. +14/-1.
tests/hermes_cli/test_kanban_db.py: new regression test. +58/-0.
scripts/release.py: AUTHOR_MAP entry. +1/-0.

Branch: steveonjava:fix/kanban-write-txn-rollback-guard
Commit: c2f3ca1eddf2c6cf5acb078612f1303681a76660

When code inside a write_txn block raises an OperationalError that SQLite has already auto-rolled-back (typical for disk I/O error, database is locked, and database disk image is malformed), the explicit ROLLBACK in write_txn.__exit__ itself raises cannot rollback - no transaction is active and the secondary exception replaces the original in the traceback. Operators see a misleading error and lose the diagnostic information they need. Swallow the rollback-time OperationalError so the caller always sees the original cause. Confirmed reproducer: tests/hermes_cli/test_kanban_db.py:: test_write_txn_preserves_original_exception_when_rollback_fails

alt-glitch · 2026-05-24T05:13:21Z

Duplicate of #31264 (open) which implements the same write_txn ROLLBACK guard against SQLite auto-abort. Both also resubmit closed #29839.

crayfish-ai · 2026-05-24T05:53:29Z

Thanks for picking this up and adding the regression test! I've closed my #31264 since yours covers the same ground. Glad to see this fix getting attention — it's been biting operators who hit transient SQLite errors and see the wrong exception.

steveonjava · 2026-05-24T06:19:53Z

Duplicate of #31264 (open) which implements the same write_txn ROLLBACK guard against SQLite auto-abort. Both also resubmit closed #29839.

#31264 Was closed by @crayfish-ai and this supercedes his fix with test cases to reproduce the issue. Let's get this merged ASAP. (thanks for the upvote, @crayfish-ai!)

crayfish-ai · 2026-05-24T13:42:34Z

Real-world case study: this bug caused a production outage

We hit this exact bug in our deployment. Sharing the timeline as evidence that this fix is critical:

May 22, 20:10 CST — A kanban task was archived while a worker was still running. This triggered a SQLite deadlock. Because write_txn had no ROLLBACK guard, the original I/O error was silently masked, and the exception propagated without proper rollback. The kanban.db was left with page corruption: a 234-page database referencing a non-existent page 235 (invalid page number 235).

May 24, 10:17 — Gateway restarted. The dispatcher resumed, re-detected the corrupted DB on every tick, and entered a loop:

Every 5 seconds:
  1. Connect to kanban.db
  2. PRAGMA integrity_check → "invalid page number 235"
  3. Create 958KB timestamped backup of the corrupt DB
  4. Raise KanbanDbCorruptError
  5. Repeat

Result:

Thousands of duplicate backup files created
~28 GB of disk space consumed (backup files at 958KB × ~30,000 iterations)
Disk filled to 96%, causing cascading I/O errors
Gateway repeatedly interrupted: "kanban notifier tick failed: disk I/O error"
User received multiple "Gateway shutting down" notifications during active conversation

Root cause: The except Exception: conn.execute("ROLLBACK") in write_txn had no guard against SQLite auto-aborting the transaction before the explicit rollback. When SQLite auto-aborted, the explicit ROLLBACK itself threw, masking the original exception and leaving the DB in an inconsistent state.

This PR fixes that exact gap. The recovery required manually deleting the corrupt DB and recreating it from scratch. The write_txn fix prevents future occurrences — this is not a theoretical concern; it caused real data loss and service disruption.

steveonjava · 2026-05-26T22:36:36Z

Bundled into #32857 for batch review. This draft remains open as a cherry-pick fallback if maintainers prefer surgical landing.

kshitijk4poor · 2026-05-27T21:32:52Z

Merged via #33482 (commit e83252d). Cherry-picked with authorship preserved as part of the @steveonjava batch salvage from #32857. Thanks!

steveonjava force-pushed the fix/kanban-write-txn-rollback-guard branch from c2f3ca1 to ff5fdd0 Compare May 24, 2026 04:56

alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/cli CLI entry point, hermes_cli/, setup wizard duplicate This issue or pull request already exists labels May 24, 2026

steveonjava marked this pull request as ready for review May 24, 2026 15:59

Merge branch 'main' into fix/kanban-write-txn-rollback-guard

c90c2c0

crayfish-ai mentioned this pull request May 25, 2026

feat(kanban): auto-subscribe, review-required promotion, unblock dispatch, rerun, parent workspace #31883

Closed

ddblue0 mentioned this pull request May 25, 2026

Gateway embedded Kanban dispatcher opens SQLite WAL connections every tick, causing FD/WAL pressure #31736

Closed

alt-glitch mentioned this pull request May 26, 2026

fix(kanban): close leaked decompose connections + don't mask I/O error on rollback #32415

Closed

3 tasks

This was referenced May 26, 2026

fix(kanban): batch-salvage 8 SQLite corruption hardening fixes (closes #31158, refs #29610) #32857

Closed

Bug: embedded Kanban dispatcher still leaks sqlite/WAL file descriptors after #28301 #29610

Closed

kshitijk4poor mentioned this pull request May 27, 2026

fix(kanban): batch-salvage 7 SQLite corruption hardening fixes from #32857 #33482

Merged

kshitijk4poor closed this May 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kanban): guard write_txn ROLLBACK against SQLite auto-abort#31310

fix(kanban): guard write_txn ROLLBACK against SQLite auto-abort#31310
steveonjava wants to merge 2 commits into
NousResearch:mainfrom
steveonjava:fix/kanban-write-txn-rollback-guard

steveonjava commented May 24, 2026

Uh oh!

alt-glitch commented May 24, 2026

Uh oh!

crayfish-ai commented May 24, 2026

Uh oh!

steveonjava commented May 24, 2026

Uh oh!

crayfish-ai commented May 24, 2026

Uh oh!

steveonjava commented May 26, 2026

Uh oh!

kshitijk4poor commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

steveonjava commented May 24, 2026

Summary

Problem

Fix

Test

Related

Files changed

Uh oh!

alt-glitch commented May 24, 2026

Uh oh!

crayfish-ai commented May 24, 2026

Uh oh!

steveonjava commented May 24, 2026

Uh oh!

crayfish-ai commented May 24, 2026

Real-world case study: this bug caused a production outage

Uh oh!

steveonjava commented May 26, 2026

Uh oh!

kshitijk4poor commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants