Skip to content

fix(kanban): guard write_txn ROLLBACK against SQLite auto-abort#31310

Closed
steveonjava wants to merge 2 commits into
NousResearch:mainfrom
steveonjava:fix/kanban-write-txn-rollback-guard
Closed

fix(kanban): guard write_txn ROLLBACK against SQLite auto-abort#31310
steveonjava wants to merge 2 commits into
NousResearch:mainfrom
steveonjava:fix/kanban-write-txn-rollback-guard

Conversation

@steveonjava

Copy link
Copy Markdown
Contributor

Summary

Wraps the explicit ROLLBACK in write_txn with a try/except so a SQLite auto-rollback does not mask the original exception with a "cannot rollback - no transaction is active" error.

Problem

hermes_cli/kanban_db.py:write_txn runs BEGIN IMMEDIATE, yields, and on exception runs a bare conn.execute("ROLLBACK") before re-raising. When the body raises an OperationalError that SQLite has already auto-aborted on (disk I/O error, database is locked, database disk image is malformed), there is no active transaction left to roll back. The bare ROLLBACK then raises its own OperationalError("cannot rollback - no transaction is active") and that replaces the original exception in the traceback. Operators see the misleading rollback error instead of the actual disk or lock or corruption signal.

Fix

Catch sqlite3.OperationalError around the ROLLBACK and swallow it. The original exception still propagates via the outer raise.

Test

Added tests/hermes_cli/test_kanban_db.py::test_write_txn_preserves_original_exception_when_rollback_fails. Uses a mock connection that raises OperationalError("cannot rollback - no transaction is active") on every ROLLBACK call, then verifies that the original RuntimeError("boom") is the exception the caller sees and that ROLLBACK was attempted.

Targeted run: pytest tests/hermes_cli/test_kanban_db.py::test_write_txn_preserves_original_exception_when_rollback_fails passes.

Related

Files changed

  • hermes_cli/kanban_db.py: try/except around ROLLBACK in write_txn. +14/-1.
  • tests/hermes_cli/test_kanban_db.py: new regression test. +58/-0.
  • scripts/release.py: AUTHOR_MAP entry. +1/-0.

Branch: steveonjava:fix/kanban-write-txn-rollback-guard
Commit: c2f3ca1eddf2c6cf5acb078612f1303681a76660

When code inside a write_txn block raises an OperationalError that SQLite
has already auto-rolled-back (typical for disk I/O error,
database is locked, and database disk image is malformed), the
explicit ROLLBACK in write_txn.__exit__ itself raises
cannot rollback - no transaction is active and the secondary exception
replaces the original in the traceback. Operators see a misleading error
and lose the diagnostic information they need.

Swallow the rollback-time OperationalError so the caller always sees the
original cause.

Confirmed reproducer: tests/hermes_cli/test_kanban_db.py::
test_write_txn_preserves_original_exception_when_rollback_fails
@steveonjava steveonjava force-pushed the fix/kanban-write-txn-rollback-guard branch from c2f3ca1 to ff5fdd0 Compare May 24, 2026 04:56
@alt-glitch alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/cli CLI entry point, hermes_cli/, setup wizard duplicate This issue or pull request already exists labels May 24, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Duplicate of #31264 (open) which implements the same write_txn ROLLBACK guard against SQLite auto-abort. Both also resubmit closed #29839.

@crayfish-ai

Copy link
Copy Markdown
Contributor

Thanks for picking this up and adding the regression test! I've closed my #31264 since yours covers the same ground. Glad to see this fix getting attention — it's been biting operators who hit transient SQLite errors and see the wrong exception.

@steveonjava

Copy link
Copy Markdown
Contributor Author

Duplicate of #31264 (open) which implements the same write_txn ROLLBACK guard against SQLite auto-abort. Both also resubmit closed #29839.

#31264 Was closed by @crayfish-ai and this supercedes his fix with test cases to reproduce the issue. Let's get this merged ASAP. (thanks for the upvote, @crayfish-ai!)

@crayfish-ai

Copy link
Copy Markdown
Contributor

Real-world case study: this bug caused a production outage

We hit this exact bug in our deployment. Sharing the timeline as evidence that this fix is critical:

May 22, 20:10 CST — A kanban task was archived while a worker was still running. This triggered a SQLite deadlock. Because write_txn had no ROLLBACK guard, the original I/O error was silently masked, and the exception propagated without proper rollback. The kanban.db was left with page corruption: a 234-page database referencing a non-existent page 235 (invalid page number 235).

May 24, 10:17 — Gateway restarted. The dispatcher resumed, re-detected the corrupted DB on every tick, and entered a loop:

Every 5 seconds:
  1. Connect to kanban.db
  2. PRAGMA integrity_check → "invalid page number 235"
  3. Create 958KB timestamped backup of the corrupt DB
  4. Raise KanbanDbCorruptError
  5. Repeat

Result:

  • Thousands of duplicate backup files created
  • ~28 GB of disk space consumed (backup files at 958KB × ~30,000 iterations)
  • Disk filled to 96%, causing cascading I/O errors
  • Gateway repeatedly interrupted: "kanban notifier tick failed: disk I/O error"
  • User received multiple "Gateway shutting down" notifications during active conversation

Root cause: The except Exception: conn.execute("ROLLBACK") in write_txn had no guard against SQLite auto-aborting the transaction before the explicit rollback. When SQLite auto-aborted, the explicit ROLLBACK itself threw, masking the original exception and leaving the DB in an inconsistent state.

This PR fixes that exact gap. The recovery required manually deleting the corrupt DB and recreating it from scratch. The write_txn fix prevents future occurrences — this is not a theoretical concern; it caused real data loss and service disruption.

@steveonjava steveonjava marked this pull request as ready for review May 24, 2026 15:59
@steveonjava

Copy link
Copy Markdown
Contributor Author

Bundled into #32857 for batch review. This draft remains open as a cherry-pick fallback if maintainers prefer surgical landing.

@kshitijk4poor

Copy link
Copy Markdown
Collaborator

Merged via #33482 (commit e83252d). Cherry-picked with authorship preserved as part of the @steveonjava batch salvage from #32857. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard duplicate This issue or pull request already exists P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants