fix(kanban): guard write_txn ROLLBACK against SQLite auto-abort#31310
fix(kanban): guard write_txn ROLLBACK against SQLite auto-abort#31310steveonjava wants to merge 2 commits into
Conversation
When code inside a write_txn block raises an OperationalError that SQLite has already auto-rolled-back (typical for disk I/O error, database is locked, and database disk image is malformed), the explicit ROLLBACK in write_txn.__exit__ itself raises cannot rollback - no transaction is active and the secondary exception replaces the original in the traceback. Operators see a misleading error and lose the diagnostic information they need. Swallow the rollback-time OperationalError so the caller always sees the original cause. Confirmed reproducer: tests/hermes_cli/test_kanban_db.py:: test_write_txn_preserves_original_exception_when_rollback_fails
c2f3ca1 to
ff5fdd0
Compare
|
Thanks for picking this up and adding the regression test! I've closed my #31264 since yours covers the same ground. Glad to see this fix getting attention — it's been biting operators who hit transient SQLite errors and see the wrong exception. |
#31264 Was closed by @crayfish-ai and this supercedes his fix with test cases to reproduce the issue. Let's get this merged ASAP. (thanks for the upvote, @crayfish-ai!) |
Real-world case study: this bug caused a production outageWe hit this exact bug in our deployment. Sharing the timeline as evidence that this fix is critical: May 22, 20:10 CST — A kanban task was archived while a worker was still running. This triggered a SQLite deadlock. Because May 24, 10:17 — Gateway restarted. The dispatcher resumed, re-detected the corrupted DB on every tick, and entered a loop: Result:
Root cause: The This PR fixes that exact gap. The recovery required manually deleting the corrupt DB and recreating it from scratch. The |
|
Bundled into #32857 for batch review. This draft remains open as a cherry-pick fallback if maintainers prefer surgical landing. |
|
Merged via #33482 (commit e83252d). Cherry-picked with authorship preserved as part of the @steveonjava batch salvage from #32857. Thanks! |
Summary
Wraps the explicit ROLLBACK in
write_txnwith a try/except so a SQLite auto-rollback does not mask the original exception with a "cannot rollback - no transaction is active" error.Problem
hermes_cli/kanban_db.py:write_txnrunsBEGIN IMMEDIATE, yields, and on exception runs a bareconn.execute("ROLLBACK")before re-raising. When the body raises anOperationalErrorthat SQLite has already auto-aborted on (disk I/O error, database is locked, database disk image is malformed), there is no active transaction left to roll back. The bare ROLLBACK then raises its ownOperationalError("cannot rollback - no transaction is active")and that replaces the original exception in the traceback. Operators see the misleading rollback error instead of the actual disk or lock or corruption signal.Fix
Catch
sqlite3.OperationalErroraround the ROLLBACK and swallow it. The original exception still propagates via the outerraise.Test
Added
tests/hermes_cli/test_kanban_db.py::test_write_txn_preserves_original_exception_when_rollback_fails. Uses a mock connection that raisesOperationalError("cannot rollback - no transaction is active")on every ROLLBACK call, then verifies that the originalRuntimeError("boom")is the exception the caller sees and that ROLLBACK was attempted.Targeted run:
pytest tests/hermes_cli/test_kanban_db.py::test_write_txn_preserves_original_exception_when_rollback_failspasses.Related
Files changed
hermes_cli/kanban_db.py: try/except around ROLLBACK inwrite_txn. +14/-1.tests/hermes_cli/test_kanban_db.py: new regression test. +58/-0.scripts/release.py: AUTHOR_MAP entry. +1/-0.Branch:
steveonjava:fix/kanban-write-txn-rollback-guardCommit:
c2f3ca1eddf2c6cf5acb078612f1303681a76660