Skip to content

fix(kanban-db): retry integrity probe before flagging DB as corrupt#31795

Open
emigal wants to merge 1 commit into
NousResearch:mainfrom
emigal:fix/kanban-db-guard-retry-transient-malformed
Open

fix(kanban-db): retry integrity probe before flagging DB as corrupt#31795
emigal wants to merge 1 commit into
NousResearch:mainfrom
emigal:fix/kanban-db-guard-retry-transient-malformed

Conversation

@emigal

@emigal emigal commented May 25, 2026

Copy link
Copy Markdown

The kanban gateway dispatcher opens a fresh sqlite connection per tick, each of which runs PRAGMA integrity_check via _guard_existing_db_is_healthy. Under WAL with concurrent worker writes, the probe can transiently observe a torn page and either return a non-'ok' integrity row or raise sqlite3.DatabaseError('database disk image is malformed') -- even though the file is fine and the very next probe succeeds.

The previous guard treated the first such hit as terminal corruption: it copied the file to a timestamped .corrupt.*.bak and raised KanbanDbCorruptError, which the gateway dispatcher then used to disable dispatch on that board until the file mtime changed or the gateway restarted. In practice this caused the dispatcher to silently stop processing tasks on a perfectly healthy DB, and left dozens of spurious .corrupt backup files on disk.

Retry the integrity probe up to 3 times with a short backoff (250ms, 500ms) before declaring corruption. A genuinely corrupt file still gets flagged after 3 consistent failures; a transient WAL blip from a concurrent worker write now self-heals.

Adds a regression test that injects a single transient DatabaseError on the first probe attempt and asserts:

  • connect() succeeds (retry sees a healthy DB on attempt 2)
  • no .corrupt backup is produced
  • the retry path was actually exercised

Existing tests for genuine corruption and locked-but-healthy DBs continue to pass unchanged.

The kanban gateway dispatcher opens a fresh sqlite connection per tick,
each of which runs PRAGMA integrity_check via _guard_existing_db_is_healthy.
Under WAL with concurrent worker writes, the probe can transiently observe
a torn page and either return a non-'ok' integrity row or raise
sqlite3.DatabaseError('database disk image is malformed') -- even though
the file is fine and the very next probe succeeds.

The previous guard treated the first such hit as terminal corruption:
it copied the file to a timestamped .corrupt.*.bak and raised
KanbanDbCorruptError, which the gateway dispatcher then used to disable
dispatch on that board until the file mtime changed or the gateway
restarted. In practice this caused the dispatcher to silently stop
processing tasks on a perfectly healthy DB, and left dozens of spurious
.corrupt backup files on disk.

Retry the integrity probe up to 3 times with a short backoff (250ms,
500ms) before declaring corruption. A genuinely corrupt file still gets
flagged after 3 consistent failures; a transient WAL blip from a
concurrent worker write now self-heals.

Adds a regression test that injects a single transient DatabaseError on
the first probe attempt and asserts:
  - connect() succeeds (retry sees a healthy DB on attempt 2)
  - no .corrupt backup is produced
  - the retry path was actually exercised

Existing tests for genuine corruption and locked-but-healthy DBs continue
to pass unchanged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard comp/plugins Plugin system and bundled plugins P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants