fix(kanban-db): retry integrity probe before flagging DB as corrupt#31795
Open
emigal wants to merge 1 commit into
Open
fix(kanban-db): retry integrity probe before flagging DB as corrupt#31795emigal wants to merge 1 commit into
emigal wants to merge 1 commit into
Conversation
The kanban gateway dispatcher opens a fresh sqlite connection per tick,
each of which runs PRAGMA integrity_check via _guard_existing_db_is_healthy.
Under WAL with concurrent worker writes, the probe can transiently observe
a torn page and either return a non-'ok' integrity row or raise
sqlite3.DatabaseError('database disk image is malformed') -- even though
the file is fine and the very next probe succeeds.
The previous guard treated the first such hit as terminal corruption:
it copied the file to a timestamped .corrupt.*.bak and raised
KanbanDbCorruptError, which the gateway dispatcher then used to disable
dispatch on that board until the file mtime changed or the gateway
restarted. In practice this caused the dispatcher to silently stop
processing tasks on a perfectly healthy DB, and left dozens of spurious
.corrupt backup files on disk.
Retry the integrity probe up to 3 times with a short backoff (250ms,
500ms) before declaring corruption. A genuinely corrupt file still gets
flagged after 3 consistent failures; a transient WAL blip from a
concurrent worker write now self-heals.
Adds a regression test that injects a single transient DatabaseError on
the first probe attempt and asserts:
- connect() succeeds (retry sees a healthy DB on attempt 2)
- no .corrupt backup is produced
- the retry path was actually exercised
Existing tests for genuine corruption and locked-but-healthy DBs continue
to pass unchanged.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The kanban gateway dispatcher opens a fresh sqlite connection per tick, each of which runs PRAGMA integrity_check via _guard_existing_db_is_healthy. Under WAL with concurrent worker writes, the probe can transiently observe a torn page and either return a non-'ok' integrity row or raise sqlite3.DatabaseError('database disk image is malformed') -- even though the file is fine and the very next probe succeeds.
The previous guard treated the first such hit as terminal corruption: it copied the file to a timestamped .corrupt.*.bak and raised KanbanDbCorruptError, which the gateway dispatcher then used to disable dispatch on that board until the file mtime changed or the gateway restarted. In practice this caused the dispatcher to silently stop processing tasks on a perfectly healthy DB, and left dozens of spurious .corrupt backup files on disk.
Retry the integrity probe up to 3 times with a short backoff (250ms, 500ms) before declaring corruption. A genuinely corrupt file still gets flagged after 3 consistent failures; a transient WAL blip from a concurrent worker write now self-heals.
Adds a regression test that injects a single transient DatabaseError on the first probe attempt and asserts:
Existing tests for genuine corruption and locked-but-healthy DBs continue to pass unchanged.