Background
PR #32857 (@steveonjava — Stephen Chin) batch-salvaged 8 kanban.db SQLite corruption hardening fixes. 7 of 8 were merged via #33482. The 8th was deliberately deferred because it had been partly superseded by c94ad89 (@donovan-yohan, fix(kanban): retry corrupt-board dispatch after quarantine) which landed on main between when @steveonjava drafted the batch and when it was reviewed.
This issue tracks the follow-up salvage of that 8th commit, rebased onto the current dispatcher shape so both contributions are preserved.
What @steveonjava's deferred commit does
Commit fefb4617d (fix(gateway): replace permanent corrupt-board latch with exponential backoff) makes two substantive improvements over the original permanent latch:
-
Exponential backoff (30s → 30min cap) instead of immediate-and-forever latching. New state schema:
disabled_corrupt_boards: dict[str, dict] = {}
# state = {"fingerprint": ..., "disabled_until_ts": ..., "backoff_seconds": ...}
INITIAL_BACKOFF_SEC = 30.0
MAX_BACKOFF_SEC = 900.0 # 15 min cap (PR body says 30; code says 15 — clarify in salvage)
On repeated same-fingerprint corruption, backoff_seconds = min(prev * 2, MAX_BACKOFF_SEC). On dispatch success, the latch clears.
-
PRAGMA quick_check confirmation before latching. _confirm_corruption(slug, exc) opens a read-only URI (file:{path}?mode=ro) and runs PRAGMA quick_check. If the result is 'ok', the original error was transient and the latch is skipped:
if not _confirm_corruption(slug, exc):
return None
This distinguishes a real corrupt file from a one-tick EIO/lock race that matches the same exception pattern.
The PR also ships tests/hermes_cli/test_kanban_dispatcher_resilience.py (~292 lines) covering both improvements.
What main has now (post-c94ad89818)
c94ad89 introduced a flat 5-minute quarantine timer instead of permanent latching:
CORRUPT_BOARD_RETRY_AFTER_SECONDS = 300
disabled_corrupt_boards: dict[str, tuple[tuple[str, int | None, int | None], float]] = {}
# state = (fingerprint_tuple, disabled_at_monotonic)
It also auto-retries on fingerprint change (size/mtime delta) and recognizes _kb.KanbanDbCorruptError as a corrupt-board signal.
So main already has some of the "don't latch forever" goal, just via a simpler mechanism. What's missing vs @steveonjava's commit:
- Exponential backoff (vs flat 5min)
PRAGMA quick_check ro-probe before latching (no current way to distinguish transient EIO from real corruption — every match latches)
- The
test_kanban_dispatcher_resilience.py test surface
Proposed approach
Rebuild commit 5's improvements on top of c94ad89 rather than replacing it:
- Migrate state schema from
tuple[(fingerprint, disabled_at)] to dict[{"fingerprint", "disabled_until_ts", "backoff_seconds"}]. Preserve the fingerprint-change retry semantics c94ad89 added.
- Replace flat
CORRUPT_BOARD_RETRY_AFTER_SECONDS=300 with exponential backoff (INITIAL_BACKOFF_SEC=30, MAX_BACKOFF_SEC=900 — confirm cap value with the contributor). Reset on dispatch success.
- Add
_confirm_corruption(slug, exc) with the PRAGMA quick_check ro-probe. Wire it into both the sqlite3.DatabaseError and the broader Exception branches (c94ad89 handles both).
- Salvage
tests/hermes_cli/test_kanban_dispatcher_resilience.py from the original PR, updating any assertions that depend on the c94ad89 shape we're keeping.
- PR attribution: cherry-pick the original commit
fefb4617d if it rebases cleanly enough, otherwise commit our rebuild with --author='Stephen Chin <steveonjava@gmail.com>' and credit @donovan-yohan + @steveonjava in the PR body. Per references/partly-superseded-pr-salvage.md: don't fake authorship if the diff bears no resemblance to the original.
Files affected
gateway/run.py (~5400–5620 area around _tick_once_for_board)
tests/hermes_cli/test_kanban_core_functionality.py (existing test_gateway_dispatcher_retries_corrupt_board_after_quarantine — likely needs assertion updates for the new backoff shape)
tests/hermes_cli/test_kanban_dispatcher_resilience.py (new, from original PR)
Credit
Refs
Background
PR #32857 (@steveonjava — Stephen Chin) batch-salvaged 8 kanban.db SQLite corruption hardening fixes. 7 of 8 were merged via #33482. The 8th was deliberately deferred because it had been partly superseded by c94ad89 (@donovan-yohan,
fix(kanban): retry corrupt-board dispatch after quarantine) which landed on main between when @steveonjava drafted the batch and when it was reviewed.This issue tracks the follow-up salvage of that 8th commit, rebased onto the current dispatcher shape so both contributions are preserved.
What @steveonjava's deferred commit does
Commit fefb4617d (
fix(gateway): replace permanent corrupt-board latch with exponential backoff) makes two substantive improvements over the original permanent latch:Exponential backoff (30s → 30min cap) instead of immediate-and-forever latching. New state schema:
On repeated same-fingerprint corruption,
backoff_seconds = min(prev * 2, MAX_BACKOFF_SEC). On dispatch success, the latch clears.PRAGMA quick_checkconfirmation before latching._confirm_corruption(slug, exc)opens a read-only URI (file:{path}?mode=ro) and runsPRAGMA quick_check. If the result is'ok', the original error was transient and the latch is skipped:This distinguishes a real corrupt file from a one-tick EIO/lock race that matches the same exception pattern.
The PR also ships
tests/hermes_cli/test_kanban_dispatcher_resilience.py(~292 lines) covering both improvements.What main has now (post-c94ad89818)
c94ad89 introduced a flat 5-minute quarantine timer instead of permanent latching:
It also auto-retries on fingerprint change (size/mtime delta) and recognizes
_kb.KanbanDbCorruptErroras a corrupt-board signal.So main already has some of the "don't latch forever" goal, just via a simpler mechanism. What's missing vs @steveonjava's commit:
PRAGMA quick_checkro-probe before latching (no current way to distinguish transient EIO from real corruption — every match latches)test_kanban_dispatcher_resilience.pytest surfaceProposed approach
Rebuild commit 5's improvements on top of c94ad89 rather than replacing it:
tuple[(fingerprint, disabled_at)]todict[{"fingerprint", "disabled_until_ts", "backoff_seconds"}]. Preserve the fingerprint-change retry semantics c94ad89 added.CORRUPT_BOARD_RETRY_AFTER_SECONDS=300with exponential backoff (INITIAL_BACKOFF_SEC=30,MAX_BACKOFF_SEC=900— confirm cap value with the contributor). Reset on dispatch success._confirm_corruption(slug, exc)with thePRAGMA quick_checkro-probe. Wire it into both thesqlite3.DatabaseErrorand the broaderExceptionbranches (c94ad89 handles both).tests/hermes_cli/test_kanban_dispatcher_resilience.pyfrom the original PR, updating any assertions that depend on the c94ad89 shape we're keeping.fefb4617dif it rebases cleanly enough, otherwise commit our rebuild with--author='Stephen Chin <steveonjava@gmail.com>'and credit @donovan-yohan + @steveonjava in the PR body. Perreferences/partly-superseded-pr-salvage.md: don't fake authorship if the diff bears no resemblance to the original.Files affected
gateway/run.py(~5400–5620 area around_tick_once_for_board)tests/hermes_cli/test_kanban_core_functionality.py(existingtest_gateway_dispatcher_retries_corrupt_board_after_quarantine— likely needs assertion updates for the new backoff shape)tests/hermes_cli/test_kanban_dispatcher_resilience.py(new, from original PR)Credit
Refs
fix(kanban): retry corrupt-board dispatch after quarantine