Skip to content

Follow-up: rebuild PR #32857 commit 5 (exp-backoff + PRAGMA quick_check) on top of c94ad8981 #33486

@kshitijk4poor

Description

@kshitijk4poor

Background

PR #32857 (@steveonjava — Stephen Chin) batch-salvaged 8 kanban.db SQLite corruption hardening fixes. 7 of 8 were merged via #33482. The 8th was deliberately deferred because it had been partly superseded by c94ad89 (@donovan-yohan, fix(kanban): retry corrupt-board dispatch after quarantine) which landed on main between when @steveonjava drafted the batch and when it was reviewed.

This issue tracks the follow-up salvage of that 8th commit, rebased onto the current dispatcher shape so both contributions are preserved.

What @steveonjava's deferred commit does

Commit fefb4617d (fix(gateway): replace permanent corrupt-board latch with exponential backoff) makes two substantive improvements over the original permanent latch:

  1. Exponential backoff (30s → 30min cap) instead of immediate-and-forever latching. New state schema:

    disabled_corrupt_boards: dict[str, dict] = {}
    # state = {"fingerprint": ..., "disabled_until_ts": ..., "backoff_seconds": ...}
    INITIAL_BACKOFF_SEC = 30.0
    MAX_BACKOFF_SEC = 900.0  # 15 min cap (PR body says 30; code says 15 — clarify in salvage)

    On repeated same-fingerprint corruption, backoff_seconds = min(prev * 2, MAX_BACKOFF_SEC). On dispatch success, the latch clears.

  2. PRAGMA quick_check confirmation before latching. _confirm_corruption(slug, exc) opens a read-only URI (file:{path}?mode=ro) and runs PRAGMA quick_check. If the result is 'ok', the original error was transient and the latch is skipped:

    if not _confirm_corruption(slug, exc):
        return None

    This distinguishes a real corrupt file from a one-tick EIO/lock race that matches the same exception pattern.

The PR also ships tests/hermes_cli/test_kanban_dispatcher_resilience.py (~292 lines) covering both improvements.

What main has now (post-c94ad89818)

c94ad89 introduced a flat 5-minute quarantine timer instead of permanent latching:

CORRUPT_BOARD_RETRY_AFTER_SECONDS = 300
disabled_corrupt_boards: dict[str, tuple[tuple[str, int | None, int | None], float]] = {}
# state = (fingerprint_tuple, disabled_at_monotonic)

It also auto-retries on fingerprint change (size/mtime delta) and recognizes _kb.KanbanDbCorruptError as a corrupt-board signal.

So main already has some of the "don't latch forever" goal, just via a simpler mechanism. What's missing vs @steveonjava's commit:

  • Exponential backoff (vs flat 5min)
  • PRAGMA quick_check ro-probe before latching (no current way to distinguish transient EIO from real corruption — every match latches)
  • The test_kanban_dispatcher_resilience.py test surface

Proposed approach

Rebuild commit 5's improvements on top of c94ad89 rather than replacing it:

  1. Migrate state schema from tuple[(fingerprint, disabled_at)] to dict[{"fingerprint", "disabled_until_ts", "backoff_seconds"}]. Preserve the fingerprint-change retry semantics c94ad89 added.
  2. Replace flat CORRUPT_BOARD_RETRY_AFTER_SECONDS=300 with exponential backoff (INITIAL_BACKOFF_SEC=30, MAX_BACKOFF_SEC=900 — confirm cap value with the contributor). Reset on dispatch success.
  3. Add _confirm_corruption(slug, exc) with the PRAGMA quick_check ro-probe. Wire it into both the sqlite3.DatabaseError and the broader Exception branches (c94ad89 handles both).
  4. Salvage tests/hermes_cli/test_kanban_dispatcher_resilience.py from the original PR, updating any assertions that depend on the c94ad89 shape we're keeping.
  5. PR attribution: cherry-pick the original commit fefb4617d if it rebases cleanly enough, otherwise commit our rebuild with --author='Stephen Chin <steveonjava@gmail.com>' and credit @donovan-yohan + @steveonjava in the PR body. Per references/partly-superseded-pr-salvage.md: don't fake authorship if the diff bears no resemblance to the original.

Files affected

  • gateway/run.py (~5400–5620 area around _tick_once_for_board)
  • tests/hermes_cli/test_kanban_core_functionality.py (existing test_gateway_dispatcher_retries_corrupt_board_after_quarantine — likely needs assertion updates for the new backoff shape)
  • tests/hermes_cli/test_kanban_dispatcher_resilience.py (new, from original PR)

Credit

Refs

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Low — cosmetic, nice to havecomp/cronCron scheduler and job managementcomp/gatewayGateway runner, session dispatch, deliverytype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions