Follow-up: rebuild PR #32857 commit 5 (exp-backoff + PRAGMA quick_check) on top of c94ad8981

## Background

PR #32857 (@steveonjava — Stephen Chin) batch-salvaged 8 kanban.db SQLite corruption hardening fixes. 7 of 8 were merged via #33482. **The 8th was deliberately deferred** because it had been partly superseded by c94ad8981 (@donovan-yohan, `fix(kanban): retry corrupt-board dispatch after quarantine`) which landed on main between when @steveonjava drafted the batch and when it was reviewed.

This issue tracks the follow-up salvage of that 8th commit, rebased onto the current dispatcher shape so both contributions are preserved.

## What @steveonjava's deferred commit does

[Commit fefb4617d](https://github.com/NousResearch/hermes-agent/pull/32857/commits/fefb4617d) (`fix(gateway): replace permanent corrupt-board latch with exponential backoff`) makes two substantive improvements over the original permanent latch:

1. **Exponential backoff (30s → 30min cap)** instead of immediate-and-forever latching. New state schema:
   ```python
   disabled_corrupt_boards: dict[str, dict] = {}
   # state = {"fingerprint": ..., "disabled_until_ts": ..., "backoff_seconds": ...}
   INITIAL_BACKOFF_SEC = 30.0
   MAX_BACKOFF_SEC = 900.0  # 15 min cap (PR body says 30; code says 15 — clarify in salvage)
   ```
   On repeated same-fingerprint corruption, `backoff_seconds = min(prev * 2, MAX_BACKOFF_SEC)`. On dispatch success, the latch clears.

2. **`PRAGMA quick_check` confirmation before latching.** `_confirm_corruption(slug, exc)` opens a read-only URI (`file:{path}?mode=ro`) and runs `PRAGMA quick_check`. If the result is `'ok'`, the original error was transient and the latch is skipped:
   ```python
   if not _confirm_corruption(slug, exc):
       return None
   ```
   This distinguishes a real corrupt file from a one-tick EIO/lock race that matches the same exception pattern.

The PR also ships [`tests/hermes_cli/test_kanban_dispatcher_resilience.py`](https://github.com/NousResearch/hermes-agent/pull/32857/files#diff-tests-hermes-cli-test-kanban-dispatcher-resilience) (~292 lines) covering both improvements.

## What main has now (post-c94ad89818)

c94ad8981 introduced a **flat 5-minute quarantine timer** instead of permanent latching:

```python
CORRUPT_BOARD_RETRY_AFTER_SECONDS = 300
disabled_corrupt_boards: dict[str, tuple[tuple[str, int | None, int | None], float]] = {}
# state = (fingerprint_tuple, disabled_at_monotonic)
```

It also auto-retries on fingerprint change (size/mtime delta) and recognizes `_kb.KanbanDbCorruptError` as a corrupt-board signal.

So main already has *some* of the "don't latch forever" goal, just via a simpler mechanism. What's missing vs @steveonjava's commit:
- Exponential backoff (vs flat 5min)
- `PRAGMA quick_check` ro-probe before latching (no current way to distinguish transient EIO from real corruption — every match latches)
- The `test_kanban_dispatcher_resilience.py` test surface

## Proposed approach

Rebuild commit 5's improvements *on top of* c94ad8981 rather than replacing it:

1. **Migrate state schema** from `tuple[(fingerprint, disabled_at)]` to `dict[{"fingerprint", "disabled_until_ts", "backoff_seconds"}]`. Preserve the fingerprint-change retry semantics c94ad8981 added.
2. **Replace flat `CORRUPT_BOARD_RETRY_AFTER_SECONDS=300`** with exponential backoff (`INITIAL_BACKOFF_SEC=30`, `MAX_BACKOFF_SEC=900` — confirm cap value with the contributor). Reset on dispatch success.
3. **Add `_confirm_corruption(slug, exc)`** with the `PRAGMA quick_check` ro-probe. Wire it into both the `sqlite3.DatabaseError` and the broader `Exception` branches (c94ad8981 handles both).
4. **Salvage `tests/hermes_cli/test_kanban_dispatcher_resilience.py`** from the original PR, updating any assertions that depend on the c94ad8981 shape we're keeping.
5. **PR attribution**: cherry-pick the original commit `fefb4617d` if it rebases cleanly enough, otherwise commit our rebuild with `--author='Stephen Chin <steveonjava@gmail.com>'` and credit @donovan-yohan + @steveonjava in the PR body. Per `references/partly-superseded-pr-salvage.md`: don't fake authorship if the diff bears no resemblance to the original.

## Files affected

- `gateway/run.py` (~5400–5620 area around `_tick_once_for_board`)
- `tests/hermes_cli/test_kanban_core_functionality.py` (existing `test_gateway_dispatcher_retries_corrupt_board_after_quarantine` — likely needs assertion updates for the new backoff shape)
- `tests/hermes_cli/test_kanban_dispatcher_resilience.py` (new, from original PR)

## Credit

- @steveonjava — original exp-backoff + quick_check design (PR #32857 commit 5)
- @donovan-yohan — flat-timer quarantine + fingerprint-change retry foundation (c94ad8981)

## Refs

- Original PR: #32857 (closed, batch-salvaged via #33482)
- Foundation commit: c94ad8981 — `fix(kanban): retry corrupt-board dispatch after quarantine`
- Related: #31158 (root cause issue), #31932 (the original draft this commit was extracted from)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Follow-up: rebuild PR #32857 commit 5 (exp-backoff + PRAGMA quick_check) on top of c94ad8981 #33486

Background

What @steveonjava's deferred commit does

What main has now (post-c94ad89818)

Proposed approach

Files affected

Credit

Refs

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Follow-up: rebuild PR #32857 commit 5 (exp-backoff + PRAGMA quick_check) on top of c94ad8981 #33486

Description

Background

What @steveonjava's deferred commit does

What main has now (post-c94ad89818)

Proposed approach

Files affected

Credit

Refs

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions