kanban.db index corruption after frequent gateway restarts — dispatcher disables board permanently

## Summary

The kanban dispatcher's SQLite database (`~/.hermes/kanban.db`) suffers repeated index corruption after frequent gateway restarts (SIGTERM). Once corrupted, the dispatcher permanently disables itself for the board until the file changes or the gateway restarts — but **even after restart it detects the same fingerprint and stays disabled** because the file size/mtime change is too subtle for the recovery heuristic.

## Environment

- Hermes Agent v0.14.0 (2026.5.16)
- macOS (APFS sealed volume, journaled)
- kanban.db uses WAL mode

## Reproduction

1. Run gateway with kanban dispatch enabled (`kanban.dispatch_in_gateway: true`)
2. Restart gateway frequently (e.g., during development: `hermes gateway restart` multiple times within minutes)
3. At some point `release_stale_claims()` hits `sqlite3.OperationalError: disk I/O error` during dispatch
4. This causes an incomplete WAL checkpoint, leaving indices out of sync with the table data
5. On next tick, `connect()` raises `sqlite3.DatabaseError: database disk image is malformed`
6. Dispatcher disables the board. Recovery attempt on next tick sees the same `(path, mtime, size)` fingerprint and stays disabled

## Evidence

### Corruption pattern (integrity check)

```
wrong # of entries in index idx_events_run
wrong # of entries in index idx_events_task
wrong # of entries in index idx_runs_status
wrong # of entries in index idx_runs_task
wrong # of entries in index idx_tasks_status
wrong # of entries in index idx_tasks_assignee_status
row 120 missing from index idx_events_run
... (90+ missing index entries across 6 indices)
```

### Gateway log timeline

```
18:54:53 kanban dispatcher [default]: spawned=1 reclaimed=0 ...
18:55:53 kanban dispatcher: tick failed on board default
  -> sqlite3.OperationalError: disk I/O error (in release_stale_claims)
18:55:53+ kanban notifier tick failed: cannot rollback - no transaction is active (repeated)
18:56:54 kanban dispatcher: board default database ... is not a valid SQLite database; disabling dispatch
18:57:54 kanban dispatcher: board default database changed; retrying dispatch
18:57:54 kanban dispatcher: board default database ... is not a valid SQLite database; disabling dispatch
```

The second restart attempt already detects a file change but still fails. The `.dump` + rebuild was required to fix the indices.

### Manual recovery that worked

```bash
sqlite3 kanban.db '.dump' > dump.sql
sqlite3 kanban.db.new < dump.sql
mv kanban.db.new kanban.db
# integrity_check: ok
```

This happened twice on 5/22 and 5/23 with identical symptoms.

## Root Cause Analysis

The corruption chain:

1. **Frequent SIGTERM** -> gateway doesn't always complete WAL checkpoint before exit
2. **APFS + WAL edge case** -> `disk I/O error` during a read query on partially-checkpointed WAL
3. **Index/table desync** -> indices become stale relative to table data
4. **`_is_corrupt_board_db_error()` catches `DatabaseError`** -> disables the board correctly
5. **Recovery heuristic is fragile** -> fingerprint is `(path, mtime_ns, size)`. After a dump/rebuild the file does change, but simply restarting the gateway after the corrupt write doesn't change the fingerprint enough (file size may be identical)

## Suggested Fixes

### 1. Auto-repair on corruption detection (recommended)

When `_is_corrupt_board_db_error()` fires in `_tick_once_for_board()`, instead of permanently disabling the board, attempt a self-heal:

- Try `REINDEX` first (fast, handles most index-only corruption)
- If that fails, fall back to `.dump` + rebuild (handles deeper page-level corruption)
- If repair succeeds, retry dispatch; if it fails, then disable the board

### 2. Periodic WAL checkpoint

Add a periodic `PRAGMA wal_checkpoint(TRUNCATE)` in the dispatcher tick (e.g., every N ticks or every M minutes) to keep the WAL file small and reduce the window for corruption on unclean shutdown.

### 3. Improve recovery fingerprint heuristic

The current `disabled_corrupt_boards` recovery only retries when `(path, mtime_ns, size)` changes. Consider also tracking a generation counter that increments on any gateway restart, so a restart always gets at least one retry attempt before permanently disabling.

### 4. Add `PRAGMA journal_mode=TRUNCATE` fallback

For systems where WAL is problematic, allow configuring `journal_mode` per-board. WAL is preferred for concurrent read/write but TRUNCATE is more resilient to unclean shutdowns.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kanban.db index corruption after frequent gateway restarts — dispatcher disables board permanently #30908

Summary

Environment

Reproduction

Evidence

Corruption pattern (integrity check)

Gateway log timeline

Manual recovery that worked

Root Cause Analysis

Suggested Fixes

1. Auto-repair on corruption detection (recommended)

2. Periodic WAL checkpoint

3. Improve recovery fingerprint heuristic

4. Add `PRAGMA journal_mode=TRUNCATE` fallback

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

kanban.db index corruption after frequent gateway restarts — dispatcher disables board permanently #30908

Description

Summary

Environment

Reproduction

Evidence

Corruption pattern (integrity check)

Gateway log timeline

Manual recovery that worked

Root Cause Analysis

Suggested Fixes

1. Auto-repair on corruption detection (recommended)

2. Periodic WAL checkpoint

3. Improve recovery fingerprint heuristic

4. Add PRAGMA journal_mode=TRUNCATE fallback

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

4. Add `PRAGMA journal_mode=TRUNCATE` fallback