Summary
The kanban dispatcher's SQLite database (~/.hermes/kanban.db) suffers repeated index corruption after frequent gateway restarts (SIGTERM). Once corrupted, the dispatcher permanently disables itself for the board until the file changes or the gateway restarts — but even after restart it detects the same fingerprint and stays disabled because the file size/mtime change is too subtle for the recovery heuristic.
Environment
- Hermes Agent v0.14.0 (2026.5.16)
- macOS (APFS sealed volume, journaled)
- kanban.db uses WAL mode
Reproduction
- Run gateway with kanban dispatch enabled (
kanban.dispatch_in_gateway: true)
- Restart gateway frequently (e.g., during development:
hermes gateway restart multiple times within minutes)
- At some point
release_stale_claims() hits sqlite3.OperationalError: disk I/O error during dispatch
- This causes an incomplete WAL checkpoint, leaving indices out of sync with the table data
- On next tick,
connect() raises sqlite3.DatabaseError: database disk image is malformed
- Dispatcher disables the board. Recovery attempt on next tick sees the same
(path, mtime, size) fingerprint and stays disabled
Evidence
Corruption pattern (integrity check)
wrong # of entries in index idx_events_run
wrong # of entries in index idx_events_task
wrong # of entries in index idx_runs_status
wrong # of entries in index idx_runs_task
wrong # of entries in index idx_tasks_status
wrong # of entries in index idx_tasks_assignee_status
row 120 missing from index idx_events_run
... (90+ missing index entries across 6 indices)
Gateway log timeline
18:54:53 kanban dispatcher [default]: spawned=1 reclaimed=0 ...
18:55:53 kanban dispatcher: tick failed on board default
-> sqlite3.OperationalError: disk I/O error (in release_stale_claims)
18:55:53+ kanban notifier tick failed: cannot rollback - no transaction is active (repeated)
18:56:54 kanban dispatcher: board default database ... is not a valid SQLite database; disabling dispatch
18:57:54 kanban dispatcher: board default database changed; retrying dispatch
18:57:54 kanban dispatcher: board default database ... is not a valid SQLite database; disabling dispatch
The second restart attempt already detects a file change but still fails. The .dump + rebuild was required to fix the indices.
Manual recovery that worked
sqlite3 kanban.db '.dump' > dump.sql
sqlite3 kanban.db.new < dump.sql
mv kanban.db.new kanban.db
# integrity_check: ok
This happened twice on 5/22 and 5/23 with identical symptoms.
Root Cause Analysis
The corruption chain:
- Frequent SIGTERM -> gateway doesn't always complete WAL checkpoint before exit
- APFS + WAL edge case ->
disk I/O error during a read query on partially-checkpointed WAL
- Index/table desync -> indices become stale relative to table data
_is_corrupt_board_db_error() catches DatabaseError -> disables the board correctly
- Recovery heuristic is fragile -> fingerprint is
(path, mtime_ns, size). After a dump/rebuild the file does change, but simply restarting the gateway after the corrupt write doesn't change the fingerprint enough (file size may be identical)
Suggested Fixes
1. Auto-repair on corruption detection (recommended)
When _is_corrupt_board_db_error() fires in _tick_once_for_board(), instead of permanently disabling the board, attempt a self-heal:
- Try
REINDEX first (fast, handles most index-only corruption)
- If that fails, fall back to
.dump + rebuild (handles deeper page-level corruption)
- If repair succeeds, retry dispatch; if it fails, then disable the board
2. Periodic WAL checkpoint
Add a periodic PRAGMA wal_checkpoint(TRUNCATE) in the dispatcher tick (e.g., every N ticks or every M minutes) to keep the WAL file small and reduce the window for corruption on unclean shutdown.
3. Improve recovery fingerprint heuristic
The current disabled_corrupt_boards recovery only retries when (path, mtime_ns, size) changes. Consider also tracking a generation counter that increments on any gateway restart, so a restart always gets at least one retry attempt before permanently disabling.
4. Add PRAGMA journal_mode=TRUNCATE fallback
For systems where WAL is problematic, allow configuring journal_mode per-board. WAL is preferred for concurrent read/write but TRUNCATE is more resilient to unclean shutdowns.
Summary
The kanban dispatcher's SQLite database (
~/.hermes/kanban.db) suffers repeated index corruption after frequent gateway restarts (SIGTERM). Once corrupted, the dispatcher permanently disables itself for the board until the file changes or the gateway restarts — but even after restart it detects the same fingerprint and stays disabled because the file size/mtime change is too subtle for the recovery heuristic.Environment
Reproduction
kanban.dispatch_in_gateway: true)hermes gateway restartmultiple times within minutes)release_stale_claims()hitssqlite3.OperationalError: disk I/O errorduring dispatchconnect()raisessqlite3.DatabaseError: database disk image is malformed(path, mtime, size)fingerprint and stays disabledEvidence
Corruption pattern (integrity check)
Gateway log timeline
The second restart attempt already detects a file change but still fails. The
.dump+ rebuild was required to fix the indices.Manual recovery that worked
This happened twice on 5/22 and 5/23 with identical symptoms.
Root Cause Analysis
The corruption chain:
disk I/O errorduring a read query on partially-checkpointed WAL_is_corrupt_board_db_error()catchesDatabaseError-> disables the board correctly(path, mtime_ns, size). After a dump/rebuild the file does change, but simply restarting the gateway after the corrupt write doesn't change the fingerprint enough (file size may be identical)Suggested Fixes
1. Auto-repair on corruption detection (recommended)
When
_is_corrupt_board_db_error()fires in_tick_once_for_board(), instead of permanently disabling the board, attempt a self-heal:REINDEXfirst (fast, handles most index-only corruption).dump+ rebuild (handles deeper page-level corruption)2. Periodic WAL checkpoint
Add a periodic
PRAGMA wal_checkpoint(TRUNCATE)in the dispatcher tick (e.g., every N ticks or every M minutes) to keep the WAL file small and reduce the window for corruption on unclean shutdown.3. Improve recovery fingerprint heuristic
The current
disabled_corrupt_boardsrecovery only retries when(path, mtime_ns, size)changes. Consider also tracking a generation counter that increments on any gateway restart, so a restart always gets at least one retry attempt before permanently disabling.4. Add
PRAGMA journal_mode=TRUNCATEfallbackFor systems where WAL is problematic, allow configuring
journal_modeper-board. WAL is preferred for concurrent read/write but TRUNCATE is more resilient to unclean shutdowns.