Skip to content

kanban.db index corruption after frequent gateway restarts — dispatcher disables board permanently #30908

@pbwheel

Description

@pbwheel

Summary

The kanban dispatcher's SQLite database (~/.hermes/kanban.db) suffers repeated index corruption after frequent gateway restarts (SIGTERM). Once corrupted, the dispatcher permanently disables itself for the board until the file changes or the gateway restarts — but even after restart it detects the same fingerprint and stays disabled because the file size/mtime change is too subtle for the recovery heuristic.

Environment

  • Hermes Agent v0.14.0 (2026.5.16)
  • macOS (APFS sealed volume, journaled)
  • kanban.db uses WAL mode

Reproduction

  1. Run gateway with kanban dispatch enabled (kanban.dispatch_in_gateway: true)
  2. Restart gateway frequently (e.g., during development: hermes gateway restart multiple times within minutes)
  3. At some point release_stale_claims() hits sqlite3.OperationalError: disk I/O error during dispatch
  4. This causes an incomplete WAL checkpoint, leaving indices out of sync with the table data
  5. On next tick, connect() raises sqlite3.DatabaseError: database disk image is malformed
  6. Dispatcher disables the board. Recovery attempt on next tick sees the same (path, mtime, size) fingerprint and stays disabled

Evidence

Corruption pattern (integrity check)

wrong # of entries in index idx_events_run
wrong # of entries in index idx_events_task
wrong # of entries in index idx_runs_status
wrong # of entries in index idx_runs_task
wrong # of entries in index idx_tasks_status
wrong # of entries in index idx_tasks_assignee_status
row 120 missing from index idx_events_run
... (90+ missing index entries across 6 indices)

Gateway log timeline

18:54:53 kanban dispatcher [default]: spawned=1 reclaimed=0 ...
18:55:53 kanban dispatcher: tick failed on board default
  -> sqlite3.OperationalError: disk I/O error (in release_stale_claims)
18:55:53+ kanban notifier tick failed: cannot rollback - no transaction is active (repeated)
18:56:54 kanban dispatcher: board default database ... is not a valid SQLite database; disabling dispatch
18:57:54 kanban dispatcher: board default database changed; retrying dispatch
18:57:54 kanban dispatcher: board default database ... is not a valid SQLite database; disabling dispatch

The second restart attempt already detects a file change but still fails. The .dump + rebuild was required to fix the indices.

Manual recovery that worked

sqlite3 kanban.db '.dump' > dump.sql
sqlite3 kanban.db.new < dump.sql
mv kanban.db.new kanban.db
# integrity_check: ok

This happened twice on 5/22 and 5/23 with identical symptoms.

Root Cause Analysis

The corruption chain:

  1. Frequent SIGTERM -> gateway doesn't always complete WAL checkpoint before exit
  2. APFS + WAL edge case -> disk I/O error during a read query on partially-checkpointed WAL
  3. Index/table desync -> indices become stale relative to table data
  4. _is_corrupt_board_db_error() catches DatabaseError -> disables the board correctly
  5. Recovery heuristic is fragile -> fingerprint is (path, mtime_ns, size). After a dump/rebuild the file does change, but simply restarting the gateway after the corrupt write doesn't change the fingerprint enough (file size may be identical)

Suggested Fixes

1. Auto-repair on corruption detection (recommended)

When _is_corrupt_board_db_error() fires in _tick_once_for_board(), instead of permanently disabling the board, attempt a self-heal:

  • Try REINDEX first (fast, handles most index-only corruption)
  • If that fails, fall back to .dump + rebuild (handles deeper page-level corruption)
  • If repair succeeds, retry dispatch; if it fails, then disable the board

2. Periodic WAL checkpoint

Add a periodic PRAGMA wal_checkpoint(TRUNCATE) in the dispatcher tick (e.g., every N ticks or every M minutes) to keep the WAL file small and reduce the window for corruption on unclean shutdown.

3. Improve recovery fingerprint heuristic

The current disabled_corrupt_boards recovery only retries when (path, mtime_ns, size) changes. Consider also tracking a generation counter that increments on any gateway restart, so a restart always gets at least one retry attempt before permanently disabling.

4. Add PRAGMA journal_mode=TRUNCATE fallback

For systems where WAL is problematic, allow configuring journal_mode per-board. WAL is preferred for concurrent read/write but TRUNCATE is more resilient to unclean shutdowns.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existscomp/cronCron scheduler and job managementcomp/gatewayGateway runner, session dispatch, deliverytype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions