Description
When multiple Hermes profile gateways (e.g., --profile bingge, --profile pixiel, --profile mafei) run concurrently and share the same kanban board DB (~/.hermes/kanban.db), the SQLite database becomes corrupted. The corruption specifically affects the kanban_notify_subs table indexes.
This is not a false positive from PRAGMA integrity_check — the indexes genuinely become inconsistent with the table data.
Environment
- macOS (APFS filesystem)
- SQLite 3.51.0
- Hermes Agent (latest main branch)
- 4 gateway processes: default + 3 named profiles, all sharing
kanban.db by design
Root Cause Analysis
Architecture
kanban_home() intentionally resolves to ~/.hermes/ for all profiles (by design, per the docstring: "The kanban board is shared across profiles")
- Each profile gateway runs its own dispatcher, which opens independent SQLite connections
- CLI commands (
hermes kanban create/complete/block/link) also open new connections
The Race
Multiple processes concurrently execute BEGIN IMMEDIATE write transactions against the same DB. While SQLite WAL mode supports concurrent readers + single writer per connection, concurrent WAL checkpoints from separate processes can corrupt the main DB file.
Evidence
- 4 gateway processes had open file descriptors on
kanban.db at time of corruption
- Last events before corruption show rapid concurrent activity from different profile dispatchers:
- bingge gateway: completed Sprint 3 PRD (13:04:34)
- pixiel gateway: spawned Sprint 3 design task (13:04:36)
- mafei gateway: protocol_violation → gave_up → re-spawned Sprint 2 (13:05:04)
- Corruption was in
kanban_notify_subs indexes (idx_notify_task + sqlite_autoindex_kanban_notify_subs_1)
PRAGMA integrity_check returned:
Tree 10 page 10: btreeInitPage() returns error code 11
wrong # of entries in index idx_notify_task
wrong # of entries in index sqlite_autoindex_kanban_notify_subs_1
Steps to Reproduce
- Start 3+ profile gateways:
hermes --profile X gateway run --replace
- Run kanban operations that trigger concurrent writes (task create + claim + notify-subscribe)
- Observe corruption after ~30-60 minutes of active use
Suggested Fixes
Short-term
Add a file-level advisory lock (fcntl.flock) around all kanban write operations in kanban_db.py. The existing BEGIN IMMEDIATE handles SQLite-level serialization, but doesn't protect against concurrent WAL checkpoints from separate processes.
Medium-term
Serialize kanban writes through a single writer process/thread. Each gateway could send write requests to a central kanban writer instead of opening independent connections.
Long-term
Consider PostgreSQL as an optional backend for multi-profile setups. SQLite's WAL mode has documented limitations with concurrent writers from separate processes.
Workaround
Currently working around by:
- Monitoring for corruption via
PRAGMA integrity_check
- Recovering from
.recover.*.sql dumps when corruption is detected
- Restarting all gateways after recovery
This is fragile — the recovery SQL can be stale, losing recent task state.
Description
When multiple Hermes profile gateways (e.g.,
--profile bingge,--profile pixiel,--profile mafei) run concurrently and share the same kanban board DB (~/.hermes/kanban.db), the SQLite database becomes corrupted. The corruption specifically affects thekanban_notify_substable indexes.This is not a false positive from
PRAGMA integrity_check— the indexes genuinely become inconsistent with the table data.Environment
kanban.dbby designRoot Cause Analysis
Architecture
kanban_home()intentionally resolves to~/.hermes/for all profiles (by design, per the docstring: "The kanban board is shared across profiles")hermes kanban create/complete/block/link) also open new connectionsThe Race
Multiple processes concurrently execute
BEGIN IMMEDIATEwrite transactions against the same DB. While SQLite WAL mode supports concurrent readers + single writer per connection, concurrent WAL checkpoints from separate processes can corrupt the main DB file.Evidence
kanban.dbat time of corruptionkanban_notify_subsindexes (idx_notify_task+sqlite_autoindex_kanban_notify_subs_1)PRAGMA integrity_checkreturned:Steps to Reproduce
hermes --profile X gateway run --replaceSuggested Fixes
Short-term
Add a file-level advisory lock (
fcntl.flock) around all kanban write operations inkanban_db.py. The existingBEGIN IMMEDIATEhandles SQLite-level serialization, but doesn't protect against concurrent WAL checkpoints from separate processes.Medium-term
Serialize kanban writes through a single writer process/thread. Each gateway could send write requests to a central kanban writer instead of opening independent connections.
Long-term
Consider PostgreSQL as an optional backend for multi-profile setups. SQLite's WAL mode has documented limitations with concurrent writers from separate processes.
Workaround
Currently working around by:
PRAGMA integrity_check.recover.*.sqldumps when corruption is detectedThis is fragile — the recovery SQL can be stale, losing recent task state.