Summary
When both hermes-gateway and hermes-dashboard systemd services are running, the SQLite kanban DB (kanban.db) for a board quickly becomes corrupted with integrity-check errors like wrong # of entries in index idx_events_run. The root cause is that two independent processes open the same SQLite DB in WAL mode without any file locking or access serialization.
Environment
- Hermes Agent: v0.13.0
- OS: CentOS 7 (kernel 3.10, glibc 2.17)
- SQLite: 3.50.4 (bundled in venv)
- Setup: systemd services —
hermes-gateway.service + hermes-dashboard.service (PartOf)
- Board:
wurenji (but any shared board is affected)
Steps to Reproduce
- Enable both gateway and dashboard as systemd services pointing to the same
~/.hermes directory.
- Have at least one active kanban board (e.g.,
wurenji) with tasks.
- Wait for the gateway kanban dispatcher tick (60s interval) to overlap with dashboard API requests that read the same board.
- Within minutes to hours,
kanban_db.connect() raises KanbanDbCorruptError.
Observed Behavior
Gateway logs:
ERROR gateway.run: kanban dispatcher: tick failed on board wurenji
hermes_cli.kanban_db.KanbanDbCorruptError:
Refusing to open corrupt kanban DB at .../wurenji/kanban.db:
integrity_check returned 'wrong # of entries in index idx_events_run'
Dashboard logs (500 Internal Server Error):
starlette.exceptions.HTTPException: 500 — the same KanbanDbCorruptError propagates to the web API
In our environment the board directory accumulated 300+ corrupt backup files (kanban.db.corrupt.*.bak) in under 30 minutes because both services kept retrying and re-corrupting the DB.
Root Cause Analysis
hermes_cli/kanban_db.py connect() opens the DB in journal_mode=WAL.
- Both
gateway and dashboard processes call connect(board=slug) independently.
- When one process triggers a WAL checkpoint or toggles
journal_mode, the other process can be mid-transaction, leaving the DB file and -shm / -wal files in an inconsistent state.
kanban_db.py only performs post-hoc validation (_guard_existing_db_is_healthy). It detects corruption but does not prevent it.
Suggested Fixes
- Single-writer architecture: Let
gateway own the kanban DB exclusively. dashboard should read board state via an in-memory cache or an HTTP/internal API provided by gateway, not by opening the SQLite file directly.
- File locking: If direct DB access from both processes is required, use
fcntl.flock or SQLite’s built-in busy_timeout + locking_mode to serialize open/close operations.
- Auto-recovery: When
KanbanDbCorruptError is detected, automatically perform iterdump → rebuild instead of requiring manual service stop + python script.
- Separate DB paths: As a short-term workaround, allow dashboard to use a read-only replica or a separate DB path so the two processes never share the same
-wal / -shm files.
Workaround (manual recovery)
sudo systemctl stop hermes-dashboard hermes-gateway
cd ~/.hermes/kanban/boards/<board>
# Use venv Python (newer sqlite3) to dump & reload
python -c "
import sqlite3, os
src = 'kanban.db'
conn = sqlite3.connect(src)
conn.execute('PRAGMA journal_mode=DELETE')
conn.execute('PRAGMA wal_checkpoint(TRUNCATE)')
with open('dump.sql','w') as f:
for line in conn.iterdump(): f.write(line+'\n')
conn.close()
conn = sqlite3.connect('kanban.db.new')
with open('dump.sql') as f: conn.executescript(f.read())
conn.execute('PRAGMA journal_mode=WAL')
conn.close()
"
mv kanban.db kanban.db.corrupt.$(date +%Y%m%d_%H%M%S).bak
mv kanban.db.new kanban.db
sudo systemctl start hermes-gateway
sudo systemctl start hermes-dashboard
Related
This issue is a code-level design/architecture bug, not an OS or SQLite bug.
Summary
When both
hermes-gatewayandhermes-dashboardsystemd services are running, the SQLite kanban DB (kanban.db) for a board quickly becomes corrupted with integrity-check errors likewrong # of entries in index idx_events_run. The root cause is that two independent processes open the same SQLite DB in WAL mode without any file locking or access serialization.Environment
hermes-gateway.service+hermes-dashboard.service(PartOf)wurenji(but any shared board is affected)Steps to Reproduce
~/.hermesdirectory.wurenji) with tasks.kanban_db.connect()raisesKanbanDbCorruptError.Observed Behavior
Gateway logs:
Dashboard logs (500 Internal Server Error):
In our environment the board directory accumulated 300+ corrupt backup files (
kanban.db.corrupt.*.bak) in under 30 minutes because both services kept retrying and re-corrupting the DB.Root Cause Analysis
hermes_cli/kanban_db.pyconnect()opens the DB injournal_mode=WAL.gatewayanddashboardprocesses callconnect(board=slug)independently.journal_mode, the other process can be mid-transaction, leaving the DB file and-shm/-walfiles in an inconsistent state.kanban_db.pyonly performs post-hoc validation (_guard_existing_db_is_healthy). It detects corruption but does not prevent it.Suggested Fixes
gatewayown the kanban DB exclusively.dashboardshould read board state via an in-memory cache or an HTTP/internal API provided by gateway, not by opening the SQLite file directly.fcntl.flockor SQLite’s built-inbusy_timeout+locking_modeto serialize open/close operations.KanbanDbCorruptErroris detected, automatically performiterdump→ rebuild instead of requiring manual service stop + python script.-wal/-shmfiles.Workaround (manual recovery)
Related
This issue is a code-level design/architecture bug, not an OS or SQLite bug.