Skip to content

[Bug] Kanban DB corruption when gateway and dashboard open the same board DB concurrently (WAL mode) #33169

@baofuen

Description

@baofuen

Summary

When both hermes-gateway and hermes-dashboard systemd services are running, the SQLite kanban DB (kanban.db) for a board quickly becomes corrupted with integrity-check errors like wrong # of entries in index idx_events_run. The root cause is that two independent processes open the same SQLite DB in WAL mode without any file locking or access serialization.

Environment

  • Hermes Agent: v0.13.0
  • OS: CentOS 7 (kernel 3.10, glibc 2.17)
  • SQLite: 3.50.4 (bundled in venv)
  • Setup: systemd services — hermes-gateway.service + hermes-dashboard.service (PartOf)
  • Board: wurenji (but any shared board is affected)

Steps to Reproduce

  1. Enable both gateway and dashboard as systemd services pointing to the same ~/.hermes directory.
  2. Have at least one active kanban board (e.g., wurenji) with tasks.
  3. Wait for the gateway kanban dispatcher tick (60s interval) to overlap with dashboard API requests that read the same board.
  4. Within minutes to hours, kanban_db.connect() raises KanbanDbCorruptError.

Observed Behavior

Gateway logs:

ERROR gateway.run: kanban dispatcher: tick failed on board wurenji
hermes_cli.kanban_db.KanbanDbCorruptError:
  Refusing to open corrupt kanban DB at .../wurenji/kanban.db:
  integrity_check returned 'wrong # of entries in index idx_events_run'

Dashboard logs (500 Internal Server Error):

starlette.exceptions.HTTPException: 500 — the same KanbanDbCorruptError propagates to the web API

In our environment the board directory accumulated 300+ corrupt backup files (kanban.db.corrupt.*.bak) in under 30 minutes because both services kept retrying and re-corrupting the DB.

Root Cause Analysis

  1. hermes_cli/kanban_db.py connect() opens the DB in journal_mode=WAL.
  2. Both gateway and dashboard processes call connect(board=slug) independently.
  3. When one process triggers a WAL checkpoint or toggles journal_mode, the other process can be mid-transaction, leaving the DB file and -shm / -wal files in an inconsistent state.
  4. kanban_db.py only performs post-hoc validation (_guard_existing_db_is_healthy). It detects corruption but does not prevent it.

Suggested Fixes

  1. Single-writer architecture: Let gateway own the kanban DB exclusively. dashboard should read board state via an in-memory cache or an HTTP/internal API provided by gateway, not by opening the SQLite file directly.
  2. File locking: If direct DB access from both processes is required, use fcntl.flock or SQLite’s built-in busy_timeout + locking_mode to serialize open/close operations.
  3. Auto-recovery: When KanbanDbCorruptError is detected, automatically perform iterdump → rebuild instead of requiring manual service stop + python script.
  4. Separate DB paths: As a short-term workaround, allow dashboard to use a read-only replica or a separate DB path so the two processes never share the same -wal / -shm files.

Workaround (manual recovery)

sudo systemctl stop hermes-dashboard hermes-gateway
cd ~/.hermes/kanban/boards/<board>
# Use venv Python (newer sqlite3) to dump & reload
python -c "
import sqlite3, os
src = 'kanban.db'
conn = sqlite3.connect(src)
conn.execute('PRAGMA journal_mode=DELETE')
conn.execute('PRAGMA wal_checkpoint(TRUNCATE)')
with open('dump.sql','w') as f:
    for line in conn.iterdump(): f.write(line+'\n')
conn.close()
conn = sqlite3.connect('kanban.db.new')
with open('dump.sql') as f: conn.executescript(f.read())
conn.execute('PRAGMA journal_mode=WAL')
conn.close()
"
mv kanban.db kanban.db.corrupt.$(date +%Y%m%d_%H%M%S).bak
mv kanban.db.new kanban.db
sudo systemctl start hermes-gateway
sudo systemctl start hermes-dashboard

Related

This issue is a code-level design/architecture bug, not an OS or SQLite bug.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Low — cosmetic, nice to havecomp/pluginsPlugin system and bundled pluginsduplicateThis issue or pull request already existstype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions