[Bug] Kanban DB corruption when gateway and dashboard open the same board DB concurrently (WAL mode)

### Summary
When both `hermes-gateway` and `hermes-dashboard` systemd services are running, the SQLite kanban DB (`kanban.db`) for a board quickly becomes corrupted with integrity-check errors like `wrong # of entries in index idx_events_run`. The root cause is that **two independent processes open the same SQLite DB in WAL mode without any file locking or access serialization**.

### Environment
- Hermes Agent: v0.13.0
- OS: CentOS 7 (kernel 3.10, glibc 2.17)
- SQLite: 3.50.4 (bundled in venv)
- Setup: systemd services — `hermes-gateway.service` + `hermes-dashboard.service` (PartOf)
- Board: `wurenji` (but any shared board is affected)

### Steps to Reproduce
1. Enable both gateway and dashboard as systemd services pointing to the same `~/.hermes` directory.
2. Have at least one active kanban board (e.g., `wurenji`) with tasks.
3. Wait for the gateway kanban dispatcher tick (60s interval) to overlap with dashboard API requests that read the same board.
4. Within minutes to hours, `kanban_db.connect()` raises `KanbanDbCorruptError`.

### Observed Behavior
**Gateway logs:**
```
ERROR gateway.run: kanban dispatcher: tick failed on board wurenji
hermes_cli.kanban_db.KanbanDbCorruptError:
  Refusing to open corrupt kanban DB at .../wurenji/kanban.db:
  integrity_check returned 'wrong # of entries in index idx_events_run'
```

**Dashboard logs (500 Internal Server Error):**
```
starlette.exceptions.HTTPException: 500 — the same KanbanDbCorruptError propagates to the web API
```

In our environment the board directory accumulated **300+ corrupt backup files** (`kanban.db.corrupt.*.bak`) in under 30 minutes because both services kept retrying and re-corrupting the DB.

### Root Cause Analysis
1. `hermes_cli/kanban_db.py` `connect()` opens the DB in `journal_mode=WAL`.
2. Both `gateway` and `dashboard` processes call `connect(board=slug)` independently.
3. When one process triggers a WAL checkpoint or toggles `journal_mode`, the other process can be mid-transaction, leaving the DB file and `-shm` / `-wal` files in an inconsistent state.
4. `kanban_db.py` only performs **post-hoc** validation (`_guard_existing_db_is_healthy`). It detects corruption but does not prevent it.

### Suggested Fixes
1. **Single-writer architecture**: Let `gateway` own the kanban DB exclusively. `dashboard` should read board state via an in-memory cache or an HTTP/internal API provided by gateway, not by opening the SQLite file directly.
2. **File locking**: If direct DB access from both processes is required, use `fcntl.flock` or SQLite’s built-in `busy_timeout` + `locking_mode` to serialize open/close operations.
3. **Auto-recovery**: When `KanbanDbCorruptError` is detected, automatically perform `iterdump` → rebuild instead of requiring manual service stop + python script.
4. **Separate DB paths**: As a short-term workaround, allow dashboard to use a read-only replica or a separate DB path so the two processes never share the same `-wal` / `-shm` files.

### Workaround (manual recovery)
```bash
sudo systemctl stop hermes-dashboard hermes-gateway
cd ~/.hermes/kanban/boards/<board>
# Use venv Python (newer sqlite3) to dump & reload
python -c "
import sqlite3, os
src = 'kanban.db'
conn = sqlite3.connect(src)
conn.execute('PRAGMA journal_mode=DELETE')
conn.execute('PRAGMA wal_checkpoint(TRUNCATE)')
with open('dump.sql','w') as f:
    for line in conn.iterdump(): f.write(line+'\n')
conn.close()
conn = sqlite3.connect('kanban.db.new')
with open('dump.sql') as f: conn.executescript(f.read())
conn.execute('PRAGMA journal_mode=WAL')
conn.close()
"
mv kanban.db kanban.db.corrupt.$(date +%Y%m%d_%H%M%S).bak
mv kanban.db.new kanban.db
sudo systemctl start hermes-gateway
sudo systemctl start hermes-dashboard
```

### Related
- #33113 — fix(gateway): close kanban DB connection after dispatch tick
- #33159 — [Bug] Kanban plugin: kanban.db file descriptor leak

This issue is a **code-level design/architecture bug**, not an OS or SQLite bug.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Kanban DB corruption when gateway and dashboard open the same board DB concurrently (WAL mode) #33169

Summary

Environment

Steps to Reproduce

Observed Behavior

Root Cause Analysis

Suggested Fixes

Workaround (manual recovery)

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug] Kanban DB corruption when gateway and dashboard open the same board DB concurrently (WAL mode) #33169

Description

Summary

Environment

Steps to Reproduce

Observed Behavior

Root Cause Analysis

Suggested Fixes

Workaround (manual recovery)

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions