Description
Every Hermes profile gateway opens a SQLite connection to the shared kanban.db on startup, regardless of whether the profile has kanban.dispatch_in_gateway enabled. On a multi-profile setup with 7+ active gateways, this creates 7+ concurrent SQLite connections to the same WAL-mode database.
Root Cause
When one process runs hermes kanban init (which deletes and recreates the file), older gateway processes hold file handles on the old inode. New processes write to the new inode, while old processes may still write to the stale inode, causing database disk image is malformed and disk I/O error.
lsof output from a typical setup:
7 python processes × ~40 file descriptors each = 280+ open handles on kanban.db
Suggested Fixes (any would help)
- Gateways with
dispatch_in_gateway: false should skip kanban DB initialization entirely
- Add a per-profile config flag like
kanban.enabled: false to prevent DB connection on non-participating profiles
- Single-writer proxy: only one process (the architect gateway) writes to SQLite; other gateways communicate via IPC
Workaround Deployed
We wrote a cron-based health checker (kanban_health.py) that runs PRAGMA integrity_check every 6 hours, kills zombie processes holding stale handles, runs hermes kanban init, and restarts all gateways. This is a band-aid, not a proper fix.
Environment
- Profiles: architect (dispatch=true), wikid (dispatch=true), a-dev, a-creative, a-eval, a-view, intel-pilot (dispatch=false but still open handles)
- All gateways running on same Linux host, same user
Description
Every Hermes profile gateway opens a SQLite connection to the shared kanban.db on startup, regardless of whether the profile has
kanban.dispatch_in_gatewayenabled. On a multi-profile setup with 7+ active gateways, this creates 7+ concurrent SQLite connections to the same WAL-mode database.Root Cause
When one process runs
hermes kanban init(which deletes and recreates the file), older gateway processes hold file handles on the old inode. New processes write to the new inode, while old processes may still write to the stale inode, causingdatabase disk image is malformedanddisk I/O error.lsof output from a typical setup:
Suggested Fixes (any would help)
dispatch_in_gateway: falseshould skip kanban DB initialization entirelykanban.enabled: falseto prevent DB connection on non-participating profilesWorkaround Deployed
We wrote a cron-based health checker (
kanban_health.py) that runsPRAGMA integrity_checkevery 6 hours, kills zombie processes holding stale handles, runshermes kanban init, and restarts all gateways. This is a band-aid, not a proper fix.Environment