Summary
The kanban dispatcher embedded in hermes gateway run wedges after 3-4 verifier-subprocess completions, requiring a gateway restart to recover. The evidence points to a race between the gateway's multi-threaded SQLite connection cycling and verifier-subprocess connection lifecycles in WAL mode. Switching the kanban DB to journal_mode=DELETE eliminates the wedge in our environment; we've validated the patch under load. Submitting as an issue first so the maintainer can decide on the right fix shape — a more surgical WAL-preserving fix may be preferable.
Environment
- Hermes Agent v0.14.0 (2026.5.16) — commit
d61785889
- Linux 6.17.0-29-generic, Ubuntu 24.04
- Filesystem: local XFS (rw,noatime) — not NFS/SMB
- SQLite via Python 3.11
sqlite3 module
- Affected: any board with multi-process write access (gateway dispatcher + verifier subprocesses)
Symptom (measured)
The kanban dispatcher embedded in hermes gateway run ticks every 60s. After 3-4 verifier-subprocess completions on a single board, every subsequent tick fails with sqlite3.OperationalError: disk I/O error in release_stale_claims (hermes_cli/kanban_db.py:2399). Failures continue indefinitely until hermes gateway restart.
Repro
- Run
hermes gateway start against a board with active dispatch (kanban.dispatch_in_gateway: true).
- Dispatch ~4 worker tasks whose verifier subprocesses each open their own kanban DB connection (e.g., a custom verifier profile that calls
kanban_block from inside a skill).
- After 3-4 completions, dispatcher ticks start failing with the I/O error.
- Every subsequent tick fails until
hermes gateway restart.
Direct evidence (measured)
lsof on wedged gateway:
python <PID> orion DEL-r REG ... /<HERMES_HOME>/kanban/boards/<board>/kanban.db-shm
python <PID> orion 26u REG ... /<HERMES_HOME>/kanban/boards/<board>/kanban.db-wal (deleted)
python <PID> orion 27ur REG ... /<HERMES_HOME>/kanban/boards/<board>/kanban.db-shm (deleted)
python <PID> orion 31u REG ... /<HERMES_HOME>/kanban/boards/<board>/kanban.db-wal (deleted) [different inode]
python <PID> orion 35u REG ... /<HERMES_HOME>/kanban/boards/<board>/kanban.db-wal (deleted) [different inode]
Three different deleted -wal inodes in one gateway lifetime. Meanwhile:
PRAGMA integrity_check and quick_check: both ok throughout
- Fresh
sqlite3.connect() from a different process: same SELECT runs cleanly
hermes kanban dispatch --dry-run from a fresh CLI process: works
- Only the gateway's long-lived process's connections wedge
Root cause (our analysis — measured wedge + lsof evidence + inferred mechanism)
The gateway has ~7 threads. All 7 kanban DB connect sites use proper try/finally: close() patterns — no individual connection leak (we checked all of them):
gateway/run.py:4590 — notifier-watcher (~5s loop)
gateway/run.py:4842, 4857, 4878 — sub/unsub/advance helpers
gateway/run.py:5164 — dispatcher tick (60s)
gateway/run.py:5241 — spawn-budget watcher
gateway/run.py:9321 — auto-subscribe handler
The evidence is consistent with this mechanism: in WAL mode, SQLite shares a per-process unixShmNode and SHM mmap across all connections to the same DB within one process. The lsof output above shows the gateway holding multiple deleted-inode FDs on -wal / -shm — three different deleted -wal inodes per gateway lifetime, matching the 3-4 verifier-subprocess completions per wedge cycle.
Our inferred mechanism: when (a) a verifier subprocess opens its own connection to call kanban_block, (b) all gateway threads happen to be momentarily between connect()s, and (c) the verifier exits as the last DB-holder, SQLite's "last connection cleanup" unlinks -wal and -shm. New gateway connections see fresh files on disk, but the gateway's still-cached unixShmNode references the deleted inode. Subsequent SELECTs in any gateway connection fail with what we believe is SQLITE_IOERR_SHMMAP surfacing as the visible disk I/O error.
The wedge, the deleted-inode FDs, and the fix's effectiveness are measured. The specific unixShmNode poisoning mechanism is inferred from those measurements + SQLite's documented per-process WAL state semantics. We did not directly capture the SQLite internal error code (would require strace + a recompile to expose).
Fix we applied (works in our environment, may not be the right one for upstream)
One-line change in hermes_cli/kanban_db.py (around line 1049):
with _INIT_LOCK:
- # WAL doesn't work on network filesystems (NFS/SMB/FUSE). Shared helper
- # falls back to DELETE with one WARNING so kanban stays usable there.
- from hermes_state import apply_wal_with_fallback
- apply_wal_with_fallback(conn, db_label=f"kanban.db ({path.name})")
+ conn.execute("PRAGMA journal_mode=DELETE")
conn.execute("PRAGMA synchronous=NORMAL")
conn.execute("PRAGMA foreign_keys=ON")
Other Hermes DBs that use apply_wal_with_fallback (state.db, memory_store.db, response_store.db) are unaffected — only kanban DBs switch.
Tradeoff this fix accepts
DELETE mode trades away WAL's concurrent-reader concurrency: writers serialize on the DB exclusive lock. For the kanban DB this looks invisible — small DB (~120KB in our case), low write rate (60s dispatcher tick + event-driven helpers, all sub-second), and the multi-process write pattern is exactly what makes WAL fragile here. The previous apply_wal_with_fallback call also removed (which exists to fall back to DELETE on WAL-incompatible filesystems like NFS) — DELETE mode is unconditional with this diff, so the NFS-fallback log message no longer fires.
A more surgical fix may preserve WAL. Options the maintainer might prefer:
- Shared long-lived gateway connection per board, used by all gateway threads (would need locking) — keeps WAL but eliminates the multi-connection-per-process pattern.
- Checkpoint discipline: pin the WAL/SHM lifecycle so verifier subprocesses can't trigger the "last connection cleanup" unlink (e.g., have the gateway hold a sentinel connection open for the lifetime of the process).
- Detect-and-reopen on I/O error: catch the symptom and refresh connections, but doesn't fix the underlying race.
We picked DELETE because it removes the failure class by construction; we don't have visibility into upstream's wider design considerations (NFS support, expected write concurrency, etc.). The patch above is offered as one resolution; happy to defer to the maintainer's preferred approach.
Validation (measured)
After applying the patch + gateway restart:
- Existing
-wal file auto-removed on first connect (PRAGMA migration worked)
- 3-task stress test: all dispatched in one tick, completed cleanly, zero I/O errors
- lsof shows 0
-wal/-shm FDs on the gateway across multiple verifier-subprocess cycles
- 0 dispatcher tick failures across the validation window
Happy to open a PR with the change shown above if the DELETE-mode direction is what you want — or to take a different cut at it if you'd prefer one of the surgical alternatives.
Summary
The kanban dispatcher embedded in
hermes gateway runwedges after 3-4 verifier-subprocess completions, requiring a gateway restart to recover. The evidence points to a race between the gateway's multi-threaded SQLite connection cycling and verifier-subprocess connection lifecycles in WAL mode. Switching the kanban DB tojournal_mode=DELETEeliminates the wedge in our environment; we've validated the patch under load. Submitting as an issue first so the maintainer can decide on the right fix shape — a more surgical WAL-preserving fix may be preferable.Environment
d61785889sqlite3moduleSymptom (measured)
The kanban dispatcher embedded in
hermes gateway runticks every 60s. After 3-4 verifier-subprocess completions on a single board, every subsequent tick fails withsqlite3.OperationalError: disk I/O errorinrelease_stale_claims(hermes_cli/kanban_db.py:2399). Failures continue indefinitely untilhermes gateway restart.Repro
hermes gateway startagainst a board with active dispatch (kanban.dispatch_in_gateway: true).kanban_blockfrom inside a skill).hermes gateway restart.Direct evidence (measured)
lsof on wedged gateway:
Three different deleted
-walinodes in one gateway lifetime. Meanwhile:PRAGMA integrity_checkandquick_check: bothokthroughoutsqlite3.connect()from a different process: same SELECT runs cleanlyhermes kanban dispatch --dry-runfrom a fresh CLI process: worksRoot cause (our analysis — measured wedge + lsof evidence + inferred mechanism)
The gateway has ~7 threads. All 7 kanban DB connect sites use proper
try/finally: close()patterns — no individual connection leak (we checked all of them):gateway/run.py:4590— notifier-watcher (~5s loop)gateway/run.py:4842, 4857, 4878— sub/unsub/advance helpersgateway/run.py:5164— dispatcher tick (60s)gateway/run.py:5241— spawn-budget watchergateway/run.py:9321— auto-subscribe handlerThe evidence is consistent with this mechanism: in WAL mode, SQLite shares a per-process
unixShmNodeand SHM mmap across all connections to the same DB within one process. The lsof output above shows the gateway holding multiple deleted-inode FDs on-wal/-shm— three different deleted-walinodes per gateway lifetime, matching the 3-4 verifier-subprocess completions per wedge cycle.Our inferred mechanism: when (a) a verifier subprocess opens its own connection to call
kanban_block, (b) all gateway threads happen to be momentarily between connect()s, and (c) the verifier exits as the last DB-holder, SQLite's "last connection cleanup" unlinks-waland-shm. New gateway connections see fresh files on disk, but the gateway's still-cachedunixShmNodereferences the deleted inode. Subsequent SELECTs in any gateway connection fail with what we believe isSQLITE_IOERR_SHMMAPsurfacing as the visibledisk I/O error.The wedge, the deleted-inode FDs, and the fix's effectiveness are measured. The specific
unixShmNodepoisoning mechanism is inferred from those measurements + SQLite's documented per-process WAL state semantics. We did not directly capture the SQLite internal error code (would require strace + a recompile to expose).Fix we applied (works in our environment, may not be the right one for upstream)
One-line change in
hermes_cli/kanban_db.py(around line 1049):Other Hermes DBs that use
apply_wal_with_fallback(state.db, memory_store.db, response_store.db) are unaffected — only kanban DBs switch.Tradeoff this fix accepts
DELETE mode trades away WAL's concurrent-reader concurrency: writers serialize on the DB exclusive lock. For the kanban DB this looks invisible — small DB (~120KB in our case), low write rate (60s dispatcher tick + event-driven helpers, all sub-second), and the multi-process write pattern is exactly what makes WAL fragile here. The previous
apply_wal_with_fallbackcall also removed (which exists to fall back to DELETE on WAL-incompatible filesystems like NFS) — DELETE mode is unconditional with this diff, so the NFS-fallback log message no longer fires.A more surgical fix may preserve WAL. Options the maintainer might prefer:
We picked DELETE because it removes the failure class by construction; we don't have visibility into upstream's wider design considerations (NFS support, expected write concurrency, etc.). The patch above is offered as one resolution; happy to defer to the maintainer's preferred approach.
Validation (measured)
After applying the patch + gateway restart:
-walfile auto-removed on first connect (PRAGMA migration worked)-wal/-shmFDs on the gateway across multiple verifier-subprocess cyclesHappy to open a PR with the change shown above if the DELETE-mode direction is what you want — or to take a different cut at it if you'd prefer one of the surgical alternatives.