Bug Description
Gateway Kanban dispatcher intermittently reports kanban.db is not a valid SQLite database and disables dispatch for the board. The DB auto-recovers after some time, but during the disabled window, ready tasks sit unprocessed.
Gateway log pattern (every few minutes):
16:31:34 kanban dispatcher: spawned=5 ← dispatch succeeds
16:33:35 kanban.db is not a valid SQLite database; disabling dispatch
16:51:37 kanban dispatcher: spawned=1 ← DB recovers, dispatch resumes
16:53:38 kanban.db is not a valid SQLite database; disabling dispatch ← recurs
The corruption always appears AFTER worker subprocesses complete (spawned=N → next tick: corrupted).
Root Cause
Three contributing factors in hermes_cli/kanban_db.py + gateway/run.py:
1. No explicit WAL checkpoint management. kanban_db.py has zero PRAGMA wal_checkpoint calls anywhere. In contrast, hermes_state.py (SessionDB) properly manages WAL with _try_wal_checkpoint() every 50 writes and in close(). Workers crash without proper connection close → WAL frames partially written → next connect() reads inconsistent WAL → sqlite3.DatabaseError.
2. synchronous=NORMAL in kanban_db.py connect(). With NORMAL, SQLite does NOT fsync on commit. If a worker process crashes between writing WAL frames and the checkpoint, WAL contains partially-written frames.
3. Fingerprint only tracks .db file, not -wal (gateway/run.py _board_db_fingerprint()). If only -wal is corrupted but .db mtime/size unchanged → fingerprint unchanged → board stays disabled permanently until gateway restart.
Steps to Reproduce
- Run Hermes Gateway with Kanban dispatch enabled
- Create Kanban tasks that cause worker protocol violations (worker exits without
kanban_complete())
- Gateway spawns workers → some crash
- Next dispatcher tick:
connect() fails with "database disk image is malformed"
- Board disabled until
.db file mtime changes or gateway restarts
Expected Behavior
Worker crashes should not leave Kanban DB unreadable. WAL should be checkpointed to prevent partial-frame corruption from blocking the dispatcher.
Actual Behavior
After worker crashes, Gateway sees DB as corrupted, disables dispatch, cannot recover until .db file is externally modified or gateway restarts.
Proposed Fix
-
Add WAL checkpoint on connection close — in gateway/run.py before conn.close(): conn.execute("PRAGMA wal_checkpoint(PASSIVE)") (mirrors SessionDB.close() at hermes_state.py:458)
-
Include -wal file in fingerprint — track (wal_mtime_ns, wal_size) so dispatcher auto-recovers when only WAL corrupted.
-
Consider synchronous=FULL — prevents WAL checkpoint crashes from corrupting main DB (trade-off: slightly slower writes).
Environment
- Hermes Agent v0.14.0
- macOS 15.7.4, Python 3.11.11, SQLite 3.47.1
Bug Description
Gateway Kanban dispatcher intermittently reports
kanban.db is not a valid SQLite databaseand disables dispatch for the board. The DB auto-recovers after some time, but during the disabled window, ready tasks sit unprocessed.Gateway log pattern (every few minutes):
The corruption always appears AFTER worker subprocesses complete (spawned=N → next tick: corrupted).
Root Cause
Three contributing factors in
hermes_cli/kanban_db.py+gateway/run.py:1. No explicit WAL checkpoint management.
kanban_db.pyhas zeroPRAGMA wal_checkpointcalls anywhere. In contrast,hermes_state.py(SessionDB) properly manages WAL with_try_wal_checkpoint()every 50 writes and inclose(). Workers crash without proper connection close → WAL frames partially written → nextconnect()reads inconsistent WAL →sqlite3.DatabaseError.2.
synchronous=NORMALinkanban_db.pyconnect(). With NORMAL, SQLite does NOT fsync on commit. If a worker process crashes between writing WAL frames and the checkpoint, WAL contains partially-written frames.3. Fingerprint only tracks
.dbfile, not-wal(gateway/run.py_board_db_fingerprint()). If only-walis corrupted but.dbmtime/size unchanged → fingerprint unchanged → board stays disabled permanently until gateway restart.Steps to Reproduce
kanban_complete())connect()fails with "database disk image is malformed".dbfile mtime changes or gateway restartsExpected Behavior
Worker crashes should not leave Kanban DB unreadable. WAL should be checkpointed to prevent partial-frame corruption from blocking the dispatcher.
Actual Behavior
After worker crashes, Gateway sees DB as corrupted, disables dispatch, cannot recover until
.dbfile is externally modified or gateway restarts.Proposed Fix
Add WAL checkpoint on connection close — in
gateway/run.pybeforeconn.close():conn.execute("PRAGMA wal_checkpoint(PASSIVE)")(mirrorsSessionDB.close()athermes_state.py:458)Include
-walfile in fingerprint — track(wal_mtime_ns, wal_size)so dispatcher auto-recovers when only WAL corrupted.Consider
synchronous=FULL— prevents WAL checkpoint crashes from corrupting main DB (trade-off: slightly slower writes).Environment