Summary
When ~/.hermes is on a network filesystem (NFS, SMB/CIFS, some FUSE mounts, WSL1), SQLite's PRAGMA journal_mode=WAL fails with sqlite3.OperationalError: locking protocol. Every component that opens state.db or kanban.db swallows this error silently, and the user is left with:
The user has no way to know why any of this is happening. Hermes does not check for WAL compatibility and does not attempt a fallback.
Evidence
Real user debug report. Their stat -f ~/.hermes output and mount line:
File: "/home/mormio/.hermes"
Type: nfs
ID: 0 Namelen: 255
172.26.224.200:d2dfac12/home on /home type nfs
(rw, relatime, vers=3, rsize=1048576, wsize=1048576, namelen=255,
hard, forcerdirplus, proto=tcp, nconnect=4, timeo=600, retrans=2,
sec=sys, mountaddr=172.26.224.200, mountvers=3, mountport=20048,
mountproto=udp, local_lock=none, addr=172.26.224.200)
NFSv3 over TCP with local_lock=none — the exact configuration SQLite upstream documents as incompatible with WAL:
SQLite databases in WAL mode do not work over a network filesystem.
The resulting log entries in the same user's session:
2026-05-08 13:41:11 WARNING hermes_cli.backup: SQLite safe copy failed for ~/.hermes/state.db: locking protocol
2026-05-08 13:45:05 ERROR gateway.run: kanban dispatcher: tick failed on board default
File "hermes_cli/kanban_db.py", line 878, in connect
conn.execute("PRAGMA journal_mode=WAL")
sqlite3.OperationalError: locking protocol
2026-05-08 13:46:46 WARNING tui_gateway.server: TUI session store unavailable — continuing without state.db features: locking protocol
2026-05-08 13:46:59 WARNING cli: Failed to initialize SessionDB — session will NOT be indexed for search: locking protocol
2026-05-08 13:47:08 WARNING tui_gateway.server: TUI session store unavailable — continuing without state.db features: locking protocol
The kanban dispatcher retried this failed migration continuously until the user restarted the gateway.
Root cause
Two files hit PRAGMA journal_mode=WAL unconditionally with no fallback:
hermes_state.py:201 — SessionDB.__init__ sets journal_mode=WAL. On failure the caller (SessionDB() in cli.py:2379, gateway/run.py:1194, tui_gateway/server.py) catches the exception and sets _session_db = None, but never tries a different journal mode.
hermes_cli/kanban_db.py:920 — connect() sets journal_mode=WAL. On failure the exception bubbles to the kanban dispatcher tick, which is retried every 60s forever.
The failure is silent downstream:
- Gateway logs at
DEBUG (gateway/run.py:1196): logger.debug("SQLite session store not available: %s", e) — invisible in errors.log.
- CLI logs at
WARNING (correct) — visible but still generic.
/resume error message hard-codes "Session database not available." with no cause. Nine such sites across cli.py and gateway/run.py:
cli.py:5368, 5479, 6755, 6770
gateway/run.py:10186, 10224, 10438, 10482, 10569
Who this affects
- Users with
~/.hermes on NFS (shared university clusters, enterprise Linux, cloud dev VMs mounting team home dirs)
- Users with
~/.hermes on SMB/CIFS, some FUSE mounts, or WSL1
- Anyone whose
state.db / kanban.db ends up in a container bind-mount where locking semantics differ
The failure mode presents to the user as "/resume just doesn't work" with no actionable diagnostic. Support burden: every affected user has to share logs with a maintainer to figure out what's broken.
Proposed fix
Three changes, all in one PR:
-
Fall back to journal_mode=DELETE on WAL failure. DELETE mode is the SQLite default before WAL was invented; it works on NFS. Concurrency drops (no concurrent readers during writes) but the feature works. Apply the fallback in both hermes_state.py and hermes_cli/kanban_db.py. Log a single WARNING on fallback explaining why.
-
Surface the cause in /resume and related error messages. Capture the underlying OperationalError on the failing init and include it in the user-facing string. Instead of "Session database not available.", show "Session database not available: locking protocol (state.db may be on a network filesystem — see <docs>).".
-
Bump gateway/run.py:1196 log level from DEBUG to WARNING so the failure appears in errors.log, matching the CLI path which already does this correctly.
Deliberately out of scope for the PR
Acceptance criteria
SessionDB() succeeds on NFS via DELETE-mode fallback, with a single WARNING logged once per process.
kanban_db.connect() succeeds on NFS via the same fallback.
/resume on a system where SessionDB genuinely cannot open returns a message containing the underlying cause.
- New tests cover:
- WAL pragma raising
OperationalError("locking protocol") → DELETE fallback fires, DB is usable.
/resume error string includes the captured cause when _session_db is None.
- No regression in existing SessionDB / kanban tests.
References
Summary
When
~/.hermesis on a network filesystem (NFS, SMB/CIFS, some FUSE mounts, WSL1), SQLite'sPRAGMA journal_mode=WALfails withsqlite3.OperationalError: locking protocol. Every component that opensstate.dborkanban.dbswallows this error silently, and the user is left with:/resume,/title,/history,/branchall respond"Session database not available."with no explanationhermes updatesnapshot warningSQLite safe copy failed for ~/.hermes/state.db: locking protocolduplicate column name: consecutive_failureskanban migration race (kanban dispatcher: 'duplicate column name: consecutive_failures' on first tick after gateway restart #21708 / Race condition in kanban _migrate_add_optional_columns on gateway startup #21374) firing continuously because the migration is retried on every tickThe user has no way to know why any of this is happening. Hermes does not check for WAL compatibility and does not attempt a fallback.
Evidence
Real user debug report. Their
stat -f ~/.hermesoutput and mount line:NFSv3 over TCP with
local_lock=none— the exact configuration SQLite upstream documents as incompatible with WAL:The resulting log entries in the same user's session:
The kanban dispatcher retried this failed migration continuously until the user restarted the gateway.
Root cause
Two files hit
PRAGMA journal_mode=WALunconditionally with no fallback:hermes_state.py:201—SessionDB.__init__setsjournal_mode=WAL. On failure the caller (SessionDB()incli.py:2379,gateway/run.py:1194,tui_gateway/server.py) catches the exception and sets_session_db = None, but never tries a different journal mode.hermes_cli/kanban_db.py:920—connect()setsjournal_mode=WAL. On failure the exception bubbles to the kanban dispatcher tick, which is retried every 60s forever.The failure is silent downstream:
DEBUG(gateway/run.py:1196):logger.debug("SQLite session store not available: %s", e)— invisible inerrors.log.WARNING(correct) — visible but still generic./resumeerror message hard-codes"Session database not available."with no cause. Nine such sites acrosscli.pyandgateway/run.py:cli.py:5368, 5479, 6755, 6770gateway/run.py:10186, 10224, 10438, 10482, 10569Who this affects
~/.hermeson NFS (shared university clusters, enterprise Linux, cloud dev VMs mounting team home dirs)~/.hermeson SMB/CIFS, some FUSE mounts, or WSL1state.db/kanban.dbends up in a container bind-mount where locking semantics differThe failure mode presents to the user as "
/resumejust doesn't work" with no actionable diagnostic. Support burden: every affected user has to share logs with a maintainer to figure out what's broken.Proposed fix
Three changes, all in one PR:
Fall back to
journal_mode=DELETEon WAL failure. DELETE mode is the SQLite default before WAL was invented; it works on NFS. Concurrency drops (no concurrent readers during writes) but the feature works. Apply the fallback in bothhermes_state.pyandhermes_cli/kanban_db.py. Log a singleWARNINGon fallback explaining why.Surface the cause in
/resumeand related error messages. Capture the underlyingOperationalErroron the failing init and include it in the user-facing string. Instead of"Session database not available.", show"Session database not available: locking protocol (state.db may be on a network filesystem — see <docs>).".Bump
gateway/run.py:1196log level fromDEBUGtoWARNINGso the failure appears inerrors.log, matching the CLI path which already does this correctly.Deliberately out of scope for the PR
statvfs//proc/mounts. Fragile across Linux/macOS/WSL/Docker overlay FS. The try/except fallback approach is OS-agnostic and more robust.hermes doctorintegration. Separate concern, separate PR.duplicate column name: consecutive_failureskanban migration race (kanban dispatcher: 'duplicate column name: consecutive_failures' on first tick after gateway restart #21708 / Race condition in kanban _migrate_add_optional_columns on gateway startup #21374). Unrelated root cause; fires because of this bug (WAL failure → migration retried forever) but fixing the WAL issue stops the cascade without fixing the migration itself.Acceptance criteria
SessionDB()succeeds on NFS via DELETE-mode fallback, with a singleWARNINGlogged once per process.kanban_db.connect()succeeds on NFS via the same fallback./resumeon a system where SessionDB genuinely cannot open returns a message containing the underlying cause.OperationalError("locking protocol")→ DELETE fallback fires, DB is usable./resumeerror string includes the captured cause when_session_db is None.References
duplicate column name: consecutive_failures_migrate_add_optional_columns