Skip to content

SQLite 'locking protocol' on NFS silently breaks /resume, /title, /history, /branch, and kanban #22032

@kshitijk4poor

Description

@kshitijk4poor

Summary

When ~/.hermes is on a network filesystem (NFS, SMB/CIFS, some FUSE mounts, WSL1), SQLite's PRAGMA journal_mode=WAL fails with sqlite3.OperationalError: locking protocol. Every component that opens state.db or kanban.db swallows this error silently, and the user is left with:

The user has no way to know why any of this is happening. Hermes does not check for WAL compatibility and does not attempt a fallback.

Evidence

Real user debug report. Their stat -f ~/.hermes output and mount line:

File: "/home/mormio/.hermes"
Type: nfs
ID: 0  Namelen: 255
172.26.224.200:d2dfac12/home on /home type nfs
  (rw, relatime, vers=3, rsize=1048576, wsize=1048576, namelen=255,
   hard, forcerdirplus, proto=tcp, nconnect=4, timeo=600, retrans=2,
   sec=sys, mountaddr=172.26.224.200, mountvers=3, mountport=20048,
   mountproto=udp, local_lock=none, addr=172.26.224.200)

NFSv3 over TCP with local_lock=none — the exact configuration SQLite upstream documents as incompatible with WAL:

SQLite databases in WAL mode do not work over a network filesystem.

The resulting log entries in the same user's session:

2026-05-08 13:41:11  WARNING hermes_cli.backup: SQLite safe copy failed for ~/.hermes/state.db: locking protocol
2026-05-08 13:45:05  ERROR gateway.run: kanban dispatcher: tick failed on board default
    File "hermes_cli/kanban_db.py", line 878, in connect
      conn.execute("PRAGMA journal_mode=WAL")
  sqlite3.OperationalError: locking protocol
2026-05-08 13:46:46  WARNING tui_gateway.server: TUI session store unavailable — continuing without state.db features: locking protocol
2026-05-08 13:46:59  WARNING cli: Failed to initialize SessionDB — session will NOT be indexed for search: locking protocol
2026-05-08 13:47:08  WARNING tui_gateway.server: TUI session store unavailable — continuing without state.db features: locking protocol

The kanban dispatcher retried this failed migration continuously until the user restarted the gateway.

Root cause

Two files hit PRAGMA journal_mode=WAL unconditionally with no fallback:

  • hermes_state.py:201SessionDB.__init__ sets journal_mode=WAL. On failure the caller (SessionDB() in cli.py:2379, gateway/run.py:1194, tui_gateway/server.py) catches the exception and sets _session_db = None, but never tries a different journal mode.
  • hermes_cli/kanban_db.py:920connect() sets journal_mode=WAL. On failure the exception bubbles to the kanban dispatcher tick, which is retried every 60s forever.

The failure is silent downstream:

  • Gateway logs at DEBUG (gateway/run.py:1196): logger.debug("SQLite session store not available: %s", e) — invisible in errors.log.
  • CLI logs at WARNING (correct) — visible but still generic.
  • /resume error message hard-codes "Session database not available." with no cause. Nine such sites across cli.py and gateway/run.py:
    • cli.py:5368, 5479, 6755, 6770
    • gateway/run.py:10186, 10224, 10438, 10482, 10569

Who this affects

  • Users with ~/.hermes on NFS (shared university clusters, enterprise Linux, cloud dev VMs mounting team home dirs)
  • Users with ~/.hermes on SMB/CIFS, some FUSE mounts, or WSL1
  • Anyone whose state.db / kanban.db ends up in a container bind-mount where locking semantics differ

The failure mode presents to the user as "/resume just doesn't work" with no actionable diagnostic. Support burden: every affected user has to share logs with a maintainer to figure out what's broken.

Proposed fix

Three changes, all in one PR:

  1. Fall back to journal_mode=DELETE on WAL failure. DELETE mode is the SQLite default before WAL was invented; it works on NFS. Concurrency drops (no concurrent readers during writes) but the feature works. Apply the fallback in both hermes_state.py and hermes_cli/kanban_db.py. Log a single WARNING on fallback explaining why.

  2. Surface the cause in /resume and related error messages. Capture the underlying OperationalError on the failing init and include it in the user-facing string. Instead of "Session database not available.", show "Session database not available: locking protocol (state.db may be on a network filesystem — see <docs>).".

  3. Bump gateway/run.py:1196 log level from DEBUG to WARNING so the failure appears in errors.log, matching the CLI path which already does this correctly.

Deliberately out of scope for the PR

Acceptance criteria

  • SessionDB() succeeds on NFS via DELETE-mode fallback, with a single WARNING logged once per process.
  • kanban_db.connect() succeeds on NFS via the same fallback.
  • /resume on a system where SessionDB genuinely cannot open returns a message containing the underlying cause.
  • New tests cover:
    • WAL pragma raising OperationalError("locking protocol") → DELETE fallback fires, DB is usable.
    • /resume error string includes the captured cause when _session_db is None.
  • No regression in existing SessionDB / kanban tests.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High — major feature broken, no workaroundarea/configConfig system, migrations, profilescomp/cliCLI entry point, hermes_cli/, setup wizardcomp/gatewayGateway runner, session dispatch, deliverytype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions