SQLite 'locking protocol' on NFS silently breaks /resume, /title, /history, /branch, and kanban

## Summary

When `~/.hermes` is on a network filesystem (NFS, SMB/CIFS, some FUSE mounts, WSL1), SQLite's `PRAGMA journal_mode=WAL` fails with `sqlite3.OperationalError: locking protocol`. Every component that opens `state.db` or `kanban.db` swallows this error silently, and the user is left with:

- `/resume`, `/title`, `/history`, `/branch` all respond `"Session database not available."` with no explanation
- `hermes update` snapshot warning `SQLite safe copy failed for ~/.hermes/state.db: locking protocol`
- Kanban dispatcher tick crashing every 60s with the same error
- TUI session store unavailable warnings
- (Downstream) the known `duplicate column name: consecutive_failures` kanban migration race (#21708 / #21374) firing continuously because the migration is retried on every tick

The user has no way to know why any of this is happening. Hermes does not check for WAL compatibility and does not attempt a fallback.

## Evidence

Real user debug report. Their `stat -f ~/.hermes` output and mount line:

```
File: "/home/mormio/.hermes"
Type: nfs
ID: 0  Namelen: 255
172.26.224.200:d2dfac12/home on /home type nfs
  (rw, relatime, vers=3, rsize=1048576, wsize=1048576, namelen=255,
   hard, forcerdirplus, proto=tcp, nconnect=4, timeo=600, retrans=2,
   sec=sys, mountaddr=172.26.224.200, mountvers=3, mountport=20048,
   mountproto=udp, local_lock=none, addr=172.26.224.200)
```

NFSv3 over TCP with `local_lock=none` — the exact configuration SQLite upstream documents as [incompatible with WAL](https://www.sqlite.org/wal.html#sometimes_queries_return_sqlite_busy_in_wal_mode):

> SQLite databases in WAL mode do not work over a network filesystem.

The resulting log entries in the same user's session:

```
2026-05-08 13:41:11  WARNING hermes_cli.backup: SQLite safe copy failed for ~/.hermes/state.db: locking protocol
2026-05-08 13:45:05  ERROR gateway.run: kanban dispatcher: tick failed on board default
    File "hermes_cli/kanban_db.py", line 878, in connect
      conn.execute("PRAGMA journal_mode=WAL")
  sqlite3.OperationalError: locking protocol
2026-05-08 13:46:46  WARNING tui_gateway.server: TUI session store unavailable — continuing without state.db features: locking protocol
2026-05-08 13:46:59  WARNING cli: Failed to initialize SessionDB — session will NOT be indexed for search: locking protocol
2026-05-08 13:47:08  WARNING tui_gateway.server: TUI session store unavailable — continuing without state.db features: locking protocol
```

The kanban dispatcher retried this failed migration **continuously** until the user restarted the gateway.

## Root cause

Two files hit `PRAGMA journal_mode=WAL` unconditionally with no fallback:

- **`hermes_state.py:201`** — `SessionDB.__init__` sets `journal_mode=WAL`. On failure the caller (`SessionDB()` in `cli.py:2379`, `gateway/run.py:1194`, `tui_gateway/server.py`) catches the exception and sets `_session_db = None`, but never tries a different journal mode.
- **`hermes_cli/kanban_db.py:920`** — `connect()` sets `journal_mode=WAL`. On failure the exception bubbles to the kanban dispatcher tick, which is retried every 60s forever.

The failure is silent downstream:

- **Gateway logs at `DEBUG`** (`gateway/run.py:1196`): `logger.debug("SQLite session store not available: %s", e)` — invisible in `errors.log`.
- **CLI logs at `WARNING`** (correct) — visible but still generic.
- **`/resume` error message** hard-codes `"Session database not available."` with no cause. Nine such sites across `cli.py` and `gateway/run.py`:
  - `cli.py:5368, 5479, 6755, 6770`
  - `gateway/run.py:10186, 10224, 10438, 10482, 10569`

## Who this affects

- Users with `~/.hermes` on NFS (shared university clusters, enterprise Linux, cloud dev VMs mounting team home dirs)
- Users with `~/.hermes` on SMB/CIFS, some FUSE mounts, or WSL1
- Anyone whose `state.db` / `kanban.db` ends up in a container bind-mount where locking semantics differ

The failure mode presents to the user as "`/resume` just doesn't work" with no actionable diagnostic. Support burden: every affected user has to share logs with a maintainer to figure out what's broken.

## Proposed fix

Three changes, all in one PR:

1. **Fall back to `journal_mode=DELETE` on WAL failure.** DELETE mode is the SQLite default before WAL was invented; it works on NFS. Concurrency drops (no concurrent readers during writes) but the feature works. Apply the fallback in both `hermes_state.py` and `hermes_cli/kanban_db.py`. Log a single `WARNING` on fallback explaining why.

2. **Surface the cause in `/resume` and related error messages.** Capture the underlying `OperationalError` on the failing init and include it in the user-facing string. Instead of `"Session database not available."`, show `"Session database not available: locking protocol (state.db may be on a network filesystem — see <docs>)."`.

3. **Bump `gateway/run.py:1196` log level** from `DEBUG` to `WARNING` so the failure appears in `errors.log`, matching the CLI path which already does this correctly.

## Deliberately out of scope for the PR

- NFS autodetection at startup via `statvfs` / `/proc/mounts`. Fragile across Linux/macOS/WSL/Docker overlay FS. The try/except fallback approach is OS-agnostic and more robust.
- `hermes doctor` integration. Separate concern, separate PR.
- The `duplicate column name: consecutive_failures` kanban migration race (#21708 / #21374). Unrelated root cause; fires *because* of this bug (WAL failure → migration retried forever) but fixing the WAL issue stops the cascade without fixing the migration itself.

## Acceptance criteria

- `SessionDB()` succeeds on NFS via DELETE-mode fallback, with a single `WARNING` logged once per process.
- `kanban_db.connect()` succeeds on NFS via the same fallback.
- `/resume` on a system where SessionDB genuinely cannot open returns a message containing the underlying cause.
- New tests cover:
  - WAL pragma raising `OperationalError("locking protocol")` → DELETE fallback fires, DB is usable.
  - `/resume` error string includes the captured cause when `_session_db is None`.
- No regression in existing SessionDB / kanban tests.

## References

- SQLite WAL documentation: https://www.sqlite.org/wal.html#sometimes_queries_return_sqlite_busy_in_wal_mode
- Related symptom issues (downstream of this bug on NFS):
  - #21708 — kanban `duplicate column name: consecutive_failures`
  - #21374 — race condition in `_migrate_add_optional_columns`
- Prior related PR (TUI degradation only, did not fix root cause): #14135


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SQLite 'locking protocol' on NFS silently breaks /resume, /title, /history, /branch, and kanban #22032

Summary

Evidence

Root cause

Who this affects

Proposed fix

Deliberately out of scope for the PR

Acceptance criteria

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

SQLite 'locking protocol' on NFS silently breaks /resume, /title, /history, /branch, and kanban #22032

Description

Summary

Evidence

Root cause

Who this affects

Proposed fix

Deliberately out of scope for the PR

Acceptance criteria

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions