fix(sqlite): fall back to journal_mode=DELETE on NFS/SMB/FUSE (fixes /resume on network-mounted HERMES_HOME) by kshitijk4poor · Pull Request #22043 · NousResearch/hermes-agent

kshitijk4poor · 2026-05-08T19:15:47Z

Summary

When ~/.hermes is on a network filesystem (NFS, SMB/CIFS, some FUSE mounts, WSL1), SQLite's PRAGMA journal_mode=WAL raises sqlite3.OperationalError: locking protocol (SQLITE_PROTOCOL). Every caller currently catches this and swallows it silently, leaving the user with broken /resume, /title, /history, /branch, session search, and the kanban dispatcher — with no diagnostic.

This PR makes PRAGMA journal_mode=WAL fall back to journal_mode=DELETE when it fails on a WAL-incompatible filesystem, logs one WARNING explaining why, and surfaces the underlying cause in slash-command error messages.

Closes #22032.

Why

SQLite upstream documents that WAL mode does not work over a network filesystem. The real-world user in #22032 is on an NFSv3 mount:

172.26.224.200:d2dfac12/home on /home type nfs
  (rw, vers=3, ..., local_lock=none)

local_lock=none routes locks to the server, and WAL's shared-memory coordination breaks down. The same user's logs showed 4 distinct locking protocol failures in a single session (backup, kanban, TUI, CLI), all silently degrading features:

WARNING hermes_cli.backup: SQLite safe copy failed for ~/.hermes/state.db: locking protocol
ERROR gateway.run: kanban dispatcher: tick failed ...: locking protocol
WARNING tui_gateway.server: TUI session store unavailable ...: locking protocol
WARNING cli: Failed to initialize SessionDB ...: locking protocol

Downstream, the broken kanban migration kept retrying every 60s, driving #21708 / #21374 continuously.

What Changed

`hermes_state.py` — shared WAL-compatibility fallback

apply_wal_with_fallback(conn, db_label) — attempts PRAGMA journal_mode=WAL, catches OperationalError with a marker in ("locking protocol", "not authorized", "disk i/o error"), logs one WARNING naming the db and cause, then sets journal_mode=DELETE. Unrelated OperationalErrors still propagate.
get_last_init_error() — records the most recent SessionDB() init failure for slash-command messages.
format_session_db_unavailable(prefix="Session database not available") — formats a user-facing error string that includes the cause and adds an NFS/SMB hint + docs link when the cause looks like a WAL-compat failure.
SessionDB.__init__ uses the helper and records success/failure into the last-init-error slot.

`hermes_cli/kanban_db.py` — use the shared helper

connect() replaces the bare conn.execute("PRAGMA journal_mode=WAL") with apply_wal_with_fallback(conn, db_label=f"kanban.db ({path.name})").

`gateway/run.py` — surface the failure

SessionDB init failure log bumped from DEBUG → WARNING (matches cli.py's existing correct behavior; now appears in errors.log).
5 return "Session database not available." sites replaced with return format_session_db_unavailable() (calls into hermes_state).

`cli.py` — same treatment for 4 sites

4 _cprint(" Session database not available.") sites replaced with _cprint(f" {format_session_db_unavailable()}").

Example new error output

Before:

  Session database not available.

After (on NFS):

  Session database not available: OperationalError: locking protocol (state.db may be on NFS/SMB/FUSE — see https://www.sqlite.org/wal.html).

Test Plan

New tests

File	Tests	What they cover
`tests/test_hermes_state_wal_fallback.py` (NEW)	12	`apply_wal_with_fallback` happy path, 3 WAL-incompat error markers, unrelated-error re-raise, `get_last_init_error` success/failure paths, `format_session_db_unavailable` 4 variants, E2E SessionDB CRUD on simulated NFS
`tests/hermes_cli/test_kanban_db.py` (+1)	1	`kanban_db.connect()` falls back to DELETE and persists a task

Validation

# New tests
bash scripts/run_tests.sh tests/test_hermes_state_wal_fallback.py tests/hermes_cli/test_kanban_db.py
# → 73 passed in 2.35s

# Regression — full state/kanban test scope
bash scripts/run_tests.sh tests/test_hermes_state.py tests/test_hermes_state_wal_fallback.py tests/hermes_cli/test_kanban_db.py
# → 283 passed in 2.65s

# Wider scope — gateway tests (only unrelated pre-existing failures)
bash scripts/run_tests.sh tests/gateway/
# → 5045 passed, 3 failed — all 3 failures verified pre-existing on origin/main
#   (test_discord_free_channel_skips_auto_thread, test_matrix device_id,
#   one flaky agent_cache concurrency test that passes on rerun)

# CLI tests — 14 of 15 failures verified pre-existing on origin/main
bash scripts/run_tests.sh tests/hermes_cli/
# → 3990 passed, 15 failed (unrelated — systemd/WSL/CI env differences)

E2E validation with real imports

Spun up a temp HERMES_HOME, imported hermes_state and hermes_cli.kanban_db from the worktree, and monkey-patched sqlite3.connect to return a Connection subclass that raises OperationalError("locking protocol") on PRAGMA journal_mode=WAL:

✅ Normal init + session round-trip works (unchanged behavior)
✅ NFS-simulated init falls back to journal_mode=DELETE; create_session + get_session round-trip works; get_last_init_error() returns None (init succeeded via fallback)
✅ Message formatting: bare message when no cause, cause-bearing message + NFS/SMB hint + docs link when cause is locking protocol
✅ kanban.db falls back to DELETE; create_task + list_tasks round-trip works

Lint

ruff check cli.py gateway/run.py hermes_cli/kanban_db.py hermes_state.py tests/hermes_cli/test_kanban_db.py tests/test_hermes_state_wal_fallback.py → All checks passed!

Out of Scope (deliberately)

NFS autodetection via statvfs / /proc/mounts. Fragile across Linux/macOS/WSL/Docker overlay FS. The try/except fallback is OS-agnostic and more robust.
hermes doctor integration. Separate concern, separate PR.
Telegram polling-conflict reset bug observed in the same user's logs (1137 conflict events, _polling_conflict_count resets to 0 on false success → fatal path unreachable against dual-poller). Different file, different root cause, different test suite. Will be a separate issue + PR.
Fix for kanban dispatcher: 'duplicate column name: consecutive_failures' on first tick after gateway restart #21708 / Race condition in kanban _migrate_add_optional_columns on gateway startup #21374 (kanban migration race). Downstream symptom of NFS — once this PR lands, the kanban dispatcher won't be retrying a doomed migration every 60s, so the race window collapses dramatically. The underlying migration is still worth fixing but separately.

Checklist

Ruff clean on all changed files
New tests added (13 total), all passing
Existing tests/test_hermes_state.py suite still green
Existing tests/hermes_cli/test_kanban_db.py suite still green
tests/gateway/ regressions identified and verified pre-existing on origin/main
E2E validation with real imports on simulated NFS
AUTHOR_MAP already includes this commit's author email
Commit message references issue SQLite 'locking protocol' on NFS silently breaks /resume, /title, /history, /branch, and kanban #22032 via closes
PR only touches the files it claims to (no unrelated drift)

github-actions · 2026-05-08T19:16:48Z

🔎 Lint report: `fix/sqlite-wal-fallback-on-nfs` vs `origin/main`

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 7872 on HEAD, 7871 on base (🆕 +1)

🆕 New issues (15):

Rule	Count
`invalid-argument-type`	8
`unsupported-operator`	3
`unresolved-attribute`	3
`unresolved-import`	1

First entries

tests/run_agent/test_provider_attribution_headers.py:156: [unsupported-operator] unsupported-operator: Operator `not in` is not supported between objects of type `Literal["X-OpenRouter-Cache-TTL"]` and `Unknown | str | dict[str, str] | ... omitted 4 union elements`
tests/run_agent/test_provider_attribution_headers.py:155: [unsupported-operator] unsupported-operator: Operator `not in` is not supported between objects of type `Literal["X-OpenRouter-Cache"]` and `Unknown | str | dict[str, str] | ... omitted 4 union elements`
run_agent.py:12821: [invalid-argument-type] invalid-argument-type: Argument to function `len` is incorrect: Expected `Sized`, found `(str & ~AlwaysFalsy) | (dict[Unknown, Unknown] & ~AlwaysFalsy) | (Any & ~AlwaysFalsy) | ... omitted 4 union elements`
run_agent.py:12818: [invalid-argument-type] invalid-argument-type: Argument to function `_is_oauth_token` is incorrect: Expected `str`, found `str | dict[Unknown, Unknown] | Any | ... omitted 4 union elements`
run_agent.py:6822: [invalid-argument-type] invalid-argument-type: Argument to function `build_anthropic_client` is incorrect: Expected `str`, found `str | dict[Unknown, Unknown] | Any | ... omitted 4 union elements`
run_agent.py:2562: [invalid-argument-type] invalid-argument-type: Argument to function `build_anthropic_client` is incorrect: Expected `str`, found `(Unknown & ~AlwaysFalsy) | (str & ~AlwaysFalsy) | (dict[str, str] & ~AlwaysFalsy) | ... omitted 5 union elements`
tests/agent/test_codex_cloudflare_headers.py:181: [unsupported-operator] unsupported-operator: Operator `in` is not supported between objects of type `Literal["originator"]` and `(Unknown & ~AlwaysFalsy) | (str & ~AlwaysFalsy) | (dict[str, str] & ~AlwaysFalsy) | ... omitted 4 union elements`
tests/agent/test_codex_cloudflare_headers.py:163: [unresolved-attribute] unresolved-attribute: Attribute `get` is not defined on `str & ~AlwaysFalsy`, `int & ~AlwaysFalsy` in union `(Unknown & ~AlwaysFalsy) | (str & ~AlwaysFalsy) | (dict[str, str] & ~AlwaysFalsy) | ... omitted 4 union elements`
run_agent.py:2613: [invalid-argument-type] invalid-argument-type: Argument to function `get_model_context_length` is incorrect: Expected `str`, found `str | dict[str, str] | Any | ... omitted 4 union elements`
run_agent.py:2565: [invalid-argument-type] invalid-argument-type: Argument to function `_is_oauth_token` is incorrect: Expected `str`, found `(Unknown & ~AlwaysFalsy) | (str & ~AlwaysFalsy) | (dict[str, str] & ~AlwaysFalsy) | ... omitted 5 union elements`
run_agent.py:2330: [invalid-argument-type] invalid-argument-type: Argument to function `query_ollama_num_ctx` is incorrect: Expected `str`, found `(str & ~AlwaysFalsy) | (dict[str, str] & ~AlwaysFalsy) | (Any & ~AlwaysFalsy) | ... omitted 5 union elements`
tests/test_hermes_state_wal_fallback.py:19: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`
run_agent.py:6651: [invalid-argument-type] invalid-argument-type: Argument to function `_codex_cloudflare_headers` is incorrect: Expected `str`, found `Unknown | str | dict[str, str] | ... omitted 4 union elements`
tests/agent/test_codex_cloudflare_headers.py:163: [unresolved-attribute] unresolved-attribute: Attribute `startswith` is not defined on `dict[str, str]` in union `Unknown | str | Divergent | dict[str, str]`
tests/run_agent/test_provider_attribution_headers.py:90: [unresolved-attribute] unresolved-attribute: Attribute `startswith` is not defined on `dict[str, str]` in union `Unknown | str | Divergent | dict[str, str]`

✅ Fixed issues (14):

Rule	Count
`invalid-argument-type`	8
`unresolved-attribute`	3
`unsupported-operator`	3

First entries

run_agent.py:2330: [invalid-argument-type] invalid-argument-type: Argument to function `query_ollama_num_ctx` is incorrect: Expected `str`, found `(str & ~AlwaysFalsy) | (dict[str, str] & ~AlwaysFalsy) | (Any & ~AlwaysFalsy) | ... omitted 4 union elements`
run_agent.py:12819: [invalid-argument-type] invalid-argument-type: Argument to function `len` is incorrect: Expected `Sized`, found `(str & ~AlwaysFalsy) | (dict[Unknown, Unknown] & ~AlwaysFalsy) | (Any & ~AlwaysFalsy) | ... omitted 3 union elements`
run_agent.py:12816: [invalid-argument-type] invalid-argument-type: Argument to function `_is_oauth_token` is incorrect: Expected `str`, found `str | dict[Unknown, Unknown] | Any | ... omitted 3 union elements`
tests/agent/test_codex_cloudflare_headers.py:163: [unresolved-attribute] unresolved-attribute: Attribute `get` is not defined on `str & ~AlwaysFalsy`, `int & ~AlwaysFalsy` in union `(Unknown & ~AlwaysFalsy) | (str & ~AlwaysFalsy) | (dict[str, str] & ~AlwaysFalsy) | ... omitted 3 union elements`
run_agent.py:6822: [invalid-argument-type] invalid-argument-type: Argument to function `build_anthropic_client` is incorrect: Expected `str`, found `str | dict[Unknown, Unknown] | Any | ... omitted 3 union elements`
tests/agent/test_codex_cloudflare_headers.py:181: [unsupported-operator] unsupported-operator: Operator `in` is not supported between objects of type `Literal["originator"]` and `(Unknown & ~AlwaysFalsy) | (str & ~AlwaysFalsy) | (dict[str, str] & ~AlwaysFalsy) | ... omitted 3 union elements`
tests/run_agent/test_provider_attribution_headers.py:90: [unresolved-attribute] unresolved-attribute: Attribute `startswith` is not defined on `dict[str, str]` in union `Unknown | str | dict[str, str]`
run_agent.py:2562: [invalid-argument-type] invalid-argument-type: Argument to function `build_anthropic_client` is incorrect: Expected `str`, found `(Unknown & ~AlwaysFalsy) | (str & ~AlwaysFalsy) | (dict[str, str] & ~AlwaysFalsy) | ... omitted 4 union elements`
tests/agent/test_codex_cloudflare_headers.py:163: [unresolved-attribute] unresolved-attribute: Attribute `startswith` is not defined on `dict[str, str]` in union `Unknown | str | dict[str, str]`
run_agent.py:2565: [invalid-argument-type] invalid-argument-type: Argument to function `_is_oauth_token` is incorrect: Expected `str`, found `(Unknown & ~AlwaysFalsy) | (str & ~AlwaysFalsy) | (dict[str, str] & ~AlwaysFalsy) | ... omitted 4 union elements`
run_agent.py:2613: [invalid-argument-type] invalid-argument-type: Argument to function `get_model_context_length` is incorrect: Expected `str`, found `str | dict[str, str] | Any | ... omitted 3 union elements`
run_agent.py:6651: [invalid-argument-type] invalid-argument-type: Argument to function `_codex_cloudflare_headers` is incorrect: Expected `str`, found `Unknown | str | dict[str, str] | ... omitted 3 union elements`
tests/run_agent/test_provider_attribution_headers.py:156: [unsupported-operator] unsupported-operator: Operator `not in` is not supported between objects of type `Literal["X-OpenRouter-Cache-TTL"]` and `Unknown | str | dict[str, str] | ... omitted 3 union elements`
tests/run_agent/test_provider_attribution_headers.py:155: [unsupported-operator] unsupported-operator: Operator `not in` is not supported between objects of type `Literal["X-OpenRouter-Cache"]` and `Unknown | str | dict[str, str] | ... omitted 3 union elements`

Unchanged: 4154 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

SQLite's WAL mode requires shared-memory (mmap) coordination and fcntl byte-range locks that don't reliably work on network filesystems. Upstream documents this explicitly: https://www.sqlite.org/wal.html#sometimes_queries_return_sqlite_busy_in_wal_mode On NFS / SMB / some FUSE mounts / WSL1, 'PRAGMA journal_mode=WAL' raises 'sqlite3.OperationalError: locking protocol' (SQLITE_PROTOCOL). Before this change, every feature backed by state.db or kanban.db broke silently: - /resume, /title, /history, /branch returned 'Session database not available.' with no cause - gateway logged the init failure at DEBUG (invisible in errors.log) - kanban dispatcher crashed every 60s, driving the known migration race (duplicate column name: consecutive_failures, #21708 / #21374) Changes: - hermes_state.apply_wal_with_fallback(): shared helper that tries WAL and falls back to DELETE on SQLITE_PROTOCOL-style errors with one WARNING explaining why - hermes_state.get_last_init_error() + format_session_db_unavailable(): capture the init failure cause and surface it in user-facing strings (with an NFS/SMB pointer for 'locking protocol') - hermes_cli/kanban_db.connect(): use the shared helper - gateway/run.py: bump SessionDB init failure log DEBUG -> WARNING (matches cli.py's existing correct behavior) - cli.py (4 sites) + gateway/run.py (5 sites): replace bare 'Session database not available.' with format_session_db_unavailable() Tests: 12 new tests in tests/test_hermes_state_wal_fallback.py + 1 new test in tests/hermes_cli/test_kanban_db.py. Existing suites (state, kanban, gateway, cli) remain green for all tests unrelated to pre-existing failures on main. Evidence: real-world user on NFSv3 mount (172.26.224.200:d2dfac12/home, local_lock=none) reporting 'Session database not available.' on /resume; 'locking protocol' appears in 4 distinct log entries across backup, kanban, TUI, and CLI paths in the same session. closes #22032

…search#22043) SQLite's WAL mode requires shared-memory (mmap) coordination and fcntl byte-range locks that don't reliably work on network filesystems. Upstream documents this explicitly: https://www.sqlite.org/wal.html#sometimes_queries_return_sqlite_busy_in_wal_mode On NFS / SMB / some FUSE mounts / WSL1, 'PRAGMA journal_mode=WAL' raises 'sqlite3.OperationalError: locking protocol' (SQLITE_PROTOCOL). Before this change, every feature backed by state.db or kanban.db broke silently: - /resume, /title, /history, /branch returned 'Session database not available.' with no cause - gateway logged the init failure at DEBUG (invisible in errors.log) - kanban dispatcher crashed every 60s, driving the known migration race (duplicate column name: consecutive_failures, NousResearch#21708 / NousResearch#21374) Changes: - hermes_state.apply_wal_with_fallback(): shared helper that tries WAL and falls back to DELETE on SQLITE_PROTOCOL-style errors with one WARNING explaining why - hermes_state.get_last_init_error() + format_session_db_unavailable(): capture the init failure cause and surface it in user-facing strings (with an NFS/SMB pointer for 'locking protocol') - hermes_cli/kanban_db.connect(): use the shared helper - gateway/run.py: bump SessionDB init failure log DEBUG -> WARNING (matches cli.py's existing correct behavior) - cli.py (4 sites) + gateway/run.py (5 sites): replace bare 'Session database not available.' with format_session_db_unavailable() Tests: 12 new tests in tests/test_hermes_state_wal_fallback.py + 1 new test in tests/hermes_cli/test_kanban_db.py. Existing suites (state, kanban, gateway, cli) remain green for all tests unrelated to pre-existing failures on main. Evidence: real-world user on NFSv3 mount (172.26.224.200:d2dfac12/home, local_lock=none) reporting 'Session database not available.' on /resume; 'locking protocol' appears in 4 distinct log entries across backup, kanban, TUI, and CLI paths in the same session. closes NousResearch#22032

kshitijk4poor force-pushed the fix/sqlite-wal-fallback-on-nfs branch from 14c529a to 6ac7632 Compare May 8, 2026 19:37

kshitijk4poor force-pushed the fix/sqlite-wal-fallback-on-nfs branch from 6ac7632 to c2049b3 Compare May 9, 2026 09:07

kshitijk4poor merged commit 2a7047c into main May 9, 2026
15 of 16 checks passed

kshitijk4poor deleted the fix/sqlite-wal-fallback-on-nfs branch May 9, 2026 09:09

bot-ted mentioned this pull request May 9, 2026

chore: sync with upstream main (2026-05-09) bot-ted/hermes-agent#25

Merged

github-actions Bot mentioned this pull request May 17, 2026

chore: bump NousResearch/hermes-agent version from v2026.5.7 to v2026.5.16 Docker-Hub-sirmark/docker-hermes-agent#6

Merged

jamesleech mentioned this pull request May 22, 2026

[Bug]: Kanban stale claim locks from dead workers have no auto-cleanup — tasks permanently stuck until manual intervention #22926

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sqlite): fall back to journal_mode=DELETE on NFS/SMB/FUSE (fixes /resume on network-mounted HERMES_HOME)#22043

fix(sqlite): fall back to journal_mode=DELETE on NFS/SMB/FUSE (fixes /resume on network-mounted HERMES_HOME)#22043
kshitijk4poor merged 1 commit into
mainfrom
fix/sqlite-wal-fallback-on-nfs

kshitijk4poor commented May 8, 2026

Uh oh!

github-actions Bot commented May 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kshitijk4poor commented May 8, 2026

Summary

Why

What Changed

hermes_state.py — shared WAL-compatibility fallback

hermes_cli/kanban_db.py — use the shared helper

gateway/run.py — surface the failure

cli.py — same treatment for 4 sites

Example new error output

Test Plan

New tests

Validation

E2E validation with real imports

Lint

Out of Scope (deliberately)

Checklist

Uh oh!

github-actions Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔎 Lint report: fix/sqlite-wal-fallback-on-nfs vs origin/main

ruff

ty (type checker)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`hermes_state.py` — shared WAL-compatibility fallback

`hermes_cli/kanban_db.py` — use the shared helper

`gateway/run.py` — surface the failure

`cli.py` — same treatment for 4 sites

github-actions Bot commented May 8, 2026 •

edited

Loading

🔎 Lint report: `fix/sqlite-wal-fallback-on-nfs` vs `origin/main`