Skip to content

fix(sqlite): fall back to journal_mode=DELETE on NFS/SMB/FUSE (fixes /resume on network-mounted HERMES_HOME)#22043

Merged
kshitijk4poor merged 1 commit into
mainfrom
fix/sqlite-wal-fallback-on-nfs
May 9, 2026
Merged

fix(sqlite): fall back to journal_mode=DELETE on NFS/SMB/FUSE (fixes /resume on network-mounted HERMES_HOME)#22043
kshitijk4poor merged 1 commit into
mainfrom
fix/sqlite-wal-fallback-on-nfs

Conversation

@kshitijk4poor

Copy link
Copy Markdown
Collaborator

Summary

When ~/.hermes is on a network filesystem (NFS, SMB/CIFS, some FUSE mounts, WSL1), SQLite's PRAGMA journal_mode=WAL raises sqlite3.OperationalError: locking protocol (SQLITE_PROTOCOL). Every caller currently catches this and swallows it silently, leaving the user with broken /resume, /title, /history, /branch, session search, and the kanban dispatcher — with no diagnostic.

This PR makes PRAGMA journal_mode=WAL fall back to journal_mode=DELETE when it fails on a WAL-incompatible filesystem, logs one WARNING explaining why, and surfaces the underlying cause in slash-command error messages.

Closes #22032.

Why

SQLite upstream documents that WAL mode does not work over a network filesystem. The real-world user in #22032 is on an NFSv3 mount:

172.26.224.200:d2dfac12/home on /home type nfs
  (rw, vers=3, ..., local_lock=none)

local_lock=none routes locks to the server, and WAL's shared-memory coordination breaks down. The same user's logs showed 4 distinct locking protocol failures in a single session (backup, kanban, TUI, CLI), all silently degrading features:

WARNING hermes_cli.backup: SQLite safe copy failed for ~/.hermes/state.db: locking protocol
ERROR gateway.run: kanban dispatcher: tick failed ...: locking protocol
WARNING tui_gateway.server: TUI session store unavailable ...: locking protocol
WARNING cli: Failed to initialize SessionDB ...: locking protocol

Downstream, the broken kanban migration kept retrying every 60s, driving #21708 / #21374 continuously.

What Changed

hermes_state.py — shared WAL-compatibility fallback

  • apply_wal_with_fallback(conn, db_label) — attempts PRAGMA journal_mode=WAL, catches OperationalError with a marker in ("locking protocol", "not authorized", "disk i/o error"), logs one WARNING naming the db and cause, then sets journal_mode=DELETE. Unrelated OperationalErrors still propagate.
  • get_last_init_error() — records the most recent SessionDB() init failure for slash-command messages.
  • format_session_db_unavailable(prefix="Session database not available") — formats a user-facing error string that includes the cause and adds an NFS/SMB hint + docs link when the cause looks like a WAL-compat failure.
  • SessionDB.__init__ uses the helper and records success/failure into the last-init-error slot.

hermes_cli/kanban_db.py — use the shared helper

  • connect() replaces the bare conn.execute("PRAGMA journal_mode=WAL") with apply_wal_with_fallback(conn, db_label=f"kanban.db ({path.name})").

gateway/run.py — surface the failure

  • SessionDB init failure log bumped from DEBUGWARNING (matches cli.py's existing correct behavior; now appears in errors.log).
  • 5 return "Session database not available." sites replaced with return format_session_db_unavailable() (calls into hermes_state).

cli.py — same treatment for 4 sites

  • 4 _cprint(" Session database not available.") sites replaced with _cprint(f" {format_session_db_unavailable()}").

Example new error output

Before:

  Session database not available.

After (on NFS):

  Session database not available: OperationalError: locking protocol (state.db may be on NFS/SMB/FUSE — see https://www.sqlite.org/wal.html).

Test Plan

New tests

File Tests What they cover
tests/test_hermes_state_wal_fallback.py (NEW) 12 apply_wal_with_fallback happy path, 3 WAL-incompat error markers, unrelated-error re-raise, get_last_init_error success/failure paths, format_session_db_unavailable 4 variants, E2E SessionDB CRUD on simulated NFS
tests/hermes_cli/test_kanban_db.py (+1) 1 kanban_db.connect() falls back to DELETE and persists a task

Validation

# New tests
bash scripts/run_tests.sh tests/test_hermes_state_wal_fallback.py tests/hermes_cli/test_kanban_db.py
# → 73 passed in 2.35s

# Regression — full state/kanban test scope
bash scripts/run_tests.sh tests/test_hermes_state.py tests/test_hermes_state_wal_fallback.py tests/hermes_cli/test_kanban_db.py
# → 283 passed in 2.65s

# Wider scope — gateway tests (only unrelated pre-existing failures)
bash scripts/run_tests.sh tests/gateway/
# → 5045 passed, 3 failed — all 3 failures verified pre-existing on origin/main
#   (test_discord_free_channel_skips_auto_thread, test_matrix device_id,
#   one flaky agent_cache concurrency test that passes on rerun)

# CLI tests — 14 of 15 failures verified pre-existing on origin/main
bash scripts/run_tests.sh tests/hermes_cli/
# → 3990 passed, 15 failed (unrelated — systemd/WSL/CI env differences)

E2E validation with real imports

Spun up a temp HERMES_HOME, imported hermes_state and hermes_cli.kanban_db from the worktree, and monkey-patched sqlite3.connect to return a Connection subclass that raises OperationalError("locking protocol") on PRAGMA journal_mode=WAL:

  • ✅ Normal init + session round-trip works (unchanged behavior)
  • ✅ NFS-simulated init falls back to journal_mode=DELETE; create_session + get_session round-trip works; get_last_init_error() returns None (init succeeded via fallback)
  • ✅ Message formatting: bare message when no cause, cause-bearing message + NFS/SMB hint + docs link when cause is locking protocol
  • kanban.db falls back to DELETE; create_task + list_tasks round-trip works

Lint

ruff check cli.py gateway/run.py hermes_cli/kanban_db.py hermes_state.py tests/hermes_cli/test_kanban_db.py tests/test_hermes_state_wal_fallback.pyAll checks passed!

Out of Scope (deliberately)

  • NFS autodetection via statvfs / /proc/mounts. Fragile across Linux/macOS/WSL/Docker overlay FS. The try/except fallback is OS-agnostic and more robust.
  • hermes doctor integration. Separate concern, separate PR.
  • Telegram polling-conflict reset bug observed in the same user's logs (1137 conflict events, _polling_conflict_count resets to 0 on false success → fatal path unreachable against dual-poller). Different file, different root cause, different test suite. Will be a separate issue + PR.
  • Fix for kanban dispatcher: 'duplicate column name: consecutive_failures' on first tick after gateway restart #21708 / Race condition in kanban _migrate_add_optional_columns on gateway startup #21374 (kanban migration race). Downstream symptom of NFS — once this PR lands, the kanban dispatcher won't be retrying a doomed migration every 60s, so the race window collapses dramatically. The underlying migration is still worth fixing but separately.

Checklist

  • Ruff clean on all changed files
  • New tests added (13 total), all passing
  • Existing tests/test_hermes_state.py suite still green
  • Existing tests/hermes_cli/test_kanban_db.py suite still green
  • tests/gateway/ regressions identified and verified pre-existing on origin/main
  • E2E validation with real imports on simulated NFS
  • AUTHOR_MAP already includes this commit's author email
  • Commit message references issue SQLite 'locking protocol' on NFS silently breaks /resume, /title, /history, /branch, and kanban #22032 via closes
  • PR only touches the files it claims to (no unrelated drift)

@github-actions

github-actions Bot commented May 8, 2026

Copy link
Copy Markdown
Contributor

🔎 Lint report: fix/sqlite-wal-fallback-on-nfs vs origin/main

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 7872 on HEAD, 7871 on base (🆕 +1)

🆕 New issues (15):

Rule Count
invalid-argument-type 8
unsupported-operator 3
unresolved-attribute 3
unresolved-import 1
First entries
tests/run_agent/test_provider_attribution_headers.py:156: [unsupported-operator] unsupported-operator: Operator `not in` is not supported between objects of type `Literal["X-OpenRouter-Cache-TTL"]` and `Unknown | str | dict[str, str] | ... omitted 4 union elements`
tests/run_agent/test_provider_attribution_headers.py:155: [unsupported-operator] unsupported-operator: Operator `not in` is not supported between objects of type `Literal["X-OpenRouter-Cache"]` and `Unknown | str | dict[str, str] | ... omitted 4 union elements`
run_agent.py:12821: [invalid-argument-type] invalid-argument-type: Argument to function `len` is incorrect: Expected `Sized`, found `(str & ~AlwaysFalsy) | (dict[Unknown, Unknown] & ~AlwaysFalsy) | (Any & ~AlwaysFalsy) | ... omitted 4 union elements`
run_agent.py:12818: [invalid-argument-type] invalid-argument-type: Argument to function `_is_oauth_token` is incorrect: Expected `str`, found `str | dict[Unknown, Unknown] | Any | ... omitted 4 union elements`
run_agent.py:6822: [invalid-argument-type] invalid-argument-type: Argument to function `build_anthropic_client` is incorrect: Expected `str`, found `str | dict[Unknown, Unknown] | Any | ... omitted 4 union elements`
run_agent.py:2562: [invalid-argument-type] invalid-argument-type: Argument to function `build_anthropic_client` is incorrect: Expected `str`, found `(Unknown & ~AlwaysFalsy) | (str & ~AlwaysFalsy) | (dict[str, str] & ~AlwaysFalsy) | ... omitted 5 union elements`
tests/agent/test_codex_cloudflare_headers.py:181: [unsupported-operator] unsupported-operator: Operator `in` is not supported between objects of type `Literal["originator"]` and `(Unknown & ~AlwaysFalsy) | (str & ~AlwaysFalsy) | (dict[str, str] & ~AlwaysFalsy) | ... omitted 4 union elements`
tests/agent/test_codex_cloudflare_headers.py:163: [unresolved-attribute] unresolved-attribute: Attribute `get` is not defined on `str & ~AlwaysFalsy`, `int & ~AlwaysFalsy` in union `(Unknown & ~AlwaysFalsy) | (str & ~AlwaysFalsy) | (dict[str, str] & ~AlwaysFalsy) | ... omitted 4 union elements`
run_agent.py:2613: [invalid-argument-type] invalid-argument-type: Argument to function `get_model_context_length` is incorrect: Expected `str`, found `str | dict[str, str] | Any | ... omitted 4 union elements`
run_agent.py:2565: [invalid-argument-type] invalid-argument-type: Argument to function `_is_oauth_token` is incorrect: Expected `str`, found `(Unknown & ~AlwaysFalsy) | (str & ~AlwaysFalsy) | (dict[str, str] & ~AlwaysFalsy) | ... omitted 5 union elements`
run_agent.py:2330: [invalid-argument-type] invalid-argument-type: Argument to function `query_ollama_num_ctx` is incorrect: Expected `str`, found `(str & ~AlwaysFalsy) | (dict[str, str] & ~AlwaysFalsy) | (Any & ~AlwaysFalsy) | ... omitted 5 union elements`
tests/test_hermes_state_wal_fallback.py:19: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`
run_agent.py:6651: [invalid-argument-type] invalid-argument-type: Argument to function `_codex_cloudflare_headers` is incorrect: Expected `str`, found `Unknown | str | dict[str, str] | ... omitted 4 union elements`
tests/agent/test_codex_cloudflare_headers.py:163: [unresolved-attribute] unresolved-attribute: Attribute `startswith` is not defined on `dict[str, str]` in union `Unknown | str | Divergent | dict[str, str]`
tests/run_agent/test_provider_attribution_headers.py:90: [unresolved-attribute] unresolved-attribute: Attribute `startswith` is not defined on `dict[str, str]` in union `Unknown | str | Divergent | dict[str, str]`

✅ Fixed issues (14):

Rule Count
invalid-argument-type 8
unresolved-attribute 3
unsupported-operator 3
First entries
run_agent.py:2330: [invalid-argument-type] invalid-argument-type: Argument to function `query_ollama_num_ctx` is incorrect: Expected `str`, found `(str & ~AlwaysFalsy) | (dict[str, str] & ~AlwaysFalsy) | (Any & ~AlwaysFalsy) | ... omitted 4 union elements`
run_agent.py:12819: [invalid-argument-type] invalid-argument-type: Argument to function `len` is incorrect: Expected `Sized`, found `(str & ~AlwaysFalsy) | (dict[Unknown, Unknown] & ~AlwaysFalsy) | (Any & ~AlwaysFalsy) | ... omitted 3 union elements`
run_agent.py:12816: [invalid-argument-type] invalid-argument-type: Argument to function `_is_oauth_token` is incorrect: Expected `str`, found `str | dict[Unknown, Unknown] | Any | ... omitted 3 union elements`
tests/agent/test_codex_cloudflare_headers.py:163: [unresolved-attribute] unresolved-attribute: Attribute `get` is not defined on `str & ~AlwaysFalsy`, `int & ~AlwaysFalsy` in union `(Unknown & ~AlwaysFalsy) | (str & ~AlwaysFalsy) | (dict[str, str] & ~AlwaysFalsy) | ... omitted 3 union elements`
run_agent.py:6822: [invalid-argument-type] invalid-argument-type: Argument to function `build_anthropic_client` is incorrect: Expected `str`, found `str | dict[Unknown, Unknown] | Any | ... omitted 3 union elements`
tests/agent/test_codex_cloudflare_headers.py:181: [unsupported-operator] unsupported-operator: Operator `in` is not supported between objects of type `Literal["originator"]` and `(Unknown & ~AlwaysFalsy) | (str & ~AlwaysFalsy) | (dict[str, str] & ~AlwaysFalsy) | ... omitted 3 union elements`
tests/run_agent/test_provider_attribution_headers.py:90: [unresolved-attribute] unresolved-attribute: Attribute `startswith` is not defined on `dict[str, str]` in union `Unknown | str | dict[str, str]`
run_agent.py:2562: [invalid-argument-type] invalid-argument-type: Argument to function `build_anthropic_client` is incorrect: Expected `str`, found `(Unknown & ~AlwaysFalsy) | (str & ~AlwaysFalsy) | (dict[str, str] & ~AlwaysFalsy) | ... omitted 4 union elements`
tests/agent/test_codex_cloudflare_headers.py:163: [unresolved-attribute] unresolved-attribute: Attribute `startswith` is not defined on `dict[str, str]` in union `Unknown | str | dict[str, str]`
run_agent.py:2565: [invalid-argument-type] invalid-argument-type: Argument to function `_is_oauth_token` is incorrect: Expected `str`, found `(Unknown & ~AlwaysFalsy) | (str & ~AlwaysFalsy) | (dict[str, str] & ~AlwaysFalsy) | ... omitted 4 union elements`
run_agent.py:2613: [invalid-argument-type] invalid-argument-type: Argument to function `get_model_context_length` is incorrect: Expected `str`, found `str | dict[str, str] | Any | ... omitted 3 union elements`
run_agent.py:6651: [invalid-argument-type] invalid-argument-type: Argument to function `_codex_cloudflare_headers` is incorrect: Expected `str`, found `Unknown | str | dict[str, str] | ... omitted 3 union elements`
tests/run_agent/test_provider_attribution_headers.py:156: [unsupported-operator] unsupported-operator: Operator `not in` is not supported between objects of type `Literal["X-OpenRouter-Cache-TTL"]` and `Unknown | str | dict[str, str] | ... omitted 3 union elements`
tests/run_agent/test_provider_attribution_headers.py:155: [unsupported-operator] unsupported-operator: Operator `not in` is not supported between objects of type `Literal["X-OpenRouter-Cache"]` and `Unknown | str | dict[str, str] | ... omitted 3 union elements`

Unchanged: 4154 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

@kshitijk4poor kshitijk4poor force-pushed the fix/sqlite-wal-fallback-on-nfs branch from 14c529a to 6ac7632 Compare May 8, 2026 19:37
SQLite's WAL mode requires shared-memory (mmap) coordination and fcntl
byte-range locks that don't reliably work on network filesystems. Upstream
documents this explicitly:
  https://www.sqlite.org/wal.html#sometimes_queries_return_sqlite_busy_in_wal_mode

On NFS / SMB / some FUSE mounts / WSL1, 'PRAGMA journal_mode=WAL' raises
'sqlite3.OperationalError: locking protocol' (SQLITE_PROTOCOL). Before
this change, every feature backed by state.db or kanban.db broke silently:
  - /resume, /title, /history, /branch returned 'Session database not
    available.' with no cause
  - gateway logged the init failure at DEBUG (invisible in errors.log)
  - kanban dispatcher crashed every 60s, driving the known migration race
    (duplicate column name: consecutive_failures, #21708 / #21374)

Changes:
  - hermes_state.apply_wal_with_fallback(): shared helper that tries WAL
    and falls back to DELETE on SQLITE_PROTOCOL-style errors with one
    WARNING explaining why
  - hermes_state.get_last_init_error() + format_session_db_unavailable():
    capture the init failure cause and surface it in user-facing strings
    (with an NFS/SMB pointer for 'locking protocol')
  - hermes_cli/kanban_db.connect(): use the shared helper
  - gateway/run.py: bump SessionDB init failure log DEBUG -> WARNING
    (matches cli.py's existing correct behavior)
  - cli.py (4 sites) + gateway/run.py (5 sites): replace bare
    'Session database not available.' with format_session_db_unavailable()

Tests: 12 new tests in tests/test_hermes_state_wal_fallback.py + 1 new
test in tests/hermes_cli/test_kanban_db.py. Existing suites (state,
kanban, gateway, cli) remain green for all tests unrelated to pre-existing
failures on main.

Evidence: real-world user on NFSv3 mount (172.26.224.200:d2dfac12/home,
local_lock=none) reporting 'Session database not available.' on /resume;
'locking protocol' appears in 4 distinct log entries across backup,
kanban, TUI, and CLI paths in the same session.

closes #22032
@kshitijk4poor kshitijk4poor force-pushed the fix/sqlite-wal-fallback-on-nfs branch from 6ac7632 to c2049b3 Compare May 9, 2026 09:07
@kshitijk4poor kshitijk4poor merged commit 2a7047c into main May 9, 2026
15 of 16 checks passed
@kshitijk4poor kshitijk4poor deleted the fix/sqlite-wal-fallback-on-nfs branch May 9, 2026 09:09
@alt-glitch alt-glitch added type/bug Something isn't working P1 High — major feature broken, no workaround comp/agent Core agent loop, run_agent.py, prompt builder comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery comp/plugins Plugin system and bundled plugins labels May 9, 2026
JZKK720 pushed a commit to JZKK720/hermes-agent that referenced this pull request May 11, 2026
…search#22043)

SQLite's WAL mode requires shared-memory (mmap) coordination and fcntl
byte-range locks that don't reliably work on network filesystems. Upstream
documents this explicitly:
  https://www.sqlite.org/wal.html#sometimes_queries_return_sqlite_busy_in_wal_mode

On NFS / SMB / some FUSE mounts / WSL1, 'PRAGMA journal_mode=WAL' raises
'sqlite3.OperationalError: locking protocol' (SQLITE_PROTOCOL). Before
this change, every feature backed by state.db or kanban.db broke silently:
  - /resume, /title, /history, /branch returned 'Session database not
    available.' with no cause
  - gateway logged the init failure at DEBUG (invisible in errors.log)
  - kanban dispatcher crashed every 60s, driving the known migration race
    (duplicate column name: consecutive_failures, NousResearch#21708 / NousResearch#21374)

Changes:
  - hermes_state.apply_wal_with_fallback(): shared helper that tries WAL
    and falls back to DELETE on SQLITE_PROTOCOL-style errors with one
    WARNING explaining why
  - hermes_state.get_last_init_error() + format_session_db_unavailable():
    capture the init failure cause and surface it in user-facing strings
    (with an NFS/SMB pointer for 'locking protocol')
  - hermes_cli/kanban_db.connect(): use the shared helper
  - gateway/run.py: bump SessionDB init failure log DEBUG -> WARNING
    (matches cli.py's existing correct behavior)
  - cli.py (4 sites) + gateway/run.py (5 sites): replace bare
    'Session database not available.' with format_session_db_unavailable()

Tests: 12 new tests in tests/test_hermes_state_wal_fallback.py + 1 new
test in tests/hermes_cli/test_kanban_db.py. Existing suites (state,
kanban, gateway, cli) remain green for all tests unrelated to pre-existing
failures on main.

Evidence: real-world user on NFSv3 mount (172.26.224.200:d2dfac12/home,
local_lock=none) reporting 'Session database not available.' on /resume;
'locking protocol' appears in 4 distinct log entries across backup,
kanban, TUI, and CLI paths in the same session.

closes NousResearch#22032
rmulligan pushed a commit to rmulligan/hermes-agent that referenced this pull request May 11, 2026
…search#22043)

SQLite's WAL mode requires shared-memory (mmap) coordination and fcntl
byte-range locks that don't reliably work on network filesystems. Upstream
documents this explicitly:
  https://www.sqlite.org/wal.html#sometimes_queries_return_sqlite_busy_in_wal_mode

On NFS / SMB / some FUSE mounts / WSL1, 'PRAGMA journal_mode=WAL' raises
'sqlite3.OperationalError: locking protocol' (SQLITE_PROTOCOL). Before
this change, every feature backed by state.db or kanban.db broke silently:
  - /resume, /title, /history, /branch returned 'Session database not
    available.' with no cause
  - gateway logged the init failure at DEBUG (invisible in errors.log)
  - kanban dispatcher crashed every 60s, driving the known migration race
    (duplicate column name: consecutive_failures, NousResearch#21708 / NousResearch#21374)

Changes:
  - hermes_state.apply_wal_with_fallback(): shared helper that tries WAL
    and falls back to DELETE on SQLITE_PROTOCOL-style errors with one
    WARNING explaining why
  - hermes_state.get_last_init_error() + format_session_db_unavailable():
    capture the init failure cause and surface it in user-facing strings
    (with an NFS/SMB pointer for 'locking protocol')
  - hermes_cli/kanban_db.connect(): use the shared helper
  - gateway/run.py: bump SessionDB init failure log DEBUG -> WARNING
    (matches cli.py's existing correct behavior)
  - cli.py (4 sites) + gateway/run.py (5 sites): replace bare
    'Session database not available.' with format_session_db_unavailable()

Tests: 12 new tests in tests/test_hermes_state_wal_fallback.py + 1 new
test in tests/hermes_cli/test_kanban_db.py. Existing suites (state,
kanban, gateway, cli) remain green for all tests unrelated to pre-existing
failures on main.

Evidence: real-world user on NFSv3 mount (172.26.224.200:d2dfac12/home,
local_lock=none) reporting 'Session database not available.' on /resume;
'locking protocol' appears in 4 distinct log entries across backup,
kanban, TUI, and CLI paths in the same session.

closes NousResearch#22032
JinyuID pushed a commit to JinyuID/hermes-agent that referenced this pull request May 11, 2026
…search#22043)

SQLite's WAL mode requires shared-memory (mmap) coordination and fcntl
byte-range locks that don't reliably work on network filesystems. Upstream
documents this explicitly:
  https://www.sqlite.org/wal.html#sometimes_queries_return_sqlite_busy_in_wal_mode

On NFS / SMB / some FUSE mounts / WSL1, 'PRAGMA journal_mode=WAL' raises
'sqlite3.OperationalError: locking protocol' (SQLITE_PROTOCOL). Before
this change, every feature backed by state.db or kanban.db broke silently:
  - /resume, /title, /history, /branch returned 'Session database not
    available.' with no cause
  - gateway logged the init failure at DEBUG (invisible in errors.log)
  - kanban dispatcher crashed every 60s, driving the known migration race
    (duplicate column name: consecutive_failures, NousResearch#21708 / NousResearch#21374)

Changes:
  - hermes_state.apply_wal_with_fallback(): shared helper that tries WAL
    and falls back to DELETE on SQLITE_PROTOCOL-style errors with one
    WARNING explaining why
  - hermes_state.get_last_init_error() + format_session_db_unavailable():
    capture the init failure cause and surface it in user-facing strings
    (with an NFS/SMB pointer for 'locking protocol')
  - hermes_cli/kanban_db.connect(): use the shared helper
  - gateway/run.py: bump SessionDB init failure log DEBUG -> WARNING
    (matches cli.py's existing correct behavior)
  - cli.py (4 sites) + gateway/run.py (5 sites): replace bare
    'Session database not available.' with format_session_db_unavailable()

Tests: 12 new tests in tests/test_hermes_state_wal_fallback.py + 1 new
test in tests/hermes_cli/test_kanban_db.py. Existing suites (state,
kanban, gateway, cli) remain green for all tests unrelated to pre-existing
failures on main.

Evidence: real-world user on NFSv3 mount (172.26.224.200:d2dfac12/home,
local_lock=none) reporting 'Session database not available.' on /resume;
'locking protocol' appears in 4 distinct log entries across backup,
kanban, TUI, and CLI paths in the same session.

closes NousResearch#22032
jsboige pushed a commit to jsboige/hermes-agent that referenced this pull request May 14, 2026
…search#22043)

SQLite's WAL mode requires shared-memory (mmap) coordination and fcntl
byte-range locks that don't reliably work on network filesystems. Upstream
documents this explicitly:
  https://www.sqlite.org/wal.html#sometimes_queries_return_sqlite_busy_in_wal_mode

On NFS / SMB / some FUSE mounts / WSL1, 'PRAGMA journal_mode=WAL' raises
'sqlite3.OperationalError: locking protocol' (SQLITE_PROTOCOL). Before
this change, every feature backed by state.db or kanban.db broke silently:
  - /resume, /title, /history, /branch returned 'Session database not
    available.' with no cause
  - gateway logged the init failure at DEBUG (invisible in errors.log)
  - kanban dispatcher crashed every 60s, driving the known migration race
    (duplicate column name: consecutive_failures, NousResearch#21708 / NousResearch#21374)

Changes:
  - hermes_state.apply_wal_with_fallback(): shared helper that tries WAL
    and falls back to DELETE on SQLITE_PROTOCOL-style errors with one
    WARNING explaining why
  - hermes_state.get_last_init_error() + format_session_db_unavailable():
    capture the init failure cause and surface it in user-facing strings
    (with an NFS/SMB pointer for 'locking protocol')
  - hermes_cli/kanban_db.connect(): use the shared helper
  - gateway/run.py: bump SessionDB init failure log DEBUG -> WARNING
    (matches cli.py's existing correct behavior)
  - cli.py (4 sites) + gateway/run.py (5 sites): replace bare
    'Session database not available.' with format_session_db_unavailable()

Tests: 12 new tests in tests/test_hermes_state_wal_fallback.py + 1 new
test in tests/hermes_cli/test_kanban_db.py. Existing suites (state,
kanban, gateway, cli) remain green for all tests unrelated to pre-existing
failures on main.

Evidence: real-world user on NFSv3 mount (172.26.224.200:d2dfac12/home,
local_lock=none) reporting 'Session database not available.' on /resume;
'locking protocol' appears in 4 distinct log entries across backup,
kanban, TUI, and CLI paths in the same session.

closes NousResearch#22032
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request May 25, 2026
…search#22043)

SQLite's WAL mode requires shared-memory (mmap) coordination and fcntl
byte-range locks that don't reliably work on network filesystems. Upstream
documents this explicitly:
  https://www.sqlite.org/wal.html#sometimes_queries_return_sqlite_busy_in_wal_mode

On NFS / SMB / some FUSE mounts / WSL1, 'PRAGMA journal_mode=WAL' raises
'sqlite3.OperationalError: locking protocol' (SQLITE_PROTOCOL). Before
this change, every feature backed by state.db or kanban.db broke silently:
  - /resume, /title, /history, /branch returned 'Session database not
    available.' with no cause
  - gateway logged the init failure at DEBUG (invisible in errors.log)
  - kanban dispatcher crashed every 60s, driving the known migration race
    (duplicate column name: consecutive_failures, NousResearch#21708 / NousResearch#21374)

Changes:
  - hermes_state.apply_wal_with_fallback(): shared helper that tries WAL
    and falls back to DELETE on SQLITE_PROTOCOL-style errors with one
    WARNING explaining why
  - hermes_state.get_last_init_error() + format_session_db_unavailable():
    capture the init failure cause and surface it in user-facing strings
    (with an NFS/SMB pointer for 'locking protocol')
  - hermes_cli/kanban_db.connect(): use the shared helper
  - gateway/run.py: bump SessionDB init failure log DEBUG -> WARNING
    (matches cli.py's existing correct behavior)
  - cli.py (4 sites) + gateway/run.py (5 sites): replace bare
    'Session database not available.' with format_session_db_unavailable()

Tests: 12 new tests in tests/test_hermes_state_wal_fallback.py + 1 new
test in tests/hermes_cli/test_kanban_db.py. Existing suites (state,
kanban, gateway, cli) remain green for all tests unrelated to pre-existing
failures on main.

Evidence: real-world user on NFSv3 mount (172.26.224.200:d2dfac12/home,
local_lock=none) reporting 'Session database not available.' on /resume;
'locking protocol' appears in 4 distinct log entries across backup,
kanban, TUI, and CLI paths in the same session.

closes NousResearch#22032
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
…search#22043)

SQLite's WAL mode requires shared-memory (mmap) coordination and fcntl
byte-range locks that don't reliably work on network filesystems. Upstream
documents this explicitly:
  https://www.sqlite.org/wal.html#sometimes_queries_return_sqlite_busy_in_wal_mode

On NFS / SMB / some FUSE mounts / WSL1, 'PRAGMA journal_mode=WAL' raises
'sqlite3.OperationalError: locking protocol' (SQLITE_PROTOCOL). Before
this change, every feature backed by state.db or kanban.db broke silently:
  - /resume, /title, /history, /branch returned 'Session database not
    available.' with no cause
  - gateway logged the init failure at DEBUG (invisible in errors.log)
  - kanban dispatcher crashed every 60s, driving the known migration race
    (duplicate column name: consecutive_failures, NousResearch#21708 / NousResearch#21374)

Changes:
  - hermes_state.apply_wal_with_fallback(): shared helper that tries WAL
    and falls back to DELETE on SQLITE_PROTOCOL-style errors with one
    WARNING explaining why
  - hermes_state.get_last_init_error() + format_session_db_unavailable():
    capture the init failure cause and surface it in user-facing strings
    (with an NFS/SMB pointer for 'locking protocol')
  - hermes_cli/kanban_db.connect(): use the shared helper
  - gateway/run.py: bump SessionDB init failure log DEBUG -> WARNING
    (matches cli.py's existing correct behavior)
  - cli.py (4 sites) + gateway/run.py (5 sites): replace bare
    'Session database not available.' with format_session_db_unavailable()

Tests: 12 new tests in tests/test_hermes_state_wal_fallback.py + 1 new
test in tests/hermes_cli/test_kanban_db.py. Existing suites (state,
kanban, gateway, cli) remain green for all tests unrelated to pre-existing
failures on main.

Evidence: real-world user on NFSv3 mount (172.26.224.200:d2dfac12/home,
local_lock=none) reporting 'Session database not available.' on /resume;
'locking protocol' appears in 4 distinct log entries across backup,
kanban, TUI, and CLI paths in the same session.

closes NousResearch#22032
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026
…search#22043)

SQLite's WAL mode requires shared-memory (mmap) coordination and fcntl
byte-range locks that don't reliably work on network filesystems. Upstream
documents this explicitly:
  https://www.sqlite.org/wal.html#sometimes_queries_return_sqlite_busy_in_wal_mode

On NFS / SMB / some FUSE mounts / WSL1, 'PRAGMA journal_mode=WAL' raises
'sqlite3.OperationalError: locking protocol' (SQLITE_PROTOCOL). Before
this change, every feature backed by state.db or kanban.db broke silently:
  - /resume, /title, /history, /branch returned 'Session database not
    available.' with no cause
  - gateway logged the init failure at DEBUG (invisible in errors.log)
  - kanban dispatcher crashed every 60s, driving the known migration race
    (duplicate column name: consecutive_failures, NousResearch#21708 / NousResearch#21374)

Changes:
  - hermes_state.apply_wal_with_fallback(): shared helper that tries WAL
    and falls back to DELETE on SQLITE_PROTOCOL-style errors with one
    WARNING explaining why
  - hermes_state.get_last_init_error() + format_session_db_unavailable():
    capture the init failure cause and surface it in user-facing strings
    (with an NFS/SMB pointer for 'locking protocol')
  - hermes_cli/kanban_db.connect(): use the shared helper
  - gateway/run.py: bump SessionDB init failure log DEBUG -> WARNING
    (matches cli.py's existing correct behavior)
  - cli.py (4 sites) + gateway/run.py (5 sites): replace bare
    'Session database not available.' with format_session_db_unavailable()

Tests: 12 new tests in tests/test_hermes_state_wal_fallback.py + 1 new
test in tests/hermes_cli/test_kanban_db.py. Existing suites (state,
kanban, gateway, cli) remain green for all tests unrelated to pre-existing
failures on main.

Evidence: real-world user on NFSv3 mount (172.26.224.200:d2dfac12/home,
local_lock=none) reporting 'Session database not available.' on /resume;
'locking protocol' appears in 4 distinct log entries across backup,
kanban, TUI, and CLI paths in the same session.

closes NousResearch#22032
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery comp/plugins Plugin system and bundled plugins P1 High — major feature broken, no workaround type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SQLite 'locking protocol' on NFS silently breaks /resume, /title, /history, /branch, and kanban

2 participants