Skip to content

fix(state): wrap DELETE journal_mode fallback in try/except to survive APFS double-failure#30823

Closed
briandevans wants to merge 3 commits into
NousResearch:mainfrom
briandevans:fix/hermes-state-delete-fallback-uncaught-30816
Closed

fix(state): wrap DELETE journal_mode fallback in try/except to survive APFS double-failure#30823
briandevans wants to merge 3 commits into
NousResearch:mainfrom
briandevans:fix/hermes-state-delete-fallback-uncaught-30816

Conversation

@briandevans

Copy link
Copy Markdown
Contributor

What does this PR do?

When apply_wal_with_fallback() in hermes_state.py falls back to PRAGMA journal_mode=DELETE on a WAL-incompatible filesystem (NFS / SMB / FUSE), and the DELETE pragma also fails — observed on APFS external SSDs raising disk I/O error for both pragma calls — the inner OperationalError was uncaught and propagated out. Every caller of apply_wal_with_fallback then crashed at init time:

  • SessionDB.__init__ (hermes_state.py:354) → _session_db = None, breaks /resume, /title, /history, /branch, session search
  • hermes_cli/kanban_db.py::connect() → kanban dispatcher crashes every 60s when the dashboard is open
  • gateway/platforms/api_server.py:349 → ResponseStore unavailable
  • plugins/memory/holographic/store.py:134 → holographic memory store fails

The connection's default journal mode is already DELETE (SQLite's pre-WAL default), so the connection is still usable for reads/writes even when the explicit pragma is rejected. This change wraps the DELETE pragma in try/except, logs a dedup'd WARNING per db_label when both fail, and returns "delete" so callers keep running.

Related Issue

Fixes #30816

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • hermes_state.py — wrap the inner conn.execute(\"PRAGMA journal_mode=DELETE\") in try/except sqlite3.OperationalError; log a dedup'd WARNING via the new _log_wal_delete_fallback_failed_once helper (mirrors the existing per-db_label dedup pattern used by _log_wal_fallback_once). No behavior change on the WAL-only-fails path.
  • tests/test_hermes_state_wal_fallback.py — adds test_falls_back_when_delete_pragma_also_fails (verifies mode=="delete", both pragmas attempted, two distinct warnings, connection still usable), test_delete_fallback_failure_warning_deduplicated_per_db_label (regression guard for log-spam under kanban_db.connect()'s per-operation pattern), and test_sessiondb_survives_when_both_journal_pragmas_fail (E2E: SessionDB() initializes and round-trips a session despite both pragmas failing). Reworks the existing test_captures_cause_on_failed_init to fail through PRAGMA foreign_keys=ON instead of journal_mode so it still exercises the _set_last_init_error capture path under the new behavior.

How to Test

  1. Repro the original bug by reverting just the try/except around the DELETE pragma:
    uv run --with pytest --with pytest-xdist --with pytest-asyncio --with pytest-timeout python3 -m pytest tests/test_hermes_state_wal_fallback.py -v
    
    The three new tests fail with sqlite3.OperationalError: disk I/O error propagating out of apply_wal_with_fallback.
  2. Restore the fix and re-run — all 18 tests pass.
  3. Adjacent coverage:
    uv run --with pytest --with pytest-xdist --with pytest-asyncio --with pytest-timeout python3 -m pytest tests/test_hermes_state_wal_fallback.py tests/hermes_cli/test_kanban_db.py -v
    
    177 passed.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run focused tests for the touched code and all pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: macOS 26.x

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — N/A; the new helper carries its own docstring
  • I've updated cli-config.yaml.example if I added/changed config keys — N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — fix is observed on macOS APFS but applies cross-platform; the WAL→DELETE→default cascade is OS-agnostic
  • I've updated tool descriptions/schemas if I changed tool behavior — N/A

Related / Positioning

Orthogonal to @SimoKiihamaki's #30654, which targets a different SQLite issue (#30636, SIGTERM-induced WAL corruption) by adding synchronous=FULL after the WAL pragma succeeds and switching close() to wal_checkpoint(TRUNCATE). Both PRs touch apply_wal_with_fallback but on different lines (their addition lives on the WAL-success path; mine lives inside the DELETE-fallback except block), so they merge cleanly.

Audited siblings: apply_wal_with_fallback is the single shared WAL→DELETE entry point used by SessionDB, hermes_cli/kanban_db.py::connect(), gateway/platforms/api_server.py:349, and plugins/memory/holographic/store.py:134. All four callers benefit from the fix without additional changes. No widening needed.

Copilot AI review requested due to automatic review settings May 23, 2026 08:15

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR improves SQLite journal-mode fallback handling by tolerating environments where both PRAGMA journal_mode=WAL and PRAGMA journal_mode=DELETE fail, while deduplicating the new warning to avoid log spam.

Changes:

  • Catch and log (once per db_label) failures of the DELETE fallback pragma while continuing to use the (still-usable) connection.
  • Add tests covering the “both pragmas fail” scenario, including SessionDB initialization regression coverage.
  • Update the “captures cause on failed init” test to simulate a non-journal-related init failure (foreign_keys).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
tests/test_hermes_state_wal_fallback.py Adds regression tests for dual-pragmas failure + warning dedup; adjusts init-error test scenario.
hermes_state.py Adds deduped warning + exception handling when DELETE fallback pragma fails.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/test_hermes_state_wal_fallback.py Outdated
msgs = [r.getMessage() for r in caplog.records if r.levelname == "WARNING"]
wal_warnings = [m for m in msgs if "apfs-test.db" in m and "WAL" in m]
delete_warnings = [
m for m in msgs if "apfs-test.db" in m and "DELETE journal_mode failed" in m
Comment thread hermes_state.py
Comment on lines +161 to 172
try:
conn.execute("PRAGMA journal_mode=DELETE")
except sqlite3.OperationalError as delete_exc:
# Some filesystems (e.g. APFS external SSDs under heavy
# contention) reject BOTH WAL and DELETE journal_mode pragmas
# with "disk I/O error". The connection's default journal
# mode is already DELETE, so propagating this would crash
# every caller (SessionDB, kanban_db, ResponseStore, holographic
# memory store) when in practice the connection is still
# usable for reads/writes. Log once and continue.
_log_wal_delete_fallback_failed_once(db_label, exc, delete_exc)
return "delete"
Comment thread hermes_state.py Outdated
@@ -71,6 +71,7 @@
# hermes_cli/kanban_db.py for ~30 call sites) would re-log the same
# filesystem-incompat warning on every connection, filling errors.log.
_wal_fallback_warned_paths: set[str] = set()
_wal_delete_fallback_failed_paths: set[str] = set()
@alt-glitch alt-glitch added type/bug Something isn't working comp/agent Core agent loop, run_agent.py, prompt builder comp/cli CLI entry point, hermes_cli/, setup wizard P2 Medium — degraded but workaround exists labels May 23, 2026
@briandevans

Copy link
Copy Markdown
Contributor Author

@copilot All three findings addressed in commit e98164390:

  1. Contract violation (line 172, hermes_state.py) — in the both-PRAGMA-fail branch, the function now queries PRAGMA journal_mode and returns the mode actually in effect on the connection. If even the read fails (broken connection), we fall through to "delete" (SQLite's default). Docstring updated to describe this precedence.

  2. Naming (line 74, hermes_state.py) — renamed _wal_delete_fallback_failed_paths_wal_delete_fallback_failed_db_labels, since the set is keyed by db_label, not filesystem paths.

  3. Test substring (line 197, tests/test_hermes_state_wal_fallback.py) — tightened from "DELETE journal_mode failed" (which IS a substring of the actual warning "both WAL and DELETE journal_mode failed", so the existing assertion was passing) to the full "both WAL and DELETE journal_mode failed" so the match can't drift if other warnings ever mention DELETE journal_mode in a different context.

Also added a new regression test test_both_pragmas_fail_but_readback_reports_actual_mode covering the new read-back code path (PRAGMA writes blocked, PRAGMA reads pass through). All 19 tests in the file pass locally.

@briandevans briandevans force-pushed the fix/hermes-state-delete-fallback-uncaught-30816 branch from e981643 to a348bb4 Compare May 23, 2026 11:08
@briandevans briandevans force-pushed the fix/hermes-state-delete-fallback-uncaught-30816 branch from a348bb4 to fe40500 Compare May 24, 2026 14:15
@briandevans briandevans force-pushed the fix/hermes-state-delete-fallback-uncaught-30816 branch from fe40500 to dc24e38 Compare May 26, 2026 22:12
@briandevans briandevans force-pushed the fix/hermes-state-delete-fallback-uncaught-30816 branch from dc24e38 to e00861a Compare May 28, 2026 00:18
…e APFS double-failure

When `apply_wal_with_fallback()` falls back to `PRAGMA journal_mode=DELETE`
on a WAL-incompatible filesystem and the DELETE pragma also fails (observed
on APFS external SSDs raising "disk I/O error"), the uncaught
OperationalError propagated out of `SessionDB.__init__`, `kanban_db.connect()`,
`ResponseStore.__init__`, and the holographic memory store — silently
breaking /resume, /title, /history, kanban, and the API response store.

The connection's default journal mode is already DELETE (the pre-WAL
SQLite default), so the connection remains usable for reads and writes
even when the explicit pragma is rejected.  Catch the inner failure,
log a dedup'd WARNING per db_label, and return "delete" so callers
continue running.

Fixes NousResearch#30816.
…opilot review

Address three Copilot findings on NousResearch#30823:

1. `apply_wal_with_fallback` previously returned a hardcoded `"delete"`
   even when `PRAGMA journal_mode=DELETE` raised, violating the docstring
   contract that the returned value reflects the journal mode actually
   in effect.  On the both-fail path, read `PRAGMA journal_mode` back
   from the connection and return its value; if even the read fails,
   fall through to `"delete"` (SQLite's default) — the previous behavior.

2. Rename `_wal_delete_fallback_failed_paths` to
   `_wal_delete_fallback_failed_db_labels`.  The set is keyed by
   `db_label` (e.g. `"state.db"`, `"kanban.db (foo.db)"`), not by
   filesystem paths.

3. Tighten the test substring from `"DELETE journal_mode failed"` to
   `"both WAL and DELETE journal_mode failed"` so the assertion can't
   accidentally match an unrelated message that happens to contain
   `"DELETE journal_mode"`.

Add a new regression test `test_both_pragmas_fail_but_readback_reports_actual_mode`
that exercises the new read-back path explicitly: PRAGMA writes fail,
PRAGMA reads succeed, function returns whatever the connection reports.

All 19 tests pass locally.
@briandevans briandevans force-pushed the fix/hermes-state-delete-fallback-uncaught-30816 branch from e00861a to 583f2f4 Compare May 31, 2026 09:13
…r not bare EIO

The three both-pragmas-fail tests fed PRAGMA journal_mode=WAL an
OperationalError('disk I/O error'), expecting it to fall through to the
DELETE fallback. But bare transient EIO on the WAL pragma is intentionally
re-raised (Bug D / test_reraises_on_disk_io_error): treating it as a
permanent WAL-incompat marker risks mixed-journal-mode cross-process
corruption. The WAL pragma therefore re-raised before any fallback ran,
failing all four new tests under CI.

Rewrite the test connection factories so WAL fails with a recognized
WAL-incompat marker (locking protocol) and only the DELETE pragma (plus
the bare read-back) fails with disk I/O error. This exercises the genuine
both-fail path the PR targets while preserving the transient-EIO-re-raise
contract. No production logic change; the apply_wal_with_fallback docstring
is corrected to state the WAL pragma re-raises bare EIO.
@briandevans

Copy link
Copy Markdown
Contributor Author

Closing to focus the queue on security/file-safety work where civilian merges are landing. Happy to reopen if maintainers want this picked up.

@briandevans briandevans closed this Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder comp/cli CLI entry point, hermes_cli/, setup wizard P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

apply_wal_with_fallback: DELETE fallback uncaught — crashes on APFS external SSDs

3 participants