fix(state): wrap DELETE journal_mode fallback in try/except to survive APFS double-failure#30823
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR improves SQLite journal-mode fallback handling by tolerating environments where both PRAGMA journal_mode=WAL and PRAGMA journal_mode=DELETE fail, while deduplicating the new warning to avoid log spam.
Changes:
- Catch and log (once per
db_label) failures of the DELETE fallback pragma while continuing to use the (still-usable) connection. - Add tests covering the “both pragmas fail” scenario, including
SessionDBinitialization regression coverage. - Update the “captures cause on failed init” test to simulate a non-journal-related init failure (
foreign_keys).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| tests/test_hermes_state_wal_fallback.py | Adds regression tests for dual-pragmas failure + warning dedup; adjusts init-error test scenario. |
| hermes_state.py | Adds deduped warning + exception handling when DELETE fallback pragma fails. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| msgs = [r.getMessage() for r in caplog.records if r.levelname == "WARNING"] | ||
| wal_warnings = [m for m in msgs if "apfs-test.db" in m and "WAL" in m] | ||
| delete_warnings = [ | ||
| m for m in msgs if "apfs-test.db" in m and "DELETE journal_mode failed" in m |
| try: | ||
| conn.execute("PRAGMA journal_mode=DELETE") | ||
| except sqlite3.OperationalError as delete_exc: | ||
| # Some filesystems (e.g. APFS external SSDs under heavy | ||
| # contention) reject BOTH WAL and DELETE journal_mode pragmas | ||
| # with "disk I/O error". The connection's default journal | ||
| # mode is already DELETE, so propagating this would crash | ||
| # every caller (SessionDB, kanban_db, ResponseStore, holographic | ||
| # memory store) when in practice the connection is still | ||
| # usable for reads/writes. Log once and continue. | ||
| _log_wal_delete_fallback_failed_once(db_label, exc, delete_exc) | ||
| return "delete" |
| @@ -71,6 +71,7 @@ | |||
| # hermes_cli/kanban_db.py for ~30 call sites) would re-log the same | |||
| # filesystem-incompat warning on every connection, filling errors.log. | |||
| _wal_fallback_warned_paths: set[str] = set() | |||
| _wal_delete_fallback_failed_paths: set[str] = set() | |||
|
@copilot All three findings addressed in commit e98164390:
Also added a new regression test |
e981643 to
a348bb4
Compare
a348bb4 to
fe40500
Compare
fe40500 to
dc24e38
Compare
dc24e38 to
e00861a
Compare
…e APFS double-failure When `apply_wal_with_fallback()` falls back to `PRAGMA journal_mode=DELETE` on a WAL-incompatible filesystem and the DELETE pragma also fails (observed on APFS external SSDs raising "disk I/O error"), the uncaught OperationalError propagated out of `SessionDB.__init__`, `kanban_db.connect()`, `ResponseStore.__init__`, and the holographic memory store — silently breaking /resume, /title, /history, kanban, and the API response store. The connection's default journal mode is already DELETE (the pre-WAL SQLite default), so the connection remains usable for reads and writes even when the explicit pragma is rejected. Catch the inner failure, log a dedup'd WARNING per db_label, and return "delete" so callers continue running. Fixes NousResearch#30816.
…opilot review Address three Copilot findings on NousResearch#30823: 1. `apply_wal_with_fallback` previously returned a hardcoded `"delete"` even when `PRAGMA journal_mode=DELETE` raised, violating the docstring contract that the returned value reflects the journal mode actually in effect. On the both-fail path, read `PRAGMA journal_mode` back from the connection and return its value; if even the read fails, fall through to `"delete"` (SQLite's default) — the previous behavior. 2. Rename `_wal_delete_fallback_failed_paths` to `_wal_delete_fallback_failed_db_labels`. The set is keyed by `db_label` (e.g. `"state.db"`, `"kanban.db (foo.db)"`), not by filesystem paths. 3. Tighten the test substring from `"DELETE journal_mode failed"` to `"both WAL and DELETE journal_mode failed"` so the assertion can't accidentally match an unrelated message that happens to contain `"DELETE journal_mode"`. Add a new regression test `test_both_pragmas_fail_but_readback_reports_actual_mode` that exercises the new read-back path explicitly: PRAGMA writes fail, PRAGMA reads succeed, function returns whatever the connection reports. All 19 tests pass locally.
e00861a to
583f2f4
Compare
…r not bare EIO
The three both-pragmas-fail tests fed PRAGMA journal_mode=WAL an
OperationalError('disk I/O error'), expecting it to fall through to the
DELETE fallback. But bare transient EIO on the WAL pragma is intentionally
re-raised (Bug D / test_reraises_on_disk_io_error): treating it as a
permanent WAL-incompat marker risks mixed-journal-mode cross-process
corruption. The WAL pragma therefore re-raised before any fallback ran,
failing all four new tests under CI.
Rewrite the test connection factories so WAL fails with a recognized
WAL-incompat marker (locking protocol) and only the DELETE pragma (plus
the bare read-back) fails with disk I/O error. This exercises the genuine
both-fail path the PR targets while preserving the transient-EIO-re-raise
contract. No production logic change; the apply_wal_with_fallback docstring
is corrected to state the WAL pragma re-raises bare EIO.
|
Closing to focus the queue on security/file-safety work where civilian merges are landing. Happy to reopen if maintainers want this picked up. |
What does this PR do?
When
apply_wal_with_fallback()inhermes_state.pyfalls back toPRAGMA journal_mode=DELETEon a WAL-incompatible filesystem (NFS / SMB / FUSE), and the DELETE pragma also fails — observed on APFS external SSDs raisingdisk I/O errorfor both pragma calls — the innerOperationalErrorwas uncaught and propagated out. Every caller ofapply_wal_with_fallbackthen crashed at init time:SessionDB.__init__(hermes_state.py:354) →_session_db = None, breaks/resume,/title,/history,/branch, session searchhermes_cli/kanban_db.py::connect()→ kanban dispatcher crashes every 60s when the dashboard is opengateway/platforms/api_server.py:349→ ResponseStore unavailableplugins/memory/holographic/store.py:134→ holographic memory store failsThe connection's default journal mode is already DELETE (SQLite's pre-WAL default), so the connection is still usable for reads/writes even when the explicit pragma is rejected. This change wraps the DELETE pragma in
try/except, logs a dedup'd WARNING perdb_labelwhen both fail, and returns"delete"so callers keep running.Related Issue
Fixes #30816
Type of Change
Changes Made
hermes_state.py— wrap the innerconn.execute(\"PRAGMA journal_mode=DELETE\")intry/except sqlite3.OperationalError; log a dedup'd WARNING via the new_log_wal_delete_fallback_failed_oncehelper (mirrors the existing per-db_label dedup pattern used by_log_wal_fallback_once). No behavior change on the WAL-only-fails path.tests/test_hermes_state_wal_fallback.py— addstest_falls_back_when_delete_pragma_also_fails(verifies mode=="delete", both pragmas attempted, two distinct warnings, connection still usable),test_delete_fallback_failure_warning_deduplicated_per_db_label(regression guard for log-spam underkanban_db.connect()'s per-operation pattern), andtest_sessiondb_survives_when_both_journal_pragmas_fail(E2E:SessionDB()initializes and round-trips a session despite both pragmas failing). Reworks the existingtest_captures_cause_on_failed_initto fail throughPRAGMA foreign_keys=ONinstead ofjournal_modeso it still exercises the_set_last_init_errorcapture path under the new behavior.How to Test
try/exceptaround the DELETE pragma:sqlite3.OperationalError: disk I/O errorpropagating out ofapply_wal_with_fallback.Checklist
Code
fix(scope):,feat(scope):, etc.)Documentation & Housekeeping
docs/, docstrings) — N/A; the new helper carries its own docstringcli-config.yaml.exampleif I added/changed config keys — N/ACONTRIBUTING.mdorAGENTS.mdif I changed architecture or workflows — N/ARelated / Positioning
Orthogonal to @SimoKiihamaki's #30654, which targets a different SQLite issue (#30636, SIGTERM-induced WAL corruption) by adding
synchronous=FULLafter the WAL pragma succeeds and switchingclose()towal_checkpoint(TRUNCATE). Both PRs touchapply_wal_with_fallbackbut on different lines (their addition lives on the WAL-success path; mine lives inside the DELETE-fallbackexceptblock), so they merge cleanly.