Skip to content

Commit 5c49cd0

Browse files
steveonjavakshitijk4poor
authored andcommitted
fix(state): never silently downgrade WAL to DELETE on transient EIO
apply_wal_with_fallback() treated "disk i/o error" as a permanent WAL-incompatibility marker, identical to "locking protocol" (NFS) and "not authorized" (FUSE). But EIO during PRAGMA journal_mode=WAL is typically TRANSIENT — page-cache pressure, brief lock contention, recoverable storage hiccups — not a permanent filesystem property. Treating transient EIO as a permanent downgrade signal produces the mixed-journal-mode-across-processes corruption pattern: 1. Process A opens kanban.db, hits transient EIO on the WAL pragma, silently downgrades to journal_mode=DELETE. 2. Process B (no EIO) opens the same file moments later and successfully sets journal_mode=WAL. 3. A writes rollback-journal frames while B writes WAL frames. SQLite documents this as unsupported and corrupts the file: https://www.sqlite.org/wal.html ("all connections to the same database must use the same locking protocol"). This was the root cause of repeated kanban.db corruption on hosts with multiple gateway processes plus CLI invocations against the same DB (observed pattern: corruption shortly after gateway startup, after the process logged "WAL journal_mode unsupported on this filesystem (disk I/O error) — falling back to journal_mode=DELETE"). The fallback warning told the truth — fallback DID happen — but the premise ("unsupported on this filesystem") was wrong; the EIO was a one-shot event and sibling processes successfully used WAL. Fix has two layers: 1. Remove "disk i/o error" from _WAL_INCOMPAT_MARKERS. EIO now re-raises so callers can retry instead of silently corrupting the DB. The two remaining markers ("locking protocol", "not authorized") are deterministic per filesystem so they remain safe permanent-downgrade signals. 2. Belt-and-suspenders: before downgrading on ANY marker match, peek the on-disk journal mode. If the header says WAL, refuse to downgrade and re-raise the original error. This guards against any future addition to _WAL_INCOMPAT_MARKERS turning out to be transient in some environment we haven't yet seen. Tests: - tests/test_hermes_state_wal_fallback.py: * Flipped test_falls_back_on_disk_io_error → test_reraises_on_disk_io_error asserting EIO is re-raised, not silently swallowed. * Added test_does_not_downgrade_when_disk_says_wal covering the on-disk-header safety guard for the existing legitimate markers. - tests/hermes_cli/test_kanban_db.py: * test_connect_falls_back_to_delete_on_locking_protocol now uses a truly-fresh DB (instead of the kanban_home fixture which pre-inits in WAL). On NFS the very first process touching the file legitimately downgrades; on a file already in WAL the new guard correctly refuses. A standalone reproducer lives at /tmp/kanban-stress/repro_bugD_eio_wal_downgrade.py (not committed): without fix the DB silently flips from WAL to DELETE mid-process; with fix the EIO surfaces and the file stays WAL. Refs: Bug D in the kanban-corruption investigation series (Bugs A and C shipped in ebe7374f3 and e02147d5e respectively). Bug D explains every corruption incident this week including those that survived A's single-dispatcher mitigation, because every CLI invocation is a separate process whose WAL pragma can transiently fail.
1 parent 6416dd5 commit 5c49cd0

4 files changed

Lines changed: 112 additions & 6 deletions

File tree

hermes_state.py

Lines changed: 27 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,6 @@
5454
_WAL_INCOMPAT_MARKERS = (
5555
"locking protocol", # SQLITE_PROTOCOL on NFS/SMB
5656
"not authorized", # Some FUSE mounts block WAL pragma outright
57-
"disk i/o error", # Flaky network FS during WAL setup
5857
)
5958

6059
# Last SessionDB() init error, per-process. Surfaced in /resume and
@@ -125,6 +124,27 @@ def format_session_db_unavailable(prefix: str = "Session database not available"
125124
return f"{prefix}: {cause}{hint}."
126125

127126

127+
def _on_disk_journal_mode(conn: sqlite3.Connection) -> Optional[str]:
128+
"""Read the journal mode from the SQLite DB header on disk.
129+
130+
Returns the mode string (e.g. ``"wal"``, ``"delete"``), or ``None``
131+
if the value cannot be determined (new DB, or PRAGMA read failed).
132+
"""
133+
try:
134+
row = conn.execute("PRAGMA journal_mode").fetchone()
135+
except sqlite3.OperationalError:
136+
return None
137+
if row is None:
138+
return None
139+
mode = row[0]
140+
if isinstance(mode, bytes): # defensive: sqlite3 occasionally returns bytes
141+
try:
142+
mode = mode.decode("ascii")
143+
except UnicodeDecodeError:
144+
return None
145+
return str(mode).strip().lower() if mode is not None else None
146+
147+
128148
def apply_wal_with_fallback(
129149
conn: sqlite3.Connection,
130150
*,
@@ -147,6 +167,8 @@ def apply_wal_with_fallback(
147167
148168
Shared by :class:`SessionDB` and ``hermes_cli.kanban_db.connect`` so
149169
both databases get identical fallback behavior.
170+
171+
Never downgrades to DELETE if the on-disk DB header reports WAL — see _on_disk_journal_mode.
150172
"""
151173
try:
152174
conn.execute("PRAGMA journal_mode=WAL")
@@ -156,6 +178,10 @@ def apply_wal_with_fallback(
156178
if not any(marker in msg for marker in _WAL_INCOMPAT_MARKERS):
157179
# Unrelated OperationalError — don't silently swallow.
158180
raise
181+
# Don't downgrade if another process already set WAL on disk.
182+
existing = _on_disk_journal_mode(conn)
183+
if existing == "wal":
184+
raise
159185
_log_wal_fallback_once(db_label, exc)
160186
conn.execute("PRAGMA journal_mode=DELETE")
161187
return "delete"

tests/hermes_cli/test_kanban_core_functionality.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,15 @@ def kanban_home(tmp_path, monkeypatch):
3636
home.mkdir()
3737
monkeypatch.setenv("HERMES_HOME", str(home))
3838
monkeypatch.setattr(Path, "home", lambda: tmp_path)
39+
# Disable the detect_crashed_workers grace period for legacy tests in
40+
# this file that claim a task and immediately expect
41+
# ``detect_crashed_workers`` to act on it. The grace period (30s by
42+
# default, see ``DEFAULT_CRASH_GRACE_SECONDS``) prevents the
43+
# multi-dispatcher reap race in production; setting it to 0 here
44+
# restores the pre-fix instant-reclaim semantics these tests were
45+
# written against. The grace-period itself is covered by dedicated
46+
# tests in tests/hermes_cli/test_kanban_db.py.
47+
monkeypatch.setenv("HERMES_KANBAN_CRASH_GRACE_SECONDS", "0")
3948
kb.init_db()
4049
return home
4150

tests/hermes_cli/test_kanban_db.py

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2097,17 +2097,31 @@ def test_latest_summaries_batch_omits_tasks_without_summary(kanban_home):
20972097
# NFS / network-filesystem fallback (see hermes_state.apply_wal_with_fallback)
20982098
# ---------------------------------------------------------------------------
20992099

2100-
def test_connect_falls_back_to_delete_on_locking_protocol(kanban_home, caplog):
2100+
def test_connect_falls_back_to_delete_on_locking_protocol(tmp_path, monkeypatch, caplog):
21012101
"""kanban_db.connect() must handle ``locking protocol`` on NFS/SMB.
21022102
21032103
Without this fallback, the gateway's kanban dispatcher crashes every
21042104
60s and the kanban migration (``consecutive_failures`` ADD COLUMN) is
21052105
retried forever — which is what the real-world user report shows
21062106
(see hermes-agent issue #22032).
2107+
2108+
NOTE: We do NOT use the ``kanban_home`` fixture here because that
2109+
fixture pre-initializes the DB via ``kb.init_db()`` — putting the
2110+
file in WAL on disk. The Bug D safety guard now refuses to downgrade
2111+
to DELETE when the on-disk header is already WAL, so testing the
2112+
NFS-fallback path requires a truly-fresh DB file (NFS scenario in
2113+
production: first connection of the first process ever to touch the
2114+
file, where downgrading is safe because nobody else has WAL state
2115+
yet).
21072116
"""
21082117
import sqlite3 as _sqlite3
21092118
from unittest.mock import patch as _patch
21102119

2120+
home = tmp_path / ".hermes"
2121+
home.mkdir()
2122+
monkeypatch.setenv("HERMES_HOME", str(home))
2123+
monkeypatch.setattr(Path, "home", lambda: tmp_path)
2124+
21112125
# Clear module cache so a fresh connect() is attempted
21122126
kb._INITIALIZED_PATHS.clear()
21132127

tests/test_hermes_state_wal_fallback.py

Lines changed: 61 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -110,15 +110,72 @@ def test_falls_back_on_not_authorized(self, tmp_path):
110110
assert mode == "delete"
111111
conn.close()
112112

113-
def test_falls_back_on_disk_io_error(self, tmp_path):
114-
"""Flaky network FS → disk I/O error → still fall back."""
113+
def test_reraises_on_disk_io_error(self, tmp_path):
114+
"""Transient EIO from ``PRAGMA journal_mode=WAL`` must NOT silently
115+
downgrade to DELETE.
116+
117+
Regression for "Bug D": treating transient EIO as a permanent
118+
WAL-incompat marker produced the mixed-journal-mode-across-processes
119+
corruption pattern (process A downgrades to DELETE, sibling
120+
processes successfully set WAL, SQLite corrupts the file because
121+
the two locking protocols are documented as incompatible). EIO is
122+
usually transient (page-cache pressure, lock contention, brief
123+
storage hiccups); the right behavior is to re-raise so the caller
124+
can retry, not to walk the DB into a permanently downgraded state.
125+
"""
115126
conn, _ = _open_blocking(
116127
tmp_path / "flaky.db", reason="disk I/O error", isolation_level=None
117128
)
118-
mode = apply_wal_with_fallback(conn)
119-
assert mode == "delete"
129+
with pytest.raises(sqlite3.OperationalError, match="disk I/O error"):
130+
apply_wal_with_fallback(conn)
120131
conn.close()
121132

133+
def test_does_not_downgrade_when_disk_says_wal(self, tmp_path):
134+
"""Belt-and-suspenders: even if a marker matches, refuse to
135+
downgrade when the on-disk DB header is already WAL.
136+
137+
Prevents a future addition to ``_WAL_INCOMPAT_MARKERS`` from
138+
accidentally reintroducing the mixed-journal-mode corruption
139+
pattern. We construct a DB already in WAL on disk, then open a
140+
new connection whose ``PRAGMA journal_mode=WAL`` raises one of
141+
the legit markers — the function must still re-raise (refusing
142+
the downgrade) because the on-disk file is WAL.
143+
"""
144+
# Prime the file in WAL mode using a normal connection
145+
primer = sqlite3.connect(
146+
str(tmp_path / "already-wal.db"), isolation_level=None
147+
)
148+
try:
149+
primer.execute("PRAGMA journal_mode=WAL")
150+
primer.execute("CREATE TABLE t (x INTEGER)")
151+
primer.execute("INSERT INTO t VALUES (1)")
152+
assert (
153+
primer.execute("PRAGMA journal_mode").fetchone()[0].lower() == "wal"
154+
)
155+
finally:
156+
primer.close()
157+
158+
# New connection whose WAL pragma raises "locking protocol" — a
159+
# marker that WOULD normally trigger downgrade. With the on-disk
160+
# guard, we must instead re-raise.
161+
conn, _ = _open_blocking(
162+
tmp_path / "already-wal.db",
163+
reason="locking protocol",
164+
isolation_level=None,
165+
)
166+
with pytest.raises(sqlite3.OperationalError, match="locking protocol"):
167+
apply_wal_with_fallback(conn)
168+
conn.close()
169+
170+
# And the file is STILL WAL on disk — nothing got rewritten
171+
check = sqlite3.connect(str(tmp_path / "already-wal.db"))
172+
try:
173+
assert (
174+
check.execute("PRAGMA journal_mode").fetchone()[0].lower() == "wal"
175+
)
176+
finally:
177+
check.close()
178+
122179
def test_reraises_unrelated_operational_error(self, tmp_path):
123180
"""Non-WAL-compat errors must NOT be silently swallowed by the fallback."""
124181
conn, _ = _open_blocking(

0 commit comments

Comments
 (0)