fix(state): never silently downgrade WAL to DELETE on transient EIO#31294
fix(state): never silently downgrade WAL to DELETE on transient EIO#31294steveonjava wants to merge 2 commits into
Conversation
apply_wal_with_fallback() treated "disk i/o error" as a permanent
WAL-incompatibility marker, identical to "locking protocol" (NFS) and
"not authorized" (FUSE). But EIO during PRAGMA journal_mode=WAL is
typically TRANSIENT — page-cache pressure, brief lock contention,
recoverable storage hiccups — not a permanent filesystem property.
Treating transient EIO as a permanent downgrade signal produces the
mixed-journal-mode-across-processes corruption pattern:
1. Process A opens kanban.db, hits transient EIO on the WAL pragma,
silently downgrades to journal_mode=DELETE.
2. Process B (no EIO) opens the same file moments later and
successfully sets journal_mode=WAL.
3. A writes rollback-journal frames while B writes WAL frames. SQLite
documents this as unsupported and corrupts the file:
https://www.sqlite.org/wal.html ("all connections to the same
database must use the same locking protocol").
This was the root cause of repeated kanban.db corruption on hosts with
multiple gateway processes plus CLI invocations against the same DB
(observed pattern: corruption shortly after gateway startup, after the
process logged "WAL journal_mode unsupported on this filesystem (disk
I/O error) — falling back to journal_mode=DELETE"). The fallback
warning told the truth — fallback DID happen — but the premise
("unsupported on this filesystem") was wrong; the EIO was a one-shot
event and sibling processes successfully used WAL.
Fix has two layers:
1. Remove "disk i/o error" from _WAL_INCOMPAT_MARKERS. EIO now re-raises
so callers can retry instead of silently corrupting the DB. The two
remaining markers ("locking protocol", "not authorized") are
deterministic per filesystem so they remain safe permanent-downgrade
signals.
2. Belt-and-suspenders: before downgrading on ANY marker match, peek the
on-disk journal mode. If the header says WAL, refuse to downgrade and
re-raise the original error. This guards against any future addition
to _WAL_INCOMPAT_MARKERS turning out to be transient in some
environment we haven't yet seen.
Tests:
- tests/test_hermes_state_wal_fallback.py:
* Flipped test_falls_back_on_disk_io_error → test_reraises_on_disk_io_error
asserting EIO is re-raised, not silently swallowed.
* Added test_does_not_downgrade_when_disk_says_wal covering the
on-disk-header safety guard for the existing legitimate markers.
- tests/hermes_cli/test_kanban_db.py:
* test_connect_falls_back_to_delete_on_locking_protocol now uses a
truly-fresh DB (instead of the kanban_home fixture which pre-inits
in WAL). On NFS the very first process touching the file legitimately
downgrades; on a file already in WAL the new guard correctly refuses.
A standalone reproducer lives at /tmp/kanban-stress/repro_bugD_eio_wal_downgrade.py
(not committed): without fix the DB silently flips from WAL to DELETE
mid-process; with fix the EIO surfaces and the file stays WAL.
Refs: Bug D in the kanban-corruption investigation series (Bugs A and C
shipped in ebe7374f3 and e02147d5e respectively). Bug D explains every
corruption incident this week including those that survived A's
single-dispatcher mitigation, because every CLI invocation is a
separate process whose WAL pragma can transiently fail.
1afc6ef to
c4aea65
Compare
|
Adding live field evidence in support of this fix — the symptom path described here matched our incident almost line for line. Environment
Timeline (most recent occurrence) 6 minutes between the WAL→DELETE downgrade warning and the corruption surfacing. Gateway pid had been running for ~12.5h before this with no OOM, SIGKILL, or restart. The fallback warning was the only abnormal event in the preceding window. Corruption shape (PRAGMA integrity_check): Tree 9 = Notable secondary observation What we'd ask Happy to attach the full integrity_check output, the sqlite_master rootpage mapping, or the systemd journal slice around 09:51–09:57 if useful. (Also tracking #31014 from |
|
Bundled into #32857 for batch review. This draft remains open as a cherry-pick fallback if maintainers prefer surgical landing. |
|
Merged via #33482 (commit 5c49cd0). Cherry-picked with authorship preserved as part of the @steveonjava batch salvage from #32857. Thanks! |
Summary
Removes
"disk i/o error"from_WAL_INCOMPAT_MARKERSso transient EIO fromPRAGMA journal_mode=WALno longer triggers a silent downgrade to DELETE journal mode. Adds an unconditional safety guard that re-raises if the on-disk DB header already reports WAL, defending against any future marker that turns out to be transient.Problem
apply_wal_with_fallbackinhermes_state.pytreatedOperationalError("disk i/o error")as a signal that the filesystem doesn't support WAL, and downgraded the connection to DELETE journal mode. EIO from SQLite is frequently transient — page-cache pressure, brief lock contention, recoverable storage hiccups — not a permanent filesystem property.The consequence: one process hits a transient EIO and writes the file in DELETE-mode rollback-journal layout; sibling processes (a separate dispatcher, a gateway, a worker) successfully set WAL on the same file. Per the SQLite docs (https://www.sqlite.org/wal.html), all connections to the same database must use the same locking protocol — mixed-mode access corrupts the file. We've reproduced this corruption pattern in production against the kanban DB.
Fix
Remove
"disk i/o error"from_WAL_INCOMPAT_MARKERS. The two remaining markers ("locking protocol"for NFS/SMB,"not authorized"for restricted FUSE mounts) are deterministic per filesystem — they fire on every attempt, not transiently — so they remain safe permanent-downgrade signals.Add
_on_disk_journal_mode(conn)helper that reads the on-disk journal mode viaPRAGMA journal_mode.In
apply_wal_with_fallback, after marker matching but before downgrade, check the on-disk mode. If the file is already WAL on disk (set by another process), re-raise the originalOperationalErrorinstead of downgrading. The caller can retry; we never walk an established-WAL DB into a mixed-mode state.Verification
tests/test_hermes_state_wal_fallback.py,tests/hermes_cli/test_kanban_db.py,tests/hermes_cli/test_kanban_core_functionality.py— 180/180 passing in our env. The single failure in the broader suite (test_resolve_hermes_argv_module_actually_runs) reproduces on unmodifiedupstream/mainand is unrelated to this change.test_hermes_state_wal_fallback.py: regression test that EIO no longer auto-downgrades to DELETE; safety-guard test confirming on-disk-WAL prevents downgrade even if a marker matches.test_kanban_db.py/test_kanban_core_functionality.py: kanban-side coverage of the no-downgrade path.Related
kazuto-k, open) — concurrent narrower fix that also removes"disk i/o error"from the markers. This PR is a strict superset: same marker removal plus the on-disk-WAL safety guard and 147 lines of test coverage. Happy to defer / collaborate if maintainers prefer fix: remove disk i/o error from WAL incompatibility markers #31014's smaller surface.deepujain, open, Fixes Fix: SQLite WAL + BTRFS COW compatibility — busy_timeout + retry logic #30576) — adds busy_timeout=30s and 3× retry to WAL setup. Complementary with this PR: fix(state): retry transient SQLite WAL setup failures (Fixes #30576) #30700 reduces the rate at which transient EIO surfaces to the caller; this PR ensures that when EIO does surface after retries, we don't silently corrupt. The two together close the loop.briandevans, open) — wraps the DELETE fallback in try/except for the APFS double-failure case. Orthogonal; touches the fallback path, not the marker classification.SimoKiihamaki, open) —synchronous=FULL+wal_checkpoint(TRUNCATE)on shutdown. Orthogonal; durability, not journal-mode classification.hermes kanban init). Not closed by this PR.This PR closes ONE specific silent-downgrade pathway. It does not close all open kanban-corruption reports.
Files changed
hermes_state.py— remove EIO marker, add_on_disk_journal_mode, add safety-guard branch inapply_wal_with_fallback. +57/−5.scripts/release.py— append AUTHOR_MAP entry (steveonjava@gmail.com→steveonjava) socontributor-check.ymlpasses. +1/−0.tests/test_hermes_state_wal_fallback.py— new regression + safety-guard tests. +65/−0.tests/hermes_cli/test_kanban_db.py,tests/hermes_cli/test_kanban_core_functionality.py— kanban-side coverage. +25/−1.Branch:
steveonjava:feat/kanban-wal-disk-io-error-corruption-v2Commit:
1afc6ef80ee00d60b295dd7a75efb613758c78ea