fix: remove disk i/o error from WAL incompatibility markers#31014
fix: remove disk i/o error from WAL incompatibility markers#31014kazuto-k wants to merge 1 commit into
Conversation
A disk I/O error is not a signal of WAL filesystem incompatibility — it signals genuine storage failure. Treating it as a WAL incompatibility caused the fallback to DELETE journal mode, which compounds damage by writing directly to the main DB file during I/O errors. On 2026-05-23 this caused a corrupted B-tree page pointer in the kanban task_links table, destroying 161 of 166 parent-child link records during recovery (page 62 referenced nonexistent page 289). The correct response to disk I/O errors is to fail fast so the caller can retry or alert. WAL mode is safer during I/O errors because writes land in the WAL sidecar rather than directly mutating B-tree pages. Root cause analysis: https://github.com/kazuto-k/hermes-agent/issues/1
b173779 to
1672c77
Compare
|
Hey @kazuto-k, I opened #31294 a few minutes ago with the same marker removal plus a small safety guard and tests. The guard reads the on-disk journal mode before downgrading, so if another process already set WAL, the fallback re-raises instead of writing a rollback journal into a WAL-mode file. Fine with me if maintainers want the smaller surface here, since your marker removal alone fixes the immediate corruption case. Related: #30700 adds retries on transient WAL setup failures, which is complementary either way. |
|
Thanks @steveonjava. I'm in favor of the more comprehensive approach in #31294 — the journal mode check before downgrading is a necessary safety guard. That said, if maintainers want to merge the minimal fix here first to stop the immediate corruption while #31294 goes through review, either order works for me. Leaving it to maintainer preference. |
|
Closing as already fixed on main — landed via #33482 commit 5c49cd0ed (@steveonjava's batch-salvage). That commit removes |
Summary
_WAL_INCOMPAT_MARKERSinhermes_state.pyincorrectly included"disk i/o error"as a signal that WAL journal mode is unsupported on the filesystem. A disk I/O error is not a WAL incompatibility — it signals genuine storage failure.What Went Wrong
When the SQLite kanban DB experienced a transient disk I/O error on 2026-05-23,
apply_wal_with_fallback()interpreted it as "WAL unsupported, fall back to DELETE journal mode." The DELETE fallback then continued writing directly to the main DB file during I/O errors, causing a partial write that corrupted a B-tree page pointer:task_linkstable B-tree page 62 referenced nonexistent page 289.recovercould not salvage the corrupted table dataThe Fix
Remove
"disk i/o error"from_WAL_INCOMPAT_MARKERS. When a genuine disk I/O error occurs during WAL setup, the code now re-raises the exception (fail-fast) instead of silently switching to a more fragile journal mode. WAL is safer during I/O errors because writes land in the WAL sidecar first, rather than directly mutating B-tree pages.Impact
Verification
The fix has been running locally with
hermes gateway restartand the kanban DB remains healthy (integrity check passes). A backup cron with pre-backup integrity check was also added to prevent undetected corruption from propagating to backups.Closes #1