Skip to content

fix: remove disk i/o error from WAL incompatibility markers#31014

Closed
kazuto-k wants to merge 1 commit into
NousResearch:mainfrom
kazuto-k:fix/disk-io-error-wal-fallback
Closed

fix: remove disk i/o error from WAL incompatibility markers#31014
kazuto-k wants to merge 1 commit into
NousResearch:mainfrom
kazuto-k:fix/disk-io-error-wal-fallback

Conversation

@kazuto-k

Copy link
Copy Markdown

Summary

_WAL_INCOMPAT_MARKERS in hermes_state.py incorrectly included "disk i/o error" as a signal that WAL journal mode is unsupported on the filesystem. A disk I/O error is not a WAL incompatibility — it signals genuine storage failure.

What Went Wrong

When the SQLite kanban DB experienced a transient disk I/O error on 2026-05-23, apply_wal_with_fallback() interpreted it as "WAL unsupported, fall back to DELETE journal mode." The DELETE fallback then continued writing directly to the main DB file during I/O errors, causing a partial write that corrupted a B-tree page pointer:

  • task_links table B-tree page 62 referenced nonexistent page 289
  • 161 of 166 parent-child link records were destroyed
  • Recovery via .recover could not salvage the corrupted table data

The Fix

Remove "disk i/o error" from _WAL_INCOMPAT_MARKERS. When a genuine disk I/O error occurs during WAL setup, the code now re-raises the exception (fail-fast) instead of silently switching to a more fragile journal mode. WAL is safer during I/O errors because writes land in the WAL sidecar first, rather than directly mutating B-tree pages.

Impact

  • Before: Disk I/O error → fall back to DELETE → continued writes on failing disk → data corruption
  • After: Disk I/O error → raise → caller (kanban dispatcher / session DB) retries or alerts → no silent data corruption

Verification

The fix has been running locally with hermes gateway restart and the kanban DB remains healthy (integrity check passes). A backup cron with pre-backup integrity check was also added to prevent undetected corruption from propagating to backups.

Closes #1

A disk I/O error is not a signal of WAL filesystem incompatibility —
it signals genuine storage failure.  Treating it as a WAL
incompatibility caused the fallback to DELETE journal mode, which
compounds damage by writing directly to the main DB file during I/O
errors.  On 2026-05-23 this caused a corrupted B-tree page pointer in
the kanban task_links table, destroying 161 of 166 parent-child link
records during recovery (page 62 referenced nonexistent page 289).

The correct response to disk I/O errors is to fail fast so the caller
can retry or alert.  WAL mode is safer during I/O errors because
writes land in the WAL sidecar rather than directly mutating
B-tree pages.

Root cause analysis: https://github.com/kazuto-k/hermes-agent/issues/1
@kazuto-k kazuto-k force-pushed the fix/disk-io-error-wal-fallback branch from b173779 to 1672c77 Compare May 23, 2026 15:59
@alt-glitch alt-glitch added type/bug Something isn't working P1 High — major feature broken, no workaround comp/agent Core agent loop, run_agent.py, prompt builder labels May 23, 2026
@steveonjava

Copy link
Copy Markdown
Contributor

Hey @kazuto-k, I opened #31294 a few minutes ago with the same marker removal plus a small safety guard and tests. The guard reads the on-disk journal mode before downgrading, so if another process already set WAL, the fallback re-raises instead of writing a rollback journal into a WAL-mode file. Fine with me if maintainers want the smaller surface here, since your marker removal alone fixes the immediate corruption case. Related: #30700 adds retries on transient WAL setup failures, which is complementary either way.

@kazuto-k

Copy link
Copy Markdown
Author

Thanks @steveonjava. I'm in favor of the more comprehensive approach in #31294 — the journal mode check before downgrading is a necessary safety guard. That said, if maintainers want to merge the minimal fix here first to stop the immediate corruption while #31294 goes through review, either order works for me. Leaving it to maintainer preference.

@steveonjava

Copy link
Copy Markdown
Contributor

Thanks for the feedback, @kazuto-k. I agree, #31294 is a more comprehensive fix.

@kshitijk4poor

Copy link
Copy Markdown
Collaborator

Closing as already fixed on main — landed via #33482 commit 5c49cd0ed (@steveonjava's batch-salvage). That commit removes "disk i/o error" from _WAL_INCOMPAT_MARKERS exactly as you proposed, with a belt-and-suspenders on-disk-header check before downgrading. Your reasoning (EIO is genuine storage failure / DELETE writes are more fragile) is consistent with the analysis in the merged commit. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P1 High — major feature broken, no workaround type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants