fix: remove disk i/o error from WAL incompatibility markers by kazuto-k · Pull Request #31014 · NousResearch/hermes-agent

kazuto-k · 2026-05-23T15:57:26Z

Summary

_WAL_INCOMPAT_MARKERS in hermes_state.py incorrectly included "disk i/o error" as a signal that WAL journal mode is unsupported on the filesystem. A disk I/O error is not a WAL incompatibility — it signals genuine storage failure.

What Went Wrong

When the SQLite kanban DB experienced a transient disk I/O error on 2026-05-23, apply_wal_with_fallback() interpreted it as "WAL unsupported, fall back to DELETE journal mode." The DELETE fallback then continued writing directly to the main DB file during I/O errors, causing a partial write that corrupted a B-tree page pointer:

task_links table B-tree page 62 referenced nonexistent page 289
161 of 166 parent-child link records were destroyed
Recovery via .recover could not salvage the corrupted table data

The Fix

Remove "disk i/o error" from _WAL_INCOMPAT_MARKERS. When a genuine disk I/O error occurs during WAL setup, the code now re-raises the exception (fail-fast) instead of silently switching to a more fragile journal mode. WAL is safer during I/O errors because writes land in the WAL sidecar first, rather than directly mutating B-tree pages.

Impact

Before: Disk I/O error → fall back to DELETE → continued writes on failing disk → data corruption
After: Disk I/O error → raise → caller (kanban dispatcher / session DB) retries or alerts → no silent data corruption

Verification

The fix has been running locally with hermes gateway restart and the kanban DB remains healthy (integrity check passes). A backup cron with pre-backup integrity check was also added to prevent undetected corruption from propagating to backups.

Closes #1

A disk I/O error is not a signal of WAL filesystem incompatibility — it signals genuine storage failure. Treating it as a WAL incompatibility caused the fallback to DELETE journal mode, which compounds damage by writing directly to the main DB file during I/O errors. On 2026-05-23 this caused a corrupted B-tree page pointer in the kanban task_links table, destroying 161 of 166 parent-child link records during recovery (page 62 referenced nonexistent page 289). The correct response to disk I/O errors is to fail fast so the caller can retry or alert. WAL mode is safer during I/O errors because writes land in the WAL sidecar rather than directly mutating B-tree pages. Root cause analysis: https://github.com/kazuto-k/hermes-agent/issues/1

steveonjava · 2026-05-24T04:06:09Z

Hey @kazuto-k, I opened #31294 a few minutes ago with the same marker removal plus a small safety guard and tests. The guard reads the on-disk journal mode before downgrading, so if another process already set WAL, the fallback re-raises instead of writing a rollback journal into a WAL-mode file. Fine with me if maintainers want the smaller surface here, since your marker removal alone fixes the immediate corruption case. Related: #30700 adds retries on transient WAL setup failures, which is complementary either way.

kazuto-k · 2026-05-25T01:31:36Z

Thanks @steveonjava. I'm in favor of the more comprehensive approach in #31294 — the journal mode check before downgrading is a necessary safety guard. That said, if maintainers want to merge the minimal fix here first to stop the immediate corruption while #31294 goes through review, either order works for me. Leaving it to maintainer preference.

steveonjava · 2026-05-25T01:37:05Z

Thanks for the feedback, @kazuto-k. I agree, #31294 is a more comprehensive fix.

kshitijk4poor · 2026-05-28T06:39:41Z

Closing as already fixed on main — landed via #33482 commit 5c49cd0ed (@steveonjava's batch-salvage). That commit removes "disk i/o error" from _WAL_INCOMPAT_MARKERS exactly as you proposed, with a belt-and-suspenders on-disk-header check before downgrading. Your reasoning (EIO is genuine storage failure / DELETE writes are more fragile) is consistent with the analysis in the merged commit. Thanks!

kazuto-k force-pushed the fix/disk-io-error-wal-fallback branch from b173779 to 1672c77 Compare May 23, 2026 15:59

alt-glitch added type/bug Something isn't working P1 High — major feature broken, no workaround comp/agent Core agent loop, run_agent.py, prompt builder labels May 23, 2026

alt-glitch mentioned this pull request May 23, 2026

kanban dispatcher wedges under multi-thread + subprocess concurrency due to WAL/SHM cache poisoning #31158

Closed

steveonjava mentioned this pull request May 24, 2026

fix(state): never silently downgrade WAL to DELETE on transient EIO #31294

Closed

Tranquil-Flow mentioned this pull request May 24, 2026

fix(state): proactively skip WAL journal mode on BTRFS filesystems (#30846) #31586

Open

steveonjava mentioned this pull request May 26, 2026

fix(kanban): batch-salvage 8 SQLite corruption hardening fixes (closes #31158, refs #29610) #32857

Closed

8 tasks

kshitijk4poor mentioned this pull request May 27, 2026

fix(kanban): batch-salvage 7 SQLite corruption hardening fixes from #32857 #33482

Merged

kshitijk4poor closed this May 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: remove disk i/o error from WAL incompatibility markers#31014

fix: remove disk i/o error from WAL incompatibility markers#31014
kazuto-k wants to merge 1 commit into
NousResearch:mainfrom
kazuto-k:fix/disk-io-error-wal-fallback

kazuto-k commented May 23, 2026

Uh oh!

steveonjava commented May 24, 2026

Uh oh!

kazuto-k commented May 25, 2026

Uh oh!

steveonjava commented May 25, 2026

Uh oh!

kshitijk4poor commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

kazuto-k commented May 23, 2026

Summary

What Went Wrong

The Fix

Impact

Verification

Uh oh!

steveonjava commented May 24, 2026

Uh oh!

kazuto-k commented May 25, 2026

Uh oh!

steveonjava commented May 25, 2026

Uh oh!

kshitijk4poor commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants