Skip to content

fix(state): never silently downgrade WAL to DELETE on transient EIO#31294

Closed
steveonjava wants to merge 2 commits into
NousResearch:mainfrom
steveonjava:feat/kanban-wal-disk-io-error-corruption-v2
Closed

fix(state): never silently downgrade WAL to DELETE on transient EIO#31294
steveonjava wants to merge 2 commits into
NousResearch:mainfrom
steveonjava:feat/kanban-wal-disk-io-error-corruption-v2

Conversation

@steveonjava

Copy link
Copy Markdown
Contributor

Summary

Removes "disk i/o error" from _WAL_INCOMPAT_MARKERS so transient EIO from PRAGMA journal_mode=WAL no longer triggers a silent downgrade to DELETE journal mode. Adds an unconditional safety guard that re-raises if the on-disk DB header already reports WAL, defending against any future marker that turns out to be transient.

Problem

apply_wal_with_fallback in hermes_state.py treated OperationalError("disk i/o error") as a signal that the filesystem doesn't support WAL, and downgraded the connection to DELETE journal mode. EIO from SQLite is frequently transient — page-cache pressure, brief lock contention, recoverable storage hiccups — not a permanent filesystem property.

The consequence: one process hits a transient EIO and writes the file in DELETE-mode rollback-journal layout; sibling processes (a separate dispatcher, a gateway, a worker) successfully set WAL on the same file. Per the SQLite docs (https://www.sqlite.org/wal.html), all connections to the same database must use the same locking protocol — mixed-mode access corrupts the file. We've reproduced this corruption pattern in production against the kanban DB.

Fix

  1. Remove "disk i/o error" from _WAL_INCOMPAT_MARKERS. The two remaining markers ("locking protocol" for NFS/SMB, "not authorized" for restricted FUSE mounts) are deterministic per filesystem — they fire on every attempt, not transiently — so they remain safe permanent-downgrade signals.

  2. Add _on_disk_journal_mode(conn) helper that reads the on-disk journal mode via PRAGMA journal_mode.

  3. In apply_wal_with_fallback, after marker matching but before downgrade, check the on-disk mode. If the file is already WAL on disk (set by another process), re-raise the original OperationalError instead of downgrading. The caller can retry; we never walk an established-WAL DB into a mixed-mode state.

Verification

  • Targeted test suites all pass: tests/test_hermes_state_wal_fallback.py, tests/hermes_cli/test_kanban_db.py, tests/hermes_cli/test_kanban_core_functionality.py — 180/180 passing in our env. The single failure in the broader suite (test_resolve_hermes_argv_module_actually_runs) reproduces on unmodified upstream/main and is unrelated to this change.
  • New tests added (+147 lines):
    • test_hermes_state_wal_fallback.py: regression test that EIO no longer auto-downgrades to DELETE; safety-guard test confirming on-disk-WAL prevents downgrade even if a marker matches.
    • test_kanban_db.py / test_kanban_core_functionality.py: kanban-side coverage of the no-downgrade path.

Related

This PR closes ONE specific silent-downgrade pathway. It does not close all open kanban-corruption reports.

Files changed

  • hermes_state.py — remove EIO marker, add _on_disk_journal_mode, add safety-guard branch in apply_wal_with_fallback. +57/−5.
  • scripts/release.py — append AUTHOR_MAP entry (steveonjava@gmail.comsteveonjava) so contributor-check.yml passes. +1/−0.
  • tests/test_hermes_state_wal_fallback.py — new regression + safety-guard tests. +65/−0.
  • tests/hermes_cli/test_kanban_db.py, tests/hermes_cli/test_kanban_core_functionality.py — kanban-side coverage. +25/−1.

Branch: steveonjava:feat/kanban-wal-disk-io-error-corruption-v2
Commit: 1afc6ef80ee00d60b295dd7a75efb613758c78ea

apply_wal_with_fallback() treated "disk i/o error" as a permanent
WAL-incompatibility marker, identical to "locking protocol" (NFS) and
"not authorized" (FUSE). But EIO during PRAGMA journal_mode=WAL is
typically TRANSIENT — page-cache pressure, brief lock contention,
recoverable storage hiccups — not a permanent filesystem property.

Treating transient EIO as a permanent downgrade signal produces the
mixed-journal-mode-across-processes corruption pattern:

  1. Process A opens kanban.db, hits transient EIO on the WAL pragma,
     silently downgrades to journal_mode=DELETE.
  2. Process B (no EIO) opens the same file moments later and
     successfully sets journal_mode=WAL.
  3. A writes rollback-journal frames while B writes WAL frames. SQLite
     documents this as unsupported and corrupts the file:
     https://www.sqlite.org/wal.html ("all connections to the same
     database must use the same locking protocol").

This was the root cause of repeated kanban.db corruption on hosts with
multiple gateway processes plus CLI invocations against the same DB
(observed pattern: corruption shortly after gateway startup, after the
process logged "WAL journal_mode unsupported on this filesystem (disk
I/O error) — falling back to journal_mode=DELETE"). The fallback
warning told the truth — fallback DID happen — but the premise
("unsupported on this filesystem") was wrong; the EIO was a one-shot
event and sibling processes successfully used WAL.

Fix has two layers:

1. Remove "disk i/o error" from _WAL_INCOMPAT_MARKERS. EIO now re-raises
   so callers can retry instead of silently corrupting the DB. The two
   remaining markers ("locking protocol", "not authorized") are
   deterministic per filesystem so they remain safe permanent-downgrade
   signals.

2. Belt-and-suspenders: before downgrading on ANY marker match, peek the
   on-disk journal mode. If the header says WAL, refuse to downgrade and
   re-raise the original error. This guards against any future addition
   to _WAL_INCOMPAT_MARKERS turning out to be transient in some
   environment we haven't yet seen.

Tests:

- tests/test_hermes_state_wal_fallback.py:
  * Flipped test_falls_back_on_disk_io_error → test_reraises_on_disk_io_error
    asserting EIO is re-raised, not silently swallowed.
  * Added test_does_not_downgrade_when_disk_says_wal covering the
    on-disk-header safety guard for the existing legitimate markers.

- tests/hermes_cli/test_kanban_db.py:
  * test_connect_falls_back_to_delete_on_locking_protocol now uses a
    truly-fresh DB (instead of the kanban_home fixture which pre-inits
    in WAL). On NFS the very first process touching the file legitimately
    downgrades; on a file already in WAL the new guard correctly refuses.

A standalone reproducer lives at /tmp/kanban-stress/repro_bugD_eio_wal_downgrade.py
(not committed): without fix the DB silently flips from WAL to DELETE
mid-process; with fix the EIO surfaces and the file stays WAL.

Refs: Bug D in the kanban-corruption investigation series (Bugs A and C
shipped in ebe7374f3 and e02147d5e respectively). Bug D explains every
corruption incident this week including those that survived A's
single-dispatcher mitigation, because every CLI invocation is a
separate process whose WAL pragma can transiently fail.
@steveonjava steveonjava force-pushed the feat/kanban-wal-disk-io-error-corruption-v2 branch from 1afc6ef to c4aea65 Compare May 24, 2026 04:01
@alt-glitch alt-glitch added type/bug Something isn't working comp/agent Core agent loop, run_agent.py, prompt builder comp/cli CLI entry point, hermes_cli/, setup wizard P1 High — major feature broken, no workaround labels May 24, 2026
@neektza

neektza commented May 26, 2026

Copy link
Copy Markdown

Adding live field evidence in support of this fix — the symptom path described here matched our incident almost line for line.

Environment

  • Linux 7.0 (Ubuntu 25.04), ext4 on a single local disk (no NFS/SMB/FUSE)
  • Hermes Agent on main (1 ahead, 74 behind upstream)
  • 4 kanban boards; only the one with high write volume corrupts; the other three stayed clean throughout
  • 15 GiB RAM, no swap
  • WAL works fine on this filesystem when tested directly: sqlite3 /tmp/x.db 'PRAGMA journal_mode=WAL'wal

Timeline (most recent occurrence)

09:51:36  WARNING hermes_state: kanban.db: WAL journal_mode unsupported
           on this filesystem (disk I/O error) — falling back to journal_mode=DELETE
09:57:25  kanban.db.corrupt.<ts>.bak written
09:57:26  WARNING gateway.run: kanban notifier tick failed:
           'int' object has no attribute 'lower'   (loops every 5s)

6 minutes between the WAL→DELETE downgrade warning and the corruption surfacing. Gateway pid had been running for ~12.5h before this with no OOM, SIGKILL, or restart. The fallback warning was the only abnormal event in the preceding window.

Corruption shape (PRAGMA integrity_check):

Tree 11 page 11 cell 48: Rowid 374 out of order
Tree 9 page 9 cell 0:  2nd reference to page 455
Tree 9 page 9 cell 1:  2nd reference to page 442
... 48 cells, all "2nd reference to page N"

Tree 9 = task_runs, Tree 11 = sqlite_autoindex_kanban_notify_subs_1. Same B-tree-page-aliasing pattern that #31014's description calls out for task_links.

Notable secondary observation
The WAL-fallback warning had only fired twice in 30 days of journal history, but the affected board had ~90 corrupt-bak files accumulated over 5 days. So at least one separate corruption pathway exists beyond the fallback (we suspect SIGKILL-during-write from earlier OOM events — the gateway hit the kernel OOM killer twice within 10 minutes before the first malformed-disk-image errors on that earlier day). But for the most recent incident the fallback warning is the only candidate trigger we can correlate, and it sits 6 minutes upstream of the corruption surfacing — consistent with this PR's mechanism.

What we'd ask
+1 on getting this merged. Every recurrence of this corruption class costs ~30 min of .recover + scrub + atomic-swap work. #30700's retry layer plus this PR's no-silent-downgrade guarantee would close the loop for our environment.

Happy to attach the full integrity_check output, the sqlite_master rootpage mapping, or the systemd journal slice around 09:51–09:57 if useful.

(Also tracking #31014 from @kazuto-k — author endorsed this PR as the superset, so commenting here.)

@steveonjava

Copy link
Copy Markdown
Contributor Author

Bundled into #32857 for batch review. This draft remains open as a cherry-pick fallback if maintainers prefer surgical landing.

@kshitijk4poor

Copy link
Copy Markdown
Collaborator

Merged via #33482 (commit 5c49cd0). Cherry-picked with authorship preserved as part of the @steveonjava batch salvage from #32857. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder comp/cli CLI entry point, hermes_cli/, setup wizard P1 High — major feature broken, no workaround type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix: SQLite WAL + BTRFS COW compatibility — busy_timeout + retry logic

4 participants