fix(state): never silently downgrade WAL to DELETE on transient EIO by steveonjava · Pull Request #31294 · NousResearch/hermes-agent

steveonjava · 2026-05-24T04:01:04Z

Summary

Removes "disk i/o error" from _WAL_INCOMPAT_MARKERS so transient EIO from PRAGMA journal_mode=WAL no longer triggers a silent downgrade to DELETE journal mode. Adds an unconditional safety guard that re-raises if the on-disk DB header already reports WAL, defending against any future marker that turns out to be transient.

Problem

apply_wal_with_fallback in hermes_state.py treated OperationalError("disk i/o error") as a signal that the filesystem doesn't support WAL, and downgraded the connection to DELETE journal mode. EIO from SQLite is frequently transient — page-cache pressure, brief lock contention, recoverable storage hiccups — not a permanent filesystem property.

The consequence: one process hits a transient EIO and writes the file in DELETE-mode rollback-journal layout; sibling processes (a separate dispatcher, a gateway, a worker) successfully set WAL on the same file. Per the SQLite docs (https://www.sqlite.org/wal.html), all connections to the same database must use the same locking protocol — mixed-mode access corrupts the file. We've reproduced this corruption pattern in production against the kanban DB.

Fix

Remove "disk i/o error" from _WAL_INCOMPAT_MARKERS. The two remaining markers ("locking protocol" for NFS/SMB, "not authorized" for restricted FUSE mounts) are deterministic per filesystem — they fire on every attempt, not transiently — so they remain safe permanent-downgrade signals.
Add _on_disk_journal_mode(conn) helper that reads the on-disk journal mode via PRAGMA journal_mode.
In apply_wal_with_fallback, after marker matching but before downgrade, check the on-disk mode. If the file is already WAL on disk (set by another process), re-raise the original OperationalError instead of downgrading. The caller can retry; we never walk an established-WAL DB into a mixed-mode state.

Verification

Targeted test suites all pass: tests/test_hermes_state_wal_fallback.py, tests/hermes_cli/test_kanban_db.py, tests/hermes_cli/test_kanban_core_functionality.py — 180/180 passing in our env. The single failure in the broader suite (test_resolve_hermes_argv_module_actually_runs) reproduces on unmodified upstream/main and is unrelated to this change.
New tests added (+147 lines):
- test_hermes_state_wal_fallback.py: regression test that EIO no longer auto-downgrades to DELETE; safety-guard test confirming on-disk-WAL prevents downgrade even if a marker matches.
- test_kanban_db.py / test_kanban_core_functionality.py: kanban-side coverage of the no-downgrade path.

fix: remove disk i/o error from WAL incompatibility markers #31014 (kazuto-k, open) — concurrent narrower fix that also removes "disk i/o error" from the markers. This PR is a strict superset: same marker removal plus the on-disk-WAL safety guard and 147 lines of test coverage. Happy to defer / collaborate if maintainers prefer fix: remove disk i/o error from WAL incompatibility markers #31014's smaller surface.
fix(state): retry transient SQLite WAL setup failures (Fixes #30576) #30700 (deepujain, open, Fixes Fix: SQLite WAL + BTRFS COW compatibility — busy_timeout + retry logic #30576) — adds busy_timeout=30s and 3× retry to WAL setup. Complementary with this PR: fix(state): retry transient SQLite WAL setup failures (Fixes #30576) #30700 reduces the rate at which transient EIO surfaces to the caller; this PR ensures that when EIO does surface after retries, we don't silently corrupt. The two together close the loop.
fix(state): wrap DELETE journal_mode fallback in try/except to survive APFS double-failure #30823 (briandevans, open) — wraps the DELETE fallback in try/except for the APFS double-failure case. Orthogonal; touches the fallback path, not the marker classification.
fix(state): add PRAGMA synchronous=FULL + TRUNCATE checkpoint to prevent WAL corruption #30654 (SimoKiihamaki, open) — synchronous=FULL + wal_checkpoint(TRUNCATE) on shutdown. Orthogonal; durability, not journal-mode classification.
Refs Kanban DB corruption risk from multi-gateway concurrent SQLite access #30445 — "Kanban DB corruption risk from multi-gateway concurrent SQLite access". Related symptom (kanban corruption); different mechanism (inode confusion from hermes kanban init). Not closed by this PR.
Refs kanban dispatcher wedges under multi-thread + subprocess concurrency due to WAL/SHM cache poisoning #31158 — "Kanban dispatcher wedges under multi-thread + subprocess WAL/SHM cache poisoning". Related symptom; different mechanism. Not closed by this PR.

This PR closes ONE specific silent-downgrade pathway. It does not close all open kanban-corruption reports.

Files changed

hermes_state.py — remove EIO marker, add _on_disk_journal_mode, add safety-guard branch in apply_wal_with_fallback. +57/−5.
scripts/release.py — append AUTHOR_MAP entry (steveonjava@gmail.com → steveonjava) so contributor-check.yml passes. +1/−0.
tests/test_hermes_state_wal_fallback.py — new regression + safety-guard tests. +65/−0.
tests/hermes_cli/test_kanban_db.py, tests/hermes_cli/test_kanban_core_functionality.py — kanban-side coverage. +25/−1.

Branch: steveonjava:feat/kanban-wal-disk-io-error-corruption-v2
Commit: 1afc6ef80ee00d60b295dd7a75efb613758c78ea

apply_wal_with_fallback() treated "disk i/o error" as a permanent WAL-incompatibility marker, identical to "locking protocol" (NFS) and "not authorized" (FUSE). But EIO during PRAGMA journal_mode=WAL is typically TRANSIENT — page-cache pressure, brief lock contention, recoverable storage hiccups — not a permanent filesystem property. Treating transient EIO as a permanent downgrade signal produces the mixed-journal-mode-across-processes corruption pattern: 1. Process A opens kanban.db, hits transient EIO on the WAL pragma, silently downgrades to journal_mode=DELETE. 2. Process B (no EIO) opens the same file moments later and successfully sets journal_mode=WAL. 3. A writes rollback-journal frames while B writes WAL frames. SQLite documents this as unsupported and corrupts the file: https://www.sqlite.org/wal.html ("all connections to the same database must use the same locking protocol"). This was the root cause of repeated kanban.db corruption on hosts with multiple gateway processes plus CLI invocations against the same DB (observed pattern: corruption shortly after gateway startup, after the process logged "WAL journal_mode unsupported on this filesystem (disk I/O error) — falling back to journal_mode=DELETE"). The fallback warning told the truth — fallback DID happen — but the premise ("unsupported on this filesystem") was wrong; the EIO was a one-shot event and sibling processes successfully used WAL. Fix has two layers: 1. Remove "disk i/o error" from _WAL_INCOMPAT_MARKERS. EIO now re-raises so callers can retry instead of silently corrupting the DB. The two remaining markers ("locking protocol", "not authorized") are deterministic per filesystem so they remain safe permanent-downgrade signals. 2. Belt-and-suspenders: before downgrading on ANY marker match, peek the on-disk journal mode. If the header says WAL, refuse to downgrade and re-raise the original error. This guards against any future addition to _WAL_INCOMPAT_MARKERS turning out to be transient in some environment we haven't yet seen. Tests: - tests/test_hermes_state_wal_fallback.py: * Flipped test_falls_back_on_disk_io_error → test_reraises_on_disk_io_error asserting EIO is re-raised, not silently swallowed. * Added test_does_not_downgrade_when_disk_says_wal covering the on-disk-header safety guard for the existing legitimate markers. - tests/hermes_cli/test_kanban_db.py: * test_connect_falls_back_to_delete_on_locking_protocol now uses a truly-fresh DB (instead of the kanban_home fixture which pre-inits in WAL). On NFS the very first process touching the file legitimately downgrades; on a file already in WAL the new guard correctly refuses. A standalone reproducer lives at /tmp/kanban-stress/repro_bugD_eio_wal_downgrade.py (not committed): without fix the DB silently flips from WAL to DELETE mid-process; with fix the EIO surfaces and the file stays WAL. Refs: Bug D in the kanban-corruption investigation series (Bugs A and C shipped in ebe7374f3 and e02147d5e respectively). Bug D explains every corruption incident this week including those that survived A's single-dispatcher mitigation, because every CLI invocation is a separate process whose WAL pragma can transiently fail.

neektza · 2026-05-26T09:05:00Z

Adding live field evidence in support of this fix — the symptom path described here matched our incident almost line for line.

Environment

Linux 7.0 (Ubuntu 25.04), ext4 on a single local disk (no NFS/SMB/FUSE)
Hermes Agent on main (1 ahead, 74 behind upstream)
4 kanban boards; only the one with high write volume corrupts; the other three stayed clean throughout
15 GiB RAM, no swap
WAL works fine on this filesystem when tested directly: sqlite3 /tmp/x.db 'PRAGMA journal_mode=WAL' → wal

Timeline (most recent occurrence)

09:51:36  WARNING hermes_state: kanban.db: WAL journal_mode unsupported
           on this filesystem (disk I/O error) — falling back to journal_mode=DELETE
09:57:25  kanban.db.corrupt.<ts>.bak written
09:57:26  WARNING gateway.run: kanban notifier tick failed:
           'int' object has no attribute 'lower'   (loops every 5s)

6 minutes between the WAL→DELETE downgrade warning and the corruption surfacing. Gateway pid had been running for ~12.5h before this with no OOM, SIGKILL, or restart. The fallback warning was the only abnormal event in the preceding window.

Corruption shape (PRAGMA integrity_check):

Tree 11 page 11 cell 48: Rowid 374 out of order
Tree 9 page 9 cell 0:  2nd reference to page 455
Tree 9 page 9 cell 1:  2nd reference to page 442
... 48 cells, all "2nd reference to page N"

Tree 9 = task_runs, Tree 11 = sqlite_autoindex_kanban_notify_subs_1. Same B-tree-page-aliasing pattern that #31014's description calls out for task_links.

Notable secondary observation
The WAL-fallback warning had only fired twice in 30 days of journal history, but the affected board had ~90 corrupt-bak files accumulated over 5 days. So at least one separate corruption pathway exists beyond the fallback (we suspect SIGKILL-during-write from earlier OOM events — the gateway hit the kernel OOM killer twice within 10 minutes before the first malformed-disk-image errors on that earlier day). But for the most recent incident the fallback warning is the only candidate trigger we can correlate, and it sits 6 minutes upstream of the corruption surfacing — consistent with this PR's mechanism.

What we'd ask
+1 on getting this merged. Every recurrence of this corruption class costs ~30 min of .recover + scrub + atomic-swap work. #30700's retry layer plus this PR's no-silent-downgrade guarantee would close the loop for our environment.

Happy to attach the full integrity_check output, the sqlite_master rootpage mapping, or the systemd journal slice around 09:51–09:57 if useful.

(Also tracking #31014 from @kazuto-k — author endorsed this PR as the superset, so commenting here.)

steveonjava · 2026-05-26T22:36:35Z

Bundled into #32857 for batch review. This draft remains open as a cherry-pick fallback if maintainers prefer surgical landing.

kshitijk4poor · 2026-05-27T21:32:58Z

Merged via #33482 (commit 5c49cd0). Cherry-picked with authorship preserved as part of the @steveonjava batch salvage from #32857. Thanks!

steveonjava force-pushed the feat/kanban-wal-disk-io-error-corruption-v2 branch from 1afc6ef to c4aea65 Compare May 24, 2026 04:01

steveonjava mentioned this pull request May 24, 2026

fix: remove disk i/o error from WAL incompatibility markers #31014

Closed

alt-glitch added type/bug Something isn't working comp/agent Core agent loop, run_agent.py, prompt builder comp/cli CLI entry point, hermes_cli/, setup wizard P1 High — major feature broken, no workaround labels May 24, 2026

Tranquil-Flow mentioned this pull request May 24, 2026

fix(state): proactively skip WAL journal mode on BTRFS filesystems (#30846) #31586

Open

Merge branch 'main' into feat/kanban-wal-disk-io-error-corruption-v2

15a1deb

steveonjava marked this pull request as ready for review May 25, 2026 01:34

This was referenced May 25, 2026

fix(gateway): use shared per-board kanban connection to prevent WAL inode-rotation race #32226

Closed

fix(gateway): cache kanban DB connections per OS thread in GatewayRunner #32322

Closed

This was referenced May 26, 2026

fix(kanban): batch-salvage 8 SQLite corruption hardening fixes (closes #31158, refs #29610) #32857

Closed

Bug: embedded Kanban dispatcher still leaks sqlite/WAL file descriptors after #28301 #29610

Closed

kshitijk4poor mentioned this pull request May 27, 2026

fix(kanban): batch-salvage 7 SQLite corruption hardening fixes from #32857 #33482

Merged

kshitijk4poor closed this May 27, 2026

kaluluosi mentioned this pull request Jun 12, 2026

[Bug]: _try_wal_checkpoint TRUNCATE silently swallows exceptions, corrupts state.db WAL to zero bytes #44795

Open

liuhao1024 mentioned this pull request Jun 12, 2026

fix(state): log WAL checkpoint failures instead of silently swallowing #44834

Open

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(state): never silently downgrade WAL to DELETE on transient EIO#31294

fix(state): never silently downgrade WAL to DELETE on transient EIO#31294
steveonjava wants to merge 2 commits into
NousResearch:mainfrom
steveonjava:feat/kanban-wal-disk-io-error-corruption-v2

steveonjava commented May 24, 2026

Uh oh!

neektza commented May 26, 2026

Uh oh!

steveonjava commented May 26, 2026

Uh oh!

kshitijk4poor commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

steveonjava commented May 24, 2026

Summary

Problem

Fix

Verification

Related

Files changed

Uh oh!

neektza commented May 26, 2026

Uh oh!

steveonjava commented May 26, 2026

Uh oh!

kshitijk4poor commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants