fix: handle transient kanban SQLite disk I/O errors by nuch1011 · Pull Request #31973 · NousResearch/hermes-agent

nuch1011 · 2026-05-25T08:58:12Z

Summary

Stop treating SQLite disk I/O error as proof that WAL is unsupported.
Surface DELETE fallback failures with an actionable warning and propagate the DELETE error.
Throttle repeated gateway embedded kanban dispatcher tracebacks for transient SQLite disk I/O failures until the DB file changes or a dispatch tick succeeds.

Test Plan

git diff --check
/usr/local/lib/hermes-agent/venv/bin/python -m pytest tests/test_hermes_state_wal_fallback.py tests/hermes_cli/test_kanban_core_functionality.py -q -o 'addopts='\n- Added-line no-secrets scan over git diff\n\nKanban: t_1719c3ef

hclsys · 2026-05-25T09:11:17Z

Read through this — the three changes hang together well and the dispatcher-side handling is correct. One behavior-change tradeoff worth calling out explicitly for the maintainer, since it's the crux of the PR:

Removing "disk i/o error" from _WAL_INCOMPAT_MARKERS means a disk-I/O error during WAL setup is no longer auto-classified as 'WAL unsupported → silently fall back to DELETE'. It now flows to the raise on the unrelated-OperationalError path. The old inline comment framed disk-i/o as 'Flaky network FS during WAL setup' — so the previous behavior was: NFS/SMB users hitting a transient I/O blip during WAL init got a silent DELETE-mode downgrade and kept running. After this PR they get the error surfaced instead.

That's a defensible call (surface-over-mask — a transient I/O error genuinely isn't a WAL-capability signal, and masking it as 'WAL unsupported' hides a real FS problem), and you've backstopped it well: apply_wal_with_fallback now also propagates a failing DELETE fallback with an actionable log, and the gateway dispatcher's new _is_transient_board_disk_io_error path throttles the repeated tracebacks + resets on DB-change/success so a flaky board doesn't spam or crash the watcher. So the propagation has a catcher.

The only thing I'd want confirmed: for a user on a genuinely WAL-incapable mount that also surfaces as 'disk i/o error' (some FUSE/network setups conflate these), do they still end up in a working DELETE-mode session, or do they now hit the propagated raise on every init? If the latter, a brief note in the PR on the intended migration for those users would help. Logic itself looks sound.

nuch1011 · 2026-05-26T08:53:32Z

Kanban t_c0d6fa7b reopened this after the 2026-05-26 recurrence.

Update:

Existing PR reused and updated to current origin/main via branch fix-kanban-wal-fallback.
disk I/O error remains outside WAL-incompat fallback markers so transient SQLite I/O failures are not silently downgraded to DELETE fallback.
Dispatcher-side repeated transient disk-I/O tick failures are throttled per board DB fingerprint.

Verification:

python -m pytest tests/test_hermes_state_wal_fallback.py::TestApplyWalWithFallback tests/hermes_cli/test_kanban_core_functionality.py::test_gateway_dispatcher_throttles_transient_disk_io_errors -o 'addopts=' -q -> 9 passed
python -m pytest tests/test_hermes_state_wal_fallback.py tests/hermes_cli/test_kanban_core_functionality.py::test_gateway_dispatcher_disables_corrupt_board_without_traceback tests/hermes_cli/test_kanban_core_functionality.py::test_gateway_dispatcher_throttles_transient_disk_io_errors -o 'addopts=' -q -> 18 passed
Added-line no-secrets diff scan -> clean
gh pr checks -> no checks reported on this branch

nuch1011 · 2026-05-26T11:24:50Z

Closing per project governance: this work must not be merged into NousResearch/hermes-agent or any upstream/third-party branch from our automation. Keep the branch only in nuch1011/hermes-agent for local/fork reference unless Christian explicitly requests an upstream contribution.

fix: handle transient kanban SQLite disk I/O errors

a0945c1

nuch1011 mentioned this pull request May 25, 2026

fix: handle transient kanban SQLite disk I/O errors nuch1011/hermes-agent#2

Closed

alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/plugins Plugin system and bundled plugins comp/gateway Gateway runner, session dispatch, delivery comp/agent Core agent loop, run_agent.py, prompt builder labels May 25, 2026

steveonjava mentioned this pull request May 26, 2026

fix(kanban): skip redundant WAL pragma on already-WAL connections #32489

Closed

19 tasks

Merge remote-tracking branch 'origin/main' into fix-kanban-wal-fallback

f45e095

nuch1011 closed this May 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: handle transient kanban SQLite disk I/O errors#31973

fix: handle transient kanban SQLite disk I/O errors#31973
nuch1011 wants to merge 2 commits into
NousResearch:mainfrom
nuch1011:fix-kanban-wal-fallback

nuch1011 commented May 25, 2026

Uh oh!

hclsys commented May 25, 2026

Uh oh!

nuch1011 commented May 26, 2026

Uh oh!

nuch1011 commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nuch1011 commented May 25, 2026

Summary

Test Plan

Uh oh!

hclsys commented May 25, 2026

Uh oh!

nuch1011 commented May 26, 2026

Uh oh!

nuch1011 commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants