Skip to content

fix: handle transient kanban SQLite disk I/O errors#31973

Closed
nuch1011 wants to merge 2 commits into
NousResearch:mainfrom
nuch1011:fix-kanban-wal-fallback
Closed

fix: handle transient kanban SQLite disk I/O errors#31973
nuch1011 wants to merge 2 commits into
NousResearch:mainfrom
nuch1011:fix-kanban-wal-fallback

Conversation

@nuch1011

Copy link
Copy Markdown

Summary

  • Stop treating SQLite disk I/O error as proof that WAL is unsupported.
  • Surface DELETE fallback failures with an actionable warning and propagate the DELETE error.
  • Throttle repeated gateway embedded kanban dispatcher tracebacks for transient SQLite disk I/O failures until the DB file changes or a dispatch tick succeeds.

Test Plan

  • git diff --check
  • /usr/local/lib/hermes-agent/venv/bin/python -m pytest tests/test_hermes_state_wal_fallback.py tests/hermes_cli/test_kanban_core_functionality.py -q -o 'addopts='\n- Added-line no-secrets scan over git diff\n\nKanban: t_1719c3ef

@hclsys

hclsys commented May 25, 2026

Copy link
Copy Markdown

Read through this — the three changes hang together well and the dispatcher-side handling is correct. One behavior-change tradeoff worth calling out explicitly for the maintainer, since it's the crux of the PR:

Removing "disk i/o error" from _WAL_INCOMPAT_MARKERS means a disk-I/O error during WAL setup is no longer auto-classified as 'WAL unsupported → silently fall back to DELETE'. It now flows to the raise on the unrelated-OperationalError path. The old inline comment framed disk-i/o as 'Flaky network FS during WAL setup' — so the previous behavior was: NFS/SMB users hitting a transient I/O blip during WAL init got a silent DELETE-mode downgrade and kept running. After this PR they get the error surfaced instead.

That's a defensible call (surface-over-mask — a transient I/O error genuinely isn't a WAL-capability signal, and masking it as 'WAL unsupported' hides a real FS problem), and you've backstopped it well: apply_wal_with_fallback now also propagates a failing DELETE fallback with an actionable log, and the gateway dispatcher's new _is_transient_board_disk_io_error path throttles the repeated tracebacks + resets on DB-change/success so a flaky board doesn't spam or crash the watcher. So the propagation has a catcher.

The only thing I'd want confirmed: for a user on a genuinely WAL-incapable mount that also surfaces as 'disk i/o error' (some FUSE/network setups conflate these), do they still end up in a working DELETE-mode session, or do they now hit the propagated raise on every init? If the latter, a brief note in the PR on the intended migration for those users would help. Logic itself looks sound.

@alt-glitch alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/plugins Plugin system and bundled plugins comp/gateway Gateway runner, session dispatch, delivery comp/agent Core agent loop, run_agent.py, prompt builder labels May 25, 2026
@nuch1011

Copy link
Copy Markdown
Author

Kanban t_c0d6fa7b reopened this after the 2026-05-26 recurrence.

Update:

  • Existing PR reused and updated to current origin/main via branch fix-kanban-wal-fallback.
  • disk I/O error remains outside WAL-incompat fallback markers so transient SQLite I/O failures are not silently downgraded to DELETE fallback.
  • Dispatcher-side repeated transient disk-I/O tick failures are throttled per board DB fingerprint.

Verification:

  • python -m pytest tests/test_hermes_state_wal_fallback.py::TestApplyWalWithFallback tests/hermes_cli/test_kanban_core_functionality.py::test_gateway_dispatcher_throttles_transient_disk_io_errors -o 'addopts=' -q -> 9 passed
  • python -m pytest tests/test_hermes_state_wal_fallback.py tests/hermes_cli/test_kanban_core_functionality.py::test_gateway_dispatcher_disables_corrupt_board_without_traceback tests/hermes_cli/test_kanban_core_functionality.py::test_gateway_dispatcher_throttles_transient_disk_io_errors -o 'addopts=' -q -> 18 passed
  • Added-line no-secrets diff scan -> clean
  • gh pr checks -> no checks reported on this branch

@nuch1011

Copy link
Copy Markdown
Author

Closing per project governance: this work must not be merged into NousResearch/hermes-agent or any upstream/third-party branch from our automation. Keep the branch only in nuch1011/hermes-agent for local/fork reference unless Christian explicitly requests an upstream contribution.

@nuch1011 nuch1011 closed this May 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder comp/gateway Gateway runner, session dispatch, delivery comp/plugins Plugin system and bundled plugins P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants