Skip to content

fix(kanban): fail closed and serialize sqlite writes#31740

Open
usmch1134-droid wants to merge 2 commits into
NousResearch:mainfrom
usmch1134-droid:fix/kanban-sqlite-hardening
Open

fix(kanban): fail closed and serialize sqlite writes#31740
usmch1134-droid wants to merge 2 commits into
NousResearch:mainfrom
usmch1134-droid:fix/kanban-sqlite-hardening

Conversation

@usmch1134-droid

Copy link
Copy Markdown

Summary

  • fail closed on generic SQLite disk I/O error instead of treating it as WAL-incompatible fallback
  • serialize Kanban writes with a per-database interprocess file lock before BEGIN IMMEDIATE
  • disable Kanban boards by DB fingerprint after fatal storage errors so the gateway stops retrying known-bad DBs
  • add regression coverage for fatal error classification, write lock ordering, and multiprocess Kanban integrity

Why

On WSL under concurrent Kanban dispatcher/worker load, a board hit SQLite B-tree corruption after generic disk I/O error was handled like a safe WAL fallback. That let workers continue against degraded storage state. This makes IOERR/malformed/not-a-db fail closed and adds app-level write serialization around board mutations.

Test plan

  • PYTHONPATH=. /home/usmc1/.hermes/hermes-agent/venv/bin/python -m pytest tests/test_hermes_state_wal_fallback.py tests/hermes_cli/test_kanban_db.py tests/hermes_cli/test_kanban_multiprocess_integrity.py tests/gateway/test_kanban_sqlite_fatal_errors.py -o addopts= -q
    • Result after rebase onto upstream main: 192 passed, 1 warning in 12.78s
  • Earlier heavy temp-DB stress on Beast WSL: 8 processes x 300 iterations, PRAGMA integrity_check => ok, counts tasks=480 comments=960 events=1846 runs=406

Operational notes

  • Does not attempt to repair existing corrupted boards.
  • Operators should initialize a fresh board after deploying this patch.
  • For Beast WSL recovery, resume dispatch conservatively (--max 1) and run integrity checks before/after batches until confidence is rebuilt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants