fix: harden kanban sqlite corruption handling#30969
Conversation
Add dispatcher lease coordination, corrupt DB detection, safer reconciliation, and Kanban regression coverage.\n\nTask: t_a0e3aea4
|
Production datapoint: we've been running #30973 (synchronous=FULL + wal_autocheckpoint=100, hardcoded variant) for ~11 days under heartbeat-heavy concurrent-writer load — a multi-process dispatcher pattern with reviewer subprocesses writing 50+ events in 5 minutes against the same kanban DB. Filed #31618 after observing that the synchronous=FULL approach is insufficient under our actual failure mode: SIGKILL-on-reclaim mid-WAL-write (cgroup memory pressure killing a worker between fsync calls). The corruption recurs even with FULL because the WAL+main-DB consistency invariant requires both files to be intact, and SIGKILL can land between the two writes that fsync can't atomically coordinate. The leases + corruption-detection approach in this PR is what would actually close the race for our workload — leases give us a way to detect and reject mid-write SIGKILL victims before they corrupt subsequent reads, and the auto-detection-and-recovery path means we don't lose the whole board on the inevitable rare event. Happy to run this PR against production for a week and report back on the heartbeat-heavy workload once it's ready for outside test signal. Standing by. |
…ULL + tight wal_autocheckpoint durability hardening Heartbeat-heavy concurrent-writer workloads on the kanban DB (a dispatcher process + worker subprocesses both writing event rows and status updates many times per minute) can hit a "database disk image is malformed" race where the WAL header ends up inconsistent with main DB pages. Live repro at issue NousResearch#30896. With the default `PRAGMA synchronous=NORMAL`, COMMIT returns before the WAL frames are durably on disk, so a SIGKILL or power-loss-like race mid-WAL-write leaves the database irreparable. `PRAGMA synchronous=FULL` plus a tighter `wal_autocheckpoint=100` (vs the SQLite default of 1000) closes the window — fsync after every commit, and the WAL stays small so the recovery surface stays small. The perf cost is bounded because the kanban workload is dispatch-bounded, not write-bounded; ms-scale fsyncs do not materially change end-to-end dispatch latency under the workloads we tested. Default behavior unchanged — opt-in via env var so simpler single-writer deployments do not pay the fsync cost. Set `HERMES_KANBAN_SYNCHRONOUS_MODE=full` to enable; any other value (including unset) preserves the prior `synchronous=NORMAL` behavior. This replaces the v1 of this PR which changed the default for all deployments. Per reviewer suggestion on the v1 thread, the configurability is the right shape for upstream — it lets the operators who hit the race opt in without forcing the fsync cost on everyone else. Upstream-PR: NousResearch#30973 Upstream-Issue: NousResearch#30896 Related-Issue: NousResearch#31618 Remove-when: PR NousResearch#30973 merges into NousResearch/hermes-agent main, OR superseded by PR NousResearch#30969 (broader kanban hardening with leases + corruption detection) once that lands.
Summary
kanban reconcile-completionsdry-run/apply safeguards, backup/integrity checks, active-claim refusal, and regression coverage.Approved patch scope
This PR is intentionally limited to the reviewed Hermes Kanban SQLite hardening files:
gateway/run.pyhermes_cli/config.pyhermes_cli/kanban.pyhermes_cli/kanban_db.pytests/hermes_cli/test_kanban_cli.pytests/hermes_cli/test_kanban_core_functionality.pytests/hermes_cli/test_kanban_db.pyNo unrelated dirty files from the source checkout are included in the branch diff.
Current verification
env -u HERMES_KANBAN_DB -u HERMES_KANBAN_BOARD -u HERMES_KANBAN_WORKSPACES_ROOT -u HERMES_KANBAN_TASK -u HERMES_KANBAN_WORKSPACE -u HERMES_KANBAN_RUN_ID -u HERMES_KANBAN_CLAIM_LOCK /home/jackdidur/.hermes/hermes-agent/venv/bin/python -m pytest tests/hermes_cli/test_kanban_db.py tests/hermes_cli/test_kanban_core_functionality.py tests/hermes_cli/test_kanban_cli.py -q=> 396 passed in 27.93s/home/jackdidur/.hermes/hermes-agent/venv/bin/ruff check gateway/run.py hermes_cli/config.py hermes_cli/kanban.py hermes_cli/kanban_db.py tests/hermes_cli/test_kanban_cli.py tests/hermes_cli/test_kanban_core_functionality.py tests/hermes_cli/test_kanban_db.py=> All checks passedgit diff --name-only origin/main...ryandidurlabs/cody/kanban-sqlite-hardening-t_a0e3aea4returns only the seven approved files above.Quinn review evidence from task
t_093e1e42Quinn reviewed Cody's remediated patch and gave PASS/no blocking findings:
/tmp/quinn-itoperationshq-reconcile-dryrun-t_093e1e42.json:t_93b829b9remained blocked/review_gate_blocked withproposed_action=skip.quick_check=ok;integrity_check=ok.Notes / residual non-blocking suggestion
dispatch_interval_seconds=120,max_spawn=3,failure_limit=4) to reduce write pressure.hermes gateway restartor equivalent user-systemd restart for this install).