Skip to content

fix: harden kanban sqlite corruption handling#30969

Open
ryandidurlabs wants to merge 1 commit into
NousResearch:mainfrom
ryandidurlabs:cody/kanban-sqlite-hardening-t_a0e3aea4
Open

fix: harden kanban sqlite corruption handling#30969
ryandidurlabs wants to merge 1 commit into
NousResearch:mainfrom
ryandidurlabs:cody/kanban-sqlite-hardening-t_a0e3aea4

Conversation

@ryandidurlabs

@ryandidurlabs ryandidurlabs commented May 23, 2026

Copy link
Copy Markdown

Summary

  • Adds DB-backed Kanban dispatcher leases to reduce multi-gateway write stampedes.
  • Adds early SQLite header/truncation/TLS-clobber detection and fail-closed corrupt DB backup handling.
  • Adds safer kanban reconcile-completions dry-run/apply safeguards, backup/integrity checks, active-claim refusal, and regression coverage.
  • Keeps the approved task-relevant patch isolated from unrelated dirty local files in the original checkout.

Approved patch scope

This PR is intentionally limited to the reviewed Hermes Kanban SQLite hardening files:

  • gateway/run.py
  • hermes_cli/config.py
  • hermes_cli/kanban.py
  • hermes_cli/kanban_db.py
  • tests/hermes_cli/test_kanban_cli.py
  • tests/hermes_cli/test_kanban_core_functionality.py
  • tests/hermes_cli/test_kanban_db.py

No unrelated dirty files from the source checkout are included in the branch diff.

Current verification

  • env -u HERMES_KANBAN_DB -u HERMES_KANBAN_BOARD -u HERMES_KANBAN_WORKSPACES_ROOT -u HERMES_KANBAN_TASK -u HERMES_KANBAN_WORKSPACE -u HERMES_KANBAN_RUN_ID -u HERMES_KANBAN_CLAIM_LOCK /home/jackdidur/.hermes/hermes-agent/venv/bin/python -m pytest tests/hermes_cli/test_kanban_db.py tests/hermes_cli/test_kanban_core_functionality.py tests/hermes_cli/test_kanban_cli.py -q => 396 passed in 27.93s
  • /home/jackdidur/.hermes/hermes-agent/venv/bin/ruff check gateway/run.py hermes_cli/config.py hermes_cli/kanban.py hermes_cli/kanban_db.py tests/hermes_cli/test_kanban_cli.py tests/hermes_cli/test_kanban_core_functionality.py tests/hermes_cli/test_kanban_db.py => All checks passed
  • Branch diff check: git diff --name-only origin/main...ryandidurlabs/cody/kanban-sqlite-hardening-t_a0e3aea4 returns only the seven approved files above.
  • PR status: mergeable as of release-gate prep; no GitHub status checks were reported on the PR.

Quinn review evidence from task t_093e1e42

Quinn reviewed Cody's remediated patch and gave PASS/no blocking findings:

  • Targeted Kanban pytest subset rerun: 394 passed in 32.85s.
  • Ruff rerun on task-relevant files: All checks passed.
  • Live itoperationshq dry-run output /tmp/quinn-itoperationshq-reconcile-dryrun-t_093e1e42.json: t_93b829b9 remained blocked/review_gate_blocked with proposed_action=skip.
  • Live SQLite read-only diagnostics: quick_check=ok; integrity_check=ok.

Notes / residual non-blocking suggestion

  • Positive explicit-override test/doc would improve operator clarity but is not a blocker; default/stale manifests are now refused and override metadata is persisted.
  • Exact byte-level corruption root cause remains unproven; this PR is hardening/containment plus safer recovery tooling.
  • New dispatcher defaults are conservative (dispatch_interval_seconds=120, max_spawn=3, failure_limit=4) to reduce write pressure.
  • No live reconciliation apply, production cutover, or gateway restart was performed by this PR prep.
  • Deploy recommendation after merge: update the Hermes Agent checkout used by the live gateway, then restart the Hermes gateway only at a safe operator window after confirming no active worker will be interrupted (hermes gateway restart or equivalent user-systemd restart for this install).

Add dispatcher lease coordination, corrupt DB detection, safer reconciliation, and Kanban regression coverage.\n\nTask: t_a0e3aea4
@julio-cloudvisor

Copy link
Copy Markdown
Contributor

Production datapoint: we've been running #30973 (synchronous=FULL + wal_autocheckpoint=100, hardcoded variant) for ~11 days under heartbeat-heavy concurrent-writer load — a multi-process dispatcher pattern with reviewer subprocesses writing 50+ events in 5 minutes against the same kanban DB.

Filed #31618 after observing that the synchronous=FULL approach is insufficient under our actual failure mode: SIGKILL-on-reclaim mid-WAL-write (cgroup memory pressure killing a worker between fsync calls). The corruption recurs even with FULL because the WAL+main-DB consistency invariant requires both files to be intact, and SIGKILL can land between the two writes that fsync can't atomically coordinate.

The leases + corruption-detection approach in this PR is what would actually close the race for our workload — leases give us a way to detect and reject mid-write SIGKILL victims before they corrupt subsequent reads, and the auto-detection-and-recovery path means we don't lose the whole board on the inevitable rare event.

Happy to run this PR against production for a week and report back on the heartbeat-heavy workload once it's ready for outside test signal. Standing by.

jcprz added a commit to devisory-engineering/hermes-agent that referenced this pull request May 27, 2026
…ULL + tight wal_autocheckpoint durability hardening

Heartbeat-heavy concurrent-writer workloads on the kanban DB (a
dispatcher process + worker subprocesses both writing event rows and
status updates many times per minute) can hit a "database disk image is
malformed" race where the WAL header ends up inconsistent with main DB
pages. Live repro at issue NousResearch#30896. With the default
`PRAGMA synchronous=NORMAL`, COMMIT returns before the WAL frames are
durably on disk, so a SIGKILL or power-loss-like race mid-WAL-write
leaves the database irreparable.

`PRAGMA synchronous=FULL` plus a tighter `wal_autocheckpoint=100` (vs
the SQLite default of 1000) closes the window — fsync after every
commit, and the WAL stays small so the recovery surface stays small.
The perf cost is bounded because the kanban workload is
dispatch-bounded, not write-bounded; ms-scale fsyncs do not materially
change end-to-end dispatch latency under the workloads we tested.

Default behavior unchanged — opt-in via env var so simpler
single-writer deployments do not pay the fsync cost. Set
`HERMES_KANBAN_SYNCHRONOUS_MODE=full` to enable; any other value
(including unset) preserves the prior `synchronous=NORMAL` behavior.

This replaces the v1 of this PR which changed the default for all
deployments. Per reviewer suggestion on the v1 thread, the
configurability is the right shape for upstream — it lets the operators
who hit the race opt in without forcing the fsync cost on everyone else.

Upstream-PR: NousResearch#30973
Upstream-Issue: NousResearch#30896
Related-Issue: NousResearch#31618
Remove-when: PR NousResearch#30973 merges into NousResearch/hermes-agent main, OR
 superseded by PR NousResearch#30969 (broader kanban hardening with leases +
 corruption detection) once that lands.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants