fix: harden kanban sqlite corruption handling by ryandidurlabs · Pull Request #30969 · NousResearch/hermes-agent

ryandidurlabs · 2026-05-23T13:53:49Z

Summary

Adds DB-backed Kanban dispatcher leases to reduce multi-gateway write stampedes.
Adds early SQLite header/truncation/TLS-clobber detection and fail-closed corrupt DB backup handling.
Adds safer kanban reconcile-completions dry-run/apply safeguards, backup/integrity checks, active-claim refusal, and regression coverage.
Keeps the approved task-relevant patch isolated from unrelated dirty local files in the original checkout.

Approved patch scope

This PR is intentionally limited to the reviewed Hermes Kanban SQLite hardening files:

gateway/run.py
hermes_cli/config.py
hermes_cli/kanban.py
hermes_cli/kanban_db.py
tests/hermes_cli/test_kanban_cli.py
tests/hermes_cli/test_kanban_core_functionality.py
tests/hermes_cli/test_kanban_db.py

No unrelated dirty files from the source checkout are included in the branch diff.

Current verification

env -u HERMES_KANBAN_DB -u HERMES_KANBAN_BOARD -u HERMES_KANBAN_WORKSPACES_ROOT -u HERMES_KANBAN_TASK -u HERMES_KANBAN_WORKSPACE -u HERMES_KANBAN_RUN_ID -u HERMES_KANBAN_CLAIM_LOCK /home/jackdidur/.hermes/hermes-agent/venv/bin/python -m pytest tests/hermes_cli/test_kanban_db.py tests/hermes_cli/test_kanban_core_functionality.py tests/hermes_cli/test_kanban_cli.py -q => 396 passed in 27.93s
/home/jackdidur/.hermes/hermes-agent/venv/bin/ruff check gateway/run.py hermes_cli/config.py hermes_cli/kanban.py hermes_cli/kanban_db.py tests/hermes_cli/test_kanban_cli.py tests/hermes_cli/test_kanban_core_functionality.py tests/hermes_cli/test_kanban_db.py => All checks passed
Branch diff check: git diff --name-only origin/main...ryandidurlabs/cody/kanban-sqlite-hardening-t_a0e3aea4 returns only the seven approved files above.
PR status: mergeable as of release-gate prep; no GitHub status checks were reported on the PR.

Quinn review evidence from task `t_093e1e42`

Quinn reviewed Cody's remediated patch and gave PASS/no blocking findings:

Targeted Kanban pytest subset rerun: 394 passed in 32.85s.
Ruff rerun on task-relevant files: All checks passed.
Live itoperationshq dry-run output /tmp/quinn-itoperationshq-reconcile-dryrun-t_093e1e42.json: t_93b829b9 remained blocked/review_gate_blocked with proposed_action=skip.
Live SQLite read-only diagnostics: quick_check=ok; integrity_check=ok.

Notes / residual non-blocking suggestion

Positive explicit-override test/doc would improve operator clarity but is not a blocker; default/stale manifests are now refused and override metadata is persisted.
Exact byte-level corruption root cause remains unproven; this PR is hardening/containment plus safer recovery tooling.
New dispatcher defaults are conservative (dispatch_interval_seconds=120, max_spawn=3, failure_limit=4) to reduce write pressure.
No live reconciliation apply, production cutover, or gateway restart was performed by this PR prep.
Deploy recommendation after merge: update the Hermes Agent checkout used by the live gateway, then restart the Hermes gateway only at a safe operator window after confirming no active worker will be interrupted (hermes gateway restart or equivalent user-systemd restart for this install).

Add dispatcher lease coordination, corrupt DB detection, safer reconciliation, and Kanban regression coverage.\n\nTask: t_a0e3aea4

julio-cloudvisor · 2026-05-27T15:49:45Z

Production datapoint: we've been running #30973 (synchronous=FULL + wal_autocheckpoint=100, hardcoded variant) for ~11 days under heartbeat-heavy concurrent-writer load — a multi-process dispatcher pattern with reviewer subprocesses writing 50+ events in 5 minutes against the same kanban DB.

Filed #31618 after observing that the synchronous=FULL approach is insufficient under our actual failure mode: SIGKILL-on-reclaim mid-WAL-write (cgroup memory pressure killing a worker between fsync calls). The corruption recurs even with FULL because the WAL+main-DB consistency invariant requires both files to be intact, and SIGKILL can land between the two writes that fsync can't atomically coordinate.

The leases + corruption-detection approach in this PR is what would actually close the race for our workload — leases give us a way to detect and reject mid-write SIGKILL victims before they corrupt subsequent reads, and the auto-detection-and-recovery path means we don't lose the whole board on the inevitable rare event.

Happy to run this PR against production for a week and report back on the heartbeat-heavy workload once it's ready for outside test signal. Standing by.

…ULL + tight wal_autocheckpoint durability hardening Heartbeat-heavy concurrent-writer workloads on the kanban DB (a dispatcher process + worker subprocesses both writing event rows and status updates many times per minute) can hit a "database disk image is malformed" race where the WAL header ends up inconsistent with main DB pages. Live repro at issue NousResearch#30896. With the default `PRAGMA synchronous=NORMAL`, COMMIT returns before the WAL frames are durably on disk, so a SIGKILL or power-loss-like race mid-WAL-write leaves the database irreparable. `PRAGMA synchronous=FULL` plus a tighter `wal_autocheckpoint=100` (vs the SQLite default of 1000) closes the window — fsync after every commit, and the WAL stays small so the recovery surface stays small. The perf cost is bounded because the kanban workload is dispatch-bounded, not write-bounded; ms-scale fsyncs do not materially change end-to-end dispatch latency under the workloads we tested. Default behavior unchanged — opt-in via env var so simpler single-writer deployments do not pay the fsync cost. Set `HERMES_KANBAN_SYNCHRONOUS_MODE=full` to enable; any other value (including unset) preserves the prior `synchronous=NORMAL` behavior. This replaces the v1 of this PR which changed the default for all deployments. Per reviewer suggestion on the v1 thread, the configurability is the right shape for upstream — it lets the operators who hit the race opt in without forcing the fsync cost on everyone else. Upstream-PR: NousResearch#30973 Upstream-Issue: NousResearch#30896 Related-Issue: NousResearch#31618 Remove-when: PR NousResearch#30973 merges into NousResearch/hermes-agent main, OR superseded by PR NousResearch#30969 (broader kanban hardening with leases + corruption detection) once that lands.

fix: harden kanban sqlite corruption handling

87520c6

Add dispatcher lease coordination, corrupt DB detection, safer reconciliation, and Kanban regression coverage.\n\nTask: t_a0e3aea4

alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery labels May 23, 2026

alt-glitch mentioned this pull request May 23, 2026

feat(kanban): opt-in HERMES_KANBAN_SYNCHRONOUS_MODE for synchronous=FULL durability hardening #30973

Closed

steveonjava mentioned this pull request May 24, 2026

fix(kanban): harden SQLite against torn-write corruption (secure_delete + cell_size_check + synchronous=FULL) #31208

Closed

14 tasks

This was referenced May 24, 2026

fix(kanban): change synchronous=NORMAL to FULL + add wal_autocheckpoint=100 #31726

Closed

fix(kanban): change synchronous=NORMAL to FULL + add wal_autocheckpoint=100 #31731

Closed

crayfish-ai mentioned this pull request May 25, 2026

feat(kanban): auto-subscribe, review-required promotion, unblock dispatch, rerun, parent workspace #31883

Closed

This was referenced May 26, 2026

fix(kanban): close leaked decompose connections + don't mask I/O error on rollback #32415

Closed

fix: stabilize gateway recovery and kanban backups #32493

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: harden kanban sqlite corruption handling#30969

fix: harden kanban sqlite corruption handling#30969
ryandidurlabs wants to merge 1 commit into
NousResearch:mainfrom
ryandidurlabs:cody/kanban-sqlite-hardening-t_a0e3aea4

ryandidurlabs commented May 23, 2026 •

edited

Loading

Uh oh!

julio-cloudvisor commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ryandidurlabs commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Approved patch scope

Current verification

Quinn review evidence from task t_093e1e42

Notes / residual non-blocking suggestion

Uh oh!

julio-cloudvisor commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ryandidurlabs commented May 23, 2026 •

edited

Loading

Quinn review evidence from task `t_093e1e42`