Skip to content

feat(kanban): opt-in HERMES_KANBAN_SYNCHRONOUS_MODE for synchronous=FULL durability hardening#30973

Closed
julio-cloudvisor wants to merge 1 commit into
NousResearch:mainfrom
devisory-engineering:cv/wal-corruption-resilience
Closed

feat(kanban): opt-in HERMES_KANBAN_SYNCHRONOUS_MODE for synchronous=FULL durability hardening#30973
julio-cloudvisor wants to merge 1 commit into
NousResearch:mainfrom
devisory-engineering:cv/wal-corruption-resilience

Conversation

@julio-cloudvisor

@julio-cloudvisor julio-cloudvisor commented May 23, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds an opt-in env-var (HERMES_KANBAN_SYNCHRONOUS_MODE=full) that switches the kanban DB to PRAGMA synchronous=FULL + PRAGMA wal_autocheckpoint=100 for deployments that hit the WAL/main-DB consistency race documented in #30896. Default behavior unchanged — when the env var is unset or set to anything other than "full", the prior synchronous=NORMAL behavior is preserved.

Why opt-in vs default-change (v2 of this PR)

The v1 of this PR changed the default for all deployments. @jsboige's review on the v1 thread suggested making it a config flag instead — that's exactly right for an upstream merge. Simpler single-writer deployments (the common case) shouldn't pay the fsync cost for a race they'll never hit. Operators who DO hit the race (heartbeat-heavy concurrent-writer workloads — multi-process dispatcher + worker subprocesses writing 50+ events in 5 minutes against the same DB) can opt in with one env var.

What the env var does

sync_mode = os.environ.get("HERMES_KANBAN_SYNCHRONOUS_MODE", "normal").strip().lower()
if sync_mode == "full":
    conn.execute("PRAGMA synchronous=FULL")
    conn.execute("PRAGMA wal_autocheckpoint=100")
else:
    conn.execute("PRAGMA synchronous=NORMAL")
  • HERMES_KANBAN_SYNCHRONOUS_MODE=fullsynchronous=FULL + wal_autocheckpoint=100 (durable, slower)
  • anything else or unset → synchronous=NORMAL (current behavior, fast)

Why these two PRAGMAs

synchronous=FULL: with NORMAL, COMMIT returns before WAL frames are durably on disk. Under a multi-process workload where dispatcher + workers both write rows, a SIGKILL or power-loss-like race mid-WAL-write can leave the WAL header inconsistent with main DB pages — producing "database disk image is malformed" on subsequent reads. FULL forces fsync() after every commit. The perf cost is bounded because the kanban workload is dispatch-bounded, not write-bounded.

wal_autocheckpoint=100 (down from default 1000): keeps the WAL small so the durability window stays short. Heartbeat-heavy review workloads can otherwise grow the WAL to many MB before checkpoint, widening the window where a worker crash can leave the WAL irreparable.

Known limitation (filed at #31618)

Even with synchronous=FULL, corruption can still recur under SIGKILL-on-reclaim mid-WAL-write (cgroup memory pressure killing a worker between fsync calls). The fix-class in #30969 (leases + corruption detection) is the comprehensive answer for that failure mode. This PR closes the easier-to-hit POSIX-fsync race; the harder cgroup-SIGKILL race needs leases.

Backwards compatibility

100%. Default behavior identical to current code. No schema changes, no migration, no startup cost change for the common case. Single new env-var check on connect().

Test plan

Validation

Tested in production for 11 days with the v1 (default-change) shape. Confirmed the WAL-consistency race no longer fires under heartbeat-heavy load. Confirmed via #31618 that the harder SIGKILL-mid-fsync class still happens — but that's #30969's domain, not this PR's.

Related

@jsboige

jsboige commented May 23, 2026

Copy link
Copy Markdown

Review — Technical Assessment

Overall: This is a sound hotfix for a real production incident. The diagnosis matches the documented SQLite behavior: synchronous=NORMAL in WAL mode defers fsync to checkpoint boundaries, which under concurrent multi-process writes (dispatcher + worker heartbeat subprocess) creates a window where a crash or race leaves WAL frames inconsistent with main DB pages.

What looks right

  • synchronous=FULL is the correct choice here. The workload is dispatch-bounded (2-3 writes/tick), so the fsync-per-commit cost (<10ms/tick per the body) is negligible. This is exactly what FULL is designed for: trading write throughput for durability guarantees under concurrent access.
  • wal_autocheckpoint=100 (down from default 1000) keeps the WAL small (~400KB vs ~4MB), shrinking the window where torn pages can accumulate. Reasonable for heartbeat-heavy workloads.
  • No migration concerns: Both are runtime PRAGMAs applied at connect() — no schema change, no data rewrite, backward compatible with existing DBs.

Suggestions (non-blocking)

  1. Consider a config flag. The body mentions this is a Cloudvisor-specific workload pattern. Upstream may prefer synchronous=FULL behind an opt-in setting (env var or kanban config) rather than changing the default for all users, especially those with simpler single-writer setups where NORMAL is sufficient and faster.
  2. Test debt: The absence of a test is understandable for a timing-sensitive multi-process race, but a follow-up issue with a deterministic repro (e.g., multiprocessing.Process + synthetic heartbeat burst) would prevent regression and validate the fix independently of production.
  3. Comment block is verbose: 10 lines of comments for a 2-line functional change. Consider trimming to 3-4 lines referencing the issue number.

Verdict

This fix correctly addresses the symptom (B-tree corruption from torn WAL frames) with the right SQLite primitives. No blockers from my side, but would recommend the config-flag approach for upstream merge to avoid penalizing simpler deployments.


PR review by Claude Code — technical correctness assessment, no prior reviews to conflict with.

@jcprz jcprz force-pushed the cv/wal-corruption-resilience branch 2 times, most recently from 0b6ef71 to 7cd1f6e Compare May 23, 2026 14:11
@alt-glitch alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/cli CLI entry point, hermes_cli/, setup wizard labels May 23, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Superset of #30645 — both change PRAGMA synchronous=NORMAL to FULL in kanban_db.py. This PR also adds wal_autocheckpoint=100 for tighter checkpoint frequency. Related to #30896 (B-tree corruption repro), #30969 (broader kanban hardening with leases + corruption detection).

…ULL + tight wal_autocheckpoint durability hardening

Heartbeat-heavy concurrent-writer workloads on the kanban DB (a
dispatcher process + worker subprocesses both writing event rows and
status updates many times per minute) can hit a "database disk image is
malformed" race where the WAL header ends up inconsistent with main DB
pages. Live repro at issue NousResearch#30896. With the default
`PRAGMA synchronous=NORMAL`, COMMIT returns before the WAL frames are
durably on disk, so a SIGKILL or power-loss-like race mid-WAL-write
leaves the database irreparable.

`PRAGMA synchronous=FULL` plus a tighter `wal_autocheckpoint=100` (vs
the SQLite default of 1000) closes the window — fsync after every
commit, and the WAL stays small so the recovery surface stays small.
The perf cost is bounded because the kanban workload is
dispatch-bounded, not write-bounded; ms-scale fsyncs do not materially
change end-to-end dispatch latency under the workloads we tested.

Default behavior unchanged — opt-in via env var so simpler
single-writer deployments do not pay the fsync cost. Set
`HERMES_KANBAN_SYNCHRONOUS_MODE=full` to enable; any other value
(including unset) preserves the prior `synchronous=NORMAL` behavior.

This replaces the v1 of this PR which changed the default for all
deployments. Per reviewer suggestion on the v1 thread, the
configurability is the right shape for upstream — it lets the operators
who hit the race opt in without forcing the fsync cost on everyone else.

Upstream-PR: NousResearch#30973
Upstream-Issue: NousResearch#30896
Related-Issue: NousResearch#31618
Remove-when: PR NousResearch#30973 merges into NousResearch/hermes-agent main, OR
 superseded by PR NousResearch#30969 (broader kanban hardening with leases +
 corruption detection) once that lands.
@jcprz jcprz force-pushed the cv/wal-corruption-resilience branch from 7cd1f6e to 4f0568c Compare May 27, 2026 16:00
@julio-cloudvisor julio-cloudvisor changed the title fix(kanban): synchronous=FULL + wal_autocheckpoint=100 to prevent B-tree corruption under concurrent writer races feat(kanban): opt-in HERMES_KANBAN_SYNCHRONOUS_MODE for synchronous=FULL durability hardening May 27, 2026
@julio-cloudvisor

Copy link
Copy Markdown
Contributor Author

Force-pushed v2 of this PR per @jsboige's review suggestion.

What changed: synchronous=FULL + wal_autocheckpoint=100 are now opt-in via HERMES_KANBAN_SYNCHRONOUS_MODE=full env var instead of changing the default for all deployments. Default behavior is identical to current upstream (synchronous=NORMAL).

What didn't change: the underlying PRAGMA logic + the workload characterization. The repro at #30896 still applies; the limitation at #31618 still applies.

@alt-glitch you noted this is related to #30645 + #30969. v2 here is the smaller hotfix shape for the POSIX-fsync race; #30969 (broader leases + corruption detection) is what we'd need for the harder SIGKILL-mid-fsync class. Happy to either land both, or land this one as a stop-gap while #30969 is being reviewed.

PR labeled P3 currently — would appreciate a re-evaluation if the opt-in shape makes this lower-risk to merge. The default-unchanged behavior means zero regression surface for anyone not opting in.

@julio-cloudvisor

Copy link
Copy Markdown
Contributor Author

Heads-up: #33243 (opened today by @teknium1) rediagnoses the corruption class as a false-positive from a probe-connection race rather than a real WAL/durability gap. It closes 7 related issues including our #31618 and validates against the same .corrupt.*.bak symptom we hit.

If #33243 lands, this PR's PRAGMA changes become unnecessary — the false-positive detector going away eliminates the symptom regardless of synchronous mode. We'll close this PR out once #33243 merges.

Leaving open in the meantime as a smaller stop-gap in case Nous prefers to ship the targeted fix first, or for deployments that want belt-and-suspenders on the WAL durability semantics independent of the false-positive bug. (The v2 opt-in shape via env var means there's still zero regression surface for the common case.)

kshitijk4poor pushed a commit that referenced this pull request May 27, 2026
Reads header bytes 28-31 after every COMMIT and compares against actual file size. Raises sqlite3.DatabaseError on torn-extend (actual_pages < page_count). Also sets PRAGMA wal_autocheckpoint=100 in connect().

Refs: #31208 (Bug E - same file, coordinate), #30973 (wal_autocheckpoint)
Refs: #30445, #30896, #30908 (corruption reports)
@julio-cloudvisor

Copy link
Copy Markdown
Contributor Author

Closing — upstream main now has the equivalent (and better-targeted) fixes:

  • 6416dd51 fix(kanban): harden SQLite against torn-write corruption (synchronous=FULL + secure_delete + cell_size_check) — does what this PR was trying to do, as the default, plus the additional integrity PRAGMAs.
  • 99c19eb2 fix(kanban): add post-commit page_count invariant check + wal_autocheckpoint=100 — adds the tighter WAL checkpoint cadence this PR also wanted.
  • dc98314f fix(kanban): skip redundant WAL pragma on already-WAL connections — the actual root cause for our kanban.db corruption recurs under concurrent reclaim-SIGKILL even with synchronous=FULL + wal_autocheckpoint=100 #31618 observation. The pragma re-application was the false-positive trigger; our PR mischaracterized this as a durability race.

Combined, these three commits supersede this PR entirely and the opt-in env-var shape from the v2 re-roll is no longer needed. Validated all three are in origin/main and pulled cleanly into our production runtime.

Thanks @jsboige for the v1 review (the config-flag suggestion was right for the wrong reason — the real fix lives further down the stack) and @alt-glitch for the initial triage that pointed at #30969 and the related cluster.

mathias3 pushed a commit to mathias3/hermes-agent that referenced this pull request May 28, 2026
Reads header bytes 28-31 after every COMMIT and compares against actual file size. Raises sqlite3.DatabaseError on torn-extend (actual_pages < page_count). Also sets PRAGMA wal_autocheckpoint=100 in connect().

Refs: NousResearch#31208 (Bug E - same file, coordinate), NousResearch#30973 (wal_autocheckpoint)
Refs: NousResearch#30445, NousResearch#30896, NousResearch#30908 (corruption reports)
Bryce-huang pushed a commit to wbkunlun/hermes-agent that referenced this pull request May 29, 2026
Reads header bytes 28-31 after every COMMIT and compares against actual file size. Raises sqlite3.DatabaseError on torn-extend (actual_pages < page_count). Also sets PRAGMA wal_autocheckpoint=100 in connect().

Refs: NousResearch#31208 (Bug E - same file, coordinate), NousResearch#30973 (wal_autocheckpoint)
Refs: NousResearch#30445, NousResearch#30896, NousResearch#30908 (corruption reports)

#AI commit#
mosaiq-systems pushed a commit to mosaiq-systems/hermes-agent that referenced this pull request May 29, 2026
Reads header bytes 28-31 after every COMMIT and compares against actual file size. Raises sqlite3.DatabaseError on torn-extend (actual_pages < page_count). Also sets PRAGMA wal_autocheckpoint=100 in connect().

Refs: NousResearch#31208 (Bug E - same file, coordinate), NousResearch#30973 (wal_autocheckpoint)
Refs: NousResearch#30445, NousResearch#30896, NousResearch#30908 (corruption reports)
KKT-OPT pushed a commit to KKT-OPT/hermes-agent that referenced this pull request May 31, 2026
Reads header bytes 28-31 after every COMMIT and compares against actual file size. Raises sqlite3.DatabaseError on torn-extend (actual_pages < page_count). Also sets PRAGMA wal_autocheckpoint=100 in connect().

Refs: NousResearch#31208 (Bug E - same file, coordinate), NousResearch#30973 (wal_autocheckpoint)
Refs: NousResearch#30445, NousResearch#30896, NousResearch#30908 (corruption reports)
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
Reads header bytes 28-31 after every COMMIT and compares against actual file size. Raises sqlite3.DatabaseError on torn-extend (actual_pages < page_count). Also sets PRAGMA wal_autocheckpoint=100 in connect().

Refs: NousResearch#31208 (Bug E - same file, coordinate), NousResearch#30973 (wal_autocheckpoint)
Refs: NousResearch#30445, NousResearch#30896, NousResearch#30908 (corruption reports)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants