feat(kanban): opt-in HERMES_KANBAN_SYNCHRONOUS_MODE for synchronous=FULL durability hardening#30973
Conversation
Review — Technical AssessmentOverall: This is a sound hotfix for a real production incident. The diagnosis matches the documented SQLite behavior: What looks right
Suggestions (non-blocking)
VerdictThis fix correctly addresses the symptom (B-tree corruption from torn WAL frames) with the right SQLite primitives. No blockers from my side, but would recommend the config-flag approach for upstream merge to avoid penalizing simpler deployments. PR review by Claude Code — technical correctness assessment, no prior reviews to conflict with. |
0b6ef71 to
7cd1f6e
Compare
…ULL + tight wal_autocheckpoint durability hardening Heartbeat-heavy concurrent-writer workloads on the kanban DB (a dispatcher process + worker subprocesses both writing event rows and status updates many times per minute) can hit a "database disk image is malformed" race where the WAL header ends up inconsistent with main DB pages. Live repro at issue NousResearch#30896. With the default `PRAGMA synchronous=NORMAL`, COMMIT returns before the WAL frames are durably on disk, so a SIGKILL or power-loss-like race mid-WAL-write leaves the database irreparable. `PRAGMA synchronous=FULL` plus a tighter `wal_autocheckpoint=100` (vs the SQLite default of 1000) closes the window — fsync after every commit, and the WAL stays small so the recovery surface stays small. The perf cost is bounded because the kanban workload is dispatch-bounded, not write-bounded; ms-scale fsyncs do not materially change end-to-end dispatch latency under the workloads we tested. Default behavior unchanged — opt-in via env var so simpler single-writer deployments do not pay the fsync cost. Set `HERMES_KANBAN_SYNCHRONOUS_MODE=full` to enable; any other value (including unset) preserves the prior `synchronous=NORMAL` behavior. This replaces the v1 of this PR which changed the default for all deployments. Per reviewer suggestion on the v1 thread, the configurability is the right shape for upstream — it lets the operators who hit the race opt in without forcing the fsync cost on everyone else. Upstream-PR: NousResearch#30973 Upstream-Issue: NousResearch#30896 Related-Issue: NousResearch#31618 Remove-when: PR NousResearch#30973 merges into NousResearch/hermes-agent main, OR superseded by PR NousResearch#30969 (broader kanban hardening with leases + corruption detection) once that lands.
7cd1f6e to
4f0568c
Compare
|
Force-pushed v2 of this PR per @jsboige's review suggestion. What changed: What didn't change: the underlying PRAGMA logic + the workload characterization. The repro at #30896 still applies; the limitation at #31618 still applies. @alt-glitch you noted this is related to #30645 + #30969. v2 here is the smaller hotfix shape for the POSIX-fsync race; #30969 (broader leases + corruption detection) is what we'd need for the harder SIGKILL-mid-fsync class. Happy to either land both, or land this one as a stop-gap while #30969 is being reviewed. PR labeled P3 currently — would appreciate a re-evaluation if the opt-in shape makes this lower-risk to merge. The default-unchanged behavior means zero regression surface for anyone not opting in. |
|
Heads-up: #33243 (opened today by @teknium1) rediagnoses the corruption class as a false-positive from a probe-connection race rather than a real WAL/durability gap. It closes 7 related issues including our #31618 and validates against the same If #33243 lands, this PR's PRAGMA changes become unnecessary — the false-positive detector going away eliminates the symptom regardless of Leaving open in the meantime as a smaller stop-gap in case Nous prefers to ship the targeted fix first, or for deployments that want belt-and-suspenders on the WAL durability semantics independent of the false-positive bug. (The v2 opt-in shape via env var means there's still zero regression surface for the common case.) |
Reads header bytes 28-31 after every COMMIT and compares against actual file size. Raises sqlite3.DatabaseError on torn-extend (actual_pages < page_count). Also sets PRAGMA wal_autocheckpoint=100 in connect(). Refs: #31208 (Bug E - same file, coordinate), #30973 (wal_autocheckpoint) Refs: #30445, #30896, #30908 (corruption reports)
|
Closing — upstream main now has the equivalent (and better-targeted) fixes:
Combined, these three commits supersede this PR entirely and the opt-in env-var shape from the v2 re-roll is no longer needed. Validated all three are in Thanks @jsboige for the v1 review (the config-flag suggestion was right for the wrong reason — the real fix lives further down the stack) and @alt-glitch for the initial triage that pointed at #30969 and the related cluster. |
Reads header bytes 28-31 after every COMMIT and compares against actual file size. Raises sqlite3.DatabaseError on torn-extend (actual_pages < page_count). Also sets PRAGMA wal_autocheckpoint=100 in connect(). Refs: NousResearch#31208 (Bug E - same file, coordinate), NousResearch#30973 (wal_autocheckpoint) Refs: NousResearch#30445, NousResearch#30896, NousResearch#30908 (corruption reports)
Reads header bytes 28-31 after every COMMIT and compares against actual file size. Raises sqlite3.DatabaseError on torn-extend (actual_pages < page_count). Also sets PRAGMA wal_autocheckpoint=100 in connect(). Refs: NousResearch#31208 (Bug E - same file, coordinate), NousResearch#30973 (wal_autocheckpoint) Refs: NousResearch#30445, NousResearch#30896, NousResearch#30908 (corruption reports) #AI commit#
Reads header bytes 28-31 after every COMMIT and compares against actual file size. Raises sqlite3.DatabaseError on torn-extend (actual_pages < page_count). Also sets PRAGMA wal_autocheckpoint=100 in connect(). Refs: NousResearch#31208 (Bug E - same file, coordinate), NousResearch#30973 (wal_autocheckpoint) Refs: NousResearch#30445, NousResearch#30896, NousResearch#30908 (corruption reports)
Reads header bytes 28-31 after every COMMIT and compares against actual file size. Raises sqlite3.DatabaseError on torn-extend (actual_pages < page_count). Also sets PRAGMA wal_autocheckpoint=100 in connect(). Refs: NousResearch#31208 (Bug E - same file, coordinate), NousResearch#30973 (wal_autocheckpoint) Refs: NousResearch#30445, NousResearch#30896, NousResearch#30908 (corruption reports)
Reads header bytes 28-31 after every COMMIT and compares against actual file size. Raises sqlite3.DatabaseError on torn-extend (actual_pages < page_count). Also sets PRAGMA wal_autocheckpoint=100 in connect(). Refs: NousResearch#31208 (Bug E - same file, coordinate), NousResearch#30973 (wal_autocheckpoint) Refs: NousResearch#30445, NousResearch#30896, NousResearch#30908 (corruption reports)
Summary
Adds an opt-in env-var (
HERMES_KANBAN_SYNCHRONOUS_MODE=full) that switches the kanban DB toPRAGMA synchronous=FULL+PRAGMA wal_autocheckpoint=100for deployments that hit the WAL/main-DB consistency race documented in #30896. Default behavior unchanged — when the env var is unset or set to anything other than"full", the priorsynchronous=NORMALbehavior is preserved.Why opt-in vs default-change (v2 of this PR)
The v1 of this PR changed the default for all deployments. @jsboige's review on the v1 thread suggested making it a config flag instead — that's exactly right for an upstream merge. Simpler single-writer deployments (the common case) shouldn't pay the fsync cost for a race they'll never hit. Operators who DO hit the race (heartbeat-heavy concurrent-writer workloads — multi-process dispatcher + worker subprocesses writing 50+ events in 5 minutes against the same DB) can opt in with one env var.
What the env var does
HERMES_KANBAN_SYNCHRONOUS_MODE=full→synchronous=FULL+wal_autocheckpoint=100(durable, slower)synchronous=NORMAL(current behavior, fast)Why these two PRAGMAs
synchronous=FULL: withNORMAL, COMMIT returns before WAL frames are durably on disk. Under a multi-process workload where dispatcher + workers both write rows, a SIGKILL or power-loss-like race mid-WAL-write can leave the WAL header inconsistent with main DB pages — producing "database disk image is malformed" on subsequent reads.FULLforces fsync() after every commit. The perf cost is bounded because the kanban workload is dispatch-bounded, not write-bounded.wal_autocheckpoint=100(down from default 1000): keeps the WAL small so the durability window stays short. Heartbeat-heavy review workloads can otherwise grow the WAL to many MB before checkpoint, widening the window where a worker crash can leave the WAL irreparable.Known limitation (filed at #31618)
Even with
synchronous=FULL, corruption can still recur under SIGKILL-on-reclaim mid-WAL-write (cgroup memory pressure killing a worker between fsync calls). The fix-class in #30969 (leases + corruption detection) is the comprehensive answer for that failure mode. This PR closes the easier-to-hit POSIX-fsync race; the harder cgroup-SIGKILL race needs leases.Backwards compatibility
100%. Default behavior identical to current code. No schema changes, no migration, no startup cost change for the common case. Single new env-var check on
connect().Test plan
HERMES_KANBAN_SYNCHRONOUS_MODE=full(expect: no corruption); once without (expect: corruption reproduces, matching current behavior).Validation
Tested in production for 11 days with the v1 (default-change) shape. Confirmed the WAL-consistency race no longer fires under heartbeat-heavy load. Confirmed via #31618 that the harder SIGKILL-mid-fsync class still happens — but that's #30969's domain, not this PR's.
Related