feat(kanban): opt-in HERMES_KANBAN_SYNCHRONOUS_MODE for synchronous=FULL durability hardening by julio-cloudvisor · Pull Request #30973 · NousResearch/hermes-agent

julio-cloudvisor · 2026-05-23T14:02:57Z

Summary

Adds an opt-in env-var (HERMES_KANBAN_SYNCHRONOUS_MODE=full) that switches the kanban DB to PRAGMA synchronous=FULL + PRAGMA wal_autocheckpoint=100 for deployments that hit the WAL/main-DB consistency race documented in #30896. Default behavior unchanged — when the env var is unset or set to anything other than "full", the prior synchronous=NORMAL behavior is preserved.

Why opt-in vs default-change (v2 of this PR)

The v1 of this PR changed the default for all deployments. @jsboige's review on the v1 thread suggested making it a config flag instead — that's exactly right for an upstream merge. Simpler single-writer deployments (the common case) shouldn't pay the fsync cost for a race they'll never hit. Operators who DO hit the race (heartbeat-heavy concurrent-writer workloads — multi-process dispatcher + worker subprocesses writing 50+ events in 5 minutes against the same DB) can opt in with one env var.

What the env var does

sync_mode = os.environ.get("HERMES_KANBAN_SYNCHRONOUS_MODE", "normal").strip().lower()
if sync_mode == "full":
    conn.execute("PRAGMA synchronous=FULL")
    conn.execute("PRAGMA wal_autocheckpoint=100")
else:
    conn.execute("PRAGMA synchronous=NORMAL")

HERMES_KANBAN_SYNCHRONOUS_MODE=full → synchronous=FULL + wal_autocheckpoint=100 (durable, slower)
anything else or unset → synchronous=NORMAL (current behavior, fast)

Why these two PRAGMAs

synchronous=FULL: with NORMAL, COMMIT returns before WAL frames are durably on disk. Under a multi-process workload where dispatcher + workers both write rows, a SIGKILL or power-loss-like race mid-WAL-write can leave the WAL header inconsistent with main DB pages — producing "database disk image is malformed" on subsequent reads. FULL forces fsync() after every commit. The perf cost is bounded because the kanban workload is dispatch-bounded, not write-bounded.

wal_autocheckpoint=100 (down from default 1000): keeps the WAL small so the durability window stays short. Heartbeat-heavy review workloads can otherwise grow the WAL to many MB before checkpoint, widening the window where a worker crash can leave the WAL irreparable.

Known limitation (filed at #31618)

Even with synchronous=FULL, corruption can still recur under SIGKILL-on-reclaim mid-WAL-write (cgroup memory pressure killing a worker between fsync calls). The fix-class in #30969 (leases + corruption detection) is the comprehensive answer for that failure mode. This PR closes the easier-to-hit POSIX-fsync race; the harder cgroup-SIGKILL race needs leases.

Backwards compatibility

100%. Default behavior identical to current code. No schema changes, no migration, no startup cost change for the common case. Single new env-var check on connect().

Test plan

Unit test the env-var parsing: unset → NORMAL, "full" → FULL+autocheckpoint, "normal" → NORMAL, "FULL" (uppercase) → FULL (case-insensitive), unknown value → NORMAL (safe default).
Integration: run the repro from [Bug]: Kanban: rapid worker spawn-crash loop (sub-2s/crash) corrupts board SQLite B-tree before failure_limit trips #30896 once with HERMES_KANBAN_SYNCHRONOUS_MODE=full (expect: no corruption); once without (expect: corruption reproduces, matching current behavior).

Validation

Tested in production for 11 days with the v1 (default-change) shape. Confirmed the WAL-consistency race no longer fires under heartbeat-heavy load. Confirmed via #31618 that the harder SIGKILL-mid-fsync class still happens — but that's #30969's domain, not this PR's.

Repro / root cause: [Bug]: Kanban: rapid worker spawn-crash loop (sub-2s/crash) corrupts board SQLite B-tree before failure_limit trips #30896
Insufficient-fix datapoint: kanban.db corruption recurs under concurrent reclaim-SIGKILL even with synchronous=FULL + wal_autocheckpoint=100 #31618
Comprehensive alternative: fix: harden kanban sqlite corruption handling #30969
Superseded v1 reviewer note: @jsboige's comment recommending the config-flag shape adopted here.

jsboige · 2026-05-23T14:05:10Z

Review — Technical Assessment

Overall: This is a sound hotfix for a real production incident. The diagnosis matches the documented SQLite behavior: synchronous=NORMAL in WAL mode defers fsync to checkpoint boundaries, which under concurrent multi-process writes (dispatcher + worker heartbeat subprocess) creates a window where a crash or race leaves WAL frames inconsistent with main DB pages.

What looks right

synchronous=FULL is the correct choice here. The workload is dispatch-bounded (2-3 writes/tick), so the fsync-per-commit cost (<10ms/tick per the body) is negligible. This is exactly what FULL is designed for: trading write throughput for durability guarantees under concurrent access.
wal_autocheckpoint=100 (down from default 1000) keeps the WAL small (~400KB vs ~4MB), shrinking the window where torn pages can accumulate. Reasonable for heartbeat-heavy workloads.
No migration concerns: Both are runtime PRAGMAs applied at connect() — no schema change, no data rewrite, backward compatible with existing DBs.

Suggestions (non-blocking)

Consider a config flag. The body mentions this is a Cloudvisor-specific workload pattern. Upstream may prefer synchronous=FULL behind an opt-in setting (env var or kanban config) rather than changing the default for all users, especially those with simpler single-writer setups where NORMAL is sufficient and faster.
Test debt: The absence of a test is understandable for a timing-sensitive multi-process race, but a follow-up issue with a deterministic repro (e.g., multiprocessing.Process + synthetic heartbeat burst) would prevent regression and validate the fix independently of production.
Comment block is verbose: 10 lines of comments for a 2-line functional change. Consider trimming to 3-4 lines referencing the issue number.

Verdict

This fix correctly addresses the symptom (B-tree corruption from torn WAL frames) with the right SQLite primitives. No blockers from my side, but would recommend the config-flag approach for upstream merge to avoid penalizing simpler deployments.

PR review by Claude Code — technical correctness assessment, no prior reviews to conflict with.

alt-glitch · 2026-05-23T14:15:30Z

Superset of #30645 — both change PRAGMA synchronous=NORMAL to FULL in kanban_db.py. This PR also adds wal_autocheckpoint=100 for tighter checkpoint frequency. Related to #30896 (B-tree corruption repro), #30969 (broader kanban hardening with leases + corruption detection).

…ULL + tight wal_autocheckpoint durability hardening Heartbeat-heavy concurrent-writer workloads on the kanban DB (a dispatcher process + worker subprocesses both writing event rows and status updates many times per minute) can hit a "database disk image is malformed" race where the WAL header ends up inconsistent with main DB pages. Live repro at issue NousResearch#30896. With the default `PRAGMA synchronous=NORMAL`, COMMIT returns before the WAL frames are durably on disk, so a SIGKILL or power-loss-like race mid-WAL-write leaves the database irreparable. `PRAGMA synchronous=FULL` plus a tighter `wal_autocheckpoint=100` (vs the SQLite default of 1000) closes the window — fsync after every commit, and the WAL stays small so the recovery surface stays small. The perf cost is bounded because the kanban workload is dispatch-bounded, not write-bounded; ms-scale fsyncs do not materially change end-to-end dispatch latency under the workloads we tested. Default behavior unchanged — opt-in via env var so simpler single-writer deployments do not pay the fsync cost. Set `HERMES_KANBAN_SYNCHRONOUS_MODE=full` to enable; any other value (including unset) preserves the prior `synchronous=NORMAL` behavior. This replaces the v1 of this PR which changed the default for all deployments. Per reviewer suggestion on the v1 thread, the configurability is the right shape for upstream — it lets the operators who hit the race opt in without forcing the fsync cost on everyone else. Upstream-PR: NousResearch#30973 Upstream-Issue: NousResearch#30896 Related-Issue: NousResearch#31618 Remove-when: PR NousResearch#30973 merges into NousResearch/hermes-agent main, OR superseded by PR NousResearch#30969 (broader kanban hardening with leases + corruption detection) once that lands.

julio-cloudvisor · 2026-05-27T16:01:20Z

Force-pushed v2 of this PR per @jsboige's review suggestion.

What changed: synchronous=FULL + wal_autocheckpoint=100 are now opt-in via HERMES_KANBAN_SYNCHRONOUS_MODE=full env var instead of changing the default for all deployments. Default behavior is identical to current upstream (synchronous=NORMAL).

What didn't change: the underlying PRAGMA logic + the workload characterization. The repro at #30896 still applies; the limitation at #31618 still applies.

@alt-glitch you noted this is related to #30645 + #30969. v2 here is the smaller hotfix shape for the POSIX-fsync race; #30969 (broader leases + corruption detection) is what we'd need for the harder SIGKILL-mid-fsync class. Happy to either land both, or land this one as a stop-gap while #30969 is being reviewed.

PR labeled P3 currently — would appreciate a re-evaluation if the opt-in shape makes this lower-risk to merge. The default-unchanged behavior means zero regression surface for anyone not opting in.

julio-cloudvisor · 2026-05-27T16:08:24Z

Heads-up: #33243 (opened today by @teknium1) rediagnoses the corruption class as a false-positive from a probe-connection race rather than a real WAL/durability gap. It closes 7 related issues including our #31618 and validates against the same .corrupt.*.bak symptom we hit.

If #33243 lands, this PR's PRAGMA changes become unnecessary — the false-positive detector going away eliminates the symptom regardless of synchronous mode. We'll close this PR out once #33243 merges.

Leaving open in the meantime as a smaller stop-gap in case Nous prefers to ship the targeted fix first, or for deployments that want belt-and-suspenders on the WAL durability semantics independent of the false-positive bug. (The v2 opt-in shape via env var means there's still zero regression surface for the common case.)

Reads header bytes 28-31 after every COMMIT and compares against actual file size. Raises sqlite3.DatabaseError on torn-extend (actual_pages < page_count). Also sets PRAGMA wal_autocheckpoint=100 in connect(). Refs: #31208 (Bug E - same file, coordinate), #30973 (wal_autocheckpoint) Refs: #30445, #30896, #30908 (corruption reports)

julio-cloudvisor · 2026-05-28T01:05:50Z

Closing — upstream main now has the equivalent (and better-targeted) fixes:

6416dd51 fix(kanban): harden SQLite against torn-write corruption (synchronous=FULL + secure_delete + cell_size_check) — does what this PR was trying to do, as the default, plus the additional integrity PRAGMAs.
99c19eb2 fix(kanban): add post-commit page_count invariant check + wal_autocheckpoint=100 — adds the tighter WAL checkpoint cadence this PR also wanted.
dc98314f fix(kanban): skip redundant WAL pragma on already-WAL connections — the actual root cause for our kanban.db corruption recurs under concurrent reclaim-SIGKILL even with synchronous=FULL + wal_autocheckpoint=100 #31618 observation. The pragma re-application was the false-positive trigger; our PR mischaracterized this as a durability race.

Combined, these three commits supersede this PR entirely and the opt-in env-var shape from the v2 re-roll is no longer needed. Validated all three are in origin/main and pulled cleanly into our production runtime.

Thanks @jsboige for the v1 review (the config-flag suggestion was right for the wrong reason — the real fix lives further down the stack) and @alt-glitch for the initial triage that pointed at #30969 and the related cluster.

Reads header bytes 28-31 after every COMMIT and compares against actual file size. Raises sqlite3.DatabaseError on torn-extend (actual_pages < page_count). Also sets PRAGMA wal_autocheckpoint=100 in connect(). Refs: NousResearch#31208 (Bug E - same file, coordinate), NousResearch#30973 (wal_autocheckpoint) Refs: NousResearch#30445, NousResearch#30896, NousResearch#30908 (corruption reports)

Reads header bytes 28-31 after every COMMIT and compares against actual file size. Raises sqlite3.DatabaseError on torn-extend (actual_pages < page_count). Also sets PRAGMA wal_autocheckpoint=100 in connect(). Refs: NousResearch#31208 (Bug E - same file, coordinate), NousResearch#30973 (wal_autocheckpoint) Refs: NousResearch#30445, NousResearch#30896, NousResearch#30908 (corruption reports) #AI commit#

Reads header bytes 28-31 after every COMMIT and compares against actual file size. Raises sqlite3.DatabaseError on torn-extend (actual_pages < page_count). Also sets PRAGMA wal_autocheckpoint=100 in connect(). Refs: NousResearch#31208 (Bug E - same file, coordinate), NousResearch#30973 (wal_autocheckpoint) Refs: NousResearch#30445, NousResearch#30896, NousResearch#30908 (corruption reports)

julio-cloudvisor mentioned this pull request May 23, 2026

[Bug]: Kanban: rapid worker spawn-crash loop (sub-2s/crash) corrupts board SQLite B-tree before failure_limit trips #30896

Open

jcprz force-pushed the cv/wal-corruption-resilience branch 2 times, most recently from 0b6ef71 to 7cd1f6e Compare May 23, 2026 14:11

alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/cli CLI entry point, hermes_cli/, setup wizard labels May 23, 2026

alt-glitch mentioned this pull request May 23, 2026

kanban dispatcher wedges under multi-thread + subprocess concurrency due to WAL/SHM cache poisoning #31158

Closed

steveonjava mentioned this pull request May 24, 2026

fix(kanban): harden SQLite against torn-write corruption (secure_delete + cell_size_check + synchronous=FULL) #31208

Closed

14 tasks

alt-glitch mentioned this pull request May 24, 2026

Kanban SQLite database corruption under rapid task creation #31502

Closed

1 task

julio-cloudvisor mentioned this pull request May 24, 2026

kanban.db corruption recurs under concurrent reclaim-SIGKILL even with synchronous=FULL + wal_autocheckpoint=100 #31618

Closed

This was referenced May 24, 2026

fix(kanban): change synchronous=NORMAL to FULL + add wal_autocheckpoint=100 #31726

Closed

fix(kanban): change synchronous=NORMAL to FULL + add wal_autocheckpoint=100 #31731

Closed

ddblue0 mentioned this pull request May 25, 2026

Gateway embedded Kanban dispatcher opens SQLite WAL connections every tick, causing FD/WAL pressure #31736

Closed

steveonjava mentioned this pull request May 25, 2026

fix(kanban): add post-commit page_count invariant check to write_txn #32300

Closed

11 tasks

alt-glitch mentioned this pull request May 26, 2026

[Bug]: Kanban DB intermittent corruption after worker crash — missing WAL checkpoint #32543

Closed

julio-cloudvisor mentioned this pull request May 26, 2026

hermes auth add openai-codex writes only to credential_pool; runtime resolver reads providers.tokens — fails with 'missing access_token' #32730

Closed

steveonjava mentioned this pull request May 26, 2026

fix(kanban): batch-salvage 8 SQLite corruption hardening fixes (closes #31158, refs #29610) #32857

Closed

8 tasks

julio-cloudvisor mentioned this pull request May 27, 2026

fix: harden kanban sqlite corruption handling #30969

Open

jcprz force-pushed the cv/wal-corruption-resilience branch from 7cd1f6e to 4f0568c Compare May 27, 2026 16:00

julio-cloudvisor changed the title ~~fix(kanban): synchronous=FULL + wal_autocheckpoint=100 to prevent B-tree corruption under concurrent writer races~~ feat(kanban): opt-in HERMES_KANBAN_SYNCHRONOUS_MODE for synchronous=FULL durability hardening May 27, 2026

julio-cloudvisor mentioned this pull request May 27, 2026

fix(kanban): pooled DB connection across gateway + dashboard (closes #33169 cluster) #33243

Closed

alt-glitch mentioned this pull request May 27, 2026

[Bug]: Database corruption problem with Kanban causing system crash. #33334

Closed

1 task

kshitijk4poor mentioned this pull request May 27, 2026

fix(kanban): batch-salvage 7 SQLite corruption hardening fixes from #32857 #33482

Merged

julio-cloudvisor closed this May 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(kanban): opt-in HERMES_KANBAN_SYNCHRONOUS_MODE for synchronous=FULL durability hardening#30973

feat(kanban): opt-in HERMES_KANBAN_SYNCHRONOUS_MODE for synchronous=FULL durability hardening#30973
julio-cloudvisor wants to merge 1 commit into
NousResearch:mainfrom
devisory-engineering:cv/wal-corruption-resilience

julio-cloudvisor commented May 23, 2026 •

edited

Loading

Uh oh!

jsboige commented May 23, 2026

Uh oh!

alt-glitch commented May 23, 2026

Uh oh!

julio-cloudvisor commented May 27, 2026

Uh oh!

julio-cloudvisor commented May 27, 2026

Uh oh!

julio-cloudvisor commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

julio-cloudvisor commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why opt-in vs default-change (v2 of this PR)

What the env var does

Why these two PRAGMAs

Known limitation (filed at #31618)

Backwards compatibility

Test plan

Validation

Related

Uh oh!

jsboige commented May 23, 2026

Review — Technical Assessment

What looks right

Suggestions (non-blocking)

Verdict

Uh oh!

alt-glitch commented May 23, 2026

Uh oh!

julio-cloudvisor commented May 27, 2026

Uh oh!

julio-cloudvisor commented May 27, 2026

Uh oh!

julio-cloudvisor commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

julio-cloudvisor commented May 23, 2026 •

edited

Loading