fix(kanban): batch-salvage 8 SQLite corruption hardening fixes (closes #31158, refs #29610) by steveonjava · Pull Request #32857 · NousResearch/hermes-agent

steveonjava · 2026-05-26T22:34:43Z

Summary

Eight kanban.db SQLite hardening fixes consolidated into one PR for batch review. Each commit is atomic, originally filed as a separate draft PR (kept open as fallback), and now validated together against a 607-test kanban + state suite (all green).

Pattern follows the precedent in #23791.

Closes #31158. Refs #29610, #30846, #30896, #30973, #31014, #31130.

Why batch this

Over the last two weeks I filed nine separate draft PRs (#30727, #31208, #31294, #31310, #31891, #31932, #32300, #32301, #32489) covering the same broad surface: kanban.db corruption events I hit running a five-profile kanban fleet under sustained writer load. Eleven distinct corruption events surfaced over that window, spanning several bug classes (torn-write on freed pages, single-row mid-write, b-tree split race during concurrent writers, EIO during WAL init, post-WAL-init (deleted) sidecar accumulation, zombie worker accumulation). The investigation pointed at a single upstream root cause for the cluster (commit 8 below) plus seven defense-in-depth contributors.

Shipping them as one PR makes the story reviewable in one pass instead of nine. Each commit is still cherry-pickable if you'd rather land them surgically; the original drafts remain open and pinned to their respective branches.

#31891 (merge complete_task + recompute_ready into one txn) is not in this bundle. I ran a week of production load with all of 1-8 applied and that PR not applied, and saw no idx_tasks_status drift. I'd rather not ship it on speculation.

What's in here, in commit order

The order is defensive primitives first, then EIO handling, then dispatcher hygiene, then write-side detection, then process hygiene, then the headline root cause last. Reading top-to-bottom traces the same story the investigation went through.

1. `fix(kanban): harden SQLite against torn-write corruption` (was #31208)

Sets secure_delete=ON, cell_size_check=ON, synchronous=FULL on every kanban connect(). After a torn write, freed pages contain zeros instead of stale cell text, so corruption is detectable and stale rows can't resurface as live data.

2. `fix(state): never silently downgrade WAL to DELETE on transient EIO` (was #31294)

apply_wal_with_fallback previously caught any OperationalError containing one of the NFS/SMB markers and silently dropped to DELETE. A transient EIO (kernel I/O backoff under memory pressure) was matching one of those markers and downgrading a healthy WAL DB to DELETE in place. Tightens the marker list and adds an on-disk header check before downgrading: if the file is already WAL on disk, re-raise instead of rewriting it.

3. `fix(kanban): preserve original exception when write_txn rollback fails` (was #31310)

On an EIO or auto-rollback inside write_txn, the explicit ROLLBACK statement raises a secondary OperationalError that shadows the original. Caller gets "no transaction is active" instead of "disk I/O error". The wrapper now catches the rollback error and re-raises the original.

4. `fix(kanban): add grace period to detect_crashed_workers` (was #30727)

When two dispatchers race during gateway restart, the new one's detect_crashed_workers sweep can reclaim a task that the old dispatcher is mid-spawning, causing a respawn storm. Adds a CRASH_GRACE_SECONDS window so a task whose claimed event landed within the last N seconds is not reclaimed.

5. `fix(gateway): replace permanent corrupt-board latch with exponential backoff` (was #31932)

The disabled_corrupt_boards dict permanently latched boards on any OperationalError matching "file is not a database". A transient EIO during connect() would short-read the SQLite header, raise that exact message, and strand the board until manual gateway restart. Replaces with exponential backoff (30s → 30min cap) plus a header-magic-byte recheck: if the first 16 bytes are SQLite format 3\0, the file is real and the error was transient.

6. `fix(kanban): add post-commit page_count invariant check to write_txn` (was #32300)

Reads bytes 28-31 of the DB header after every COMMIT and compares against os.path.getsize() / page_size. Raises sqlite3.DatabaseError on torn-extend (header claims more pages than the file has). Also sets PRAGMA wal_autocheckpoint=100 so the WAL doesn't grow unbounded under contention. Surfaces Bug F damage at commit time instead of on the next read.

7. `fix(kanban): hoist zombie reaper out of dispatch_once` (was #32301)

dispatch_once aborted on any per-board connect() failure, but the only os.waitpid(-1, WNOHANG) reaper call lived inside that same function past the connect. One transient EIO → reaper skipped → zombies accumulated → task_runs rows left with worker_pid pointing at defunct PIDs. Moves the reaper into _kanban_dispatcher_watcher so it runs at the top of every tick regardless of per-board outcomes. Also fixes a test-indentation bug from an earlier salvage. (Squashed from three sibling commits in #32301 for review.)

8. `fix(kanban): skip redundant WAL pragma on already-WAL connections` (was #32489)

The root cause. Every kanban connect() was running PRAGMA journal_mode=WAL even when the DB header already reported WAL mode. SQLite's WAL-init path treats that as "set up WAL from scratch": checkpoint, unlink the existing -shm/-wal sidecars, recreate them under the current connection's lock state. If any other process or thread had those sidecar inodes open at that instant (notifier-watcher polling every 5s, dashboard-watcher polling every 10s, dispatcher tick every 60s, any active worker), the unlinked inodes became (deleted) fds in those processes. Their in-process mmap regions kept pointing at vanished disk pages. The next connect() from any of those processes opened fresh inodes for the sidecars, but the long-lived watcher-thread connections still had the old (deleted) mappings pinned. Subsequent PRAGMA journal_mode=WAL on those stale-state connections raised sqlite3.OperationalError("disk I/O error"). Under enough write pressure the unlink-and-recreate dance raced with a concurrent writer's b-tree split and the next commit landed structurally damaged on disk (Class-C corruption).

The fix is a cheap read-only PRAGMA journal_mode probe before the set-pragma. If the result is already 'wal', return early. Mode is durable in the file header (byte 18 = 2 for WAL), so the first connection in the process primes WAL once and every subsequent connection is observation-only, with no unlink and no race. The set-pragma path still runs on first-connection-per-process, which is where the residual rate comes from; see "What this does not fix" below.

Approaches I tried locally and reverted

Worth surfacing so reviewers don't redo this work:

Shared per-board _kb_conn cache (fix(gateway): use shared per-board kanban connection to prevent WAL inode-rotation race #32226). Tested locally and reverted after production corruption was confirmed within hours. Concurrent thread access to one SQLite connection violates SQLite's threading model under any meaningful load. fix(gateway): use shared per-board kanban connection to prevent WAL inode-rotation race #32226 is closed.
TLS per-thread connection cache (fix(gateway): cache kanban DB connections per OS thread in GatewayRunner #32322). Tested locally and reverted: reduced connection count but did not eliminate the unlink itself, any code path that opened a fresh connection still triggered it. Same wedge symptom at lower frequency. fix(gateway): cache kanban DB connections per OS thread in GatewayRunner #32322 closed.
Held-BEGIN + sqlite_master pinner (fix(gateway): add WAL pinner to hold shared lock and prevent sidecar unlink (Bug I.2) #32531). Tested locally and reverted. The "correct pinner pattern" from the prior-art comments. Pinner only helps for already-WAL connections, so it needs the redundant-WAL-pragma skip (fix(kanban): skip redundant WAL pragma on already-WAL connections #32489) to actually prevent the unlink it's trying to block. On its own it doesn't help. An earlier naive variant (bare SELECT 1) was also tested locally and reverted. fix(gateway): add WAL pinner to hold shared lock and prevent sidecar unlink (Bug I.2) #32531 closed.

The pattern across all three: they tried to make the unlink "less likely" by changing the connection topology. Commit 8 in this PR removes the unlink at the source instead.

What this PR does not fix

First-connection-per-process still hits the WAL-init path, because WAL must be primed once. Any fresh process spawning under write pressure can still produce one unlink event. Three of the commits above (the on-disk header check in #2, the header-magic peek in #5, the post-commit page_count check in #6) catch and either prevent or surface the resulting damage, but they do not eliminate it.

Empirically, with this entire stack live, my fleet has seen one to two Class-C events per week on the highest-write-volume board, down from one to two events per day before any of this landed. A future fix that serializes first-connection WAL init across processes (a flock around PRAGMA journal_mode=WAL, which I'm running locally as a separate patch) would close the residual gap. Not in this PR.

Coordination with PR #32759

@MoonRay305 filed #32759 (fix(kanban): harden SQLite connection concurrency) covering related territory: first-connect flock, busy_timeout, and refusing auto-init on malformed DBs. There's no file overlap with this PR (#32759 touches hermes_cli/kanban_db.py only, mostly different functions; the AUTHOR_MAP addition won't conflict). Suggested merge order: this PR first as the root-cause fix, then #32759 as defense-in-depth on top. If you prefer the opposite, the only rebase work needed on my side would be on commit 1.

Testing

.venv/bin/python -m pytest tests/hermes_cli/test_kanban_db.py tests/test_hermes_state.py tests/test_hermes_state_wal_fallback.py tests/hermes_cli/test_kanban_dispatcher_resilience.py tests/hermes_cli/test_kanban_core_functionality.py

Result: 607 passed.

The new tests added by these commits cover each fix individually. Most useful new fixtures: tests/hermes_cli/test_kanban_dispatcher_resilience.py (corrupt-board latch behavior under transient EIO) and the WAL-skip probe tests in tests/test_hermes_state.py.

Production validation

These commits have been running on local/runtime of my fork against five gateway processes (default + writing + admin + coder + researcher) on a Linux LXC for one to seven days each, depending on the commit. The writing board sustains roughly 200-400 commits per hour under typical workloads, mostly from dispatcher tick + worker heartbeat + event log writes. EIO rate dropped from hundreds per day to single digits.

PR scope checklist

Conventional commits
Eight atomic, cherry-pickable commits, each from a previously-filed draft
No new features, no API changes, no schema migrations
All changes are kanban / SQLite hardening; no unrelated drive-by edits
AUTHOR_MAP entry for steveonjava (folded into commit 1)
Closes kanban dispatcher wedges under multi-thread + subprocess concurrency due to WAL/SHM cache poisoning #31158 in PR body. Refs Bug: embedded Kanban dispatcher still leaks sqlite/WAL file descriptors after #28301 #29610, fix: BTRFS COW + SQLite WAL incompatibility — disk I/O errors on BTRFS #30846, [Bug]: Kanban: rapid worker spawn-crash loop (sub-2s/crash) corrupts board SQLite B-tree before failure_limit trips #30896, feat(kanban): opt-in HERMES_KANBAN_SYNCHRONOUS_MODE for synchronous=FULL durability hardening #30973, fix: remove disk i/o error from WAL incompatibility markers #31014, fix(kanban-db): WAL file descriptor leak on connect/close cycles (fixes #30799) #31130.
Local pytest run green (607 passed)
I've tested on my platform: Linux 6.17.13 (LXC on Proxmox), Python 3.13, ext4 on NVMe.

…te + cell_size_check + synchronous=FULL) Production corruption NousResearch#6 left b-tree pages with zeroed headers but intact old cell content — the Bug E pattern. This fix applies three pragma calls on every connect(): - synchronous=FULL (was NORMAL): closes the WAL-checkpoint reordering window where a crash between WAL commit and main-DB write leaves a partially-written b-tree page header. Cost is <1ms per commit on local SSD; negligible at kanban write volume. - secure_delete=ON: forces SQLite to zero freed page bytes on disk. If a torn write or hardware fault later corrupts a page, the underlying cell content is zero, so corruption is detectable and no stale rows can resurface as live data. - cell_size_check=ON: adds a read-side guard so corrupt cells surface as errors at read time rather than as silent wrong-data returns. All three are connection-scoped and re-applied on every connect(). secure_delete also writes a persistent flag into the DB header on the first call against a fresh DB, making the protection durable across processes for new DBs. Tests added for all four required cases: each pragma active on a fresh connection, and all three re-applied after close+reopen. Also adds the required negative test (migration path does not reset pragmas).

apply_wal_with_fallback() treated "disk i/o error" as a permanent WAL-incompatibility marker, identical to "locking protocol" (NFS) and "not authorized" (FUSE). But EIO during PRAGMA journal_mode=WAL is typically TRANSIENT — page-cache pressure, brief lock contention, recoverable storage hiccups — not a permanent filesystem property. Treating transient EIO as a permanent downgrade signal produces the mixed-journal-mode-across-processes corruption pattern: 1. Process A opens kanban.db, hits transient EIO on the WAL pragma, silently downgrades to journal_mode=DELETE. 2. Process B (no EIO) opens the same file moments later and successfully sets journal_mode=WAL. 3. A writes rollback-journal frames while B writes WAL frames. SQLite documents this as unsupported and corrupts the file: https://www.sqlite.org/wal.html ("all connections to the same database must use the same locking protocol"). This was the root cause of repeated kanban.db corruption on hosts with multiple gateway processes plus CLI invocations against the same DB (observed pattern: corruption shortly after gateway startup, after the process logged "WAL journal_mode unsupported on this filesystem (disk I/O error) — falling back to journal_mode=DELETE"). The fallback warning told the truth — fallback DID happen — but the premise ("unsupported on this filesystem") was wrong; the EIO was a one-shot event and sibling processes successfully used WAL. Fix has two layers: 1. Remove "disk i/o error" from _WAL_INCOMPAT_MARKERS. EIO now re-raises so callers can retry instead of silently corrupting the DB. The two remaining markers ("locking protocol", "not authorized") are deterministic per filesystem so they remain safe permanent-downgrade signals. 2. Belt-and-suspenders: before downgrading on ANY marker match, peek the on-disk journal mode. If the header says WAL, refuse to downgrade and re-raise the original error. This guards against any future addition to _WAL_INCOMPAT_MARKERS turning out to be transient in some environment we haven't yet seen. Tests: - tests/test_hermes_state_wal_fallback.py: * Flipped test_falls_back_on_disk_io_error → test_reraises_on_disk_io_error asserting EIO is re-raised, not silently swallowed. * Added test_does_not_downgrade_when_disk_says_wal covering the on-disk-header safety guard for the existing legitimate markers. - tests/hermes_cli/test_kanban_db.py: * test_connect_falls_back_to_delete_on_locking_protocol now uses a truly-fresh DB (instead of the kanban_home fixture which pre-inits in WAL). On NFS the very first process touching the file legitimately downgrades; on a file already in WAL the new guard correctly refuses. A standalone reproducer lives at /tmp/kanban-stress/repro_bugD_eio_wal_downgrade.py (not committed): without fix the DB silently flips from WAL to DELETE mid-process; with fix the EIO surfaces and the file stays WAL. Refs: Bug D in the kanban-corruption investigation series (Bugs A and C shipped in ebe7374f3 and e02147d5e respectively). Bug D explains every corruption incident this week including those that survived A's single-dispatcher mitigation, because every CLI invocation is a separate process whose WAL pragma can transiently fail.

When code inside a write_txn block raises an OperationalError that SQLite has already auto-rolled-back (typical for disk I/O error, database is locked, and database disk image is malformed), the explicit ROLLBACK in write_txn.__exit__ itself raises cannot rollback - no transaction is active and the secondary exception replaces the original in the traceback. Operators see a misleading error and lose the diagnostic information they need. Swallow the rollback-time OperationalError so the caller always sees the original cause. Confirmed reproducer: tests/hermes_cli/test_kanban_db.py:: test_write_txn_preserves_original_exception_when_rollback_fails

`detect_crashed_workers` calls `_pid_alive` on every `running` task whose claim is held by this host. The check can transiently return False for a freshly-spawned worker (fork → /proc-visibility lag, or reap-race between SIGCHLD and parent reaping). When a second dispatcher ticks inside that window it reclaims the task and spawns a duplicate worker. Add `DEFAULT_CRASH_GRACE_SECONDS = 30` and an `HERMES_KANBAN_CRASH_GRACE_SECONDS` env-var override. `detect_crashed_workers` skips the liveness check when `time.time() - started_at < grace`. The existing 15-minute claim TTL still reclaims genuinely-crashed workers; grace only suppresses the launch-window false positive. `HERMES_KANBAN_CRASH_GRACE_SECONDS=0` is set on the `kanban_home` fixture in `test_kanban_core_functionality.py` so existing tests that assert immediate reclaim retain pre-fix semantics. Companion to merged PR NousResearch#23442 (`release_stale_claims`, closes NousResearch#23025), which addressed the same multi-dispatcher race in the stale-claim path. Related: NousResearch#20015 (`_pid_alive` false-negative behaviour),

…backoff Transient SQLite I/O errors that match the corruption pattern permanently disabled board dispatch. Add PRAGMA quick_check confirmation before latching, and replace the fingerprint-only latch with exponential backoff (30s initial, doubles per failure, 900s cap). Clear latch on fingerprint change or successful dispatch. Refs: NousResearch#30417 (Bug 2) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reads header bytes 28-31 after every COMMIT and compares against actual file size. Raises sqlite3.DatabaseError on torn-extend (actual_pages < page_count). Also sets PRAGMA wal_autocheckpoint=100 in connect(). Refs: NousResearch#31208 (Bug E - same file, coordinate), NousResearch#30973 (wal_autocheckpoint) Refs: NousResearch#30445, NousResearch#30896, NousResearch#30908 (corruption reports)

Reaper now runs at the top of every dispatcher tick regardless of per-board connect() failures. Previously the reaper sat inside dispatch_once after the kanban_db.connect() call — any EIO during connect would skip reaping for that tick, accumulating zombie workers and stale claim_lock rows. Also: reap_worker_zombies now returns the list of reaped pids (the dispatcher logs them) and a test indentation fix. Squashes three sibling commits from PR NousResearch#32301 into one logical change for batch review.

apply_wal_with_fallback() issued PRAGMA journal_mode=WAL on every call, including connections to DBs already in WAL mode. This triggered the WAL init code path, causing SQLite to acquire EXCLUSIVE, checkpoint, and unlink kanban.db-{wal,shm}. Other open connections received (deleted) FDs and raised sqlite3.OperationalError: disk I/O error. Add a cheap read probe (PRAGMA journal_mode, no flock/checkpoint/unlink) before the set-pragma path. If already wal, return early. The set-pragma and DELETE fallback paths are unchanged. Closes NousResearch#31158. Addresses root cause that PRs NousResearch#32226 and NousResearch#32322 attempted via connection-sharing/caching approaches.

The reaper hoist in the prior commit adds an extra `asyncio.to_thread(_kb.reap_worker_zombies)` call at the top of every dispatcher tick (before the per-board work). The existing `test_gateway_dispatcher_disables_corrupt_board_without_traceback` mocks `to_thread` with a 4-call cap that previously matched 2 full dispatch ticks. With the reaper hoist each tick is now 3 `to_thread` calls instead of 2, so the cap is raised to 6 to preserve the same number of dispatch ticks. The `connect == 5` assertion is unchanged. Also add the contributor's `steveonjava@gmail.com` to AUTHOR_MAP alongside `steve@steveonjava.com` so contributor-audit passes for both identities used across the salvaged commits. Salvage follow-up for PR #32857.

kshitijk4poor · 2026-05-27T21:32:35Z

Merged via #33482 — 7 of 8 commits cherry-picked with per-commit authorship preserved (rebase merge, both steveonjava@gmail.com and steve@steveonjava.com identities credited).

Commit 5 (fix(gateway): replace permanent corrupt-board latch with exponential backoff) was dropped from the salvage because main now has a flat 5-minute quarantine timer from c94ad89 (@donovan-yohan) — the substantive ideas in your commit 5 (exponential backoff 30s → 30min cap + PRAGMA quick_check to distinguish transient EIO from real corruption) are valuable and will be salvaged separately as a follow-up rebased onto the current dispatcher shape. Filing an issue for that work shortly.

Thanks for the thorough investigation and the production-validated batch — the WAL-pragma-skip root-cause fix is especially nice. Closing this PR; the rest of your draft chain (#31294, #31310, #30727, #32300, #32301, #32489) is being closed alongside since each commit is now on main.

The reaper hoist in the prior commit adds an extra `asyncio.to_thread(_kb.reap_worker_zombies)` call at the top of every dispatcher tick (before the per-board work). The existing `test_gateway_dispatcher_disables_corrupt_board_without_traceback` mocks `to_thread` with a 4-call cap that previously matched 2 full dispatch ticks. With the reaper hoist each tick is now 3 `to_thread` calls instead of 2, so the cap is raised to 6 to preserve the same number of dispatch ticks. The `connect == 5` assertion is unchanged. Also add the contributor's `steveonjava@gmail.com` to AUTHOR_MAP alongside `steve@steveonjava.com` so contributor-audit passes for both identities used across the salvaged commits. Salvage follow-up for PR NousResearch#32857.

The reaper hoist in the prior commit adds an extra `asyncio.to_thread(_kb.reap_worker_zombies)` call at the top of every dispatcher tick (before the per-board work). The existing `test_gateway_dispatcher_disables_corrupt_board_without_traceback` mocks `to_thread` with a 4-call cap that previously matched 2 full dispatch ticks. With the reaper hoist each tick is now 3 `to_thread` calls instead of 2, so the cap is raised to 6 to preserve the same number of dispatch ticks. The `connect == 5` assertion is unchanged. Also add the contributor's `steveonjava@gmail.com` to AUTHOR_MAP alongside `steve@steveonjava.com` so contributor-audit passes for both identities used across the salvaged commits. Salvage follow-up for PR NousResearch#32857. #AI commit#

The reaper hoist in the prior commit adds an extra `asyncio.to_thread(_kb.reap_worker_zombies)` call at the top of every dispatcher tick (before the per-board work). The existing `test_gateway_dispatcher_disables_corrupt_board_without_traceback` mocks `to_thread` with a 4-call cap that previously matched 2 full dispatch ticks. With the reaper hoist each tick is now 3 `to_thread` calls instead of 2, so the cap is raised to 6 to preserve the same number of dispatch ticks. The `connect == 5` assertion is unchanged. Also add the contributor's `steveonjava@gmail.com` to AUTHOR_MAP alongside `steve@steveonjava.com` so contributor-audit passes for both identities used across the salvaged commits. Salvage follow-up for PR NousResearch#32857.

steveonjava and others added 8 commits May 26, 2026 15:18

alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/plugins Plugin system and bundled plugins comp/gateway Gateway runner, session dispatch, delivery labels May 26, 2026

steveonjava marked this pull request as ready for review May 26, 2026 23:17

steveonjava mentioned this pull request May 26, 2026

kanban dispatcher wedges under multi-thread + subprocess concurrency due to WAL/SHM cache poisoning #31158

Closed

kshitijk4poor mentioned this pull request May 27, 2026

fix(kanban): batch-salvage 7 SQLite corruption hardening fixes from #32857 #33482

Merged

kshitijk4poor closed this in #33482 May 27, 2026

kshitijk4poor mentioned this pull request May 27, 2026

Follow-up: rebuild PR #32857 commit 5 (exp-backoff + PRAGMA quick_check) on top of c94ad8981 #33486

Open

hehehe0803 mentioned this pull request May 28, 2026

fix(kanban): sync allowlisted forced skills before blocking #33640

Open

teknium1 mentioned this pull request May 28, 2026

[Bug]: Database corruption problem with Kanban causing system crash. #33334

Closed

1 task

BROCCOLO1D mentioned this pull request May 29, 2026

fix(kanban): confirm corruption before backoff #34235

Closed

rodboev mentioned this pull request Jun 5, 2026

fix(kanban): confirm corrupt boards before backoff (#33486) #39783

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kanban): batch-salvage 8 SQLite corruption hardening fixes (closes #31158, refs #29610)#32857

fix(kanban): batch-salvage 8 SQLite corruption hardening fixes (closes #31158, refs #29610)#32857
steveonjava wants to merge 8 commits into
NousResearch:mainfrom
steveonjava:fix/kanban-corruption-batch-salvage

steveonjava commented May 26, 2026 •

edited

Loading

Uh oh!

kshitijk4poor commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

steveonjava commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why batch this

What's in here, in commit order

1. fix(kanban): harden SQLite against torn-write corruption (was #31208)

2. fix(state): never silently downgrade WAL to DELETE on transient EIO (was #31294)

3. fix(kanban): preserve original exception when write_txn rollback fails (was #31310)

4. fix(kanban): add grace period to detect_crashed_workers (was #30727)

5. fix(gateway): replace permanent corrupt-board latch with exponential backoff (was #31932)

6. fix(kanban): add post-commit page_count invariant check to write_txn (was #32300)

7. fix(kanban): hoist zombie reaper out of dispatch_once (was #32301)

8. fix(kanban): skip redundant WAL pragma on already-WAL connections (was #32489)

Approaches I tried locally and reverted

What this PR does not fix

Coordination with PR #32759

Testing

Production validation

PR scope checklist

Uh oh!

kshitijk4poor commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

steveonjava commented May 26, 2026 •

edited

Loading

1. `fix(kanban): harden SQLite against torn-write corruption` (was #31208)

2. `fix(state): never silently downgrade WAL to DELETE on transient EIO` (was #31294)

3. `fix(kanban): preserve original exception when write_txn rollback fails` (was #31310)

4. `fix(kanban): add grace period to detect_crashed_workers` (was #30727)

5. `fix(gateway): replace permanent corrupt-board latch with exponential backoff` (was #31932)

6. `fix(kanban): add post-commit page_count invariant check to write_txn` (was #32300)

7. `fix(kanban): hoist zombie reaper out of dispatch_once` (was #32301)

8. `fix(kanban): skip redundant WAL pragma on already-WAL connections` (was #32489)