fix(kanban): batch-salvage 8 SQLite corruption hardening fixes (closes #31158, refs #29610)#32857
Conversation
…te + cell_size_check + synchronous=FULL) Production corruption NousResearch#6 left b-tree pages with zeroed headers but intact old cell content — the Bug E pattern. This fix applies three pragma calls on every connect(): - synchronous=FULL (was NORMAL): closes the WAL-checkpoint reordering window where a crash between WAL commit and main-DB write leaves a partially-written b-tree page header. Cost is <1ms per commit on local SSD; negligible at kanban write volume. - secure_delete=ON: forces SQLite to zero freed page bytes on disk. If a torn write or hardware fault later corrupts a page, the underlying cell content is zero, so corruption is detectable and no stale rows can resurface as live data. - cell_size_check=ON: adds a read-side guard so corrupt cells surface as errors at read time rather than as silent wrong-data returns. All three are connection-scoped and re-applied on every connect(). secure_delete also writes a persistent flag into the DB header on the first call against a fresh DB, making the protection durable across processes for new DBs. Tests added for all four required cases: each pragma active on a fresh connection, and all three re-applied after close+reopen. Also adds the required negative test (migration path does not reset pragmas).
apply_wal_with_fallback() treated "disk i/o error" as a permanent
WAL-incompatibility marker, identical to "locking protocol" (NFS) and
"not authorized" (FUSE). But EIO during PRAGMA journal_mode=WAL is
typically TRANSIENT — page-cache pressure, brief lock contention,
recoverable storage hiccups — not a permanent filesystem property.
Treating transient EIO as a permanent downgrade signal produces the
mixed-journal-mode-across-processes corruption pattern:
1. Process A opens kanban.db, hits transient EIO on the WAL pragma,
silently downgrades to journal_mode=DELETE.
2. Process B (no EIO) opens the same file moments later and
successfully sets journal_mode=WAL.
3. A writes rollback-journal frames while B writes WAL frames. SQLite
documents this as unsupported and corrupts the file:
https://www.sqlite.org/wal.html ("all connections to the same
database must use the same locking protocol").
This was the root cause of repeated kanban.db corruption on hosts with
multiple gateway processes plus CLI invocations against the same DB
(observed pattern: corruption shortly after gateway startup, after the
process logged "WAL journal_mode unsupported on this filesystem (disk
I/O error) — falling back to journal_mode=DELETE"). The fallback
warning told the truth — fallback DID happen — but the premise
("unsupported on this filesystem") was wrong; the EIO was a one-shot
event and sibling processes successfully used WAL.
Fix has two layers:
1. Remove "disk i/o error" from _WAL_INCOMPAT_MARKERS. EIO now re-raises
so callers can retry instead of silently corrupting the DB. The two
remaining markers ("locking protocol", "not authorized") are
deterministic per filesystem so they remain safe permanent-downgrade
signals.
2. Belt-and-suspenders: before downgrading on ANY marker match, peek the
on-disk journal mode. If the header says WAL, refuse to downgrade and
re-raise the original error. This guards against any future addition
to _WAL_INCOMPAT_MARKERS turning out to be transient in some
environment we haven't yet seen.
Tests:
- tests/test_hermes_state_wal_fallback.py:
* Flipped test_falls_back_on_disk_io_error → test_reraises_on_disk_io_error
asserting EIO is re-raised, not silently swallowed.
* Added test_does_not_downgrade_when_disk_says_wal covering the
on-disk-header safety guard for the existing legitimate markers.
- tests/hermes_cli/test_kanban_db.py:
* test_connect_falls_back_to_delete_on_locking_protocol now uses a
truly-fresh DB (instead of the kanban_home fixture which pre-inits
in WAL). On NFS the very first process touching the file legitimately
downgrades; on a file already in WAL the new guard correctly refuses.
A standalone reproducer lives at /tmp/kanban-stress/repro_bugD_eio_wal_downgrade.py
(not committed): without fix the DB silently flips from WAL to DELETE
mid-process; with fix the EIO surfaces and the file stays WAL.
Refs: Bug D in the kanban-corruption investigation series (Bugs A and C
shipped in ebe7374f3 and e02147d5e respectively). Bug D explains every
corruption incident this week including those that survived A's
single-dispatcher mitigation, because every CLI invocation is a
separate process whose WAL pragma can transiently fail.
When code inside a write_txn block raises an OperationalError that SQLite has already auto-rolled-back (typical for disk I/O error, database is locked, and database disk image is malformed), the explicit ROLLBACK in write_txn.__exit__ itself raises cannot rollback - no transaction is active and the secondary exception replaces the original in the traceback. Operators see a misleading error and lose the diagnostic information they need. Swallow the rollback-time OperationalError so the caller always sees the original cause. Confirmed reproducer: tests/hermes_cli/test_kanban_db.py:: test_write_txn_preserves_original_exception_when_rollback_fails
`detect_crashed_workers` calls `_pid_alive` on every `running` task whose claim is held by this host. The check can transiently return False for a freshly-spawned worker (fork → /proc-visibility lag, or reap-race between SIGCHLD and parent reaping). When a second dispatcher ticks inside that window it reclaims the task and spawns a duplicate worker. Add `DEFAULT_CRASH_GRACE_SECONDS = 30` and an `HERMES_KANBAN_CRASH_GRACE_SECONDS` env-var override. `detect_crashed_workers` skips the liveness check when `time.time() - started_at < grace`. The existing 15-minute claim TTL still reclaims genuinely-crashed workers; grace only suppresses the launch-window false positive. `HERMES_KANBAN_CRASH_GRACE_SECONDS=0` is set on the `kanban_home` fixture in `test_kanban_core_functionality.py` so existing tests that assert immediate reclaim retain pre-fix semantics. Companion to merged PR NousResearch#23442 (`release_stale_claims`, closes NousResearch#23025), which addressed the same multi-dispatcher race in the stale-claim path. Related: NousResearch#20015 (`_pid_alive` false-negative behaviour),
…backoff Transient SQLite I/O errors that match the corruption pattern permanently disabled board dispatch. Add PRAGMA quick_check confirmation before latching, and replace the fingerprint-only latch with exponential backoff (30s initial, doubles per failure, 900s cap). Clear latch on fingerprint change or successful dispatch. Refs: NousResearch#30417 (Bug 2) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reads header bytes 28-31 after every COMMIT and compares against actual file size. Raises sqlite3.DatabaseError on torn-extend (actual_pages < page_count). Also sets PRAGMA wal_autocheckpoint=100 in connect(). Refs: NousResearch#31208 (Bug E - same file, coordinate), NousResearch#30973 (wal_autocheckpoint) Refs: NousResearch#30445, NousResearch#30896, NousResearch#30908 (corruption reports)
Reaper now runs at the top of every dispatcher tick regardless of per-board connect() failures. Previously the reaper sat inside dispatch_once after the kanban_db.connect() call — any EIO during connect would skip reaping for that tick, accumulating zombie workers and stale claim_lock rows. Also: reap_worker_zombies now returns the list of reaped pids (the dispatcher logs them) and a test indentation fix. Squashes three sibling commits from PR NousResearch#32301 into one logical change for batch review.
apply_wal_with_fallback() issued PRAGMA journal_mode=WAL on every call,
including connections to DBs already in WAL mode. This triggered the WAL
init code path, causing SQLite to acquire EXCLUSIVE, checkpoint, and unlink
kanban.db-{wal,shm}. Other open connections received (deleted) FDs and
raised sqlite3.OperationalError: disk I/O error.
Add a cheap read probe (PRAGMA journal_mode, no flock/checkpoint/unlink)
before the set-pragma path. If already wal, return early. The set-pragma
and DELETE fallback paths are unchanged.
Closes NousResearch#31158. Addresses root cause that PRs NousResearch#32226 and NousResearch#32322 attempted
via connection-sharing/caching approaches.
The reaper hoist in the prior commit adds an extra `asyncio.to_thread(_kb.reap_worker_zombies)` call at the top of every dispatcher tick (before the per-board work). The existing `test_gateway_dispatcher_disables_corrupt_board_without_traceback` mocks `to_thread` with a 4-call cap that previously matched 2 full dispatch ticks. With the reaper hoist each tick is now 3 `to_thread` calls instead of 2, so the cap is raised to 6 to preserve the same number of dispatch ticks. The `connect == 5` assertion is unchanged. Also add the contributor's `steveonjava@gmail.com` to AUTHOR_MAP alongside `steve@steveonjava.com` so contributor-audit passes for both identities used across the salvaged commits. Salvage follow-up for PR #32857.
|
Merged via #33482 — 7 of 8 commits cherry-picked with per-commit authorship preserved (rebase merge, both Commit 5 ( Thanks for the thorough investigation and the production-validated batch — the WAL-pragma-skip root-cause fix is especially nice. Closing this PR; the rest of your draft chain (#31294, #31310, #30727, #32300, #32301, #32489) is being closed alongside since each commit is now on main. |
The reaper hoist in the prior commit adds an extra `asyncio.to_thread(_kb.reap_worker_zombies)` call at the top of every dispatcher tick (before the per-board work). The existing `test_gateway_dispatcher_disables_corrupt_board_without_traceback` mocks `to_thread` with a 4-call cap that previously matched 2 full dispatch ticks. With the reaper hoist each tick is now 3 `to_thread` calls instead of 2, so the cap is raised to 6 to preserve the same number of dispatch ticks. The `connect == 5` assertion is unchanged. Also add the contributor's `steveonjava@gmail.com` to AUTHOR_MAP alongside `steve@steveonjava.com` so contributor-audit passes for both identities used across the salvaged commits. Salvage follow-up for PR NousResearch#32857.
The reaper hoist in the prior commit adds an extra `asyncio.to_thread(_kb.reap_worker_zombies)` call at the top of every dispatcher tick (before the per-board work). The existing `test_gateway_dispatcher_disables_corrupt_board_without_traceback` mocks `to_thread` with a 4-call cap that previously matched 2 full dispatch ticks. With the reaper hoist each tick is now 3 `to_thread` calls instead of 2, so the cap is raised to 6 to preserve the same number of dispatch ticks. The `connect == 5` assertion is unchanged. Also add the contributor's `steveonjava@gmail.com` to AUTHOR_MAP alongside `steve@steveonjava.com` so contributor-audit passes for both identities used across the salvaged commits. Salvage follow-up for PR NousResearch#32857. #AI commit#
The reaper hoist in the prior commit adds an extra `asyncio.to_thread(_kb.reap_worker_zombies)` call at the top of every dispatcher tick (before the per-board work). The existing `test_gateway_dispatcher_disables_corrupt_board_without_traceback` mocks `to_thread` with a 4-call cap that previously matched 2 full dispatch ticks. With the reaper hoist each tick is now 3 `to_thread` calls instead of 2, so the cap is raised to 6 to preserve the same number of dispatch ticks. The `connect == 5` assertion is unchanged. Also add the contributor's `steveonjava@gmail.com` to AUTHOR_MAP alongside `steve@steveonjava.com` so contributor-audit passes for both identities used across the salvaged commits. Salvage follow-up for PR NousResearch#32857.
The reaper hoist in the prior commit adds an extra `asyncio.to_thread(_kb.reap_worker_zombies)` call at the top of every dispatcher tick (before the per-board work). The existing `test_gateway_dispatcher_disables_corrupt_board_without_traceback` mocks `to_thread` with a 4-call cap that previously matched 2 full dispatch ticks. With the reaper hoist each tick is now 3 `to_thread` calls instead of 2, so the cap is raised to 6 to preserve the same number of dispatch ticks. The `connect == 5` assertion is unchanged. Also add the contributor's `steveonjava@gmail.com` to AUTHOR_MAP alongside `steve@steveonjava.com` so contributor-audit passes for both identities used across the salvaged commits. Salvage follow-up for PR NousResearch#32857.
The reaper hoist in the prior commit adds an extra `asyncio.to_thread(_kb.reap_worker_zombies)` call at the top of every dispatcher tick (before the per-board work). The existing `test_gateway_dispatcher_disables_corrupt_board_without_traceback` mocks `to_thread` with a 4-call cap that previously matched 2 full dispatch ticks. With the reaper hoist each tick is now 3 `to_thread` calls instead of 2, so the cap is raised to 6 to preserve the same number of dispatch ticks. The `connect == 5` assertion is unchanged. Also add the contributor's `steveonjava@gmail.com` to AUTHOR_MAP alongside `steve@steveonjava.com` so contributor-audit passes for both identities used across the salvaged commits. Salvage follow-up for PR NousResearch#32857.
Summary
Eight
kanban.dbSQLite hardening fixes consolidated into one PR for batch review. Each commit is atomic, originally filed as a separate draft PR (kept open as fallback), and now validated together against a 607-test kanban + state suite (all green).Pattern follows the precedent in #23791.
Closes #31158. Refs #29610, #30846, #30896, #30973, #31014, #31130.
Why batch this
Over the last two weeks I filed nine separate draft PRs (#30727, #31208, #31294, #31310, #31891, #31932, #32300, #32301, #32489) covering the same broad surface:
kanban.dbcorruption events I hit running a five-profile kanban fleet under sustained writer load. Eleven distinct corruption events surfaced over that window, spanning several bug classes (torn-write on freed pages, single-row mid-write, b-tree split race during concurrent writers, EIO during WAL init, post-WAL-init(deleted)sidecar accumulation, zombie worker accumulation). The investigation pointed at a single upstream root cause for the cluster (commit 8 below) plus seven defense-in-depth contributors.Shipping them as one PR makes the story reviewable in one pass instead of nine. Each commit is still cherry-pickable if you'd rather land them surgically; the original drafts remain open and pinned to their respective branches.
#31891(mergecomplete_task+recompute_readyinto one txn) is not in this bundle. I ran a week of production load with all of 1-8 applied and that PR not applied, and saw noidx_tasks_statusdrift. I'd rather not ship it on speculation.What's in here, in commit order
The order is defensive primitives first, then EIO handling, then dispatcher hygiene, then write-side detection, then process hygiene, then the headline root cause last. Reading top-to-bottom traces the same story the investigation went through.
1.
fix(kanban): harden SQLite against torn-write corruption(was #31208)Sets
secure_delete=ON,cell_size_check=ON,synchronous=FULLon every kanbanconnect(). After a torn write, freed pages contain zeros instead of stale cell text, so corruption is detectable and stale rows can't resurface as live data.2.
fix(state): never silently downgrade WAL to DELETE on transient EIO(was #31294)apply_wal_with_fallbackpreviously caught anyOperationalErrorcontaining one of the NFS/SMB markers and silently dropped to DELETE. A transient EIO (kernel I/O backoff under memory pressure) was matching one of those markers and downgrading a healthy WAL DB to DELETE in place. Tightens the marker list and adds an on-disk header check before downgrading: if the file is already WAL on disk, re-raise instead of rewriting it.3.
fix(kanban): preserve original exception when write_txn rollback fails(was #31310)On an EIO or auto-rollback inside
write_txn, the explicitROLLBACKstatement raises a secondaryOperationalErrorthat shadows the original. Caller gets "no transaction is active" instead of "disk I/O error". The wrapper now catches the rollback error and re-raises the original.4.
fix(kanban): add grace period to detect_crashed_workers(was #30727)When two dispatchers race during gateway restart, the new one's
detect_crashed_workerssweep can reclaim a task that the old dispatcher is mid-spawning, causing a respawn storm. Adds aCRASH_GRACE_SECONDSwindow so a task whoseclaimedevent landed within the last N seconds is not reclaimed.5.
fix(gateway): replace permanent corrupt-board latch with exponential backoff(was #31932)The
disabled_corrupt_boardsdict permanently latched boards on anyOperationalErrormatching "file is not a database". A transient EIO duringconnect()would short-read the SQLite header, raise that exact message, and strand the board until manual gateway restart. Replaces with exponential backoff (30s → 30min cap) plus a header-magic-byte recheck: if the first 16 bytes areSQLite format 3\0, the file is real and the error was transient.6.
fix(kanban): add post-commit page_count invariant check to write_txn(was #32300)Reads bytes 28-31 of the DB header after every COMMIT and compares against
os.path.getsize() / page_size. Raisessqlite3.DatabaseErroron torn-extend (header claims more pages than the file has). Also setsPRAGMA wal_autocheckpoint=100so the WAL doesn't grow unbounded under contention. Surfaces Bug F damage at commit time instead of on the next read.7.
fix(kanban): hoist zombie reaper out of dispatch_once(was #32301)dispatch_onceaborted on any per-boardconnect()failure, but the onlyos.waitpid(-1, WNOHANG)reaper call lived inside that same function past the connect. One transient EIO → reaper skipped → zombies accumulated →task_runsrows left withworker_pidpointing at defunct PIDs. Moves the reaper into_kanban_dispatcher_watcherso it runs at the top of every tick regardless of per-board outcomes. Also fixes a test-indentation bug from an earlier salvage. (Squashed from three sibling commits in #32301 for review.)8.
fix(kanban): skip redundant WAL pragma on already-WAL connections(was #32489)The root cause. Every kanban
connect()was runningPRAGMA journal_mode=WALeven when the DB header already reported WAL mode. SQLite's WAL-init path treats that as "set up WAL from scratch": checkpoint, unlink the existing-shm/-walsidecars, recreate them under the current connection's lock state. If any other process or thread had those sidecar inodes open at that instant (notifier-watcher polling every 5s, dashboard-watcher polling every 10s, dispatcher tick every 60s, any active worker), the unlinked inodes became(deleted)fds in those processes. Their in-process mmap regions kept pointing at vanished disk pages. The nextconnect()from any of those processes opened fresh inodes for the sidecars, but the long-lived watcher-thread connections still had the old(deleted)mappings pinned. SubsequentPRAGMA journal_mode=WALon those stale-state connections raisedsqlite3.OperationalError("disk I/O error"). Under enough write pressure the unlink-and-recreate dance raced with a concurrent writer's b-tree split and the next commit landed structurally damaged on disk (Class-C corruption).The fix is a cheap read-only
PRAGMA journal_modeprobe before the set-pragma. If the result is already'wal', return early. Mode is durable in the file header (byte 18 = 2 for WAL), so the first connection in the process primes WAL once and every subsequent connection is observation-only, with no unlink and no race. The set-pragma path still runs on first-connection-per-process, which is where the residual rate comes from; see "What this does not fix" below.Approaches I tried locally and reverted
Worth surfacing so reviewers don't redo this work:
_kb_conncache (fix(gateway): use shared per-board kanban connection to prevent WAL inode-rotation race #32226). Tested locally and reverted after production corruption was confirmed within hours. Concurrent thread access to one SQLite connection violates SQLite's threading model under any meaningful load. fix(gateway): use shared per-board kanban connection to prevent WAL inode-rotation race #32226 is closed.SELECT 1) was also tested locally and reverted. fix(gateway): add WAL pinner to hold shared lock and prevent sidecar unlink (Bug I.2) #32531 closed.The pattern across all three: they tried to make the unlink "less likely" by changing the connection topology. Commit 8 in this PR removes the unlink at the source instead.
What this PR does not fix
First-connection-per-process still hits the WAL-init path, because WAL must be primed once. Any fresh process spawning under write pressure can still produce one unlink event. Three of the commits above (the on-disk header check in #2, the header-magic peek in #5, the post-commit page_count check in #6) catch and either prevent or surface the resulting damage, but they do not eliminate it.
Empirically, with this entire stack live, my fleet has seen one to two Class-C events per week on the highest-write-volume board, down from one to two events per day before any of this landed. A future fix that serializes first-connection WAL init across processes (a flock around
PRAGMA journal_mode=WAL, which I'm running locally as a separate patch) would close the residual gap. Not in this PR.Coordination with PR #32759
@MoonRay305 filed #32759 (
fix(kanban): harden SQLite connection concurrency) covering related territory: first-connect flock, busy_timeout, and refusing auto-init on malformed DBs. There's no file overlap with this PR (#32759 toucheshermes_cli/kanban_db.pyonly, mostly different functions; the AUTHOR_MAP addition won't conflict). Suggested merge order: this PR first as the root-cause fix, then #32759 as defense-in-depth on top. If you prefer the opposite, the only rebase work needed on my side would be on commit 1.Testing
.venv/bin/python -m pytest tests/hermes_cli/test_kanban_db.py tests/test_hermes_state.py tests/test_hermes_state_wal_fallback.py tests/hermes_cli/test_kanban_dispatcher_resilience.py tests/hermes_cli/test_kanban_core_functionality.pyResult: 607 passed.
The new tests added by these commits cover each fix individually. Most useful new fixtures:
tests/hermes_cli/test_kanban_dispatcher_resilience.py(corrupt-board latch behavior under transient EIO) and the WAL-skip probe tests intests/test_hermes_state.py.Production validation
These commits have been running on
local/runtimeof my fork against five gateway processes (default + writing + admin + coder + researcher) on a Linux LXC for one to seven days each, depending on the commit. The writing board sustains roughly 200-400 commits per hour under typical workloads, mostly from dispatcher tick + worker heartbeat + event log writes. EIO rate dropped from hundreds per day to single digits.PR scope checklist