Skip to content

fix(kanban): batch-salvage 8 SQLite corruption hardening fixes (closes #31158, refs #29610)#32857

Closed
steveonjava wants to merge 8 commits into
NousResearch:mainfrom
steveonjava:fix/kanban-corruption-batch-salvage
Closed

fix(kanban): batch-salvage 8 SQLite corruption hardening fixes (closes #31158, refs #29610)#32857
steveonjava wants to merge 8 commits into
NousResearch:mainfrom
steveonjava:fix/kanban-corruption-batch-salvage

Conversation

@steveonjava

@steveonjava steveonjava commented May 26, 2026

Copy link
Copy Markdown
Contributor

Summary

Eight kanban.db SQLite hardening fixes consolidated into one PR for batch review. Each commit is atomic, originally filed as a separate draft PR (kept open as fallback), and now validated together against a 607-test kanban + state suite (all green).

Pattern follows the precedent in #23791.

Closes #31158. Refs #29610, #30846, #30896, #30973, #31014, #31130.

Why batch this

Over the last two weeks I filed nine separate draft PRs (#30727, #31208, #31294, #31310, #31891, #31932, #32300, #32301, #32489) covering the same broad surface: kanban.db corruption events I hit running a five-profile kanban fleet under sustained writer load. Eleven distinct corruption events surfaced over that window, spanning several bug classes (torn-write on freed pages, single-row mid-write, b-tree split race during concurrent writers, EIO during WAL init, post-WAL-init (deleted) sidecar accumulation, zombie worker accumulation). The investigation pointed at a single upstream root cause for the cluster (commit 8 below) plus seven defense-in-depth contributors.

Shipping them as one PR makes the story reviewable in one pass instead of nine. Each commit is still cherry-pickable if you'd rather land them surgically; the original drafts remain open and pinned to their respective branches.

#31891 (merge complete_task + recompute_ready into one txn) is not in this bundle. I ran a week of production load with all of 1-8 applied and that PR not applied, and saw no idx_tasks_status drift. I'd rather not ship it on speculation.

What's in here, in commit order

The order is defensive primitives first, then EIO handling, then dispatcher hygiene, then write-side detection, then process hygiene, then the headline root cause last. Reading top-to-bottom traces the same story the investigation went through.

1. fix(kanban): harden SQLite against torn-write corruption (was #31208)

Sets secure_delete=ON, cell_size_check=ON, synchronous=FULL on every kanban connect(). After a torn write, freed pages contain zeros instead of stale cell text, so corruption is detectable and stale rows can't resurface as live data.

2. fix(state): never silently downgrade WAL to DELETE on transient EIO (was #31294)

apply_wal_with_fallback previously caught any OperationalError containing one of the NFS/SMB markers and silently dropped to DELETE. A transient EIO (kernel I/O backoff under memory pressure) was matching one of those markers and downgrading a healthy WAL DB to DELETE in place. Tightens the marker list and adds an on-disk header check before downgrading: if the file is already WAL on disk, re-raise instead of rewriting it.

3. fix(kanban): preserve original exception when write_txn rollback fails (was #31310)

On an EIO or auto-rollback inside write_txn, the explicit ROLLBACK statement raises a secondary OperationalError that shadows the original. Caller gets "no transaction is active" instead of "disk I/O error". The wrapper now catches the rollback error and re-raises the original.

4. fix(kanban): add grace period to detect_crashed_workers (was #30727)

When two dispatchers race during gateway restart, the new one's detect_crashed_workers sweep can reclaim a task that the old dispatcher is mid-spawning, causing a respawn storm. Adds a CRASH_GRACE_SECONDS window so a task whose claimed event landed within the last N seconds is not reclaimed.

5. fix(gateway): replace permanent corrupt-board latch with exponential backoff (was #31932)

The disabled_corrupt_boards dict permanently latched boards on any OperationalError matching "file is not a database". A transient EIO during connect() would short-read the SQLite header, raise that exact message, and strand the board until manual gateway restart. Replaces with exponential backoff (30s → 30min cap) plus a header-magic-byte recheck: if the first 16 bytes are SQLite format 3\0, the file is real and the error was transient.

6. fix(kanban): add post-commit page_count invariant check to write_txn (was #32300)

Reads bytes 28-31 of the DB header after every COMMIT and compares against os.path.getsize() / page_size. Raises sqlite3.DatabaseError on torn-extend (header claims more pages than the file has). Also sets PRAGMA wal_autocheckpoint=100 so the WAL doesn't grow unbounded under contention. Surfaces Bug F damage at commit time instead of on the next read.

7. fix(kanban): hoist zombie reaper out of dispatch_once (was #32301)

dispatch_once aborted on any per-board connect() failure, but the only os.waitpid(-1, WNOHANG) reaper call lived inside that same function past the connect. One transient EIO → reaper skipped → zombies accumulated → task_runs rows left with worker_pid pointing at defunct PIDs. Moves the reaper into _kanban_dispatcher_watcher so it runs at the top of every tick regardless of per-board outcomes. Also fixes a test-indentation bug from an earlier salvage. (Squashed from three sibling commits in #32301 for review.)

8. fix(kanban): skip redundant WAL pragma on already-WAL connections (was #32489)

The root cause. Every kanban connect() was running PRAGMA journal_mode=WAL even when the DB header already reported WAL mode. SQLite's WAL-init path treats that as "set up WAL from scratch": checkpoint, unlink the existing -shm/-wal sidecars, recreate them under the current connection's lock state. If any other process or thread had those sidecar inodes open at that instant (notifier-watcher polling every 5s, dashboard-watcher polling every 10s, dispatcher tick every 60s, any active worker), the unlinked inodes became (deleted) fds in those processes. Their in-process mmap regions kept pointing at vanished disk pages. The next connect() from any of those processes opened fresh inodes for the sidecars, but the long-lived watcher-thread connections still had the old (deleted) mappings pinned. Subsequent PRAGMA journal_mode=WAL on those stale-state connections raised sqlite3.OperationalError("disk I/O error"). Under enough write pressure the unlink-and-recreate dance raced with a concurrent writer's b-tree split and the next commit landed structurally damaged on disk (Class-C corruption).

The fix is a cheap read-only PRAGMA journal_mode probe before the set-pragma. If the result is already 'wal', return early. Mode is durable in the file header (byte 18 = 2 for WAL), so the first connection in the process primes WAL once and every subsequent connection is observation-only, with no unlink and no race. The set-pragma path still runs on first-connection-per-process, which is where the residual rate comes from; see "What this does not fix" below.

Approaches I tried locally and reverted

Worth surfacing so reviewers don't redo this work:

The pattern across all three: they tried to make the unlink "less likely" by changing the connection topology. Commit 8 in this PR removes the unlink at the source instead.

What this PR does not fix

First-connection-per-process still hits the WAL-init path, because WAL must be primed once. Any fresh process spawning under write pressure can still produce one unlink event. Three of the commits above (the on-disk header check in #2, the header-magic peek in #5, the post-commit page_count check in #6) catch and either prevent or surface the resulting damage, but they do not eliminate it.

Empirically, with this entire stack live, my fleet has seen one to two Class-C events per week on the highest-write-volume board, down from one to two events per day before any of this landed. A future fix that serializes first-connection WAL init across processes (a flock around PRAGMA journal_mode=WAL, which I'm running locally as a separate patch) would close the residual gap. Not in this PR.

Coordination with PR #32759

@MoonRay305 filed #32759 (fix(kanban): harden SQLite connection concurrency) covering related territory: first-connect flock, busy_timeout, and refusing auto-init on malformed DBs. There's no file overlap with this PR (#32759 touches hermes_cli/kanban_db.py only, mostly different functions; the AUTHOR_MAP addition won't conflict). Suggested merge order: this PR first as the root-cause fix, then #32759 as defense-in-depth on top. If you prefer the opposite, the only rebase work needed on my side would be on commit 1.

Testing

.venv/bin/python -m pytest tests/hermes_cli/test_kanban_db.py tests/test_hermes_state.py tests/test_hermes_state_wal_fallback.py tests/hermes_cli/test_kanban_dispatcher_resilience.py tests/hermes_cli/test_kanban_core_functionality.py

Result: 607 passed.

The new tests added by these commits cover each fix individually. Most useful new fixtures: tests/hermes_cli/test_kanban_dispatcher_resilience.py (corrupt-board latch behavior under transient EIO) and the WAL-skip probe tests in tests/test_hermes_state.py.

Production validation

These commits have been running on local/runtime of my fork against five gateway processes (default + writing + admin + coder + researcher) on a Linux LXC for one to seven days each, depending on the commit. The writing board sustains roughly 200-400 commits per hour under typical workloads, mostly from dispatcher tick + worker heartbeat + event log writes. EIO rate dropped from hundreds per day to single digits.

PR scope checklist

steveonjava and others added 8 commits May 26, 2026 15:18
…te + cell_size_check + synchronous=FULL)

Production corruption NousResearch#6 left b-tree pages with zeroed headers but intact old cell content — the Bug E pattern. This fix applies three pragma calls on every connect():

- synchronous=FULL (was NORMAL): closes the WAL-checkpoint reordering window where a crash between WAL commit and main-DB write leaves a partially-written b-tree page header. Cost is <1ms per commit on local SSD; negligible at kanban write volume.

- secure_delete=ON: forces SQLite to zero freed page bytes on disk. If a torn write or hardware fault later corrupts a page, the underlying cell content is zero, so corruption is detectable and no stale rows can resurface as live data.

- cell_size_check=ON: adds a read-side guard so corrupt cells surface as errors at read time rather than as silent wrong-data returns.

All three are connection-scoped and re-applied on every connect(). secure_delete also writes a persistent flag into the DB header on the first call against a fresh DB, making the protection durable across processes for new DBs.

Tests added for all four required cases: each pragma active on a fresh connection, and all three re-applied after close+reopen. Also adds the required negative test (migration path does not reset pragmas).
apply_wal_with_fallback() treated "disk i/o error" as a permanent
WAL-incompatibility marker, identical to "locking protocol" (NFS) and
"not authorized" (FUSE). But EIO during PRAGMA journal_mode=WAL is
typically TRANSIENT — page-cache pressure, brief lock contention,
recoverable storage hiccups — not a permanent filesystem property.

Treating transient EIO as a permanent downgrade signal produces the
mixed-journal-mode-across-processes corruption pattern:

  1. Process A opens kanban.db, hits transient EIO on the WAL pragma,
     silently downgrades to journal_mode=DELETE.
  2. Process B (no EIO) opens the same file moments later and
     successfully sets journal_mode=WAL.
  3. A writes rollback-journal frames while B writes WAL frames. SQLite
     documents this as unsupported and corrupts the file:
     https://www.sqlite.org/wal.html ("all connections to the same
     database must use the same locking protocol").

This was the root cause of repeated kanban.db corruption on hosts with
multiple gateway processes plus CLI invocations against the same DB
(observed pattern: corruption shortly after gateway startup, after the
process logged "WAL journal_mode unsupported on this filesystem (disk
I/O error) — falling back to journal_mode=DELETE"). The fallback
warning told the truth — fallback DID happen — but the premise
("unsupported on this filesystem") was wrong; the EIO was a one-shot
event and sibling processes successfully used WAL.

Fix has two layers:

1. Remove "disk i/o error" from _WAL_INCOMPAT_MARKERS. EIO now re-raises
   so callers can retry instead of silently corrupting the DB. The two
   remaining markers ("locking protocol", "not authorized") are
   deterministic per filesystem so they remain safe permanent-downgrade
   signals.

2. Belt-and-suspenders: before downgrading on ANY marker match, peek the
   on-disk journal mode. If the header says WAL, refuse to downgrade and
   re-raise the original error. This guards against any future addition
   to _WAL_INCOMPAT_MARKERS turning out to be transient in some
   environment we haven't yet seen.

Tests:

- tests/test_hermes_state_wal_fallback.py:
  * Flipped test_falls_back_on_disk_io_error → test_reraises_on_disk_io_error
    asserting EIO is re-raised, not silently swallowed.
  * Added test_does_not_downgrade_when_disk_says_wal covering the
    on-disk-header safety guard for the existing legitimate markers.

- tests/hermes_cli/test_kanban_db.py:
  * test_connect_falls_back_to_delete_on_locking_protocol now uses a
    truly-fresh DB (instead of the kanban_home fixture which pre-inits
    in WAL). On NFS the very first process touching the file legitimately
    downgrades; on a file already in WAL the new guard correctly refuses.

A standalone reproducer lives at /tmp/kanban-stress/repro_bugD_eio_wal_downgrade.py
(not committed): without fix the DB silently flips from WAL to DELETE
mid-process; with fix the EIO surfaces and the file stays WAL.

Refs: Bug D in the kanban-corruption investigation series (Bugs A and C
shipped in ebe7374f3 and e02147d5e respectively). Bug D explains every
corruption incident this week including those that survived A's
single-dispatcher mitigation, because every CLI invocation is a
separate process whose WAL pragma can transiently fail.
When code inside a write_txn block raises an OperationalError that SQLite
has already auto-rolled-back (typical for disk I/O error,
database is locked, and database disk image is malformed), the
explicit ROLLBACK in write_txn.__exit__ itself raises
cannot rollback - no transaction is active and the secondary exception
replaces the original in the traceback. Operators see a misleading error
and lose the diagnostic information they need.

Swallow the rollback-time OperationalError so the caller always sees the
original cause.

Confirmed reproducer: tests/hermes_cli/test_kanban_db.py::
test_write_txn_preserves_original_exception_when_rollback_fails
`detect_crashed_workers` calls `_pid_alive` on every `running` task whose
claim is held by this host. The check can transiently return False for a
freshly-spawned worker (fork → /proc-visibility lag, or reap-race
between SIGCHLD and parent reaping). When a second dispatcher ticks
inside that window it reclaims the task and spawns a duplicate worker.

Add `DEFAULT_CRASH_GRACE_SECONDS = 30` and an
`HERMES_KANBAN_CRASH_GRACE_SECONDS` env-var override.
`detect_crashed_workers` skips the liveness check when
`time.time() - started_at < grace`. The existing 15-minute claim TTL
still reclaims genuinely-crashed workers; grace only suppresses the
launch-window false positive.

`HERMES_KANBAN_CRASH_GRACE_SECONDS=0` is set on the `kanban_home`
fixture in `test_kanban_core_functionality.py` so existing tests that
assert immediate reclaim retain pre-fix semantics.

Companion to merged PR NousResearch#23442 (`release_stale_claims`, closes NousResearch#23025),
which addressed the same multi-dispatcher race in the stale-claim path.
Related: NousResearch#20015 (`_pid_alive` false-negative behaviour),
…backoff

Transient SQLite I/O errors that match the corruption pattern permanently disabled board dispatch. Add PRAGMA quick_check confirmation before latching, and replace the fingerprint-only latch with exponential backoff (30s initial, doubles per failure, 900s cap). Clear latch on fingerprint change or successful dispatch. Refs: NousResearch#30417 (Bug 2)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reads header bytes 28-31 after every COMMIT and compares against actual file size. Raises sqlite3.DatabaseError on torn-extend (actual_pages < page_count). Also sets PRAGMA wal_autocheckpoint=100 in connect().

Refs: NousResearch#31208 (Bug E - same file, coordinate), NousResearch#30973 (wal_autocheckpoint)
Refs: NousResearch#30445, NousResearch#30896, NousResearch#30908 (corruption reports)
Reaper now runs at the top of every dispatcher tick regardless of per-board connect() failures. Previously the reaper sat inside dispatch_once after the kanban_db.connect() call — any EIO during connect would skip reaping for that tick, accumulating zombie workers and stale claim_lock rows.

Also: reap_worker_zombies now returns the list of reaped pids (the dispatcher logs them) and a test indentation fix.

Squashes three sibling commits from PR NousResearch#32301 into one logical change for batch review.
apply_wal_with_fallback() issued PRAGMA journal_mode=WAL on every call,
including connections to DBs already in WAL mode. This triggered the WAL
init code path, causing SQLite to acquire EXCLUSIVE, checkpoint, and unlink
kanban.db-{wal,shm}. Other open connections received (deleted) FDs and
raised sqlite3.OperationalError: disk I/O error.

Add a cheap read probe (PRAGMA journal_mode, no flock/checkpoint/unlink)
before the set-pragma path. If already wal, return early. The set-pragma
and DELETE fallback paths are unchanged.

Closes NousResearch#31158. Addresses root cause that PRs NousResearch#32226 and NousResearch#32322 attempted
via connection-sharing/caching approaches.
@alt-glitch alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/plugins Plugin system and bundled plugins comp/gateway Gateway runner, session dispatch, delivery labels May 26, 2026
@steveonjava steveonjava marked this pull request as ready for review May 26, 2026 23:17
kshitijk4poor added a commit that referenced this pull request May 27, 2026
The reaper hoist in the prior commit adds an extra
`asyncio.to_thread(_kb.reap_worker_zombies)` call at the top of every
dispatcher tick (before the per-board work). The existing
`test_gateway_dispatcher_disables_corrupt_board_without_traceback`
mocks `to_thread` with a 4-call cap that previously matched 2 full
dispatch ticks. With the reaper hoist each tick is now 3
`to_thread` calls instead of 2, so the cap is raised to 6 to preserve
the same number of dispatch ticks. The `connect == 5` assertion is
unchanged.

Also add the contributor's `steveonjava@gmail.com` to AUTHOR_MAP
alongside `steve@steveonjava.com` so contributor-audit passes for
both identities used across the salvaged commits.

Salvage follow-up for PR #32857.
@kshitijk4poor

Copy link
Copy Markdown
Collaborator

Merged via #33482 — 7 of 8 commits cherry-picked with per-commit authorship preserved (rebase merge, both steveonjava@gmail.com and steve@steveonjava.com identities credited).

Commit 5 (fix(gateway): replace permanent corrupt-board latch with exponential backoff) was dropped from the salvage because main now has a flat 5-minute quarantine timer from c94ad89 (@donovan-yohan) — the substantive ideas in your commit 5 (exponential backoff 30s → 30min cap + PRAGMA quick_check to distinguish transient EIO from real corruption) are valuable and will be salvaged separately as a follow-up rebased onto the current dispatcher shape. Filing an issue for that work shortly.

Thanks for the thorough investigation and the production-validated batch — the WAL-pragma-skip root-cause fix is especially nice. Closing this PR; the rest of your draft chain (#31294, #31310, #30727, #32300, #32301, #32489) is being closed alongside since each commit is now on main.

mathias3 pushed a commit to mathias3/hermes-agent that referenced this pull request May 28, 2026
The reaper hoist in the prior commit adds an extra
`asyncio.to_thread(_kb.reap_worker_zombies)` call at the top of every
dispatcher tick (before the per-board work). The existing
`test_gateway_dispatcher_disables_corrupt_board_without_traceback`
mocks `to_thread` with a 4-call cap that previously matched 2 full
dispatch ticks. With the reaper hoist each tick is now 3
`to_thread` calls instead of 2, so the cap is raised to 6 to preserve
the same number of dispatch ticks. The `connect == 5` assertion is
unchanged.

Also add the contributor's `steveonjava@gmail.com` to AUTHOR_MAP
alongside `steve@steveonjava.com` so contributor-audit passes for
both identities used across the salvaged commits.

Salvage follow-up for PR NousResearch#32857.
Bryce-huang pushed a commit to wbkunlun/hermes-agent that referenced this pull request May 29, 2026
The reaper hoist in the prior commit adds an extra
`asyncio.to_thread(_kb.reap_worker_zombies)` call at the top of every
dispatcher tick (before the per-board work). The existing
`test_gateway_dispatcher_disables_corrupt_board_without_traceback`
mocks `to_thread` with a 4-call cap that previously matched 2 full
dispatch ticks. With the reaper hoist each tick is now 3
`to_thread` calls instead of 2, so the cap is raised to 6 to preserve
the same number of dispatch ticks. The `connect == 5` assertion is
unchanged.

Also add the contributor's `steveonjava@gmail.com` to AUTHOR_MAP
alongside `steve@steveonjava.com` so contributor-audit passes for
both identities used across the salvaged commits.

Salvage follow-up for PR NousResearch#32857.

#AI commit#
mosaiq-systems pushed a commit to mosaiq-systems/hermes-agent that referenced this pull request May 29, 2026
The reaper hoist in the prior commit adds an extra
`asyncio.to_thread(_kb.reap_worker_zombies)` call at the top of every
dispatcher tick (before the per-board work). The existing
`test_gateway_dispatcher_disables_corrupt_board_without_traceback`
mocks `to_thread` with a 4-call cap that previously matched 2 full
dispatch ticks. With the reaper hoist each tick is now 3
`to_thread` calls instead of 2, so the cap is raised to 6 to preserve
the same number of dispatch ticks. The `connect == 5` assertion is
unchanged.

Also add the contributor's `steveonjava@gmail.com` to AUTHOR_MAP
alongside `steve@steveonjava.com` so contributor-audit passes for
both identities used across the salvaged commits.

Salvage follow-up for PR NousResearch#32857.
KKT-OPT pushed a commit to KKT-OPT/hermes-agent that referenced this pull request May 31, 2026
The reaper hoist in the prior commit adds an extra
`asyncio.to_thread(_kb.reap_worker_zombies)` call at the top of every
dispatcher tick (before the per-board work). The existing
`test_gateway_dispatcher_disables_corrupt_board_without_traceback`
mocks `to_thread` with a 4-call cap that previously matched 2 full
dispatch ticks. With the reaper hoist each tick is now 3
`to_thread` calls instead of 2, so the cap is raised to 6 to preserve
the same number of dispatch ticks. The `connect == 5` assertion is
unchanged.

Also add the contributor's `steveonjava@gmail.com` to AUTHOR_MAP
alongside `steve@steveonjava.com` so contributor-audit passes for
both identities used across the salvaged commits.

Salvage follow-up for PR NousResearch#32857.
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
The reaper hoist in the prior commit adds an extra
`asyncio.to_thread(_kb.reap_worker_zombies)` call at the top of every
dispatcher tick (before the per-board work). The existing
`test_gateway_dispatcher_disables_corrupt_board_without_traceback`
mocks `to_thread` with a 4-call cap that previously matched 2 full
dispatch ticks. With the reaper hoist each tick is now 3
`to_thread` calls instead of 2, so the cap is raised to 6 to preserve
the same number of dispatch ticks. The `connect == 5` assertion is
unchanged.

Also add the contributor's `steveonjava@gmail.com` to AUTHOR_MAP
alongside `steve@steveonjava.com` so contributor-audit passes for
both identities used across the salvaged commits.

Salvage follow-up for PR NousResearch#32857.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery comp/plugins Plugin system and bundled plugins P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

kanban dispatcher wedges under multi-thread + subprocess concurrency due to WAL/SHM cache poisoning

3 participants