fix(kanban): salvage cross-process init flock + busy_timeout (from #32759)#33696
Merged
kshitijk4poor merged 3 commits intoMay 28, 2026
Merged
Conversation
… salvage Adds `squiddy@2rook.ai → MoonRay305` to AUTHOR_MAP so contributor_audit.py passes for the salvaged commits in NousResearch#33482-followup PR.
WenhuaXia
added a commit
to WenhuaXia/hermes-agent
that referenced
this pull request
May 31, 2026
…on under concurrent workers Root cause: PR NousResearch#33696 only locked the init phase, leaving write_txn() unprotected. Under concurrent multi-worker writes (task_events + task_runs + status transitions all within the same gateway tick), BEGIN IMMEDIATE transactions from different processes can race and corrupt the SQLite WAL — proven by 7 corruptions in 33 min. Fix: Acquire an exclusive flock (.db.lock) around every write_txn() call. DDL migrations in connect() also use the same lock file, so schema init cannot collide with in-flight DML from other processes. Graceful no-op on platforms without fcntl (Windows) or when lock open fails. Reference: Original PR NousResearch#32461 was closed as duplicate of NousResearch#33696, but NousResearch#33696's 'SQLite serializes via WAL once busy_timeout is set' assumption was proven wrong by real-world corruption data.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Salvages PR #32759 (@MoonRay305 / Squiddy of 2rook.ai) onto current
mainwith a manual conflict resolution that preserves the SQLite pragma hardening from #33482 (just landed) alongside this PR's new cross-process flock + busy_timeout knob. Authorship preserved on both contributor commits via cherry-pick.Closes #32759. Closes #32461 (alternative inter-process flock by @WenhuaXia, weaker design —
fcntlimport at module level breaks Windows; usesid(conn)as dict key with potential leaks). Closes #31965 (@steveonjava's flock-on-every-write_txn variant — see "Why this PR over the alternatives" below). Refs #31158.What this PR adds (defense-in-depth on top of #33482)
hermes_cli/kanban_db.py:_cross_process_init_lock(path)— context manager usingfcntl.flock(LOCK_EX)(Unix) andmsvcrt.locking(LK_LOCK)(Windows, added in commit 2). Wraps the first-connect setup phase (header validation + integrity probe + WAL pragma + schema init + additive migrations) in a process-level exclusive lock via a<db>.init.locksidecar file. Process-local_INIT_LOCK(the existingthreading.RLock) stays nested inside — they're complementary: threading lock for in-process races, file lock for cross-process races._resolve_busy_timeout_ms()+DEFAULT_BUSY_TIMEOUT_MS=120_000— replaces the previously-hardcodedtimeout=30andtimeout=5with an explicit 2-minute busy timeout. Configurable viaHERMES_KANBAN_BUSY_TIMEOUT_MSenv var, following the establishedHERMES_KANBAN_*env-var convention (CLAIM_TTL_SECONDS, CRASH_GRACE_SECONDS, etc.). Lets concurrent writers serialize via WAL rather than failing withdatabase is lockedduring bursts._sqlite_connect(path)— centralizedsqlite3.connectwrapper that applies the busy timeout consistently acrossconnect()and_guard_existing_db_is_healthy()probes (was inconsistent before: 5s vs 30s).Manual conflict resolution (what changed from the original PR)
The PR was authored against pre-#33482 code and a verbatim cherry-pick would have reverted the just-landed SQLite hardening pragmas. Specifically, the PR's
connect()re-write hardcodedsynchronous=NORMAL(the old baseline), missing all four pragmas we just added:The merge resolution keeps both contributions: the PR's new
_cross_process_init_lock+_sqlite_connect+ busy_timeout infrastructure AND #33482's pragma block (FULL, secure_delete, cell_size_check, wal_autocheckpoint). Verified via E2E test (see below).Why this PR over the alternatives
Three different open PRs proposed inter-process locking for the same problem class:
fcntlimport at module levelid(conn)dict key, leaks unless trackedwrite_txn(flock per-transaction)#32759 is the cleanest design. The WAL-init race is the actual scenario; once both processes are on WAL with
synchronous=FULL+wal_autocheckpoint=100,BEGIN IMMEDIATEshould not deadlock — SQLite serializes via WAL with the new busy_timeout knob. #31965's per-write flock would tax every kanban write without addressing the root cause that #33482 commit 8 (skip redundant WAL pragma on already-WAL connections) already fixed at the source.#33482 commit 8 closes the redundant-WAL-pragma path on already-WAL connections. #32759 covers the remaining gap: the first connection from each process still hits the WAL-init path, and two processes racing on a fresh DB can produce the mixed-journal-mode corruption pattern. #32759's flock serializes that first-open across processes.
Tests + E2E verification
Broader sweep (kanban + state + dispatcher functionality):
E2E (4 concurrent subprocesses against a fresh DB):
_cross_process_init_lockfires (<db>.init.locksidecar created)journal_mode=walsynchronous=2(FULL — fix(kanban): batch-salvage 7 SQLite corruption hardening fixes from #32857 #33482's pragma preserved)secure_delete=1(ON — fix(kanban): batch-salvage 7 SQLite corruption hardening fixes from #32857 #33482's pragma preserved)cell_size_check=1(ON — fix(kanban): batch-salvage 7 SQLite corruption hardening fixes from #32857 #33482's pragma preserved)wal_autocheckpoint=100(fix(kanban): batch-salvage 7 SQLite corruption hardening fixes from #32857 #33482's pragma preserved)foreign_keys=1(ON)busy_timeout=60000(this PR'sHERMES_KANBAN_BUSY_TIMEOUT_MSenv var honored)ruff checkon all changed files: clean.Credit
Both substantive commits authored by @MoonRay305 (Squiddy of 2rook.ai) —
squiddy@2rook.aiidentity, preserved per-commit. Cherry-picked withgit cherry-pick(which preserves author); manual merge resolution applied to the pragma block. Follow-upchore(release)commit (AUTHOR_MAP entry) attributed to the salvager.