fix(kanban): add inter-process file lock to prevent DB corruption under concurrent writes#32461
Closed
WenhuaXia wants to merge 1 commit into
Closed
fix(kanban): add inter-process file lock to prevent DB corruption under concurrent writes#32461WenhuaXia wants to merge 1 commit into
WenhuaXia wants to merge 1 commit into
Conversation
…er concurrent writes Root cause: multiple kanban worker processes (hermes -p <profile>) open independent sqlite3 connections to the same kanban.db. SQLite WAL + BEGIN _IMMEDIATE only protects within-process threads — concurrent writers from different processes can still corrupt the WAL under heavy write pressure (task_events + task_runs + status transitions within the same tick). Fix: acquire an fcntl.LOCK_EX on a per-DB .db.lock file around every write_txn(). The lock fd is stored in a module-level dict keyed by id(conn) (sqlite3.Connection is a C-extension type that rejects both attribute assignment and weakref). On platforms without fcntl (Windows) or when the lock file can't be opened, the lock is a graceful no-op. This preserves the existing WAL + BEGIN IMMEDIATE strategy while serialising cross-process writes, which is the common pattern for SQLite-based multi-process workloads.
Collaborator
|
Closing as duplicate — #32759 (@MoonRay305) covers the same inter-process-lock concern with a cleaner design and was merged via #33696. Comparison:
The fcntl-at-module-level Windows break would have been a hard regression on every Windows install. Thanks for tackling the same problem space — #32759's Windows variant added |
WenhuaXia
added a commit
to WenhuaXia/hermes-agent
that referenced
this pull request
May 31, 2026
…on under concurrent workers Root cause: PR NousResearch#33696 only locked the init phase, leaving write_txn() unprotected. Under concurrent multi-worker writes (task_events + task_runs + status transitions all within the same gateway tick), BEGIN IMMEDIATE transactions from different processes can race and corrupt the SQLite WAL — proven by 7 corruptions in 33 min. Fix: Acquire an exclusive flock (.db.lock) around every write_txn() call. DDL migrations in connect() also use the same lock file, so schema init cannot collide with in-flight DML from other processes. Graceful no-op on platforms without fcntl (Windows) or when lock open fails. Reference: Original PR NousResearch#32461 was closed as duplicate of NousResearch#33696, but NousResearch#33696's 'SQLite serializes via WAL once busy_timeout is set' assumption was proven wrong by real-world corruption data.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Multiple kanban worker processes (
hermes -p <profile>) open independent sqlite3 connections to the samekanban.db. SQLite WAL +BEGIN IMMEDIATEonly protects within-process threads — concurrent writers from different processes can still corrupt the WAL under heavy write pressure (task_events + task_runs + status transitions within the same tick).Symptom:
kanban.dbbecomes 0 bytes or corrupted, forcing automatic recreation and task loss.Evidence: The
_INIT_LOCKinconnect()is athreading.Lock— effective for same-process threads but completely ineffective across independent worker processes.Fix
Acquire an
fcntl.LOCK_EXon a per-DB.db.lockfile around everywrite_txn():connect(): Open a lock file (e.g.kanban.db.lock) alongside the DB, store fd in a module-level dict keyed byid(conn)(sqlite3.Connection is a C-extension type that rejects both attribute assignment and weakref)write_txn():flock(LOCK_EX)→BEGIN IMMEDIATE→ ... →flock(LOCK_UN)infinallyOn platforms without fcntl (Windows) or when the lock file can't be opened, the lock is a graceful no-op.
Testing
Changes
hermes_cli/kanban_db.py: +29 lines (fcntl import, _KANBAN_WRITE_LOCKS dict, lock in connect/write_txn)