fix(kanban): add inter-process file lock on write_txn() to prevent DB corruption#35770
fix(kanban): add inter-process file lock on write_txn() to prevent DB corruption#35770WenhuaXia wants to merge 3 commits into
Conversation
…er concurrent writes Root cause: multiple kanban worker processes (hermes -p <profile>) open independent sqlite3 connections to the same kanban.db. SQLite WAL + BEGIN IMMEDIATE only protects within-process threads — concurrent writers from different processes can still corrupt the WAL under heavy write pressure (task_events + task_runs + status transitions within the same tick). Fix: acquire an fcntl.LOCK_EX on a per-DB .db.lock file around every write_txn(). The lock fd is stored in a module-level dict keyed by id(conn) (sqlite3.Connection is a C-extension type that rejects both attribute assignment and weakref). On platforms without fcntl (Windows) or when the lock file cannot be opened, the lock is a graceful no-op. This preserves the existing WAL + BEGIN IMMEDIATE strategy while serialising cross-process writes, which is the common pattern for SQLite-based multi-process workloads.
Root cause: connect() runs DDL writes (executescript + migrations) under .init.lock while write_txn() holds .db.lock. These are DIFFERENT files, so a new worker's schema init can race with a dispatching gateway's DML writes, corrupting the WAL. Fix: Acquire .db.lock around the DDL write section in connect() so schema init and normal write_txn() calls use the same lock file and are properly serialized. This fixes the observed corruption pattern (7 DB corruptions in ~10 minutes during concurrent kanban worker dispatch).
|
I found two bugs in this PR that will cause runtime failures. Bug 1: Top-level
The existing codebase already handles this correctly: lines 1155 and 1169 in current # In connect() and write_txn(), use lazy import:
if _IS_WINDOWS:
_KANBAN_WRITE_LOCKS[id(conn)] = None
else:
import fcntl
...Bug 2: At the lock_fd = _KANBAN_WRITE_LOCKS.get(id(conn))
if lock_fd is not None:
fcntl.flock(lock_fd, fcntl.LOCK_EX) # WRONG: lock_fd is TextIO, not int
Fix: fcntl.flock(lock_fd.fileno(), fcntl.LOCK_EX)
# ...
fcntl.flock(lock_fd.fileno(), fcntl.LOCK_UN) |
|
Duplicate of #35787 which supersedes this with fuller root cause analysis and corruption timeline. |
Problem
Multiple gateway worker processes writing concurrently to the same kanban SQLite DB cause corruption (7 corruptions observed in 33 minutes during a single kanban pipeline run).
Root cause: Upstream PR #33696 only added
_cross_process_init_lockaround schema initialization. It explicitly rejected per-write flock with the rationale "SQLite serializes via WAL once busy_timeout is set". Real-world evidence proves this assumption wrong under concurrent multi-worker write pressure (task_events + task_runs + status transitions all within the same dispatcher tick).Users commit
d0bab9b4bon branchfix/kanban-process-write-lockhad the correct approach but was closed as duplicate of #33696 without merging.Fix
connect(): Open a per-DB.db.lockfile aftersqlite3.connect(), stored in_KANBAN_WRITE_LOCKS[id(conn)]write_txn(): Acquirefcntl.flock(LOCK_EX)beforeBEGIN IMMEDIATE, release infinallyblockfcntlis unavailable (Windows) or lock file cannot be opened, the dict entry isNoneand flock is skipped — no disruption to existing callersChanges
hermes_cli/kanban_db.py:import fcntl+_KANBAN_WRITE_LOCKSdictconnect(): lock fd creation after sqlite3.connect()write_txn(): flock/unlock around the entire transactionReferences