Skip to content

fix(kanban): add inter-process file lock to prevent DB corruption under concurrent writes#32461

Closed
WenhuaXia wants to merge 1 commit into
NousResearch:mainfrom
WenhuaXia:fix/kanban-process-write-lock
Closed

fix(kanban): add inter-process file lock to prevent DB corruption under concurrent writes#32461
WenhuaXia wants to merge 1 commit into
NousResearch:mainfrom
WenhuaXia:fix/kanban-process-write-lock

Conversation

@WenhuaXia

Copy link
Copy Markdown

Problem

Multiple kanban worker processes (hermes -p <profile>) open independent sqlite3 connections to the same kanban.db. SQLite WAL + BEGIN IMMEDIATE only protects within-process threads — concurrent writers from different processes can still corrupt the WAL under heavy write pressure (task_events + task_runs + status transitions within the same tick).

Symptom: kanban.db becomes 0 bytes or corrupted, forcing automatic recreation and task loss.

Evidence: The _INIT_LOCK in connect() is a threading.Lock — effective for same-process threads but completely ineffective across independent worker processes.

Fix

Acquire an fcntl.LOCK_EX on a per-DB .db.lock file around every write_txn():

  1. connect(): Open a lock file (e.g. kanban.db.lock) alongside the DB, store fd in a module-level dict keyed by id(conn) (sqlite3.Connection is a C-extension type that rejects both attribute assignment and weakref)
  2. write_txn(): flock(LOCK_EX)BEGIN IMMEDIATE → ... → flock(LOCK_UN) in finally

On platforms without fcntl (Windows) or when the lock file can't be opened, the lock is a graceful no-op.

Testing

  • All 165 existing kanban DB tests pass
  • Manually verified: DB no longer corrupts after dispatch with parallel workers

Changes

  • hermes_cli/kanban_db.py: +29 lines (fcntl import, _KANBAN_WRITE_LOCKS dict, lock in connect/write_txn)

…er concurrent writes

Root cause: multiple kanban worker processes (hermes -p <profile>) open
independent sqlite3 connections to the same kanban.db. SQLite WAL + BEGIN
_IMMEDIATE only protects within-process threads — concurrent writers from
different processes can still corrupt the WAL under heavy write pressure
(task_events + task_runs + status transitions within the same tick).

Fix: acquire an fcntl.LOCK_EX on a per-DB .db.lock file around every
write_txn().  The lock fd is stored in a module-level dict keyed by
id(conn) (sqlite3.Connection is a C-extension type that rejects both
attribute assignment and weakref).  On platforms without fcntl (Windows)
or when the lock file can't be opened, the lock is a graceful no-op.

This preserves the existing WAL + BEGIN IMMEDIATE strategy while
serialising cross-process writes, which is the common pattern for
SQLite-based multi-process workloads.
@kshitijk4poor

Copy link
Copy Markdown
Collaborator

Closing as duplicate — #32759 (@MoonRay305) covers the same inter-process-lock concern with a cleaner design and was merged via #33696. Comparison:

The fcntl-at-module-level Windows break would have been a hard regression on every Windows install. Thanks for tackling the same problem space — #32759's Windows variant added msvcrt.locking in a follow-up commit. Credit appreciated even when the merged design differs.

@alt-glitch alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/plugins Plugin system and bundled plugins labels May 28, 2026
WenhuaXia added a commit to WenhuaXia/hermes-agent that referenced this pull request May 31, 2026
…on under concurrent workers

Root cause: PR NousResearch#33696 only locked the init phase, leaving write_txn() unprotected.
Under concurrent multi-worker writes (task_events + task_runs + status transitions
all within the same gateway tick), BEGIN IMMEDIATE transactions from different
processes can race and corrupt the SQLite WAL — proven by 7 corruptions in 33 min.

Fix: Acquire an exclusive flock (.db.lock) around every write_txn() call.
DDL migrations in connect() also use the same lock file, so schema init
cannot collide with in-flight DML from other processes.

Graceful no-op on platforms without fcntl (Windows) or when lock open fails.

Reference: Original PR NousResearch#32461 was closed as duplicate of NousResearch#33696, but
NousResearch#33696's 'SQLite serializes via WAL once busy_timeout is set' assumption
was proven wrong by real-world corruption data.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/plugins Plugin system and bundled plugins P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants