fix(kanban): add inter-process file lock to prevent DB corruption under concurrent writes by WenhuaXia · Pull Request #32461 · NousResearch/hermes-agent

WenhuaXia · 2026-05-26T07:05:33Z

Problem

Multiple kanban worker processes (hermes -p <profile>) open independent sqlite3 connections to the same kanban.db. SQLite WAL + BEGIN IMMEDIATE only protects within-process threads — concurrent writers from different processes can still corrupt the WAL under heavy write pressure (task_events + task_runs + status transitions within the same tick).

Symptom: kanban.db becomes 0 bytes or corrupted, forcing automatic recreation and task loss.

Evidence: The _INIT_LOCK in connect() is a threading.Lock — effective for same-process threads but completely ineffective across independent worker processes.

Fix

Acquire an fcntl.LOCK_EX on a per-DB .db.lock file around every write_txn():

connect(): Open a lock file (e.g. kanban.db.lock) alongside the DB, store fd in a module-level dict keyed by id(conn) (sqlite3.Connection is a C-extension type that rejects both attribute assignment and weakref)
write_txn(): flock(LOCK_EX) → BEGIN IMMEDIATE → ... → flock(LOCK_UN) in finally

On platforms without fcntl (Windows) or when the lock file can't be opened, the lock is a graceful no-op.

Testing

All 165 existing kanban DB tests pass
Manually verified: DB no longer corrupts after dispatch with parallel workers

Changes

hermes_cli/kanban_db.py: +29 lines (fcntl import, _KANBAN_WRITE_LOCKS dict, lock in connect/write_txn)

…er concurrent writes Root cause: multiple kanban worker processes (hermes -p <profile>) open independent sqlite3 connections to the same kanban.db. SQLite WAL + BEGIN _IMMEDIATE only protects within-process threads — concurrent writers from different processes can still corrupt the WAL under heavy write pressure (task_events + task_runs + status transitions within the same tick). Fix: acquire an fcntl.LOCK_EX on a per-DB .db.lock file around every write_txn(). The lock fd is stored in a module-level dict keyed by id(conn) (sqlite3.Connection is a C-extension type that rejects both attribute assignment and weakref). On platforms without fcntl (Windows) or when the lock file can't be opened, the lock is a graceful no-op. This preserves the existing WAL + BEGIN IMMEDIATE strategy while serialising cross-process writes, which is the common pattern for SQLite-based multi-process workloads.

kshitijk4poor · 2026-05-28T06:29:37Z

Closing as duplicate — #32759 (@MoonRay305) covers the same inter-process-lock concern with a cleaner design and was merged via #33696. Comparison:

fix(kanban): harden SQLite connection concurrency #32759 wraps first-connect setup only (header check + integrity probe + WAL pragma + schema init) in a process-level flock. Locks once, lets normal write paths run lock-free under SQLite WAL.
This PR (fix(kanban): add inter-process file lock to prevent DB corruption under concurrent writes #32461) imports fcntl at module level (breaks Windows immediately on import) and uses id(conn) as a dict key in _KANBAN_WRITE_LOCKS — connections aren't tracked through their lifecycle, so the dict grows without cleanup; id() collisions on freed connections are also possible.

The fcntl-at-module-level Windows break would have been a hard regression on every Windows install. Thanks for tackling the same problem space — #32759's Windows variant added msvcrt.locking in a follow-up commit. Credit appreciated even when the merged design differs.

…on under concurrent workers Root cause: PR NousResearch#33696 only locked the init phase, leaving write_txn() unprotected. Under concurrent multi-worker writes (task_events + task_runs + status transitions all within the same gateway tick), BEGIN IMMEDIATE transactions from different processes can race and corrupt the SQLite WAL — proven by 7 corruptions in 33 min. Fix: Acquire an exclusive flock (.db.lock) around every write_txn() call. DDL migrations in connect() also use the same lock file, so schema init cannot collide with in-flight DML from other processes. Graceful no-op on platforms without fcntl (Windows) or when lock open fails. Reference: Original PR NousResearch#32461 was closed as duplicate of NousResearch#33696, but NousResearch#33696's 'SQLite serializes via WAL once busy_timeout is set' assumption was proven wrong by real-world corruption data.

kshitijk4poor mentioned this pull request May 28, 2026

fix(kanban): salvage cross-process init flock + busy_timeout (from #32759) #33696

Merged

kshitijk4poor closed this in #33696 May 28, 2026

alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/plugins Plugin system and bundled plugins labels May 28, 2026

WenhuaXia mentioned this pull request May 31, 2026

fix(kanban): add inter-process file lock on write_txn() to prevent DB corruption #35770

Closed

WenhuaXia mentioned this pull request May 31, 2026

fix(kanban): add inter-process flock to prevent SQLite WAL corruption #35787

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kanban): add inter-process file lock to prevent DB corruption under concurrent writes#32461

fix(kanban): add inter-process file lock to prevent DB corruption under concurrent writes#32461
WenhuaXia wants to merge 1 commit into
NousResearch:mainfrom
WenhuaXia:fix/kanban-process-write-lock

WenhuaXia commented May 26, 2026

Uh oh!

kshitijk4poor commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

WenhuaXia commented May 26, 2026

Problem

Fix

Testing

Changes

Uh oh!

kshitijk4poor commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants