fix(kanban): add inter-process file lock on write_txn() to prevent DB corruption by WenhuaXia · Pull Request #35770 · NousResearch/hermes-agent

WenhuaXia · 2026-05-31T07:50:37Z

Problem

Multiple gateway worker processes writing concurrently to the same kanban SQLite DB cause corruption (7 corruptions observed in 33 minutes during a single kanban pipeline run).

Root cause: Upstream PR #33696 only added _cross_process_init_lock around schema initialization. It explicitly rejected per-write flock with the rationale "SQLite serializes via WAL once busy_timeout is set". Real-world evidence proves this assumption wrong under concurrent multi-worker write pressure (task_events + task_runs + status transitions all within the same dispatcher tick).

Users commit d0bab9b4b on branch fix/kanban-process-write-lock had the correct approach but was closed as duplicate of #33696 without merging.

Fix

connect(): Open a per-DB .db.lock file after sqlite3.connect(), stored in _KANBAN_WRITE_LOCKS[id(conn)]
write_txn(): Acquire fcntl.flock(LOCK_EX) before BEGIN IMMEDIATE, release in finally block
Graceful fallback: If fcntl is unavailable (Windows) or lock file cannot be opened, the dict entry is None and flock is skipped — no disruption to existing callers

Changes

hermes_cli/kanban_db.py:
- Added import fcntl + _KANBAN_WRITE_LOCKS dict
- connect(): lock fd creation after sqlite3.connect()
- write_txn(): flock/unlock around the entire transaction

References

User PR fix(kanban): add inter-process file lock to prevent DB corruption under concurrent writes #32461 (closed as duplicate): per-write flock fix
Upstream PR fix(kanban): salvage cross-process init flock + busy_timeout (from #32759) #33696 (merged): init-only lock, missing write_txn protection
Corruption timeline: 2026-05-31 10:41-11:14, 7 corruptions in 33 min

…er concurrent writes Root cause: multiple kanban worker processes (hermes -p <profile>) open independent sqlite3 connections to the same kanban.db. SQLite WAL + BEGIN IMMEDIATE only protects within-process threads — concurrent writers from different processes can still corrupt the WAL under heavy write pressure (task_events + task_runs + status transitions within the same tick). Fix: acquire an fcntl.LOCK_EX on a per-DB .db.lock file around every write_txn(). The lock fd is stored in a module-level dict keyed by id(conn) (sqlite3.Connection is a C-extension type that rejects both attribute assignment and weakref). On platforms without fcntl (Windows) or when the lock file cannot be opened, the lock is a graceful no-op. This preserves the existing WAL + BEGIN IMMEDIATE strategy while serialising cross-process writes, which is the common pattern for SQLite-based multi-process workloads.

Root cause: connect() runs DDL writes (executescript + migrations) under .init.lock while write_txn() holds .db.lock. These are DIFFERENT files, so a new worker's schema init can race with a dispatching gateway's DML writes, corrupting the WAL. Fix: Acquire .db.lock around the DDL write section in connect() so schema init and normal write_txn() calls use the same lock file and are properly serialized. This fixes the observed corruption pattern (7 DB corruptions in ~10 minutes during concurrent kanban worker dispatch).

liuhao1024 · 2026-05-31T08:24:36Z

I found two bugs in this PR that will cause runtime failures.

Bug 1: Top-level import fcntl crashes on Windows

hermes_cli/kanban_db.py line 73 adds import fcntl at module level. The fcntl module does not exist on Windows — this causes ModuleNotFoundError at import time, breaking the entire kanban module on all Windows deployments.

The existing codebase already handles this correctly: lines 1155 and 1169 in current main import fcntl locally inside function bodies. The PR should follow the same pattern:

# In connect() and write_txn(), use lazy import:
if _IS_WINDOWS:
    _KANBAN_WRITE_LOCKS[id(conn)] = None
else:
    import fcntl
    ...

Bug 2: write_txn() passes file object to fcntl.flock instead of file descriptor

At the write_txn hunk:

lock_fd = _KANBAN_WRITE_LOCKS.get(id(conn))
if lock_fd is not None:
    fcntl.flock(lock_fd, fcntl.LOCK_EX)  # WRONG: lock_fd is TextIO, not int

fcntl.flock() requires a file descriptor (int), not a file object. The connect() function correctly uses lock_handle.fileno(), but write_txn() passes the file object directly. This raises TypeError: argument must be an int, or have a fileno() method at runtime. Same issue on the unlock path.

Fix:

fcntl.flock(lock_fd.fileno(), fcntl.LOCK_EX)
# ...
fcntl.flock(lock_fd.fileno(), fcntl.LOCK_UN)

WenhuaXia · 2026-05-31T09:32:47Z

Duplicate of #35787 which supersedes this with fuller root cause analysis and corruption timeline.

WenhuaXia added 3 commits May 31, 2026 13:55

fix(kanban): narrow _KANBAN_WRITE_LOCKS type to fix Pyright diagnostics

89fb99f

alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/cli CLI entry point, hermes_cli/, setup wizard labels May 31, 2026

alt-glitch mentioned this pull request May 31, 2026

fix(kanban): add inter-process flock to prevent SQLite WAL corruption #35787

Closed

WenhuaXia closed this May 31, 2026

hanzckernel mentioned this pull request Jun 2, 2026

fix(kanban): serialize cross-process SQLite writes #37350

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kanban): add inter-process file lock on write_txn() to prevent DB corruption#35770

fix(kanban): add inter-process file lock on write_txn() to prevent DB corruption#35770
WenhuaXia wants to merge 3 commits into
NousResearch:mainfrom
WenhuaXia:fix/kanban-process-write-lock-v2

WenhuaXia commented May 31, 2026

Uh oh!

liuhao1024 commented May 31, 2026

Uh oh!

WenhuaXia commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

WenhuaXia commented May 31, 2026

Problem

Fix

Changes

References

Uh oh!

liuhao1024 commented May 31, 2026

Uh oh!

WenhuaXia commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants