Skip to content

fix(kanban): add inter-process file lock on write_txn() to prevent DB corruption#35770

Closed
WenhuaXia wants to merge 3 commits into
NousResearch:mainfrom
WenhuaXia:fix/kanban-process-write-lock-v2
Closed

fix(kanban): add inter-process file lock on write_txn() to prevent DB corruption#35770
WenhuaXia wants to merge 3 commits into
NousResearch:mainfrom
WenhuaXia:fix/kanban-process-write-lock-v2

Conversation

@WenhuaXia

Copy link
Copy Markdown

Problem

Multiple gateway worker processes writing concurrently to the same kanban SQLite DB cause corruption (7 corruptions observed in 33 minutes during a single kanban pipeline run).

Root cause: Upstream PR #33696 only added _cross_process_init_lock around schema initialization. It explicitly rejected per-write flock with the rationale "SQLite serializes via WAL once busy_timeout is set". Real-world evidence proves this assumption wrong under concurrent multi-worker write pressure (task_events + task_runs + status transitions all within the same dispatcher tick).

Users commit d0bab9b4b on branch fix/kanban-process-write-lock had the correct approach but was closed as duplicate of #33696 without merging.

Fix

  1. connect(): Open a per-DB .db.lock file after sqlite3.connect(), stored in _KANBAN_WRITE_LOCKS[id(conn)]
  2. write_txn(): Acquire fcntl.flock(LOCK_EX) before BEGIN IMMEDIATE, release in finally block
  3. Graceful fallback: If fcntl is unavailable (Windows) or lock file cannot be opened, the dict entry is None and flock is skipped — no disruption to existing callers

Changes

  • hermes_cli/kanban_db.py:
    • Added import fcntl + _KANBAN_WRITE_LOCKS dict
    • connect(): lock fd creation after sqlite3.connect()
    • write_txn(): flock/unlock around the entire transaction

References

WenhuaXia added 3 commits May 31, 2026 13:55
…er concurrent writes

Root cause: multiple kanban worker processes (hermes -p <profile>) open
independent sqlite3 connections to the same kanban.db. SQLite WAL + BEGIN
IMMEDIATE only protects within-process threads — concurrent writers from
different processes can still corrupt the WAL under heavy write pressure
(task_events + task_runs + status transitions within the same tick).

Fix: acquire an fcntl.LOCK_EX on a per-DB .db.lock file around every
write_txn().  The lock fd is stored in a module-level dict keyed by
id(conn) (sqlite3.Connection is a C-extension type that rejects both
attribute assignment and weakref).  On platforms without fcntl (Windows)
or when the lock file cannot be opened, the lock is a graceful no-op.

This preserves the existing WAL + BEGIN IMMEDIATE strategy while
serialising cross-process writes, which is the common pattern for
SQLite-based multi-process workloads.
Root cause: connect() runs DDL writes (executescript + migrations)
under .init.lock while write_txn() holds .db.lock. These are DIFFERENT
files, so a new worker's schema init can race with a dispatching
gateway's DML writes, corrupting the WAL.

Fix: Acquire .db.lock around the DDL write section in connect() so
schema init and normal write_txn() calls use the same lock file and
are properly serialized.

This fixes the observed corruption pattern (7 DB corruptions in ~10
minutes during concurrent kanban worker dispatch).
@alt-glitch alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/cli CLI entry point, hermes_cli/, setup wizard labels May 31, 2026
@liuhao1024

Copy link
Copy Markdown
Contributor

I found two bugs in this PR that will cause runtime failures.

Bug 1: Top-level import fcntl crashes on Windows

hermes_cli/kanban_db.py line 73 adds import fcntl at module level. The fcntl module does not exist on Windows — this causes ModuleNotFoundError at import time, breaking the entire kanban module on all Windows deployments.

The existing codebase already handles this correctly: lines 1155 and 1169 in current main import fcntl locally inside function bodies. The PR should follow the same pattern:

# In connect() and write_txn(), use lazy import:
if _IS_WINDOWS:
    _KANBAN_WRITE_LOCKS[id(conn)] = None
else:
    import fcntl
    ...

Bug 2: write_txn() passes file object to fcntl.flock instead of file descriptor

At the write_txn hunk:

lock_fd = _KANBAN_WRITE_LOCKS.get(id(conn))
if lock_fd is not None:
    fcntl.flock(lock_fd, fcntl.LOCK_EX)  # WRONG: lock_fd is TextIO, not int

fcntl.flock() requires a file descriptor (int), not a file object. The connect() function correctly uses lock_handle.fileno(), but write_txn() passes the file object directly. This raises TypeError: argument must be an int, or have a fileno() method at runtime. Same issue on the unlock path.

Fix:

fcntl.flock(lock_fd.fileno(), fcntl.LOCK_EX)
# ...
fcntl.flock(lock_fd.fileno(), fcntl.LOCK_UN)

@WenhuaXia

Copy link
Copy Markdown
Author

Duplicate of #35787 which supersedes this with fuller root cause analysis and corruption timeline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants