Skip to content

fix(kanban): 看板跨进程写锁防止 SQLite WAL 损坏 / add inter-process flock#37344

Closed
WenhuaXia wants to merge 2 commits into
NousResearch:mainfrom
WenhuaXia:fix/kanban-flock
Closed

fix(kanban): 看板跨进程写锁防止 SQLite WAL 损坏 / add inter-process flock#37344
WenhuaXia wants to merge 2 commits into
NousResearch:mainfrom
WenhuaXia:fix/kanban-flock

Conversation

@WenhuaXia

Copy link
Copy Markdown

看板 SQLite 跨进程写锁 / Kanban Cross-Process Write Lock

问题 / Problem

多个 kanban worker 进程并发写入同一 SQLite 数据库导致频繁 WAL 损坏(33 分钟内 7 次)。SQLite WAL + busy_timeout 只能保护同一进程内的并发线程,跨进程仍会竞争 BEGIN IMMEDIATE 导致页面损坏。

Multiple kanban worker processes writing concurrently to the same SQLite database caused frequent WAL corruption (7 times in 33 minutes). SQLite WAL + busy_timeout only protects concurrent threads within the same process; cross-process races through BEGIN IMMEDIATE can corrupt WAL pages.

修复 / Fix

  1. connect() 后打开 .db.lock 文件,write_txn() 用 fcntl.flock(LOCK_EX) 包裹整个事务,确保同一时间只有一个进程在写。

  2. schema init 也获取 .db.lock(不只是 .init.lock),防止 DDL/DML 竞争。

  3. Windows 或 fcntl 不可用时优雅降级为 no-op。

  4. After connect(), open .db.lock. Every write_txn() acquires fcntl.flock(LOCK_EX) around the entire transaction.

  5. Schema init also acquires .db.lock to prevent DDL/DML races.

  6. Graceful fallback on Windows or when fcntl is unavailable.

Changes

  • hermes_cli/kanban_db.py: fcntl flock for inter-process write safety

References

@liuhao1024

Copy link
Copy Markdown
Contributor

I found one issue that looks worth fixing before merge.

FD leak: _KANBAN_WRITE_LOCKS entries are never cleaned up

connect() opens a lock file handle and stores it in _KANBAN_WRITE_LOCKS[id(conn)], but no code path ever closes the handle or removes the entry:

  • connect_closing() calls conn.close() but doesn't touch _KANBAN_WRITE_LOCKS — the lock file FD stays open
  • connect()'s exception handler calls conn.close() + raise — same leak
  • There's no close() / disconnect() helper that cleans up the dict

Since connect_closing() was introduced specifically to fix FD exhaustion (#33159 — "Too many open files"), this reintroduces the same class of bug: one leaked FD per kanban operation.

Suggested fix — add cleanup to both paths:

# In connect_closing(), before conn.close():
lock_handle = _KANBAN_WRITE_LOCKS.pop(id(conn), None)
if lock_handle is not None:
    try:
        lock_handle.close()
    except OSError:
        pass

# In connect()'s exception handler, before conn.close():
lock_handle = _KANBAN_WRITE_LOCKS.pop(id(conn), None)
if lock_handle is not None:
    try:
        lock_handle.close()
    except OSError:
        pass

Secondary concernid(conn) as dict key: if a connection is GC'd and a new sqlite3.Connection reuses the same memory address, it would get the stale lock handle. Since connect_closing() immediately calls conn.close() (releasing the C-level object), address reuse is plausible in long-lived processes with many short-lived connections. Using weakref.ref(conn) or the resolved path string as key would be more robust.

@alt-glitch alt-glitch added type/bug Something isn't working comp/cli CLI entry point, hermes_cli/, setup wizard P3 Low — cosmetic, nice to have labels Jun 2, 2026
@WenhuaXia

Copy link
Copy Markdown
Author

两个问题都收到了,已经修复:

  1. FD leak — 添加了 _kanban_release_write_lock(db_path) 函数,在 connect() 的 exception handler 和 connect_closing() 的 finally block 中都调用了它。

  2. id(conn) 作为 key — 已改为使用 resolved DB path (str) 作为 key,避免了 GC 后地址复用的问题。

@WenhuaXia

Copy link
Copy Markdown
Author

Thank you @liuhao1024 for the thorough review! Both issues are now fixed:

  1. FD leak — Added _kanban_release_write_lock(db_path, conn) helper. It is now called from connect()'s exception handler and connect_closing()'s finally block, ensuring the lock file handle is always closed and the dict entry removed.

  2. id(conn) as key — Replaced with resolved DB path (str) as the key for _KANBAN_WRITE_LOCKS. Added a reverse mapping _KANBAN_CONN_PATHS: dict[int, str] so write_txn() can still look up the per-DB lock handle from a sqlite3.Connection without needing the path passed in explicitly.

This should fully prevent the #33159 class of FD leak bug from re-introducing itself.

WenhuaXia added 2 commits June 6, 2026 22:31
…vent DB corruption

问题 / Problem: 多个 kanban worker 进程并发写入同一 SQLite 数据库导致频繁
WAL 损坏(33 分钟内 7 次)。SQLite WAL + busy_timeout 只能保护同一进程内的
并发线程,跨进程仍会竞争 BEGIN IMMEDIATE 导致页面损坏。

Problem: Multiple kanban worker processes writing concurrently to the same
SQLite database caused frequent WAL corruption (7 times in 33 minutes).
SQLite WAL + busy_timeout only protects concurrent threads within the same
process; cross-process races through BEGIN IMMEDIATE can corrupt WAL pages.

修复 / Fix:
1. connect() 后打开 .db.lock 文件,write_txn() 用 fcntl.flock(LOCK_EX)
   包裹整个事务,确保同一时间只有一个进程在写。
2. schema init 也获取 .db.lock(不只是 .init.lock),防止 DDL/DML 竞争。
3. Windows 或 fcntl 不可用时优雅降级为 no-op。

1. After connect(), open a .db.lock file. Every write_txn() acquires
   fcntl.flock(LOCK_EX) around the entire transaction, ensuring only one
   process writes at a time.
2. Schema init also acquires .db.lock (not just .init.lock) to prevent
   DDL/DML races.
3. Graceful fallback on Windows or when fcntl is unavailable.

Changes:
- hermes_cli/kanban_db.py: fcntl flock for inter-process write safety

Related: NousResearch#35787 (closed), NousResearch#37292 (closed, split)
Fixes from @liuhao1024 review on NousResearch#37344:

1. FD leak: _KANBAN_WRITE_LOCKS entries were never cleaned up.
   Added _kanban_release_write_lock() called from connect() exception
   handler and connect_closing() finally block.

2. id(conn) as dict key: if a connection is GC'd and a new
   sqlite3.Connection reuses the same memory address, it would get
   the stale lock handle. Replaced id(conn) key with resolved DB
   path (str) for _KANBAN_WRITE_LOCKS. Added _KANBAN_CONN_PATHS
   reverse mapping so write_txn() can still look up the per-DB lock
   from a sqlite3.Connection alone.

Thanks to @liuhao1024 for the thorough review.
@WenhuaXia

Copy link
Copy Markdown
Author

已在上游 main 中合并,关闭此 PR。

@WenhuaXia WenhuaXia closed this Jun 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants