fix(kanban): 看板跨进程写锁防止 SQLite WAL 损坏 / add inter-process flock#37344
fix(kanban): 看板跨进程写锁防止 SQLite WAL 损坏 / add inter-process flock#37344WenhuaXia wants to merge 2 commits into
Conversation
|
I found one issue that looks worth fixing before merge. FD leak:
Since Suggested fix — add cleanup to both paths: # In connect_closing(), before conn.close():
lock_handle = _KANBAN_WRITE_LOCKS.pop(id(conn), None)
if lock_handle is not None:
try:
lock_handle.close()
except OSError:
pass
# In connect()'s exception handler, before conn.close():
lock_handle = _KANBAN_WRITE_LOCKS.pop(id(conn), None)
if lock_handle is not None:
try:
lock_handle.close()
except OSError:
passSecondary concern — |
|
两个问题都收到了,已经修复:
|
|
Thank you @liuhao1024 for the thorough review! Both issues are now fixed:
This should fully prevent the #33159 class of FD leak bug from re-introducing itself. |
…vent DB corruption 问题 / Problem: 多个 kanban worker 进程并发写入同一 SQLite 数据库导致频繁 WAL 损坏(33 分钟内 7 次)。SQLite WAL + busy_timeout 只能保护同一进程内的 并发线程,跨进程仍会竞争 BEGIN IMMEDIATE 导致页面损坏。 Problem: Multiple kanban worker processes writing concurrently to the same SQLite database caused frequent WAL corruption (7 times in 33 minutes). SQLite WAL + busy_timeout only protects concurrent threads within the same process; cross-process races through BEGIN IMMEDIATE can corrupt WAL pages. 修复 / Fix: 1. connect() 后打开 .db.lock 文件,write_txn() 用 fcntl.flock(LOCK_EX) 包裹整个事务,确保同一时间只有一个进程在写。 2. schema init 也获取 .db.lock(不只是 .init.lock),防止 DDL/DML 竞争。 3. Windows 或 fcntl 不可用时优雅降级为 no-op。 1. After connect(), open a .db.lock file. Every write_txn() acquires fcntl.flock(LOCK_EX) around the entire transaction, ensuring only one process writes at a time. 2. Schema init also acquires .db.lock (not just .init.lock) to prevent DDL/DML races. 3. Graceful fallback on Windows or when fcntl is unavailable. Changes: - hermes_cli/kanban_db.py: fcntl flock for inter-process write safety Related: NousResearch#35787 (closed), NousResearch#37292 (closed, split)
Fixes from @liuhao1024 review on NousResearch#37344: 1. FD leak: _KANBAN_WRITE_LOCKS entries were never cleaned up. Added _kanban_release_write_lock() called from connect() exception handler and connect_closing() finally block. 2. id(conn) as dict key: if a connection is GC'd and a new sqlite3.Connection reuses the same memory address, it would get the stale lock handle. Replaced id(conn) key with resolved DB path (str) for _KANBAN_WRITE_LOCKS. Added _KANBAN_CONN_PATHS reverse mapping so write_txn() can still look up the per-DB lock from a sqlite3.Connection alone. Thanks to @liuhao1024 for the thorough review.
f43f247 to
822c1d7
Compare
|
已在上游 main 中合并,关闭此 PR。 |
看板 SQLite 跨进程写锁 / Kanban Cross-Process Write Lock
问题 / Problem
多个 kanban worker 进程并发写入同一 SQLite 数据库导致频繁 WAL 损坏(33 分钟内 7 次)。SQLite WAL + busy_timeout 只能保护同一进程内的并发线程,跨进程仍会竞争 BEGIN IMMEDIATE 导致页面损坏。
Multiple kanban worker processes writing concurrently to the same SQLite database caused frequent WAL corruption (7 times in 33 minutes). SQLite WAL + busy_timeout only protects concurrent threads within the same process; cross-process races through BEGIN IMMEDIATE can corrupt WAL pages.
修复 / Fix
connect() 后打开 .db.lock 文件,write_txn() 用 fcntl.flock(LOCK_EX) 包裹整个事务,确保同一时间只有一个进程在写。
schema init 也获取 .db.lock(不只是 .init.lock),防止 DDL/DML 竞争。
Windows 或 fcntl 不可用时优雅降级为 no-op。
After connect(), open .db.lock. Every write_txn() acquires fcntl.flock(LOCK_EX) around the entire transaction.
Schema init also acquires .db.lock to prevent DDL/DML races.
Graceful fallback on Windows or when fcntl is unavailable.
Changes
References