fix(kanban): add inter-process flock to prevent SQLite WAL corruption#35787
fix(kanban): add inter-process flock to prevent SQLite WAL corruption#35787WenhuaXia wants to merge 6 commits into
Conversation
…er concurrent writes Root cause: multiple kanban worker processes (hermes -p <profile>) open independent sqlite3 connections to the same kanban.db. SQLite WAL + BEGIN IMMEDIATE only protects within-process threads — concurrent writers from different processes can still corrupt the WAL under heavy write pressure (task_events + task_runs + status transitions within the same tick). Fix: acquire an fcntl.LOCK_EX on a per-DB .db.lock file around every write_txn(). The lock fd is stored in a module-level dict keyed by id(conn) (sqlite3.Connection is a C-extension type that rejects both attribute assignment and weakref). On platforms without fcntl (Windows) or when the lock file cannot be opened, the lock is a graceful no-op. This preserves the existing WAL + BEGIN IMMEDIATE strategy while serialising cross-process writes, which is the common pattern for SQLite-based multi-process workloads.
Root cause: connect() runs DDL writes (executescript + migrations) under .init.lock while write_txn() holds .db.lock. These are DIFFERENT files, so a new worker's schema init can race with a dispatching gateway's DML writes, corrupting the WAL. Fix: Acquire .db.lock around the DDL write section in connect() so schema init and normal write_txn() calls use the same lock file and are properly serialized. This fixes the observed corruption pattern (7 DB corruptions in ~10 minutes during concurrent kanban worker dispatch).
…on under concurrent workers Root cause: PR NousResearch#33696 only locked the init phase, leaving write_txn() unprotected. Under concurrent multi-worker writes (task_events + task_runs + status transitions all within the same gateway tick), BEGIN IMMEDIATE transactions from different processes can race and corrupt the SQLite WAL — proven by 7 corruptions in 33 min. Fix: Acquire an exclusive flock (.db.lock) around every write_txn() call. DDL migrations in connect() also use the same lock file, so schema init cannot collide with in-flight DML from other processes. Graceful no-op on platforms without fcntl (Windows) or when lock open fails. Reference: Original PR NousResearch#32461 was closed as duplicate of NousResearch#33696, but NousResearch#33696's 'SQLite serializes via WAL once busy_timeout is set' assumption was proven wrong by real-world corruption data.
Hermes Agent ReviewThanks for the WAL-corruption hardening work. I found one blocking portability issue before this can merge: Blocking
Evidence from a Windows venv against this PR worktree: Suggested fix: make No formal review submitted; posting as a focused comment from the hourly commander. |
|
Duplicate of #35770 — both PRs add |
Without this guard, connect() unconditionally opened .db.lock even when fcntl = None. Although downstream flock calls are gated by 'if lock_handle is not None', opening the file was unnecessary I/O and could fail on restricted filesystems. Now we set the dict entry to None upfront when fcntl is None, matching the existing graceful-no-op intent.
tonydwb
left a comment
There was a problem hiding this comment.
Code Review Summary
Verdict: Approved ✅
🔍 What this does
Adds fcntl.flock(LOCK_EX) around all write transactions (write_txn()) and DDL initialization (connect() → schema/migration) in the kanban database layer, using a per-DB .db.lock file. This prevents SQLite WAL corruption when multiple worker processes write concurrently — a race condition that BEGIN IMMEDIATE + busy_timeout cannot fix across processes.
✅ Looks Good
- Root cause analysis: Excellent PR description with clear race-condition diagrams. The bug is real — SQLite WAL mode protects concurrent threads within one process but NOT concurrent writers from different processes.
fcntl.flockis the standard fix. - Correctness:
- Lock opened in
connect()withpath.with_suffix('.db.lock')— deterministic, no collisions. - Lock acquired BEFORE
BEGIN IMMEDIATEinwrite_txn()(ensures serialization before SQLite begins its write transaction). - Lock released in
finallyblock — guaranteed cleanup even on exceptions. - DDL init (schema + migrations) also under
.db.lock— prevents the Race 2 described in the PR.
- Lock opened in
- Edge cases:
fcntlimport wrapped intry/except ImportError— Windows gracefully skips locking.- Lock open wrapped in
try/except OSError— network/NFS failures are no-ops. _KANBAN_WRITE_LOCKS[id(conn)] = Noneas fallback — clean dict entries, no KeyError.
- Existing safety:
_check_file_length_invariantruns after commit but before thefinallyunlocks, preserving its corruption-detection role. - No deadlock risk:
LOCK_EXon the.db.lockfile.fcntl.flockis NOT recursive — a secondLOCK_EXfrom the same process is a no-op (same fd). This is safe because each process has oneconnect()call, one lock fd.
Reviewed by Hermes Agent (cron)
Standalone fallback sends msgtype: "text" which doesn't render markdown tables. The live DingTalk adapter sends msgtype: "markdown" which works. Align the two paths so cron job delivery renders tables correctly. Also adds 20K character truncation (matching live adapter limit) and fallback to plain text if the API rejects the markdown payload. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Closing duplicate — cleaner version is in a new PR (will be created). |
…+ DingTalk markdown webhook 看板 SQLite 跨进程写锁(inter-process flock): Kanban SQLite cross-process write lock (inter-process flock): 问题:多个 kanban worker 进程并发写入同一 SQLite 数据库导致 WAL 损坏。SQLite WAL + busy_timeout 只能保护同一进程内的 并发线程,跨进程仍会竞争 BEGIN IMMEDIATE 导致页面损坏。 Problem: Multiple kanban worker processes writing concurrently to the same SQLite database caused frequent WAL corruption. SQLite WAL + busy_timeout only protects concurrent threads within the same process; cross-process races through BEGIN IMMEDIATE can corrupt WAL pages. 修复: Fix: 1. connect() 后打开 .db.lock 文件,write_txn() 用 fcntl.flock(LOCK_EX) 包裹整个事务,确保同一时间只有一个进程在写。 2. schema init 也获取 .db.lock(不只是 .init.lock),防止 DDL/DML 竞争。 3. Windows 或 fcntl 不可用时优雅降级为 no-op。 1. After connect(), open a .db.lock file. Every write_txn() acquires fcntl.flock(LOCK_EX) around the entire transaction, ensuring only one process writes at a time. 2. Schema init also acquires .db.lock (not just .init.lock) to prevent DDL/DML races. 3. Graceful fallback on Windows or when fcntl is unavailable. 钉钉 webhook Markdown 渲染: DingTalk webhook Markdown rendering: 问题:standalone cron delivery 用 msgtype:"text" 发钉钉消息,markdown 表格中的管道符显示为文字。live adapter 用 msgtype:"markdown" 正确渲染。 Problem: Standalone cron delivery sent msgtype:"text" for DingTalk, which does not render markdown tables — pipes show as literal characters. The live DingTalk adapter sends msgtype:"markdown" which renders tables correctly. 修复:统一使用 msgtype:"markdown",被拒绝时 fallback 到纯文本。 Fix: Use msgtype:"markdown" consistently with fallback to plain text if the API rejects the markdown payload. Changes: - hermes_cli/kanban_db.py: fcntl flock for inter-process write safety - tools/send_message_tool.py: markdown msgtype for DingTalk standalone delivery Related: NousResearch#35787 (closed), NousResearch#34400 (closed)
…vent DB corruption 问题 / Problem: 多个 kanban worker 进程并发写入同一 SQLite 数据库导致频繁 WAL 损坏(33 分钟内 7 次)。SQLite WAL + busy_timeout 只能保护同一进程内的 并发线程,跨进程仍会竞争 BEGIN IMMEDIATE 导致页面损坏。 Problem: Multiple kanban worker processes writing concurrently to the same SQLite database caused frequent WAL corruption (7 times in 33 minutes). SQLite WAL + busy_timeout only protects concurrent threads within the same process; cross-process races through BEGIN IMMEDIATE can corrupt WAL pages. 修复 / Fix: 1. connect() 后打开 .db.lock 文件,write_txn() 用 fcntl.flock(LOCK_EX) 包裹整个事务,确保同一时间只有一个进程在写。 2. schema init 也获取 .db.lock(不只是 .init.lock),防止 DDL/DML 竞争。 3. Windows 或 fcntl 不可用时优雅降级为 no-op。 1. After connect(), open a .db.lock file. Every write_txn() acquires fcntl.flock(LOCK_EX) around the entire transaction, ensuring only one process writes at a time. 2. Schema init also acquires .db.lock (not just .init.lock) to prevent DDL/DML races. 3. Graceful fallback on Windows or when fcntl is unavailable. Changes: - hermes_cli/kanban_db.py: fcntl flock for inter-process write safety Related: NousResearch#35787 (closed), NousResearch#37292 (closed, split)
Problem
Multiple kanban worker processes (5+ gateway instances running
hermes -p <profile>) writing concurrently to the same SQLite database cause frequent WAL corruption.Observed Evidence
7 corruptions in 33 minutes during a single kanban pipeline run on 2026-05-31:
Error:
SQL logic error: wrong # of entries in index idx_runs_statusRoot Cause Analysis
Upstream PR #33696 added
_cross_process_init_lock()around schema initialization and explicitly rejected per-write flock with the rationale: "SQLite serializes via WAL once busy_timeout is set."This assumption is wrong. SQLite WAL + busy_timeout only protects concurrent threads within the same process. When multiple worker processes each open their own
sqlite3.Connection, they can race throughBEGIN IMMEDIATEconcurrently, corrupting the WAL.The corruption happens through two race conditions:
Race 1: Concurrent
write_txn()from different processesEach
write_txn()contains 3-5 write statements (claim + events + runs + status). When two processes execute these concurrently, SQLite's WAL file can end up with torn pages or mismatched index entries.Race 2: DDL schema init collides with in-flight DML
A new worker calling
connect()runsexecutescript(SCHEMA_SQL)and_migrate_add_optional_columns()under.init.lockwhile other workers are mid-write_txn()using.db.lock. These are two different lock files — no mutual exclusion. DDL (ALTER TABLE) while another process is mid-BEGIN IMMEDIATE→ corruption.Fix
Per-write flock: After
sqlite3.connect()inconnect(), open a.db.lockfile. Everywrite_txn()acquiresfcntl.flock(LOCK_EX)around the entire transaction, ensuring only one process writes at a time.DDL init under
.db.lock: Schema initialization also acquires.db.lock(not just.init.lock) so it cannot race with in-flight DML.Graceful fallback: On platforms without
fcntl(Windows) or when lock open fails, the dict entry isNoneand flock is skipped.Changes
hermes_cli/kanban_db.py: Addedimport fcntl+_KANBAN_WRITE_LOCKSdict, flock inconnect()andwrite_txn()References