fix(kanban): add inter-process flock to prevent SQLite WAL corruption by WenhuaXia · Pull Request #35787 · NousResearch/hermes-agent

WenhuaXia · 2026-05-31T08:59:38Z

Problem

Multiple kanban worker processes (5+ gateway instances running hermes -p <profile>) writing concurrently to the same SQLite database cause frequent WAL corruption.

Observed Evidence

7 corruptions in 33 minutes during a single kanban pipeline run on 2026-05-31:

Timestamp	DB Size	Corruption Token
10:41:34	131,072	5b2e1f71
10:42:19	147,456	7298fd2c
10:43:36	143,360	10a46b2f
10:44:19	143,360	ab537cb0
10:53:19	147,456	ef10fae0
10:58:36	147,456	26ad84f1
11:14:37	139,264	e6b5ed72

Error: SQL logic error: wrong # of entries in index idx_runs_status

Root Cause Analysis

Upstream PR #33696 added _cross_process_init_lock() around schema initialization and explicitly rejected per-write flock with the rationale: "SQLite serializes via WAL once busy_timeout is set."

This assumption is wrong. SQLite WAL + busy_timeout only protects concurrent threads within the same process. When multiple worker processes each open their own sqlite3.Connection, they can race through BEGIN IMMEDIATE concurrently, corrupting the WAL.

The corruption happens through two race conditions:

Race 1: Concurrent write_txn() from different processes

Process A (drafter)          Process B (legal-reviewer)
┌─────────────────────────────────────────────────────┐
│ write_txn() → BEGIN IMMEDIATE                       │
│   UPDATE tasks SET claim_lock='...'                 │
│   INSERT INTO task_events (...)                     │
│   INSERT INTO task_runs (...)                       │
│   COMMIT                                            │
└─────────────────────────────────────────────────────┘
                  ↓ WAL pages race ←
┌─────────────────────────────────────────────────────┐
│ write_txn() → BEGIN IMMEDIATE                       │
│   UPDATE tasks SET claim_lock='...'                 │
│   INSERT INTO task_events (...)                     │
│   INSERT INTO task_runs (...)                       │
│   COMMIT                                            │
└─────────────────────────────────────────────────────┘

Each write_txn() contains 3-5 write statements (claim + events + runs + status). When two processes execute these concurrently, SQLite's WAL file can end up with torn pages or mismatched index entries.

Race 2: DDL schema init collides with in-flight DML

A new worker calling connect() runs executescript(SCHEMA_SQL) and _migrate_add_optional_columns() under .init.lock while other workers are mid-write_txn() using .db.lock. These are two different lock files — no mutual exclusion. DDL (ALTER TABLE) while another process is mid-BEGIN IMMEDIATE → corruption.

Fix

Per-write flock: After sqlite3.connect() in connect(), open a .db.lock file. Every write_txn() acquires fcntl.flock(LOCK_EX) around the entire transaction, ensuring only one process writes at a time.
DDL init under .db.lock: Schema initialization also acquires .db.lock (not just .init.lock) so it cannot race with in-flight DML.
Graceful fallback: On platforms without fcntl (Windows) or when lock open fails, the dict entry is None and flock is skipped.

Changes

hermes_cli/kanban_db.py: Added import fcntl + _KANBAN_WRITE_LOCKS dict, flock in connect() and write_txn()

References

PR fix(kanban): add inter-process file lock to prevent DB corruption under concurrent writes #32461 (closed as duplicate): Original per-write flock by @WenhuaXia, closed as dup of fix(kanban): salvage cross-process init flock + busy_timeout (from #32759) #33696
PR fix(kanban): salvage cross-process init flock + busy_timeout (from #32759) #33696 (merged): Init-only lock — proven insufficient by 7 corruptions in 33 min
Contract-team board (isolated DB): 0 corruptions during same period

…er concurrent writes Root cause: multiple kanban worker processes (hermes -p <profile>) open independent sqlite3 connections to the same kanban.db. SQLite WAL + BEGIN IMMEDIATE only protects within-process threads — concurrent writers from different processes can still corrupt the WAL under heavy write pressure (task_events + task_runs + status transitions within the same tick). Fix: acquire an fcntl.LOCK_EX on a per-DB .db.lock file around every write_txn(). The lock fd is stored in a module-level dict keyed by id(conn) (sqlite3.Connection is a C-extension type that rejects both attribute assignment and weakref). On platforms without fcntl (Windows) or when the lock file cannot be opened, the lock is a graceful no-op. This preserves the existing WAL + BEGIN IMMEDIATE strategy while serialising cross-process writes, which is the common pattern for SQLite-based multi-process workloads.

Root cause: connect() runs DDL writes (executescript + migrations) under .init.lock while write_txn() holds .db.lock. These are DIFFERENT files, so a new worker's schema init can race with a dispatching gateway's DML writes, corrupting the WAL. Fix: Acquire .db.lock around the DDL write section in connect() so schema init and normal write_txn() calls use the same lock file and are properly serialized. This fixes the observed corruption pattern (7 DB corruptions in ~10 minutes during concurrent kanban worker dispatch).

…on under concurrent workers Root cause: PR NousResearch#33696 only locked the init phase, leaving write_txn() unprotected. Under concurrent multi-worker writes (task_events + task_runs + status transitions all within the same gateway tick), BEGIN IMMEDIATE transactions from different processes can race and corrupt the SQLite WAL — proven by 7 corruptions in 33 min. Fix: Acquire an exclusive flock (.db.lock) around every write_txn() call. DDL migrations in connect() also use the same lock file, so schema init cannot collide with in-flight DML from other processes. Graceful no-op on platforms without fcntl (Windows) or when lock open fails. Reference: Original PR NousResearch#32461 was closed as duplicate of NousResearch#33696, but NousResearch#33696's 'SQLite serializes via WAL once busy_timeout is set' assumption was proven wrong by real-world corruption data.

rodriguez46p-ui · 2026-05-31T09:07:53Z

Hermes Agent Review

Thanks for the WAL-corruption hardening work. I found one blocking portability issue before this can merge:

Blocking

hermes_cli/kanban_db.py now imports fcntl at module import time. On Windows, fcntl is unavailable, so any code path that imports kanban fails immediately before the intended graceful no-op can run.

Evidence from a Windows venv against this PR worktree:

PYTHONPATH=C:\Users\yolop\AppData\Local\Temp\hermes-pr-35787-review C:\Users\yolop\.hermes\hermes-agent\venv\Scripts\python.exe -c "import hermes_cli.kanban_db"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "...\hermes_cli\kanban_db.py", line 74, in <module>
    import fcntl
ModuleNotFoundError: No module named 'fcntl'

Suggested fix: make fcntl optional at import time, e.g. try: import fcntl except ImportError: fcntl = None, and guard every fcntl.flock(...) call with both lock_handle is not None and fcntl is not None (or route through a small helper). That would match the PR description's “graceful no-op on platforms without fcntl.”

No formal review submitted; posting as a focused comment from the hourly commander.

alt-glitch · 2026-05-31T09:15:03Z

Duplicate of #35770 — both PRs add fcntl.flock around write_txn() in kanban_db.py to prevent SQLite WAL corruption under concurrent multi-worker writes. Same approach, same file.

Without this guard, connect() unconditionally opened .db.lock even when fcntl = None. Although downstream flock calls are gated by 'if lock_handle is not None', opening the file was unnecessary I/O and could fail on restricted filesystems. Now we set the dict entry to None upfront when fcntl is None, matching the existing graceful-no-op intent.

tonydwb

Code Review Summary

Verdict: Approved ✅

🔍 What this does

Adds fcntl.flock(LOCK_EX) around all write transactions (write_txn()) and DDL initialization (connect() → schema/migration) in the kanban database layer, using a per-DB .db.lock file. This prevents SQLite WAL corruption when multiple worker processes write concurrently — a race condition that BEGIN IMMEDIATE + busy_timeout cannot fix across processes.

✅ Looks Good

Root cause analysis: Excellent PR description with clear race-condition diagrams. The bug is real — SQLite WAL mode protects concurrent threads within one process but NOT concurrent writers from different processes. fcntl.flock is the standard fix.
Correctness:
- Lock opened in connect() with path.with_suffix('.db.lock') — deterministic, no collisions.
- Lock acquired BEFORE BEGIN IMMEDIATE in write_txn() (ensures serialization before SQLite begins its write transaction).
- Lock released in finally block — guaranteed cleanup even on exceptions.
- DDL init (schema + migrations) also under .db.lock — prevents the Race 2 described in the PR.
Edge cases:
- fcntl import wrapped in try/except ImportError — Windows gracefully skips locking.
- Lock open wrapped in try/except OSError — network/NFS failures are no-ops.
- _KANBAN_WRITE_LOCKS[id(conn)] = None as fallback — clean dict entries, no KeyError.
Existing safety: _check_file_length_invariant runs after commit but before the finally unlocks, preserving its corruption-detection role.
No deadlock risk: LOCK_EX on the .db.lock file. fcntl.flock is NOT recursive — a second LOCK_EX from the same process is a no-op (same fd). This is safe because each process has one connect() call, one lock fd.

Reviewed by Hermes Agent (cron)

Standalone fallback sends msgtype: "text" which doesn't render markdown tables. The live DingTalk adapter sends msgtype: "markdown" which works. Align the two paths so cron job delivery renders tables correctly. Also adds 20K character truncation (matching live adapter limit) and fallback to plain text if the API rejects the markdown payload. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

WenhuaXia · 2026-06-02T08:04:13Z

Closing duplicate — cleaner version is in a new PR (will be created).

…+ DingTalk markdown webhook 看板 SQLite 跨进程写锁（inter-process flock）： Kanban SQLite cross-process write lock (inter-process flock): 问题：多个 kanban worker 进程并发写入同一 SQLite 数据库导致 WAL 损坏。SQLite WAL + busy_timeout 只能保护同一进程内的并发线程，跨进程仍会竞争 BEGIN IMMEDIATE 导致页面损坏。 Problem: Multiple kanban worker processes writing concurrently to the same SQLite database caused frequent WAL corruption. SQLite WAL + busy_timeout only protects concurrent threads within the same process; cross-process races through BEGIN IMMEDIATE can corrupt WAL pages. 修复： Fix: 1. connect() 后打开 .db.lock 文件，write_txn() 用 fcntl.flock(LOCK_EX) 包裹整个事务，确保同一时间只有一个进程在写。 2. schema init 也获取 .db.lock（不只是 .init.lock），防止 DDL/DML 竞争。 3. Windows 或 fcntl 不可用时优雅降级为 no-op。 1. After connect(), open a .db.lock file. Every write_txn() acquires fcntl.flock(LOCK_EX) around the entire transaction, ensuring only one process writes at a time. 2. Schema init also acquires .db.lock (not just .init.lock) to prevent DDL/DML races. 3. Graceful fallback on Windows or when fcntl is unavailable. 钉钉 webhook Markdown 渲染： DingTalk webhook Markdown rendering: 问题：standalone cron delivery 用 msgtype:"text" 发钉钉消息，markdown 表格中的管道符显示为文字。live adapter 用 msgtype:"markdown" 正确渲染。 Problem: Standalone cron delivery sent msgtype:"text" for DingTalk, which does not render markdown tables — pipes show as literal characters. The live DingTalk adapter sends msgtype:"markdown" which renders tables correctly. 修复：统一使用 msgtype:"markdown"，被拒绝时 fallback 到纯文本。 Fix: Use msgtype:"markdown" consistently with fallback to plain text if the API rejects the markdown payload. Changes: - hermes_cli/kanban_db.py: fcntl flock for inter-process write safety - tools/send_message_tool.py: markdown msgtype for DingTalk standalone delivery Related: NousResearch#35787 (closed), NousResearch#34400 (closed)

…vent DB corruption 问题 / Problem: 多个 kanban worker 进程并发写入同一 SQLite 数据库导致频繁 WAL 损坏（33 分钟内 7 次）。SQLite WAL + busy_timeout 只能保护同一进程内的并发线程，跨进程仍会竞争 BEGIN IMMEDIATE 导致页面损坏。 Problem: Multiple kanban worker processes writing concurrently to the same SQLite database caused frequent WAL corruption (7 times in 33 minutes). SQLite WAL + busy_timeout only protects concurrent threads within the same process; cross-process races through BEGIN IMMEDIATE can corrupt WAL pages. 修复 / Fix: 1. connect() 后打开 .db.lock 文件，write_txn() 用 fcntl.flock(LOCK_EX) 包裹整个事务，确保同一时间只有一个进程在写。 2. schema init 也获取 .db.lock（不只是 .init.lock），防止 DDL/DML 竞争。 3. Windows 或 fcntl 不可用时优雅降级为 no-op。 1. After connect(), open a .db.lock file. Every write_txn() acquires fcntl.flock(LOCK_EX) around the entire transaction, ensuring only one process writes at a time. 2. Schema init also acquires .db.lock (not just .init.lock) to prevent DDL/DML races. 3. Graceful fallback on Windows or when fcntl is unavailable. Changes: - hermes_cli/kanban_db.py: fcntl flock for inter-process write safety Related: NousResearch#35787 (closed), NousResearch#37292 (closed, split)

WenhuaXia added 4 commits May 31, 2026 16:49

fix(kanban): narrow _KANBAN_WRITE_LOCKS type to fix Pyright diagnostics

c6427cf

alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/cli CLI entry point, hermes_cli/, setup wizard labels May 31, 2026

WenhuaXia mentioned this pull request May 31, 2026

fix(kanban): add inter-process file lock on write_txn() to prevent DB corruption #35770

Closed

tonydwb approved these changes May 31, 2026

View reviewed changes

WenhuaXia closed this Jun 2, 2026

WenhuaXia mentioned this pull request Jun 2, 2026

fix(kanban,send_message): 看板进程锁 + 钉钉 Markdown webhook / kanban flock + DingTalk markdown #37292

Closed

WenhuaXia mentioned this pull request Jun 2, 2026

fix(kanban): 看板跨进程写锁防止 SQLite WAL 损坏 / add inter-process flock #37344

Closed

hanzckernel mentioned this pull request Jun 2, 2026

fix(kanban): serialize cross-process SQLite writes #37350

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kanban): add inter-process flock to prevent SQLite WAL corruption#35787

fix(kanban): add inter-process flock to prevent SQLite WAL corruption#35787
WenhuaXia wants to merge 6 commits into
NousResearch:mainfrom
WenhuaXia:main

WenhuaXia commented May 31, 2026

Uh oh!

rodriguez46p-ui commented May 31, 2026

Uh oh!

alt-glitch commented May 31, 2026

Uh oh!

tonydwb left a comment

Uh oh!

WenhuaXia commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

WenhuaXia commented May 31, 2026

Problem

Observed Evidence

Root Cause Analysis

Fix

Changes

References

Uh oh!

rodriguez46p-ui commented May 31, 2026

Hermes Agent Review

Blocking

Uh oh!

alt-glitch commented May 31, 2026

Uh oh!

tonydwb left a comment

Choose a reason for hiding this comment

Code Review Summary

🔍 What this does

✅ Looks Good

Uh oh!

WenhuaXia commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants