Skip to content

fix(kanban): add inter-process flock to prevent SQLite WAL corruption#35787

Closed
WenhuaXia wants to merge 6 commits into
NousResearch:mainfrom
WenhuaXia:main
Closed

fix(kanban): add inter-process flock to prevent SQLite WAL corruption#35787
WenhuaXia wants to merge 6 commits into
NousResearch:mainfrom
WenhuaXia:main

Conversation

@WenhuaXia

Copy link
Copy Markdown

Problem

Multiple kanban worker processes (5+ gateway instances running hermes -p <profile>) writing concurrently to the same SQLite database cause frequent WAL corruption.

Observed Evidence

7 corruptions in 33 minutes during a single kanban pipeline run on 2026-05-31:

Timestamp DB Size Corruption Token
10:41:34 131,072 5b2e1f71
10:42:19 147,456 7298fd2c
10:43:36 143,360 10a46b2f
10:44:19 143,360 ab537cb0
10:53:19 147,456 ef10fae0
10:58:36 147,456 26ad84f1
11:14:37 139,264 e6b5ed72

Error: SQL logic error: wrong # of entries in index idx_runs_status

Root Cause Analysis

Upstream PR #33696 added _cross_process_init_lock() around schema initialization and explicitly rejected per-write flock with the rationale: "SQLite serializes via WAL once busy_timeout is set."

This assumption is wrong. SQLite WAL + busy_timeout only protects concurrent threads within the same process. When multiple worker processes each open their own sqlite3.Connection, they can race through BEGIN IMMEDIATE concurrently, corrupting the WAL.

The corruption happens through two race conditions:

Race 1: Concurrent write_txn() from different processes

Process A (drafter)          Process B (legal-reviewer)
┌─────────────────────────────────────────────────────┐
│ write_txn() → BEGIN IMMEDIATE                       │
│   UPDATE tasks SET claim_lock='...'                 │
│   INSERT INTO task_events (...)                     │
│   INSERT INTO task_runs (...)                       │
│   COMMIT                                            │
└─────────────────────────────────────────────────────┘
                  ↓ WAL pages race ←
┌─────────────────────────────────────────────────────┐
│ write_txn() → BEGIN IMMEDIATE                       │
│   UPDATE tasks SET claim_lock='...'                 │
│   INSERT INTO task_events (...)                     │
│   INSERT INTO task_runs (...)                       │
│   COMMIT                                            │
└─────────────────────────────────────────────────────┘

Each write_txn() contains 3-5 write statements (claim + events + runs + status). When two processes execute these concurrently, SQLite's WAL file can end up with torn pages or mismatched index entries.

Race 2: DDL schema init collides with in-flight DML

A new worker calling connect() runs executescript(SCHEMA_SQL) and _migrate_add_optional_columns() under .init.lock while other workers are mid-write_txn() using .db.lock. These are two different lock files — no mutual exclusion. DDL (ALTER TABLE) while another process is mid-BEGIN IMMEDIATE → corruption.

Fix

  1. Per-write flock: After sqlite3.connect() in connect(), open a .db.lock file. Every write_txn() acquires fcntl.flock(LOCK_EX) around the entire transaction, ensuring only one process writes at a time.

  2. DDL init under .db.lock: Schema initialization also acquires .db.lock (not just .init.lock) so it cannot race with in-flight DML.

  3. Graceful fallback: On platforms without fcntl (Windows) or when lock open fails, the dict entry is None and flock is skipped.

Changes

  • hermes_cli/kanban_db.py: Added import fcntl + _KANBAN_WRITE_LOCKS dict, flock in connect() and write_txn()

References

WenhuaXia added 4 commits May 31, 2026 16:49
…er concurrent writes

Root cause: multiple kanban worker processes (hermes -p <profile>) open
independent sqlite3 connections to the same kanban.db. SQLite WAL + BEGIN
IMMEDIATE only protects within-process threads — concurrent writers from
different processes can still corrupt the WAL under heavy write pressure
(task_events + task_runs + status transitions within the same tick).

Fix: acquire an fcntl.LOCK_EX on a per-DB .db.lock file around every
write_txn().  The lock fd is stored in a module-level dict keyed by
id(conn) (sqlite3.Connection is a C-extension type that rejects both
attribute assignment and weakref).  On platforms without fcntl (Windows)
or when the lock file cannot be opened, the lock is a graceful no-op.

This preserves the existing WAL + BEGIN IMMEDIATE strategy while
serialising cross-process writes, which is the common pattern for
SQLite-based multi-process workloads.
Root cause: connect() runs DDL writes (executescript + migrations)
under .init.lock while write_txn() holds .db.lock. These are DIFFERENT
files, so a new worker's schema init can race with a dispatching
gateway's DML writes, corrupting the WAL.

Fix: Acquire .db.lock around the DDL write section in connect() so
schema init and normal write_txn() calls use the same lock file and
are properly serialized.

This fixes the observed corruption pattern (7 DB corruptions in ~10
minutes during concurrent kanban worker dispatch).
…on under concurrent workers

Root cause: PR NousResearch#33696 only locked the init phase, leaving write_txn() unprotected.
Under concurrent multi-worker writes (task_events + task_runs + status transitions
all within the same gateway tick), BEGIN IMMEDIATE transactions from different
processes can race and corrupt the SQLite WAL — proven by 7 corruptions in 33 min.

Fix: Acquire an exclusive flock (.db.lock) around every write_txn() call.
DDL migrations in connect() also use the same lock file, so schema init
cannot collide with in-flight DML from other processes.

Graceful no-op on platforms without fcntl (Windows) or when lock open fails.

Reference: Original PR NousResearch#32461 was closed as duplicate of NousResearch#33696, but
NousResearch#33696's 'SQLite serializes via WAL once busy_timeout is set' assumption
was proven wrong by real-world corruption data.
@rodriguez46p-ui

Copy link
Copy Markdown

Hermes Agent Review

Thanks for the WAL-corruption hardening work. I found one blocking portability issue before this can merge:

Blocking

  • hermes_cli/kanban_db.py now imports fcntl at module import time. On Windows, fcntl is unavailable, so any code path that imports kanban fails immediately before the intended graceful no-op can run.

Evidence from a Windows venv against this PR worktree:

PYTHONPATH=C:\Users\yolop\AppData\Local\Temp\hermes-pr-35787-review C:\Users\yolop\.hermes\hermes-agent\venv\Scripts\python.exe -c "import hermes_cli.kanban_db"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "...\hermes_cli\kanban_db.py", line 74, in <module>
    import fcntl
ModuleNotFoundError: No module named 'fcntl'

Suggested fix: make fcntl optional at import time, e.g. try: import fcntl except ImportError: fcntl = None, and guard every fcntl.flock(...) call with both lock_handle is not None and fcntl is not None (or route through a small helper). That would match the PR description's “graceful no-op on platforms without fcntl.”

No formal review submitted; posting as a focused comment from the hourly commander.

@alt-glitch alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/cli CLI entry point, hermes_cli/, setup wizard labels May 31, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Duplicate of #35770 — both PRs add fcntl.flock around write_txn() in kanban_db.py to prevent SQLite WAL corruption under concurrent multi-worker writes. Same approach, same file.

Without this guard, connect() unconditionally opened .db.lock even when
fcntl = None. Although downstream flock calls are gated by
'if lock_handle is not None', opening the file was unnecessary I/O and
could fail on restricted filesystems. Now we set the dict entry to None
upfront when fcntl is None, matching the existing graceful-no-op intent.

@tonydwb tonydwb left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

Verdict: Approved

🔍 What this does

Adds fcntl.flock(LOCK_EX) around all write transactions (write_txn()) and DDL initialization (connect() → schema/migration) in the kanban database layer, using a per-DB .db.lock file. This prevents SQLite WAL corruption when multiple worker processes write concurrently — a race condition that BEGIN IMMEDIATE + busy_timeout cannot fix across processes.

✅ Looks Good

  • Root cause analysis: Excellent PR description with clear race-condition diagrams. The bug is real — SQLite WAL mode protects concurrent threads within one process but NOT concurrent writers from different processes. fcntl.flock is the standard fix.
  • Correctness:
    • Lock opened in connect() with path.with_suffix('.db.lock') — deterministic, no collisions.
    • Lock acquired BEFORE BEGIN IMMEDIATE in write_txn() (ensures serialization before SQLite begins its write transaction).
    • Lock released in finally block — guaranteed cleanup even on exceptions.
    • DDL init (schema + migrations) also under .db.lock — prevents the Race 2 described in the PR.
  • Edge cases:
    • fcntl import wrapped in try/except ImportError — Windows gracefully skips locking.
    • Lock open wrapped in try/except OSError — network/NFS failures are no-ops.
    • _KANBAN_WRITE_LOCKS[id(conn)] = None as fallback — clean dict entries, no KeyError.
  • Existing safety: _check_file_length_invariant runs after commit but before the finally unlocks, preserving its corruption-detection role.
  • No deadlock risk: LOCK_EX on the .db.lock file. fcntl.flock is NOT recursive — a second LOCK_EX from the same process is a no-op (same fd). This is safe because each process has one connect() call, one lock fd.

Reviewed by Hermes Agent (cron)

Standalone fallback sends msgtype: "text" which doesn't render markdown
tables. The live DingTalk adapter sends msgtype: "markdown" which works.
Align the two paths so cron job delivery renders tables correctly.

Also adds 20K character truncation (matching live adapter limit) and
fallback to plain text if the API rejects the markdown payload.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@WenhuaXia

Copy link
Copy Markdown
Author

Closing duplicate — cleaner version is in a new PR (will be created).

@WenhuaXia WenhuaXia closed this Jun 2, 2026
WenhuaXia added a commit to WenhuaXia/hermes-agent that referenced this pull request Jun 2, 2026
…+ DingTalk markdown webhook

看板 SQLite 跨进程写锁(inter-process flock):
Kanban SQLite cross-process write lock (inter-process flock):

问题:多个 kanban worker 进程并发写入同一 SQLite 数据库导致
WAL 损坏。SQLite WAL + busy_timeout 只能保护同一进程内的
并发线程,跨进程仍会竞争 BEGIN IMMEDIATE 导致页面损坏。

Problem: Multiple kanban worker processes writing concurrently to the
same SQLite database caused frequent WAL corruption. SQLite WAL +
busy_timeout only protects concurrent threads within the same process;
cross-process races through BEGIN IMMEDIATE can corrupt WAL pages.

修复:
Fix:
1. connect() 后打开 .db.lock 文件,write_txn() 用 fcntl.flock(LOCK_EX)
   包裹整个事务,确保同一时间只有一个进程在写。
2. schema init 也获取 .db.lock(不只是 .init.lock),防止 DDL/DML 竞争。
3. Windows 或 fcntl 不可用时优雅降级为 no-op。

1. After connect(), open a .db.lock file. Every write_txn() acquires
   fcntl.flock(LOCK_EX) around the entire transaction, ensuring only one
   process writes at a time.
2. Schema init also acquires .db.lock (not just .init.lock) to prevent
   DDL/DML races.
3. Graceful fallback on Windows or when fcntl is unavailable.

钉钉 webhook Markdown 渲染:
DingTalk webhook Markdown rendering:

问题:standalone cron delivery 用 msgtype:"text" 发钉钉消息,markdown
表格中的管道符显示为文字。live adapter 用 msgtype:"markdown" 正确渲染。

Problem: Standalone cron delivery sent msgtype:"text" for DingTalk, which
does not render markdown tables — pipes show as literal characters. The
live DingTalk adapter sends msgtype:"markdown" which renders tables correctly.

修复:统一使用 msgtype:"markdown",被拒绝时 fallback 到纯文本。
Fix: Use msgtype:"markdown" consistently with fallback to plain text
if the API rejects the markdown payload.

Changes:
- hermes_cli/kanban_db.py: fcntl flock for inter-process write safety
- tools/send_message_tool.py: markdown msgtype for DingTalk standalone delivery

Related: NousResearch#35787 (closed), NousResearch#34400 (closed)
WenhuaXia added a commit to WenhuaXia/hermes-agent that referenced this pull request Jun 6, 2026
…vent DB corruption

问题 / Problem: 多个 kanban worker 进程并发写入同一 SQLite 数据库导致频繁
WAL 损坏(33 分钟内 7 次)。SQLite WAL + busy_timeout 只能保护同一进程内的
并发线程,跨进程仍会竞争 BEGIN IMMEDIATE 导致页面损坏。

Problem: Multiple kanban worker processes writing concurrently to the same
SQLite database caused frequent WAL corruption (7 times in 33 minutes).
SQLite WAL + busy_timeout only protects concurrent threads within the same
process; cross-process races through BEGIN IMMEDIATE can corrupt WAL pages.

修复 / Fix:
1. connect() 后打开 .db.lock 文件,write_txn() 用 fcntl.flock(LOCK_EX)
   包裹整个事务,确保同一时间只有一个进程在写。
2. schema init 也获取 .db.lock(不只是 .init.lock),防止 DDL/DML 竞争。
3. Windows 或 fcntl 不可用时优雅降级为 no-op。

1. After connect(), open a .db.lock file. Every write_txn() acquires
   fcntl.flock(LOCK_EX) around the entire transaction, ensuring only one
   process writes at a time.
2. Schema init also acquires .db.lock (not just .init.lock) to prevent
   DDL/DML races.
3. Graceful fallback on Windows or when fcntl is unavailable.

Changes:
- hermes_cli/kanban_db.py: fcntl flock for inter-process write safety

Related: NousResearch#35787 (closed), NousResearch#37292 (closed, split)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants