Skip to content

fix(kanban): serialize cross-process SQLite writes#37350

Closed
hanzckernel wants to merge 5 commits into
NousResearch:mainfrom
hanzckernel:fix/kanban-bounded-write-lock
Closed

fix(kanban): serialize cross-process SQLite writes#37350
hanzckernel wants to merge 5 commits into
NousResearch:mainfrom
hanzckernel:fix/kanban-bounded-write-lock

Conversation

@hanzckernel

@hanzckernel hanzckernel commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Summary

  • serialize Kanban DB writes across processes with a bounded sidecar write lock
  • run first-connect/schema/WAL initialization under the same DB write lock
  • avoid WAL-mode false positives from the post-commit main-file length invariant
  • add regression coverage for timeout behavior, schema-init locking, write_txn lock scope, symlinked DB lock canonicalization, Windows byte-range lock setup, and WAL invariant gating

Why

Kanban can have multiple worker processes opening the same board database during dispatcher bursts. SQLite WAL and busy_timeout handle normal lock waiting, but they do not give Hermes a single process-wide critical section around schema initialization and multi-statement write_txn() calls. This leaves DDL/DML and concurrent write bursts vulnerable to corrupt-board outcomes.

This patch adds a bounded DB-scoped sidecar lock that is shared by connect() initialization and write_txn(). It also keeps the existing main-file torn-extend check for non-WAL modes while skipping that check in WAL mode, where valid frames can live in the WAL while the main DB file lags.

Scope

Changed files:

  • hermes_cli/kanban_db.py
  • tests/hermes_cli/test_kanban_db.py

Out of scope:

Related work

Related: #37344, #35770, #35787, #37292, #33482, #33696, #37359

This PR keeps the write-lock fix Kanban-only and includes tests for the lock boundaries and WAL invariant behavior. The unrelated web-server broadcast flake remains split into #37359.

Fix boundary / coverage

This PR intentionally covers the full Kanban DB write boundary rather than only one call site:

  • connect() initialization: header/integrity checks, WAL activation, and schema migration run under the DB write lock so first-connect DDL cannot race ordinary writes.
  • write_txn(): the same DB-scoped lock is held from BEGIN IMMEDIATE through COMMIT and the post-commit invariant check.
  • bounded contention: sidecar lock acquisition times out with a clear SQLite operational error instead of waiting indefinitely.
  • path identity: sidecar lock paths are derived from the resolved DB path, with regression coverage for symlinked DB access.
  • Windows behavior: both .init.lock and .lock sidecars are pre-seeded before byte-range locking, with tests covering the non-empty lock-file requirement.
  • WAL correctness: the main-file length invariant is skipped in WAL mode, where valid committed frames may still live in the WAL while the main DB file lags.

The related PRs and historical fixes are useful context; this patch keeps the scope limited to the Kanban SQLite lock/invariant boundary and its regression tests.

Verification

  • python -m pytest tests/hermes_cli/test_kanban_db.py -q -o 'addopts='
    • 211 passed
  • targeted lock/WAL regression subset
    • 7 passed
  • python -m py_compile hermes_cli/kanban_db.py tests/hermes_cli/test_kanban_db.py
  • git diff --check origin/main...HEAD
  • multi-process smoke: 8 worker processes x 40 create_task calls against one DB
    • quick_check=ok
    • integrity_check=ok
    • tasks=320
    • journal=wal
  • independent review: passed after fixing the Windows zero-byte sidecar issue for both .init.lock and .lock

@alt-glitch alt-glitch added type/bug Something isn't working comp/cli CLI entry point, hermes_cli/, setup wizard P3 Low — cosmetic, nice to have labels Jun 2, 2026
@hanzckernel

Copy link
Copy Markdown
Contributor Author

Closing this in favor of the narrower replacement PR.

This write-lock branch was useful diagnostic proof, but it serializes all normal Kanban writes through a sidecar lock and sacrifices the worker parallelism Kanban is supposed to preserve. The replacement targets the actual false-positive path instead: the post-commit main-file length invariant is invalid under WAL because committed frames may still live in kanban.db-wal while the main DB file lags.

Replacement PR: #45525

That PR keeps normal writes SQLite/WAL-concurrent, keeps the invariant for rollback-journal modes, reduces checkpoint churn, and adds a real multiprocess WAL smoke test plus local stress evidence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants