fix(kanban): serialize cross-process SQLite writes#37350
Closed
hanzckernel wants to merge 5 commits into
Closed
Conversation
(cherry picked from commit da6e483)
Contributor
Author
|
Closing this in favor of the narrower replacement PR. This write-lock branch was useful diagnostic proof, but it serializes all normal Kanban writes through a sidecar lock and sacrifices the worker parallelism Kanban is supposed to preserve. The replacement targets the actual false-positive path instead: the post-commit main-file length invariant is invalid under WAL because committed frames may still live in Replacement PR: #45525 That PR keeps normal writes SQLite/WAL-concurrent, keeps the invariant for rollback-journal modes, reduces checkpoint churn, and adds a real multiprocess WAL smoke test plus local stress evidence. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Why
Kanban can have multiple worker processes opening the same board database during dispatcher bursts. SQLite WAL and busy_timeout handle normal lock waiting, but they do not give Hermes a single process-wide critical section around schema initialization and multi-statement write_txn() calls. This leaves DDL/DML and concurrent write bursts vulnerable to corrupt-board outcomes.
This patch adds a bounded DB-scoped sidecar lock that is shared by connect() initialization and write_txn(). It also keeps the existing main-file torn-extend check for non-WAL modes while skipping that check in WAL mode, where valid frames can live in the WAL while the main DB file lags.
Scope
Changed files:
Out of scope:
Related work
Related: #37344, #35770, #35787, #37292, #33482, #33696, #37359
This PR keeps the write-lock fix Kanban-only and includes tests for the lock boundaries and WAL invariant behavior. The unrelated web-server broadcast flake remains split into #37359.
Fix boundary / coverage
This PR intentionally covers the full Kanban DB write boundary rather than only one call site:
connect()initialization: header/integrity checks, WAL activation, and schema migration run under the DB write lock so first-connect DDL cannot race ordinary writes.write_txn(): the same DB-scoped lock is held fromBEGIN IMMEDIATEthroughCOMMITand the post-commit invariant check..init.lockand.locksidecars are pre-seeded before byte-range locking, with tests covering the non-empty lock-file requirement.The related PRs and historical fixes are useful context; this patch keeps the scope limited to the Kanban SQLite lock/invariant boundary and its regression tests.
Verification
python -m pytest tests/hermes_cli/test_kanban_db.py -q -o 'addopts='python -m py_compile hermes_cli/kanban_db.py tests/hermes_cli/test_kanban_db.pygit diff --check origin/main...HEAD