fix(gateway): add WAL pinner to hold shared lock and prevent sidecar unlink (Bug I.2)#32531
Closed
steveonjava wants to merge 3 commits into
Closed
Conversation
SQLite unlinks wal/shm sidecars when the last connection closes on a WAL-mode DB. Gateway connections (notifier every 5s, two per dispatcher tick) create frequent "last connection" moments, accumulating deleted-inode FDs that corrupt the in-process WAL state and cause sqlite3.OperationalError: disk I/O error in release_stale_claims. Hold one open BEGIN read transaction per board for the gateway lifetime. SQLite skips the "last connection closing" teardown path while any shared lock exists. The pinner uses isolation_level=None and PRAGMA query_only=ON; it is never committed or rolled back until gateway stop. Failure to init any pinner is logged at WARNING and does not prevent startup — the pinner is defense-in-depth. Addresses community report of dispatcher wedge under multi-thread concurrency (issue NousResearch#31158). Complementary to the per-thread cache approach in PR NousResearch#32322 — the two are stackable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the probabilistic sentinel approach in test_no_pinner_accumulates_deleted_fds with raw os.open() FDs. SQLite's internal FDs are closed before sqlite3_close returns, so Python-level sqlite3.Connection objects cannot observe (deleted) inode state. Out-of-band os.open() FDs survive the unlink and produce a deterministic (deleted) symlink in /proc/<pid>/fd — confirming that wal/shm ARE unlinked when no WAL read-mark is held, making the positive pinner test non-vacuous. Co-authored-by: Cursor <cursoragent@cursor.com>
Contributor
Author
|
Closing. I cherry-picked this onto a local branch on 2026-05-26 and reverted it the same day after the same Class-C corruption pattern reappeared on a high-write board. The held-BEGIN pinner is the right primitive in theory, but on its own it doesn't help: SQLite still unlinks the sidecars on the redundant WAL pragma path regardless of how many readers hold a shared lock. An earlier naive variant (bare |
This was referenced May 26, 2026
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
This PR fixes a resource-management bug in SQLite WAL sidecar lifecycle inside the gateway process. It does not cross any OS-level isolation boundary, does not change any external surface authorization model, and does not touch credential handling. Per SECURITY.md §3.2 this is out of scope for private disclosure and submitted as a regular bug fix PR.
Root Cause
After the WAL-skip patch, sidecar unlinks still occur at ~0.32/min. When SQLite determines it is the "last connection closing" on a WAL-mode DB, it checkpoints and unlinks the WAL/shm sidecars. Concurrent in-process connections (notifier watcher every 5s, two connections per dispatcher tick) create frequent "last connection" moments. These deleted-inode FDs accumulate, corrupting the in-process WAL state machine and causing
sqlite3.OperationalError: disk I/O errorfrom the very first SELECT inrelease_stale_claims.The Fix
Hold one open
BEGINread transaction per board for the gateway lifetime. SQLite will not enter the "last connection closing" code path while any shared lock exists. The pinner usesisolation_level=NoneandPRAGMA query_only=ON; it is never committed or rolled back until gateway stop.Key invariant: the pinner must execute
conn.execute("BEGIN")and hold it for the connection lifetime — a bareSELECT 1releases the shared lock immediately after completion.Behavior Change
For operators: None required. The pinner is initialized automatically per-board and runs transparently. No new config flags or migration steps.
Affected users: Deployers running multi-gateway setups with
dispatch_in_gateway: trueor high concurrent dispatch traffic where notifier connection churn previously caused transient EIO errors.Related Issue
Fixes #31158 (dispatcher wedge under multi-thread concurrency)
Coordinates with #32322 (complementary per-thread cache approach; the two are stackable)
Type of Change
Changes Made
gateway/run.py: Addself._wal_pinners: dict[str, sqlite3.Connection]per-board inGatewayRunner. Initialize after board list is known; gracefully close onstop().tests/hermes_cli/test_kanban_gateway_wal_pinner.py: Six tests covering pinner behavior, shared-lock holding, query_only enforcement, FD accumulation (Linux-only regression guard).How to Test
Checklist
pytest tests/ -qand all tests pass (6/6 new tests + full suite green)