Skip to content

fix(gateway): add WAL pinner to hold shared lock and prevent sidecar unlink (Bug I.2)#32531

Closed
steveonjava wants to merge 3 commits into
NousResearch:mainfrom
steveonjava:fix/kanban-gateway-wal-pinner-block-sidecar-unlink
Closed

fix(gateway): add WAL pinner to hold shared lock and prevent sidecar unlink (Bug I.2)#32531
steveonjava wants to merge 3 commits into
NousResearch:mainfrom
steveonjava:fix/kanban-gateway-wal-pinner-block-sidecar-unlink

Conversation

@steveonjava

Copy link
Copy Markdown
Contributor

What does this PR do?

This PR fixes a resource-management bug in SQLite WAL sidecar lifecycle inside the gateway process. It does not cross any OS-level isolation boundary, does not change any external surface authorization model, and does not touch credential handling. Per SECURITY.md §3.2 this is out of scope for private disclosure and submitted as a regular bug fix PR.

Root Cause

After the WAL-skip patch, sidecar unlinks still occur at ~0.32/min. When SQLite determines it is the "last connection closing" on a WAL-mode DB, it checkpoints and unlinks the WAL/shm sidecars. Concurrent in-process connections (notifier watcher every 5s, two connections per dispatcher tick) create frequent "last connection" moments. These deleted-inode FDs accumulate, corrupting the in-process WAL state machine and causing sqlite3.OperationalError: disk I/O error from the very first SELECT in release_stale_claims.

The Fix

Hold one open BEGIN read transaction per board for the gateway lifetime. SQLite will not enter the "last connection closing" code path while any shared lock exists. The pinner uses isolation_level=None and PRAGMA query_only=ON; it is never committed or rolled back until gateway stop.

Key invariant: the pinner must execute conn.execute("BEGIN") and hold it for the connection lifetime — a bare SELECT 1 releases the shared lock immediately after completion.

Behavior Change

For operators: None required. The pinner is initialized automatically per-board and runs transparently. No new config flags or migration steps.

Affected users: Deployers running multi-gateway setups with dispatch_in_gateway: true or high concurrent dispatch traffic where notifier connection churn previously caused transient EIO errors.

Related Issue

Fixes #31158 (dispatcher wedge under multi-thread concurrency)
Coordinates with #32322 (complementary per-thread cache approach; the two are stackable)

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature
  • Security fix
  • Documentation update
  • Tests
  • Refactor
  • New skill

Changes Made

  • gateway/run.py: Add self._wal_pinners: dict[str, sqlite3.Connection] per-board in GatewayRunner. Initialize after board list is known; gracefully close on stop().
  • tests/hermes_cli/test_kanban_gateway_wal_pinner.py: Six tests covering pinner behavior, shared-lock holding, query_only enforcement, FD accumulation (Linux-only regression guard).

How to Test

# Run all tests
python -m pytest tests/ -o 'addopts=' -q

# Run WAL pinner tests specifically
python -m pytest tests/hermes_cli/test_kanban_gateway_wal_pinner.py -v

# Style and cross-platform checks
ruff check . && ruff format --check .
scripts/check-windows-footguns.py gateway/run.py tests/hermes_cli/test_kanban_gateway_wal_pinner.py

# Reproducible verification: 60-second integration test
# (The regression test is deterministic; it verifies no deleted FDs accumulate under simulated notifier traffic)
python -m pytest tests/hermes_cli/test_kanban_gateway_wal_pinner.py::test_wal_pinner_no_deleted_fds_after_dispatcher_traffic -v

Checklist

steveonjava and others added 3 commits May 26, 2026 01:17
SQLite unlinks wal/shm sidecars when the last connection closes on a
WAL-mode DB. Gateway connections (notifier every 5s, two per dispatcher
tick) create frequent "last connection" moments, accumulating
deleted-inode FDs that corrupt the in-process WAL state and cause
sqlite3.OperationalError: disk I/O error in release_stale_claims.

Hold one open BEGIN read transaction per board for the gateway lifetime.
SQLite skips the "last connection closing" teardown path while any
shared lock exists. The pinner uses isolation_level=None and PRAGMA
query_only=ON; it is never committed or rolled back until gateway stop.

Failure to init any pinner is logged at WARNING and does not prevent
startup — the pinner is defense-in-depth.

Addresses community report of dispatcher wedge under multi-thread
concurrency (issue NousResearch#31158). Complementary to the per-thread cache
approach in PR NousResearch#32322 — the two are stackable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the probabilistic sentinel approach in
test_no_pinner_accumulates_deleted_fds with raw os.open() FDs.
SQLite's internal FDs are closed before sqlite3_close returns, so
Python-level sqlite3.Connection objects cannot observe (deleted) inode
state. Out-of-band os.open() FDs survive the unlink and produce a
deterministic (deleted) symlink in /proc/<pid>/fd — confirming that
wal/shm ARE unlinked when no WAL read-mark is held, making the
positive pinner test non-vacuous.

Co-authored-by: Cursor <cursoragent@cursor.com>
@alt-glitch alt-glitch added type/bug Something isn't working comp/gateway Gateway runner, session dispatch, delivery comp/plugins Plugin system and bundled plugins P3 Low — cosmetic, nice to have labels May 26, 2026
@steveonjava

steveonjava commented May 26, 2026

Copy link
Copy Markdown
Contributor Author

Closing. I cherry-picked this onto a local branch on 2026-05-26 and reverted it the same day after the same Class-C corruption pattern reappeared on a high-write board. The held-BEGIN pinner is the right primitive in theory, but on its own it doesn't help: SQLite still unlinks the sidecars on the redundant WAL pragma path regardless of how many readers hold a shared lock. An earlier naive variant (bare SELECT 1) was also reverted. PR #32489 addresses the real trigger (and is bundled into the consolidated #32857). If a pinner makes sense later as defense-in-depth, it should sit on top of #32489, not in place of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery comp/plugins Plugin system and bundled plugins P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

kanban dispatcher wedges under multi-thread + subprocess concurrency due to WAL/SHM cache poisoning

2 participants