fix(gateway): add WAL pinner to hold shared lock and prevent sidecar unlink (Bug I.2) by steveonjava · Pull Request #32531 · NousResearch/hermes-agent

steveonjava · 2026-05-26T09:59:41Z

What does this PR do?

This PR fixes a resource-management bug in SQLite WAL sidecar lifecycle inside the gateway process. It does not cross any OS-level isolation boundary, does not change any external surface authorization model, and does not touch credential handling. Per SECURITY.md §3.2 this is out of scope for private disclosure and submitted as a regular bug fix PR.

Root Cause

After the WAL-skip patch, sidecar unlinks still occur at ~0.32/min. When SQLite determines it is the "last connection closing" on a WAL-mode DB, it checkpoints and unlinks the WAL/shm sidecars. Concurrent in-process connections (notifier watcher every 5s, two connections per dispatcher tick) create frequent "last connection" moments. These deleted-inode FDs accumulate, corrupting the in-process WAL state machine and causing sqlite3.OperationalError: disk I/O error from the very first SELECT in release_stale_claims.

The Fix

Hold one open BEGIN read transaction per board for the gateway lifetime. SQLite will not enter the "last connection closing" code path while any shared lock exists. The pinner uses isolation_level=None and PRAGMA query_only=ON; it is never committed or rolled back until gateway stop.

Key invariant: the pinner must execute conn.execute("BEGIN") and hold it for the connection lifetime — a bare SELECT 1 releases the shared lock immediately after completion.

Behavior Change

For operators: None required. The pinner is initialized automatically per-board and runs transparently. No new config flags or migration steps.

Affected users: Deployers running multi-gateway setups with dispatch_in_gateway: true or high concurrent dispatch traffic where notifier connection churn previously caused transient EIO errors.

Related Issue

Fixes #31158 (dispatcher wedge under multi-thread concurrency)
Coordinates with #32322 (complementary per-thread cache approach; the two are stackable)

Type of Change

Changes Made

gateway/run.py: Add self._wal_pinners: dict[str, sqlite3.Connection] per-board in GatewayRunner. Initialize after board list is known; gracefully close on stop().
tests/hermes_cli/test_kanban_gateway_wal_pinner.py: Six tests covering pinner behavior, shared-lock holding, query_only enforcement, FD accumulation (Linux-only regression guard).

How to Test

# Run all tests
python -m pytest tests/ -o 'addopts=' -q

# Run WAL pinner tests specifically
python -m pytest tests/hermes_cli/test_kanban_gateway_wal_pinner.py -v

# Style and cross-platform checks
ruff check . && ruff format --check .
scripts/check-windows-footguns.py gateway/run.py tests/hermes_cli/test_kanban_gateway_wal_pinner.py

# Reproducible verification: 60-second integration test
# (The regression test is deterministic; it verifies no deleted FDs accumulate under simulated notifier traffic)
python -m pytest tests/hermes_cli/test_kanban_gateway_wal_pinner.py::test_wal_pinner_no_deleted_fds_after_dispatcher_traffic -v

Checklist

I've read the Contributing Guide
My commit messages follow Conventional Commits
I searched for existing PRs to make sure this isn't a duplicate (Searched: PR fix(gateway): cache kanban DB connections per OS thread in GatewayRunner #32322 complementary per-thread cache, stackable; fix(gateway): use shared per-board kanban connection to prevent WAL inode-rotation race #32226 closed/superseded; fix(kanban-db): WAL file descriptor leak on connect/close cycles (fixes #30799) #31130 different symptom. No duplicates.)
My PR contains only changes related to this fix (three commits: pinner implementation + regression tests + AUTHOR_MAP chore)
I've run pytest tests/ -q and all tests pass (6/6 new tests + full suite green)
I've added tests for my changes (required for bug fixes: 6 tests covering positive behavior, negative regression guard, edge cases)
I've tested on my platform (to be confirmed by Stephen)

SQLite unlinks wal/shm sidecars when the last connection closes on a WAL-mode DB. Gateway connections (notifier every 5s, two per dispatcher tick) create frequent "last connection" moments, accumulating deleted-inode FDs that corrupt the in-process WAL state and cause sqlite3.OperationalError: disk I/O error in release_stale_claims. Hold one open BEGIN read transaction per board for the gateway lifetime. SQLite skips the "last connection closing" teardown path while any shared lock exists. The pinner uses isolation_level=None and PRAGMA query_only=ON; it is never committed or rolled back until gateway stop. Failure to init any pinner is logged at WARNING and does not prevent startup — the pinner is defense-in-depth. Addresses community report of dispatcher wedge under multi-thread concurrency (issue NousResearch#31158). Complementary to the per-thread cache approach in PR NousResearch#32322 — the two are stackable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace the probabilistic sentinel approach in test_no_pinner_accumulates_deleted_fds with raw os.open() FDs. SQLite's internal FDs are closed before sqlite3_close returns, so Python-level sqlite3.Connection objects cannot observe (deleted) inode state. Out-of-band os.open() FDs survive the unlink and produce a deterministic (deleted) symlink in /proc/<pid>/fd — confirming that wal/shm ARE unlinked when no WAL read-mark is held, making the positive pinner test non-vacuous. Co-authored-by: Cursor <cursoragent@cursor.com>

steveonjava · 2026-05-26T22:14:36Z

Closing. I cherry-picked this onto a local branch on 2026-05-26 and reverted it the same day after the same Class-C corruption pattern reappeared on a high-write board. The held-BEGIN pinner is the right primitive in theory, but on its own it doesn't help: SQLite still unlinks the sidecars on the redundant WAL pragma path regardless of how many readers hold a shared lock. An earlier naive variant (bare SELECT 1) was also reverted. PR #32489 addresses the real trigger (and is bundled into the consolidated #32857). If a pinner makes sense later as defense-in-depth, it should sit on top of #32489, not in place of it.

steveonjava and others added 3 commits May 26, 2026 01:17

chore(release): add steveonjava to AUTHOR_MAP

935b0a4

alt-glitch added type/bug Something isn't working comp/gateway Gateway runner, session dispatch, delivery comp/plugins Plugin system and bundled plugins P3 Low — cosmetic, nice to have labels May 26, 2026

steveonjava closed this May 26, 2026

This was referenced May 26, 2026

fix(kanban): batch-salvage 8 SQLite corruption hardening fixes (closes #31158, refs #29610) #32857

Closed

Bug: embedded Kanban dispatcher still leaks sqlite/WAL file descriptors after #28301 #29610

Closed

kshitijk4poor mentioned this pull request May 28, 2026

fix(kanban): remove false-positive corruption detection from separate probe connection #32449

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gateway): add WAL pinner to hold shared lock and prevent sidecar unlink (Bug I.2)#32531

fix(gateway): add WAL pinner to hold shared lock and prevent sidecar unlink (Bug I.2)#32531
steveonjava wants to merge 3 commits into
NousResearch:mainfrom
steveonjava:fix/kanban-gateway-wal-pinner-block-sidecar-unlink

steveonjava commented May 26, 2026

Uh oh!

steveonjava commented May 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

steveonjava commented May 26, 2026

What does this PR do?

Root Cause

The Fix

Behavior Change

Related Issue

Type of Change

Changes Made

How to Test

Checklist

Uh oh!

steveonjava commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

steveonjava commented May 26, 2026 •

edited

Loading