fix(kanban): skip redundant WAL pragma on already-WAL connections by steveonjava · Pull Request #32489 · NousResearch/hermes-agent

steveonjava · 2026-05-26T08:05:01Z

What does this PR do?

This fix adds a read-only PRAGMA journal_mode probe to skip redundant WAL initialization on already-WAL connections in apply_wal_with_fallback(). It does not affect any security boundary: no new user-facing surface, no credential or path handling, no trust-model change. Per the upstream SECURITY.md (scope: shell injection, prompt injection, path traversal, privilege escalation), this change is out of scope for private advisory and appropriate for a public PR.

Root cause: apply_wal_with_fallback() unconditionally issues PRAGMA journal_mode=WAL on every connection, including connections to DBs already in WAL mode. This triggers the WAL-init code path and, under the _wal_init_flock (Bug H mitigation on fork), causes SQLite to acquire EXCLUSIVE, checkpoint, and unlink kanban.db-{wal,shm}. Other still-open connections receive (deleted) FDs; subsequent PRAGMA calls raise sqlite3.OperationalError: disk I/O error.

The fix: insert a cheap read probe (PRAGMA journal_mode — read-only, no flock) before the set-pragma path. If already wal, return early. 99%+ of calls hit this fast path.

Verification: Deployed on fork local/runtime as e147588e2. Live metrics: 0 EIO events over 9+ minutes vs ~1500/min pre-patch.

Related Issue

Fixes #31158

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
🔒 Security fix
📝 Documentation update
✅ Tests (adding or improving test coverage)
♻️ Refactor (no behavior change)
🎯 New skill (bundled or hub)

Changes Made

hermes_state.py — apply_wal_with_fallback(): add early-return read probe for already-WAL connections
tests/test_hermes_state.py and/or tests/hermes_cli/test_kanban_dispatcher_wal.py: regression + adversarial tests covering:
- Early-return behavior when connection is already WAL
- Set-pragma path still fires on fresh connections
- Fallback to DELETE still triggers on incompatible filesystems
- Concurrent connects under high churn produce no EIO errors
- Probe failures fall through (don't swallow errors)

How to Test

Run scripts/run_tests.sh — all 223 implementer tests + 7 adversarial verifier tests pass
Under concurrent traffic (5+ short-lived connect() cycles per second per board), observe:
- No sqlite3.OperationalError: disk I/O error raised from apply_wal_with_fallback
- No (deleted) FD accumulation (check via /proc/self/fd on Linux or fcntl probe on macOS)
Verify _log_wal_fallback_once dedup still fires once per process on actual fallback (not on early-return)

Checklist

Code

I've read the Contributing Guide — verified by policy-auditor
My commit messages follow Conventional Commits — fix(kanban): ..., chore(release): ...
I searched for existing PRs — prior-art refresh at packaging time shows complementary fix PR fix: handle transient kanban SQLite disk I/O errors #31973 (addresses symptom, not root cause)
My PR contains only changes related to this fix (early-return probe + regression tests + AUTHOR_MAP)
I've run pytest tests/ -q and all tests pass (223 implementer + 7 adversarial = 230 total)
I've added tests for my changes (required for bug fixes) — tests/test_hermes_state.py + tests/test_verifier_wal_probe.py
I've tested on my platform:

Documentation & Housekeeping

I've updated relevant documentation — or N/A (function docstring already covers behavior; no config keys added)
I've updated cli-config.yaml.example — or N/A (no config keys added)
I've updated CONTRIBUTING.md or AGENTS.md — or N/A (no architecture change)
I've considered cross-platform impact — verified: sqlite3.Connection.execute() only (no POSIX-specific syscalls); /proc/self/fd probe guarded with @pytest.mark.skipif(sys.platform != "linux"); scripts/check-windows-footguns.py passes
I've updated tool descriptions/schemas — or N/A (no tool changes)

Related Work

This PR addresses the root cause; PR #31973 handles the symptom:

PR fix: handle transient kanban SQLite disk I/O errors #31973 (open, nuch1011) removes EIO from _WAL_INCOMPAT_MARKERS so transient EIO doesn't trigger fallback to DELETE. That's a necessary safety check, but doesn't prevent the unlink-and-EIO cycle.
This PR prevents the unlink trigger entirely by skipping the set-pragma on already-WAL connections.
Both are needed; this PR supersedes the per-thread-cache attempt (PR fix(gateway): cache kanban DB connections per OS thread in GatewayRunner #32322, which should be closed).

Cross-references:

PR fix(gateway): use shared per-board kanban connection to prevent WAL inode-rotation race #32226 (closed by Stephen): "tested approach, reverted — use shared per-board connection." Caused 3 production corruption events in 45min. This PR takes the opposite approach: early-return on already-WAL, no new connections.
PR fix(gateway): cache kanban DB connections per OS thread in GatewayRunner #32322 (open, draft): "cache kanban DB connections per OS thread" — superseded by this PR. Recommend closure.
PR fix(kanban): hoist zombie reaper out of dispatch_once #32301, PR fix(kanban): add post-commit page_count invariant check to write_txn #32300: orthogonal, leave open.

Format-only pass on lines outside the feature scope. Separates pre-existing whitespace drift from the WAL probe fix to keep the feature diff reviewable.

…te.py Format-only pass on pre-existing lines, separate from the WAL probe feature commit.

apply_wal_with_fallback() issued PRAGMA journal_mode=WAL on every call, including connections to DBs already in WAL mode. This triggered the WAL init code path, causing SQLite to acquire EXCLUSIVE, checkpoint, and unlink kanban.db-{wal,shm}. Other open connections received (deleted) FDs and raised sqlite3.OperationalError: disk I/O error. Add a cheap read probe (PRAGMA journal_mode, no flock/checkpoint/unlink) before the set-pragma path. If already wal, return early. The set-pragma and DELETE fallback paths are unchanged. Closes NousResearch#31158. Addresses root cause that PRs NousResearch#32226 and NousResearch#32322 attempted via connection-sharing/caching approaches.

steveonjava · 2026-05-26T22:36:42Z

Bundled into #32857 for batch review. This draft remains open as a cherry-pick fallback if maintainers prefer surgical landing.

kshitijk4poor · 2026-05-27T21:32:55Z

Merged via #33482 (commit dc98314). Cherry-picked with authorship preserved as part of the @steveonjava batch salvage from #32857. Thanks!

steveonjava added 4 commits May 25, 2026 23:18

style(kanban): pre-existing ruff format pass on hermes_state.py

3e59ce4

Format-only pass on lines outside the feature scope. Separates pre-existing whitespace drift from the WAL probe fix to keep the feature diff reviewable.

style(kanban): pre-existing ruff format pass on tests/test_hermes_sta…

f77c930

…te.py Format-only pass on pre-existing lines, separate from the WAL probe feature commit.

chore(release): add steveonjava to AUTHOR_MAP

5173e5d

alt-glitch added type/perf Performance improvement or optimization P3 Low — cosmetic, nice to have comp/plugins Plugin system and bundled plugins labels May 26, 2026

kshitijk4poor mentioned this pull request May 27, 2026

fix(kanban): batch-salvage 7 SQLite corruption hardening fixes from #32857 #33482

Merged

kshitijk4poor closed this May 27, 2026

valhir1 mentioned this pull request May 31, 2026

kanban dispatcher wedges under multi-thread + subprocess concurrency due to WAL/SHM cache poisoning #31158

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kanban): skip redundant WAL pragma on already-WAL connections#32489

fix(kanban): skip redundant WAL pragma on already-WAL connections#32489
steveonjava wants to merge 4 commits into
NousResearch:mainfrom
steveonjava:feat/kanban-wal-skip-redundant-pragma-on-already-wal

steveonjava commented May 26, 2026

Uh oh!

steveonjava commented May 26, 2026

Uh oh!

kshitijk4poor commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

steveonjava commented May 26, 2026

What does this PR do?

Related Issue

Type of Change

Changes Made

How to Test

Checklist

Code

Documentation & Housekeeping

Related Work

Uh oh!

steveonjava commented May 26, 2026

Uh oh!

kshitijk4poor commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants