Skip to content

fix(kanban): skip redundant WAL pragma on already-WAL connections#32489

Closed
steveonjava wants to merge 4 commits into
NousResearch:mainfrom
steveonjava:feat/kanban-wal-skip-redundant-pragma-on-already-wal
Closed

fix(kanban): skip redundant WAL pragma on already-WAL connections#32489
steveonjava wants to merge 4 commits into
NousResearch:mainfrom
steveonjava:feat/kanban-wal-skip-redundant-pragma-on-already-wal

Conversation

@steveonjava

Copy link
Copy Markdown
Contributor

What does this PR do?

This fix adds a read-only PRAGMA journal_mode probe to skip redundant WAL initialization on already-WAL connections in apply_wal_with_fallback(). It does not affect any security boundary: no new user-facing surface, no credential or path handling, no trust-model change. Per the upstream SECURITY.md (scope: shell injection, prompt injection, path traversal, privilege escalation), this change is out of scope for private advisory and appropriate for a public PR.

Root cause: apply_wal_with_fallback() unconditionally issues PRAGMA journal_mode=WAL on every connection, including connections to DBs already in WAL mode. This triggers the WAL-init code path and, under the _wal_init_flock (Bug H mitigation on fork), causes SQLite to acquire EXCLUSIVE, checkpoint, and unlink kanban.db-{wal,shm}. Other still-open connections receive (deleted) FDs; subsequent PRAGMA calls raise sqlite3.OperationalError: disk I/O error.

The fix: insert a cheap read probe (PRAGMA journal_mode — read-only, no flock) before the set-pragma path. If already wal, return early. 99%+ of calls hit this fast path.

Verification: Deployed on fork local/runtime as e147588e2. Live metrics: 0 EIO events over 9+ minutes vs ~1500/min pre-patch.

Related Issue

Fixes #31158

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • hermes_state.pyapply_wal_with_fallback(): add early-return read probe for already-WAL connections
  • tests/test_hermes_state.py and/or tests/hermes_cli/test_kanban_dispatcher_wal.py: regression + adversarial tests covering:
    • Early-return behavior when connection is already WAL
    • Set-pragma path still fires on fresh connections
    • Fallback to DELETE still triggers on incompatible filesystems
    • Concurrent connects under high churn produce no EIO errors
    • Probe failures fall through (don't swallow errors)

How to Test

  1. Run scripts/run_tests.sh — all 223 implementer tests + 7 adversarial verifier tests pass
  2. Under concurrent traffic (5+ short-lived connect() cycles per second per board), observe:
    • No sqlite3.OperationalError: disk I/O error raised from apply_wal_with_fallback
    • No (deleted) FD accumulation (check via /proc/self/fd on Linux or fcntl probe on macOS)
  3. Verify _log_wal_fallback_once dedup still fires once per process on actual fallback (not on early-return)

Checklist

Code

  • I've read the Contributing Guide — verified by policy-auditor
  • My commit messages follow Conventional Commitsfix(kanban): ..., chore(release): ...
  • I searched for existing PRs — prior-art refresh at packaging time shows complementary fix PR fix: handle transient kanban SQLite disk I/O errors #31973 (addresses symptom, not root cause)
  • My PR contains only changes related to this fix (early-return probe + regression tests + AUTHOR_MAP)
  • I've run pytest tests/ -q and all tests pass (223 implementer + 7 adversarial = 230 total)
  • I've added tests for my changes (required for bug fixes) — tests/test_hermes_state.py + tests/test_verifier_wal_probe.py
  • I've tested on my platform:

Documentation & Housekeeping

  • I've updated relevant documentation — or N/A (function docstring already covers behavior; no config keys added)
  • I've updated cli-config.yaml.example — or N/A (no config keys added)
  • I've updated CONTRIBUTING.md or AGENTS.md — or N/A (no architecture change)
  • I've considered cross-platform impact — verified: sqlite3.Connection.execute() only (no POSIX-specific syscalls); /proc/self/fd probe guarded with @pytest.mark.skipif(sys.platform != "linux"); scripts/check-windows-footguns.py passes
  • I've updated tool descriptions/schemas — or N/A (no tool changes)

Related Work

This PR addresses the root cause; PR #31973 handles the symptom:

Cross-references:

Format-only pass on lines outside the feature scope. Separates pre-existing whitespace drift from the WAL probe fix to keep the feature diff reviewable.
…te.py

Format-only pass on pre-existing lines, separate from the WAL probe feature commit.
apply_wal_with_fallback() issued PRAGMA journal_mode=WAL on every call,
including connections to DBs already in WAL mode. This triggered the WAL
init code path, causing SQLite to acquire EXCLUSIVE, checkpoint, and unlink
kanban.db-{wal,shm}. Other open connections received (deleted) FDs and
raised sqlite3.OperationalError: disk I/O error.

Add a cheap read probe (PRAGMA journal_mode, no flock/checkpoint/unlink)
before the set-pragma path. If already wal, return early. The set-pragma
and DELETE fallback paths are unchanged.

Closes NousResearch#31158. Addresses root cause that PRs NousResearch#32226 and NousResearch#32322 attempted
via connection-sharing/caching approaches.
@steveonjava

Copy link
Copy Markdown
Contributor Author

Bundled into #32857 for batch review. This draft remains open as a cherry-pick fallback if maintainers prefer surgical landing.

@kshitijk4poor

Copy link
Copy Markdown
Collaborator

Merged via #33482 (commit dc98314). Cherry-picked with authorship preserved as part of the @steveonjava batch salvage from #32857. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/plugins Plugin system and bundled plugins P3 Low — cosmetic, nice to have type/perf Performance improvement or optimization

Projects

None yet

Development

Successfully merging this pull request may close these issues.

kanban dispatcher wedges under multi-thread + subprocess concurrency due to WAL/SHM cache poisoning

3 participants