Skip to content

fix(state): retry transient SQLite WAL setup failures (Fixes #30576)#30700

Open
deepujain wants to merge 1 commit into
NousResearch:mainfrom
deepujain:fix/30576-sqlite-busy-timeout-retry
Open

fix(state): retry transient SQLite WAL setup failures (Fixes #30576)#30700
deepujain wants to merge 1 commit into
NousResearch:mainfrom
deepujain:fix/30576-sqlite-busy-timeout-retry

Conversation

@deepujain

Copy link
Copy Markdown

Summary

Fixes #30576.

  • Set the state database SQLite busy timeout to 30 seconds when SessionDB opens its connection.
  • Retry WAL setup up to three times for transient busy/locked and WAL setup disk I/O failures.
  • Keep the existing WAL-to-DELETE fallback for known incompatible filesystems, and leave checkpoint/synchronous durability behavior to the separate WAL corruption work.

Why

The state database already retries write transactions after SQLite returns busy or locked errors. The connection itself still used a 1 second busy timeout, though, and WAL setup handled some transient failures as immediate fallback or hard failure.

On BTRFS with CoW, those short-lived WAL setup failures can clear on retry. Retrying the idempotent PRAGMA journal_mode=WAL path gives SQLite a chance to settle without broad changes to durability policy.

Overlap Check

Before changing code, I checked issue #30576 and gh search prs for #30576, SQLite WAL, BTRFS, busy_timeout, retry, and state DB lock wording. I found adjacent open WAL maintenance/corruption PRs (#30654, #16510, #30011, #24033), but none covering this busy-timeout plus WAL setup retry path. This PR avoids their checkpoint, synchronous, and doctor changes.

Validation

  • scripts/run_tests.sh tests/test_hermes_state.py tests/test_hermes_state_wal_fallback.py - 232 tests passed, 0 failed
  • uv tool run ruff check hermes_state.py tests/test_hermes_state.py tests/test_hermes_state_wal_fallback.py - all checks passed
  • git diff --check - clean

@deepujain

deepujain commented Jun 2, 2026

Copy link
Copy Markdown
Author

@teknium1 Updated this PR against current main and resolved the WAL/state conflicts. The branch now keeps current main’s read-only WAL probe and disk I/O re-raise behavior, while adding bounded retry only for transient SQLite busy/locked WAL setup errors. GitHub now reports the PR as mergeable. Focused validation: scripts/run_tests.sh tests/test_hermes_state_wal_fallback.py tests/test_hermes_state.py passed, 275/275. I still cannot formally request review because GitHub says this account lacks RequestReviewsByLogin permission on this repo. Could you approve any pending workflow run and review when you have a chance?

@deepujain deepujain force-pushed the fix/30576-sqlite-busy-timeout-retry branch from bc2d1e9 to 73041de Compare June 7, 2026 20:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix: SQLite WAL + BTRFS COW compatibility — busy_timeout + retry logic

2 participants