fix(session-db): survive CLI/gateway concurrent write contention#3180
Closed
Mibayy wants to merge 1 commit into
Closed
fix(session-db): survive CLI/gateway concurrent write contention#3180Mibayy wants to merge 1 commit into
Mibayy wants to merge 1 commit into
Conversation
Closes NousResearch#3139 Three layered fixes for the scenario where CLI and gateway write to state.db concurrently, causing create_session() to fail with 'database is locked' and permanently disabling session_search on the gateway side. 1. Increase SQLite connection timeout: 10s -> 30s hermes_state.py: longer window for the WAL writer to finish a batch flush before the other process gives up entirely. 2. INSERT OR IGNORE in create_session hermes_state.py: prevents IntegrityError on duplicate session IDs (e.g. gateway restarts while CLI session is still alive). 3. Don't null out _session_db on create_session failure (main fix) run_agent.py: a transient lock at agent startup must not permanently disable session_search for the lifetime of that agent instance. _session_db now stays alive so subsequent flushes and searches work once the lock clears. 4. New ensure_session() helper + call it during flush hermes_state.py: INSERT OR IGNORE for a minimal session row. run_agent.py _flush_messages_to_session_db: calls ensure_session() before appending messages, so the FK constraint is satisfied even when create_session() failed at startup. No-op when the row exists.
Contributor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #3139
Three layered fixes for the scenario where CLI (
hermes --resume) and gateway (hermes gateway) write tostate.dbconcurrently, causingcreate_session()to fail withdatabase is lockedand permanently disablingsession_searchon the gateway side.Root Cause (rephrased)
The original bug chain:
create_session()→ times out after 10srun_agent.py:897setsself._session_db = Nonesession_searchsession_searchreturns"Session database not available"for the rest of the sessionFixes
1. Increase SQLite connection timeout: 10s → 30s (
hermes_state.py)Gives the WAL writer more time to finish a batch flush. Most lock contention resolves within seconds; 30s covers bursts of CLI flushes without being excessive.
2.
INSERT OR IGNOREincreate_session(hermes_state.py)Prevents
IntegrityErroron duplicate session IDs (e.g. gateway restarts while a CLI session with the same ID is still alive in the DB).3. Don't null out
_session_dboncreate_sessionfailure (run_agent.py) — main fixA transient lock at agent startup must not permanently disable
session_searchfor the lifetime of that agent._session_dbnow stays alive so subsequent flushes and searches work once the lock clears.4. New
ensure_session()helper + call it during flush (hermes_state.py,run_agent.py)ensure_session()usesINSERT OR IGNOREto create a minimal session row if it doesn't exist._flush_messages_to_session_dbcalls it before appending messages, satisfying the FK constraint even whencreate_session()failed at startup. When the row already exists it's a no-op.Changes
hermes_state.pytimeout=30.0,INSERT OR IGNOREincreate_session, newensure_session()run_agent.py_session_db = Nonenullification, callensure_session()in_flush_messages_to_session_dbtests/test_hermes_state.pyTestConcurrentWriteSafetyTests
New tests cover: idempotent
create_session,ensure_sessioncreates/no-ops, FK-safe flush after failedcreate_session, timeout value check.