Skip to content

fix(session-db): survive CLI/gateway concurrent write contention#3180

Closed
Mibayy wants to merge 1 commit into
NousResearch:mainfrom
Mibayy:fix/session-db-lock-3139
Closed

fix(session-db): survive CLI/gateway concurrent write contention#3180
Mibayy wants to merge 1 commit into
NousResearch:mainfrom
Mibayy:fix/session-db-lock-3139

Conversation

@Mibayy

@Mibayy Mibayy commented Mar 26, 2026

Copy link
Copy Markdown
Contributor

Summary

Closes #3139

Three layered fixes for the scenario where CLI (hermes --resume) and gateway (hermes gateway) write to state.db concurrently, causing create_session() to fail with database is locked and permanently disabling session_search on the gateway side.

Root Cause (rephrased)

The original bug chain:

  1. CLI is doing frequent WAL flushes → SQLite writer lock held
  2. Telegram message arrives → gateway tries create_session() → times out after 10s
  3. Exception handler in run_agent.py:897 sets self._session_db = None
  4. Agent is cached → all subsequent messages in that session have no session_search
  5. session_search returns "Session database not available" for the rest of the session

Fixes

1. Increase SQLite connection timeout: 10s → 30s (hermes_state.py)

Gives the WAL writer more time to finish a batch flush. Most lock contention resolves within seconds; 30s covers bursts of CLI flushes without being excessive.

2. INSERT OR IGNORE in create_session (hermes_state.py)

Prevents IntegrityError on duplicate session IDs (e.g. gateway restarts while a CLI session with the same ID is still alive in the DB).

3. Don't null out _session_db on create_session failure (run_agent.py) — main fix

A transient lock at agent startup must not permanently disable session_search for the lifetime of that agent. _session_db now stays alive so subsequent flushes and searches work once the lock clears.

4. New ensure_session() helper + call it during flush (hermes_state.py, run_agent.py)

ensure_session() uses INSERT OR IGNORE to create a minimal session row if it doesn't exist. _flush_messages_to_session_db calls it before appending messages, satisfying the FK constraint even when create_session() failed at startup. When the row already exists it's a no-op.

Changes

File Change
hermes_state.py timeout=30.0, INSERT OR IGNORE in create_session, new ensure_session()
run_agent.py Remove _session_db = None nullification, call ensure_session() in _flush_messages_to_session_db
tests/test_hermes_state.py 5 new tests in TestConcurrentWriteSafety

Tests

125 passed  (120 pre-existing + 5 new)

New tests cover: idempotent create_session, ensure_session creates/no-ops, FK-safe flush after failed create_session, timeout value check.

Closes NousResearch#3139

Three layered fixes for the scenario where CLI and gateway write to
state.db concurrently, causing create_session() to fail with
'database is locked' and permanently disabling session_search on the
gateway side.

1. Increase SQLite connection timeout: 10s -> 30s
   hermes_state.py: longer window for the WAL writer to finish a batch
   flush before the other process gives up entirely.

2. INSERT OR IGNORE in create_session
   hermes_state.py: prevents IntegrityError on duplicate session IDs
   (e.g. gateway restarts while CLI session is still alive).

3. Don't null out _session_db on create_session failure  (main fix)
   run_agent.py: a transient lock at agent startup must not permanently
   disable session_search for the lifetime of that agent instance.
   _session_db now stays alive so subsequent flushes and searches work
   once the lock clears.

4. New ensure_session() helper + call it during flush
   hermes_state.py: INSERT OR IGNORE for a minimal session row.
   run_agent.py _flush_messages_to_session_db: calls ensure_session()
   before appending messages, so the FK constraint is satisfied even
   when create_session() failed at startup. No-op when the row exists.
@teknium1

Copy link
Copy Markdown
Contributor

Merged via PR #3249. Your substantive commit (cb08454f) was cherry-picked onto current main with authorship preserved — all three fixes (30s timeout, INSERT OR IGNORE, stop nullifying _session_db) plus ensure_session ship together. Thanks @Mibayy!

@teknium1 teknium1 closed this Mar 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

session_search permanently disabled when CLI and gateway write state.db concurrently

2 participants