fix(state): SQLite concurrency hardening + session transcript integrity by teknium1 · Pull Request #3249 · NousResearch/hermes-agent

teknium1 · 2026-03-26T20:29:13Z

Summary

Salvages three complementary SQLite concurrency and session integrity fixes into a single PR.

Fix 1: Release lock between context queries in search_messages (PR #3035 by @Kewe63)

search_messages() held a Python threading lock for the entire FTS5 query + all N per-match context fetches (O(N) sequential I/O). This blocked all other threads (message writes, session updates) for the full duration of a multi-result search.

Fix: move per-match context queries outside the outer lock, each acquiring its own short lock independently.

Fix 2: Survive CLI/gateway concurrent write contention (PR #3180 by @Mibayy, closes #3139)

When CLI and gateway write to state.db concurrently, create_session() can fail with database is locked. The exception handler set _session_db = None, permanently disabling session_search for the rest of that session.

Three-layered fix:

SQLite timeout 10s → 30s — gives WAL writer time to finish batch flushes
INSERT OR IGNORE in create_session() — idempotent on duplicate session IDs
Stop nullifying _session_db on transient failures — keep session_search alive
ensure_session() helper — lazily creates session row during flush if startup creation failed

Fix 3: Prefer longer source in load_transcript (PR #3221 by @Mibayy, closes #3212)

load_transcript() trusted SQLite unconditionally when it had any rows, even if JSONL had a more complete history. This caused silent context truncation for:

Sessions pre-dating the SQLite layer
Sessions where _session_db was nulled (the bug Fix 2 addresses)
Sessions after a DB reset/replacement

Fix: load both sources, return whichever has more messages. For fully-migrated sessions SQLite ≥ JSONL, so this is a no-op. The extra JSONL read (sequential, in page cache for active sessions) is negligible.

How the three fixes interlock for #3212

Two independent failure paths caused the same symptom (context lost mid-conversation):

Path A (Fix 2): Concurrent writes → create_session() fails → _session_db = None → no SQLite flushes → next agent writes only new turn → SQLite has 4 rows → load_transcript returns 4 instead of 994.

Path B (Fix 3): Legacy session pre-dates SQLite → _flush_messages_to_session_db skips conversation_history (assumes already in SQLite) → writes only 2 new messages → next turn SQLite has 2 rows → load_transcript returns 2 instead of 994.

Fix 2 prevents Path A. Fix 3 prevents Path B. Together they fully resolve #3212.

Test plan

125 hermes_state tests pass (including 5 new concurrent safety tests from fix(session-db): survive CLI/gateway concurrent write contention #3180)
53 gateway/session tests pass (including 5 new load_transcript source-preference tests)
Full suite: 6214 passed (1 pre-existing failure in test_429_exhausts_all_retries)

Attribution

Cherry-picked with original authorship preserved:

58a17fca by @Kewe63 (PR fix(state): release lock between context queries in search_messages #3035)
cb08454f by @Mibayy (PR fix(session-db): survive CLI/gateway concurrent write contention #3180)
a9466c46 by @Mibayy (PR fix(session): prefer longer source in load_transcript to prevent legacy truncation #3221)

github-actions · 2026-03-26T20:29:25Z

⚠️ Supply Chain Risk Detected

This PR contains patterns commonly associated with supply chain attacks. This does not mean the PR is malicious — but these patterns require careful human review before merging.

⚠️ WARNING: Outbound network calls (POST/PUT)

Outbound POST/PUT requests in new code could be data exfiltration. Verify the destination URLs are legitimate.

Matches (first 10):

70:+        with urllib.request.urlopen(req, timeout=10) as resp:

Automated scan triggered by supply-chain-audit. If this is a false positive, a maintainer can approve after manual review.

github-actions · 2026-03-26T20:42:06Z

⚠️ Supply Chain Risk Detected

This PR contains patterns commonly associated with supply chain attacks. This does not mean the PR is malicious — but these patterns require careful human review before merging.

⚠️ WARNING: Outbound network calls (POST/PUT)

Outbound POST/PUT requests in new code could be data exfiltration. Verify the destination URLs are legitimate.

Matches (first 10):

70:+        with urllib.request.urlopen(req, timeout=10) as resp:

Automated scan triggered by supply-chain-audit. If this is a false positive, a maintainer can approve after manual review.

Closes #3139 Three layered fixes for the scenario where CLI and gateway write to state.db concurrently, causing create_session() to fail with 'database is locked' and permanently disabling session_search on the gateway side. 1. Increase SQLite connection timeout: 10s -> 30s hermes_state.py: longer window for the WAL writer to finish a batch flush before the other process gives up entirely. 2. INSERT OR IGNORE in create_session hermes_state.py: prevents IntegrityError on duplicate session IDs (e.g. gateway restarts while CLI session is still alive). 3. Don't null out _session_db on create_session failure (main fix) run_agent.py: a transient lock at agent startup must not permanently disable session_search for the lifetime of that agent instance. _session_db now stays alive so subsequent flushes and searches work once the lock clears. 4. New ensure_session() helper + call it during flush hermes_state.py: INSERT OR IGNORE for a minimal session row. run_agent.py _flush_messages_to_session_db: calls ensure_session() before appending messages, so the FK constraint is satisfied even when create_session() failed at startup. No-op when the row exists.

The context-window queries (one per FTS5 match) were running inside the same lock acquisition as the primary FTS5 query, holding the lock for O(N) sequential SQLite round-trips. Move per-match context fetches outside the outer lock block so each acquires the lock independently, keeping critical sections short and allowing other threads to interleave.

…cy truncation When a long-lived session pre-dates SQLite storage (e.g. sessions created before the DB layer was introduced, or after a clean deployment that reset the DB), _flush_messages_to_session_db only writes the *new* messages from the current turn to SQLite — it skips messages already present in conversation_history, assuming they are already persisted. That assumption fails for legacy JSONL-only sessions: Turn N (first after DB migration): load_transcript(id) → SQLite: 0 → falls back to JSONL: 994 ✓ _flush_messages_to_session_db: skip first 994, write 2 new → SQLite: 2 Turn N+1: load_transcript(id) → SQLite: 2 → returns immediately ✗ Agent sees 2 messages of history instead of 996 The same pattern causes the reported symptom: session JSON truncated to 4 messages (_save_session_log writes agent.messages which only has 2 history + 2 new = 4). Fix: always load both sources and return whichever is longer. For a fully-migrated session SQLite will always be ≥ JSONL, so there is no regression. For a legacy session that hasn't been bootstrapped yet, JSONL wins and the full history is restored. Closes #3212

Covers: JSONL longer returns JSONL, SQLite longer returns SQLite, SQLite empty falls back to JSONL, both empty returns empty, equal length prefers SQLite (richer reasoning fields).

…ty (NousResearch#3249) * fix(session-db): survive CLI/gateway concurrent write contention Closes NousResearch#3139 Three layered fixes for the scenario where CLI and gateway write to state.db concurrently, causing create_session() to fail with 'database is locked' and permanently disabling session_search on the gateway side. 1. Increase SQLite connection timeout: 10s -> 30s hermes_state.py: longer window for the WAL writer to finish a batch flush before the other process gives up entirely. 2. INSERT OR IGNORE in create_session hermes_state.py: prevents IntegrityError on duplicate session IDs (e.g. gateway restarts while CLI session is still alive). 3. Don't null out _session_db on create_session failure (main fix) run_agent.py: a transient lock at agent startup must not permanently disable session_search for the lifetime of that agent instance. _session_db now stays alive so subsequent flushes and searches work once the lock clears. 4. New ensure_session() helper + call it during flush hermes_state.py: INSERT OR IGNORE for a minimal session row. run_agent.py _flush_messages_to_session_db: calls ensure_session() before appending messages, so the FK constraint is satisfied even when create_session() failed at startup. No-op when the row exists. * fix(state): release lock between context queries in search_messages The context-window queries (one per FTS5 match) were running inside the same lock acquisition as the primary FTS5 query, holding the lock for O(N) sequential SQLite round-trips. Move per-match context fetches outside the outer lock block so each acquires the lock independently, keeping critical sections short and allowing other threads to interleave. * fix(session): prefer longer source in load_transcript to prevent legacy truncation When a long-lived session pre-dates SQLite storage (e.g. sessions created before the DB layer was introduced, or after a clean deployment that reset the DB), _flush_messages_to_session_db only writes the *new* messages from the current turn to SQLite — it skips messages already present in conversation_history, assuming they are already persisted. That assumption fails for legacy JSONL-only sessions: Turn N (first after DB migration): load_transcript(id) → SQLite: 0 → falls back to JSONL: 994 ✓ _flush_messages_to_session_db: skip first 994, write 2 new → SQLite: 2 Turn N+1: load_transcript(id) → SQLite: 2 → returns immediately ✗ Agent sees 2 messages of history instead of 996 The same pattern causes the reported symptom: session JSON truncated to 4 messages (_save_session_log writes agent.messages which only has 2 history + 2 new = 4). Fix: always load both sources and return whichever is longer. For a fully-migrated session SQLite will always be ≥ JSONL, so there is no regression. For a legacy session that hasn't been bootstrapped yet, JSONL wins and the full history is restored. Closes NousResearch#3212 * test: add load_transcript source preference tests for NousResearch#3212 Covers: JSONL longer returns JSONL, SQLite longer returns SQLite, SQLite empty falls back to JSONL, both empty returns empty, equal length prefers SQLite (richer reasoning fields). --------- Co-authored-by: Mibayy <mibayy@hermes.ai> Co-authored-by: kewe63 <kewe.3217@gmail.com> Co-authored-by: Mibayy <mibayy@users.noreply.github.com>

teknium1 changed the title ~~fix(state): SQLite concurrency hardening — lock scope + write contention survival~~ fix(state): SQLite concurrency hardening + session transcript integrity Mar 26, 2026

Mibayy and others added 4 commits March 26, 2026 13:43

test: add load_transcript source preference tests for #3212

869399e

Covers: JSONL longer returns JSONL, SQLite longer returns SQLite, SQLite empty falls back to JSONL, both empty returns empty, equal length prefers SQLite (richer reasoning fields).

teknium1 force-pushed the hermes/hermes-ad9511d6 branch from 895641e to 869399e Compare March 26, 2026 20:43

teknium1 merged commit b81d49d into main Mar 26, 2026
3 of 4 checks passed

This was referenced Mar 26, 2026

fix(state): release lock between context queries in search_messages #3035

Closed

fix(session-db): survive CLI/gateway concurrent write contention #3180

Closed

fix(session): prefer longer source in load_transcript to prevent legacy truncation #3221

Closed

DoubleDD mentioned this pull request May 11, 2026

RFC: Pluggable SessionDB Provider — PostgreSQL, MySQL, and Beyond #23717

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(state): SQLite concurrency hardening + session transcript integrity#3249

fix(state): SQLite concurrency hardening + session transcript integrity#3249
teknium1 merged 4 commits into
mainfrom
hermes/hermes-ad9511d6

teknium1 commented Mar 26, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Mar 26, 2026

Uh oh!

github-actions Bot commented Mar 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

teknium1 commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fix 1: Release lock between context queries in search_messages (PR #3035 by @Kewe63)

Fix 2: Survive CLI/gateway concurrent write contention (PR #3180 by @Mibayy, closes #3139)

Fix 3: Prefer longer source in load_transcript (PR #3221 by @Mibayy, closes #3212)

How the three fixes interlock for #3212

Test plan

Attribution

Uh oh!

github-actions Bot commented Mar 26, 2026

⚠️ Supply Chain Risk Detected

⚠️ WARNING: Outbound network calls (POST/PUT)

Uh oh!

github-actions Bot commented Mar 26, 2026

⚠️ Supply Chain Risk Detected

⚠️ WARNING: Outbound network calls (POST/PUT)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

teknium1 commented Mar 26, 2026 •

edited

Loading