Skip to content

session_search permanently disabled when CLI and gateway write state.db concurrently #3139

@chencheng-li

Description

@chencheng-li

Bug

When running CLI (hermes --resume) and the Telegram gateway simultaneously, session_search becomes permanently unavailable on the gateway side.

Root Cause

state.db uses WAL mode with a single-writer design. When both CLI and gateway write concurrently, SQLite lock contention causes create_session() to fail with database is locked (10s timeout). The error handler in run_agent.py:895-897 then sets self._session_db = None, permanently disabling session_search for that agent instance. The gateway agent cache (gateway/run.py:5050) reuses this broken agent, so all subsequent messages in that session also lack session_search.

Steps to Reproduce

  1. Start the gateway: hermes gateway
  2. Start a CLI session: hermes --resume <session_id> (or just hermes and do some work)
  3. While CLI is actively making tool calls (frequent DB writes), send a message on Telegram
  4. The Telegram agent tries create_session() → SQLite returns database is locked
  5. _session_db is set to Nonesession_search returns "Session database not available"
  6. Due to agent cache, all subsequent Telegram messages in this session also fail

Evidence

  • Gateway log shows 🔍 recall "today" 0.0s [error] after the failure
  • New Telegram sessions appear in file-based ~/.hermes/sessions/*.json but NOT in state.db
  • Restarting the gateway does not fix it if the CLI is still writing
  • Direct test confirms the lock: sqlite3 state.db "INSERT INTO sessions ..." hangs when CLI is active

Suggested Fix

Several options (not mutually exclusive):

  1. Don't null out _session_db on create_session failure — retry or use INSERT OR IGNORE/INSERT OR REPLACE instead of bare INSERT
  2. Increase SQLite timeout — 10s is too short when CLI is doing frequent flushes; 30-60s would help
  3. Retry with backoff in create_session before giving up
  4. Document the limitation — the sessions doc says WAL "suits the gateway's multi-platform architecture" but doesn't mention CLI concurrent usage

Environment

  • macOS, hermes-agent from git
  • CLI and gateway running simultaneously (common workflow)
  • SQLite WAL mode, timeout=10.0s

Relevant Code

  • run_agent.py:884-897create_session failure nulls _session_db
  • hermes_state.py:257 — bare INSERT INTO sessions (no conflict handling)
  • gateway/run.py:5044-5052 — agent cache reuses broken agent
  • hermes_state.py:124-128timeout=10.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions