You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
RFC: Pluggable SessionDB Provider — PostgreSQL, MySQL, and Beyond
Tracking Issue: (to be created)
1. Problem Statement
1.1 The Hot-Update Death Spiral
When running Hermes and simultaneously updating it (git pull/hermes update), the shared state.db SQLite file inevitably suffers:
Process A (old version) has an open SQLite connection with WAL mode, mid-write
Process B (new version) starts, tries to connect → database is locked
WAL checkpoint is interrupted mid-flight by SIGTERM
Result: state.db is corrupted, session_search permanently broken until manual recovery
This is not a theoretical edge case — it's the normal development workflow for anyone running Hermes from source.
1.2 SQLite's Fundamental Limits in Multi-Process Hermes
Hermes runs multiple processes sharing one state.db:
Process
Writes
Reads
CLI session
✅ Every turn
✅ session_search
Gateway
✅ Every turn
✅ session_search
Cron scheduler
✅ Job results
✅ Job config
TUI
✅ Session list
✅ Session list
API server
✅ Session creation
✅ Query
SQLite WAL mode allows concurrent reads, but writes are serialized at the OS level. Even with the current jitter-retry + BEGIN IMMEDIATE design, contention causes:
SessionDB is lazy-imported and instantiated in 6+ call sites:
File
Pattern
cli.py (×6)
from hermes_state import SessionDB; self._session_db = SessionDB()
gateway/session.py
self._db = SessionDB()
gateway/mirror.py
db = SessionDB()
gateway/platforms/api_server.py
self._session_db = SessionDB()
cron/scheduler.py
_session_db = SessionDB()
Each call site creates a new independent connection to the same SQLite file — this is the source of cross-process write contention.
2.3 Contention Mitigations (Already Applied)
The current code has sophisticated mitigations from PR #3249 (@teknium1, merged):
WAL mode with passive checkpoint every 50 writes
Jitter retry: 15 attempts, 20-150ms random backoff per attempt
BEGIN IMMEDIATE to surface lock contention at transaction start
INSERT OR IGNORE in create_session for idempotency
ensure_session() lazy helper for flush-time recovery
Never nullify _session_db on transient failures
These are band-aids, not a cure. A real RDBMS solves the problem at the architecture level.
3. Proposed Solution
3.1 Pluggable SessionDB Provider (Following the MemoryProvider Pattern)
Following the existing MemoryProvider ABC pattern (agent/memory_provider.py, used by Holographic, Honcho, Hindsight, Mem0, etc.), introduce SessionDBProvider:
The current SessionDB API is synchronous. To minimize refactoring blast radius, the first iteration of SessionDBProvider keeps a synchronous interface. PostgreSQL/MySQL async drivers (asyncpg, aiomysql) can be wrapped synchronously or introduced in v2.
Decision 2: Search Adapter Strategy (the hardest part)
SQLite FTS5 is the most SQLite-specific feature. Mapping to PostgreSQL full-text search:
PostgreSQL pg_trgm handles CJK natively without special tokenization:
# Search with trigram similarityquery=""" SELECT m.*, s.source, s.title FROM messages m JOIN sessions s ON m.session_id = s.id WHERE m.content %(ilike)s %(param)s ORDER BY m.timestamp DESC LIMIT %(limit)s OFFSET %(offset)s"""# For short CJK queries: ILIKE with % wildcards# For longer queries: similarity() ranking
5. Migration Strategy
Phase 1: ABC + SQLite Provider (Minimal Change)
Extract SessionDBProvider ABC from existing SessionDB
These decisions are locked for v1 of the implementation:
Decision 1: Dual Search — pg_trgm + tsvector
PostgreSQL provider uses both search strategies:
pg_trgm GIN index — primary search backend. Handles CJK natively via trigram similarity (ILIKE, % wildcards), fuzzy matching, and substring search. Covers the full FTS5 trigram tokenizer use case.
tsvector GIN index — secondary search for English. to_tsvector('english', content) @@ plainto_tsquery('english', query) for exact phrase matching and linguistic stemming.
The provider picks the best index based on query content: if the query contains CJK codepoints, use pg_trgm; otherwise use tsvector. This matches the current _contains_cjk() heuristic in hermes_state.py.
Decision 2: One-Shot CLI Migration
hermes sessiondb migrate — explicit, safe, auditable. Never auto-migrate on startup (too much risk of accidental data loss). The command:
Reads all data from the sourcestate.db
Ensures the target PostgreSQL schema exists (CREATE TABLE IF NOT EXISTS)
Prints a summary: "Migrated 142 sessions, 15,863 messages, 44 FTS entries in 2.3s"
Decision 3: Synchronous API (v1), Async Later
SessionDBProvider exposes a synchronous interface. All 6+ current call sites (cli.py, gateway/session.py, etc.) are synchronous — forcing async on v1 would require rewriting the entire call chain with asyncio.run() wrappers, adding complexity with zero immediate benefit.
Async support (async def append_message(...)) is a v2 concern, tracked separately.
Decision 4: Provider-Owned Connection Pool
The PostgreSQL provider owns its connection pool via DSN. Simple, self-contained, no external lifecycle management:
Callers who want custom pool management can pass a pre-created pool via initialize(pool=...) — the provider checks for this first, falls back to DSN-based pool creation.
Decision 5: Kanban.db is Phase 2
This RFC covers SessionDB only (state.db). Kanban migration is tracked as a follow-up issue. Rationale:
SessionDB has the highest contention and data-loss impact (43MB, every-turn writes)
Kanban is read-mostly with fewer concurrent writers
Splitting the scope keeps the first PR reviewable (~2-3K lines vs ~5K+)
RFC: Pluggable SessionDB Provider — PostgreSQL, MySQL, and Beyond
Tracking Issue: (to be created)
1. Problem Statement
1.1 The Hot-Update Death Spiral
When running Hermes and simultaneously updating it (
git pull/hermes update), the sharedstate.dbSQLite file inevitably suffers:database is lockedstate.dbis corrupted,session_searchpermanently broken until manual recoveryThis is not a theoretical edge case — it's the normal development workflow for anyone running Hermes from source.
1.2 SQLite's Fundamental Limits in Multi-Process Hermes
Hermes runs multiple processes sharing one
state.db:SQLite WAL mode allows concurrent reads, but writes are serialized at the OS level. Even with the current jitter-retry +
BEGIN IMMEDIATEdesign, contention causes:session_searchpermanently disabled on lock timeoutstate.dbcorruption during normal use/resume1.3 The Hard Update Problem (Undocumented)
When
hermes updateruns:Between steps 3 and 4:
state.dbis at risk of corruption on every updateRelated issues:
hermes updatewipedstate.dbentirely (git clone instead of pull)hermes updateauto-restart kills in-process cron worker with no opt-out #6702 (open):hermes updatekills cron workers mid-write with no opt-out2. Current Architecture
2.1 SessionDB Class (hermes_state.py)
Single monolithic class:
SessionDBinhermes_state.py(~2,966 lines, ~127KB).Schema (single SQLite file):
Public interface (~60 methods):
2.2 Usage Across Codebase
SessionDBis lazy-imported and instantiated in 6+ call sites:cli.py(×6)from hermes_state import SessionDB; self._session_db = SessionDB()gateway/session.pyself._db = SessionDB()gateway/mirror.pydb = SessionDB()gateway/platforms/api_server.pyself._session_db = SessionDB()cron/scheduler.py_session_db = SessionDB()Each call site creates a new independent connection to the same SQLite file — this is the source of cross-process write contention.
2.3 Contention Mitigations (Already Applied)
The current code has sophisticated mitigations from PR #3249 (@teknium1, merged):
BEGIN IMMEDIATEto surface lock contention at transaction startINSERT OR IGNOREincreate_sessionfor idempotencyensure_session()lazy helper for flush-time recovery_session_dbon transient failuresThese are band-aids, not a cure. A real RDBMS solves the problem at the architecture level.
3. Proposed Solution
3.1 Pluggable SessionDB Provider (Following the MemoryProvider Pattern)
Following the existing
MemoryProviderABC pattern (agent/memory_provider.py, used by Holographic, Honcho, Hindsight, Mem0, etc.), introduceSessionDBProvider:3.2 SessionDBProvider ABC
3.3 Key Design Decisions
Decision 1: Synchronous API (for now)
The current
SessionDBAPI is synchronous. To minimize refactoring blast radius, the first iteration ofSessionDBProviderkeeps a synchronous interface. PostgreSQL/MySQL async drivers (asyncpg,aiomysql) can be wrapped synchronously or introduced in v2.Decision 2: Search Adapter Strategy (the hardest part)
SQLite FTS5 is the most SQLite-specific feature. Mapping to PostgreSQL full-text search:
MATCH 'foo bar'to_tsvector('english', content) @@ plainto_tsquery('english', 'foo bar')MATCH '"exact phrase"'phraseto_tsquery('english', 'exact phrase')pg_trgmextension +ILIKE/similarity()tsvectorgenerated column)Fallback: Providers that cannot implement FTS5-equivalent search may use
ILIKE/LIKEas a degraded mode.Decision 3: Connection Pooling
Unlike SQLite (single file, one writer), PostgreSQL/MySQL use connection pools. The provider should accept an existing pool or DSN:
Decision 4: Schema Migration
Each provider manages its own schema. Initial migration is DDL + data import from SQLite:
This is a one-shot CLI command, not auto-migration on startup.
4. PostgreSQL Provider Design
The primary target. PostgreSQL solves every SQLite pain point:
READ COMMITTED+ connection pool = N concurrent writerssynchronous_committunablepg_trgmextension —SELECT ... ILIKE '%中文%'Schema (PostgreSQL)
CJK Search Strategy
PostgreSQL
pg_trgmhandles CJK natively without special tokenization:5. Migration Strategy
Phase 1: ABC + SQLite Provider (Minimal Change)
SessionDBProviderABC from existingSessionDBSessionDB→SQLiteSessionDBProviderPhase 2: PostgreSQL Provider
PostgreSQLSessionDBProvidersessiondb.provider: postgresqlin confighermes sessiondb migratecommandPhase 3: MySQL Provider (Nice-to-Have)
MySQLSessionDBProviderPhase 4: Async API (Future)
async def append_message(...)etc.6. Configuration
7. Implementation Plan
SessionDBhermes_state.pySessionDB→SQLiteSessionDBProvidercli.py,gateway/,cron/sessiondb.providerconfig keyhermes_cli/config.pyplugins/sessiondb/postgresql/hermes sessiondb migrateCLIhermes_cli/commands.pyplugins/sessiondb/mysql/website/docs/Estimated Total: ~2-3 weeks (part-time)
8. Benefits
pg_trgmnativepg_dump,pgBackRest, Barman9. Design Decisions (Final)
These decisions are locked for v1 of the implementation:
Decision 1: Dual Search —
pg_trgm+tsvectorPostgreSQL provider uses both search strategies:
pg_trgmGIN index — primary search backend. Handles CJK natively via trigram similarity (ILIKE,%wildcards), fuzzy matching, and substring search. Covers the full FTS5 trigram tokenizer use case.tsvectorGIN index — secondary search for English.to_tsvector('english', content) @@ plainto_tsquery('english', query)for exact phrase matching and linguistic stemming.The provider picks the best index based on query content: if the query contains CJK codepoints, use
pg_trgm; otherwise usetsvector. This matches the current_contains_cjk()heuristic inhermes_state.py.Decision 2: One-Shot CLI Migration
hermes sessiondb migrate— explicit, safe, auditable. Never auto-migrate on startup (too much risk of accidental data loss). The command:state.dbDecision 3: Synchronous API (v1), Async Later
SessionDBProviderexposes a synchronous interface. All 6+ current call sites (cli.py,gateway/session.py, etc.) are synchronous — forcing async on v1 would require rewriting the entire call chain withasyncio.run()wrappers, adding complexity with zero immediate benefit.Async support (
async def append_message(...)) is a v2 concern, tracked separately.Decision 4: Provider-Owned Connection Pool
The PostgreSQL provider owns its connection pool via DSN. Simple, self-contained, no external lifecycle management:
Callers who want custom pool management can pass a pre-created pool via
initialize(pool=...)— the provider checks for this first, falls back to DSN-based pool creation.Decision 5: Kanban.db is Phase 2
This RFC covers SessionDB only (
state.db). Kanban migration is tracked as a follow-up issue. Rationale:10. Related Issues
hermes updateauto-restart kills in-process cron worker with no opt-out #6702 (open): hermes update kills cron workers mid-writeDrafted by the community. All hands on deck. 🏴☠️