Skip to content

state.db FTS trigram index bloat: 70% of DB size is full-text indexes #22478

@DDC113

Description

@DDC113

Problem

The state.db file grows rapidly due to dual FTS5 indexes on the messages table. On a moderately used instance (42K messages, 2.4K sessions):

Component Size % of DB
messages data (content, tool_calls, reasoning) 99MB 19.6%
sessions data (system_prompt) 45MB 8.9%
FTS indexes (fts + fts_trigram) 358MB 70.8%
Other (indexes, overhead) 3MB 0.7%
Total 505MB 100%

The messages_fts_trigram table alone consumes 247MB (49% of the entire DB) — 2.6x the size of the primary FTS index.

Root Cause

  1. Dual FTS indexes: Every message is indexed twice:

    • messages_fts (porter tokenizer) — 111MB
    • messages_fts_trigram (trigram tokenizer) — 247MB
  2. Trigram tokenizer is expensive for CJK: Chinese text produces significantly more trigram tokens than English, inflating the index. The trigram index is 2.2x larger than the porter stemmer index despite indexing the same data.

  3. system_prompt stored per-session: 2.4K sessions × ~17KB system_prompt = 38.6MB. Many sessions share nearly identical prompts (same model + similar config), but each stores a full copy.

Growth Rate

  • Daily: ~150 new sessions + ~2000 messages → +20MB/day
  • At this rate: 600MB/month, 7.3GB/year

Suggested Fixes

  1. Make trigram FTS optional: The porter stemmer FTS handles most English queries well. The trigram index is only needed for CJK substring search (3+ chars). Consider:

    • Adding a config option to disable trigram indexing
    • Or only building it on-demand when CJK search is used
  2. Normalize system_prompt storage: Store a deduplicated system_prompts table with a foreign key from sessions, eliminating redundant ~38MB.

  3. Add VACUUM/PRAGMA: Consider PRAGMA auto_vacuum = INCREMENTAL or periodic VACUUM to reclaim space after session deletion.

  4. Add a session retention/cleanup mechanism: Currently sessions grow indefinitely. A configurable TTL or max session count would help long-running instances.

Environment

  • Hermes Agent v0.13.0 (2026.5.7)
  • macOS, Python 3.11.15
  • state.db: 505MB, 42155 messages, 2375 sessions
  • Primary models: MiniMax-M2.7 (1915 sessions), glm-5-turbo (458 sessions)

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existscomp/agentCore agent loop, run_agent.py, prompt buildertype/perfPerformance improvement or optimization

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions