Skip to content

feat(state): make trigram FTS5 index optional#27770

Open
cotrelllucia wants to merge 2 commits into
NousResearch:mainfrom
cotrelllucia:devin/1778345035-fts-trigram-optional
Open

feat(state): make trigram FTS5 index optional#27770
cotrelllucia wants to merge 2 commits into
NousResearch:mainfrom
cotrelllucia:devin/1778345035-fts-trigram-optional

Conversation

@cotrelllucia

Copy link
Copy Markdown

What does this PR do?

Makes the messages_fts_trigram virtual table optional. The trigram FTS5 index is only used to serve CJK substring queries with three or more characters. On instances that never run such queries it typically accounts for ~50 % of state.db size because trigram tokens expand more aggressively for CJK text than porter stemming does for English. Pure-English deployments paid the storage cost for a feature they did not use, and there was no clean way to disable or reclaim it.

This change adds an explicit opt-out path — without breaking the existing CJK search behavior for users who want it — and exposes the maintenance hooks needed to reclaim space on databases that have already been bloated by the index.

The approach is intentionally minimal:

  • a single env var (HERMES_DISABLE_FTS_TRIGRAM) gates creation of the virtual table and triggers in both v10/v11 migration paths and the post-migration existence check;
  • SessionDB.drop_fts_trigram() lets operators reclaim space on existing databases by dropping the table and triggers and running VACUUM;
  • search_messages() already had a LIKE-fallback path for short (1–2 char) CJK queries, so the new behavior reuses it for longer CJK queries when the trigram table is absent — no new query path was introduced.

Related Issue

Refs #22478

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • hermes_state.py
    • Added module-level helpers _env_flag() and _fts_trigram_disabled() that read the new HERMES_DISABLE_FTS_TRIGRAM env var (1/true/yes/on).
    • In _init_schema():
      • the v10 backfill block is gated on _fts_trigram_disabled();
      • the v11 re-index block runs the porter-FTS recreation unconditionally and only recreates+backfills the trigram FTS when the flag is off;
      • the post-migration existence check creates the trigram virtual table only when the flag is off.
    • Added SessionDB._fts_trigram_available cache + _has_fts_trigram() runtime probe so search_messages() can route around a missing table without raising.
    • Added SessionDB.drop_fts_trigram() — drops the trigram triggers and table, then VACUUMs; idempotent on a database that already has the index dropped.
    • Added SessionDB.vacuum() — plain VACUUM for callers that want to defragment after large deletions.
    • In search_messages() the long-CJK branch now requires _has_fts_trigram(); otherwise it falls through to the existing LIKE substring path that was already used for 1–2 char CJK queries (no new query code path was introduced).
  • tests/test_hermes_state.py — new TestFTS5TrigramOptional class with 10 regression tests covering: env-var skip on fresh DB (table + triggers), porter FTS still works, long CJK queries fall back to LIKE when disabled, short CJK queries unaffected, INSERTs don't fail when triggers are missing, drop_fts_trigram() removes table + triggers, drop_fts_trigram() is idempotent, search_messages() routes long CJK to LIKE after the index is dropped, vacuum() runs cleanly.
  • website/docs/developer-guide/session-storage.md — new "Disabling the trigram FTS5 index" section explaining the env var, drop_fts_trigram(), and re-enable semantics.
  • .env.example — documented the new env var under a new SESSION DATABASE (state.db) section.

How to Test

# 1. Full hermes_state regression suite (220 tests, including 10 new ones)
scripts/run_tests.sh tests/test_hermes_state.py -q

# 2. Just the new tests
scripts/run_tests.sh tests/test_hermes_state.py::TestFTS5TrigramOptional -v

# 3. End-to-end on a fresh DB
HERMES_DISABLE_FTS_TRIGRAM=1 hermes -q "Test message"
sqlite3 ~/.hermes/state.db ".tables" | tr ' ' '\n' | grep fts
#   → only `messages_fts` is listed; `messages_fts_trigram` is absent.

# 4. Reclaim space on an existing bloated DB
python -c "from hermes_state import SessionDB; db = SessionDB(); db.drop_fts_trigram(); db.close()"
ls -la ~/.hermes/state.db
#   → file size drops by ~50 % on CJK-heavy English-only deployments.

CJK substring search still works after the env var is set (1–2 char and long queries both use the LIKE substring path) — see test_cjk_search_falls_back_to_like_when_trigram_disabled and test_short_cjk_search_works_when_trigram_disabled.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (feat(state): make trigram FTS5 index optional)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this feature (no unrelated commits)
  • I've run pytest tests/test_hermes_state.py -q and all 220 tests pass
  • I've added tests for my changes (10 new regression tests in TestFTS5TrigramOptional)
  • I've tested on my platform: Ubuntu 24.04, Python 3.11.12

Documentation & Housekeeping

  • I've updated relevant documentation (website/docs/developer-guide/session-storage.md, .env.example)
  • I've updated cli-config.yaml.example if I added/changed config keys — N/A (env var, not config key)
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — env var reads via os.environ, no platform-specific calls; ruff check . and scripts/check-windows-footguns.py --all both pass.
  • I've updated tool descriptions/schemas if I changed tool behavior — N/A

Screenshots / Logs

$ scripts/run_tests.sh tests/test_hermes_state.py::TestFTS5TrigramOptional -v
[gw0] PASSED tests/.../test_env_var_skips_trigram_table
[gw1] PASSED tests/.../test_porter_fts_still_works_when_trigram_disabled
[gw2] PASSED tests/.../test_short_cjk_search_works_when_trigram_disabled
[gw3] PASSED tests/.../test_drop_fts_trigram_removes_table_and_triggers
[gw0] PASSED tests/.../test_env_var_skips_trigram_triggers
[gw2] PASSED tests/.../test_inserts_dont_write_to_missing_trigram_table
[gw1] PASSED tests/.../test_cjk_search_falls_back_to_like_when_trigram_disabled
[gw3] PASSED tests/.../test_drop_fts_trigram_is_idempotent
[gw0] PASSED tests/.../test_search_after_drop_fts_trigram_routes_cjk_to_like
[gw2] PASSED tests/.../test_vacuum_runs_without_error

============================== 10 passed in 3.81s ==============================
$ scripts/run_tests.sh tests/test_hermes_state.py -q
220 passed in 2.69s

Adds an opt-out path for the messages_fts_trigram virtual table, which
roughly doubles state.db size on top of the porter index but is only
useful for CJK substring queries with three or more characters.

* HERMES_DISABLE_FTS_TRIGRAM=1 skips the trigram virtual table, its
  triggers, and the v10 backfill on fresh databases.
* SessionDB.drop_fts_trigram() drops the index, its triggers, and runs
  VACUUM to reclaim freed pages on existing databases. Idempotent.
* SessionDB.vacuum() exposes plain VACUUM for callers that just want
  to defragment after large deletions.
* search_messages() automatically falls back to LIKE for CJK queries
  with 3+ characters when the trigram table is unavailable, matching
  the existing 1-2 char path.

Documented in website/docs/developer-guide/session-storage.md and
.env.example. Adds 10 regression tests covering the env-var path, the
drop_fts_trigram() path, and CJK fallback behavior.

Refs NousResearch#22478
Adds the email→GitHub username mapping required by the
contributor-attribution CI check.
@alt-glitch alt-glitch added type/feature New feature or request P3 Low — cosmetic, nice to have comp/agent Core agent loop, run_agent.py, prompt builder labels May 18, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Successor of closed #22710 (same feature, same title). Addresses #22478 (state.db FTS trigram index bloat). Also related to open #20239 (external-content FTS5 mode).

teknium1 pushed a commit that referenced this pull request May 29, 2026
…imize'

The FTS5 indexes (messages_fts, messages_fts_trigram) grow as a series of
incremental b-tree segments — one per trigger-driven insert batch. SQLite's
automerge caps at ~16 segments, so a long-lived store keeps scanning many
segments per MATCH and never collapses them unless the special 'optimize'
command runs. Nothing in the codebase ever ran it: vacuum() only fired after
a prune that deleted rows, and even then never merged FTS segments.

Changes:
- SessionDB.optimize_fts(): merges each FTS5 index to a single segment,
  probing for the (optional/lazy) trigram table first so it is safe to call
  unconditionally. Layout-only — search results and snippet() are unchanged.
- vacuum() now calls optimize_fts() before VACUUM so freed index pages are
  returned to the OS in the same pass.
- 'hermes sessions optimize' CLI subcommand for on-demand reclamation +
  segment compaction (previously there was no way to compact the store
  without a prune deleting rows), with before/after size reporting.

Benchmark (8000 msgs, fragmented to 8 segments/index):
- segments 8 -> 1 on both indexes
- porter MATCH 5.5x faster (0.449 -> 0.081 ms/q)
- trigram MATCH 3.0x faster (0.632 -> 0.207 ms/q)
- 8000 matches before == 8000 after, identical row ids (no functional change)

Orthogonal to the structural FTS-size PRs (#20239 external-content,
#27770 optional trigram) — segment merge helps regardless of those.

Tests: TestOptimizeFts covers index count, search+snippet preservation,
missing-trigram path, and idempotency. Full test_hermes_state.py green (227).
sradetzky pushed a commit to sradetzky/hermes-agent that referenced this pull request May 30, 2026
…imize'

The FTS5 indexes (messages_fts, messages_fts_trigram) grow as a series of
incremental b-tree segments — one per trigger-driven insert batch. SQLite's
automerge caps at ~16 segments, so a long-lived store keeps scanning many
segments per MATCH and never collapses them unless the special 'optimize'
command runs. Nothing in the codebase ever ran it: vacuum() only fired after
a prune that deleted rows, and even then never merged FTS segments.

Changes:
- SessionDB.optimize_fts(): merges each FTS5 index to a single segment,
  probing for the (optional/lazy) trigram table first so it is safe to call
  unconditionally. Layout-only — search results and snippet() are unchanged.
- vacuum() now calls optimize_fts() before VACUUM so freed index pages are
  returned to the OS in the same pass.
- 'hermes sessions optimize' CLI subcommand for on-demand reclamation +
  segment compaction (previously there was no way to compact the store
  without a prune deleting rows), with before/after size reporting.

Benchmark (8000 msgs, fragmented to 8 segments/index):
- segments 8 -> 1 on both indexes
- porter MATCH 5.5x faster (0.449 -> 0.081 ms/q)
- trigram MATCH 3.0x faster (0.632 -> 0.207 ms/q)
- 8000 matches before == 8000 after, identical row ids (no functional change)

Orthogonal to the structural FTS-size PRs (NousResearch#20239 external-content,
NousResearch#27770 optional trigram) — segment merge helps regardless of those.

Tests: TestOptimizeFts covers index count, search+snippet preservation,
missing-trigram path, and idempotency. Full test_hermes_state.py green (227).
KKT-OPT pushed a commit to KKT-OPT/hermes-agent that referenced this pull request May 31, 2026
…imize'

The FTS5 indexes (messages_fts, messages_fts_trigram) grow as a series of
incremental b-tree segments — one per trigger-driven insert batch. SQLite's
automerge caps at ~16 segments, so a long-lived store keeps scanning many
segments per MATCH and never collapses them unless the special 'optimize'
command runs. Nothing in the codebase ever ran it: vacuum() only fired after
a prune that deleted rows, and even then never merged FTS segments.

Changes:
- SessionDB.optimize_fts(): merges each FTS5 index to a single segment,
  probing for the (optional/lazy) trigram table first so it is safe to call
  unconditionally. Layout-only — search results and snippet() are unchanged.
- vacuum() now calls optimize_fts() before VACUUM so freed index pages are
  returned to the OS in the same pass.
- 'hermes sessions optimize' CLI subcommand for on-demand reclamation +
  segment compaction (previously there was no way to compact the store
  without a prune deleting rows), with before/after size reporting.

Benchmark (8000 msgs, fragmented to 8 segments/index):
- segments 8 -> 1 on both indexes
- porter MATCH 5.5x faster (0.449 -> 0.081 ms/q)
- trigram MATCH 3.0x faster (0.632 -> 0.207 ms/q)
- 8000 matches before == 8000 after, identical row ids (no functional change)

Orthogonal to the structural FTS-size PRs (NousResearch#20239 external-content,
NousResearch#27770 optional trigram) — segment merge helps regardless of those.

Tests: TestOptimizeFts covers index count, search+snippet preservation,
missing-trigram path, and idempotency. Full test_hermes_state.py green (227).
@teknium1

Copy link
Copy Markdown
Contributor

Thanks for tackling the state.db FTS bloat. The premise still holds on current main, but this needs rework before salvage.

Problems

  • The user-facing opt-out is a new non-secret HERMES_* env var in .env.example, but AGENTS.md:102-106 says behavioral settings should go in config.yaml, not .env.
  • Current main has newer FTS maintenance paths this old diff does not cover: _fts_trigger_count() and _rebuild_fts_indexes() still expect both messages_fts and messages_fts_trigram at hermes_state.py:811-839.
  • Current main already has SessionDB.optimize_fts() and a stronger SessionDB.vacuum() at hermes_state.py:4529-4604 from 3869525, so the simple vacuum() addition here is stale.

Suggested changes

  • Make the opt-out a config.yaml setting (or existing config mechanism) rather than a documented new HERMES_* env var.
  • Wire the disabled-trigram behavior through current main’s v10/v11 migrations, startup schema creation, trigger repair/counting, _rebuild_fts_indexes(), and CJK search_messages() fallback.
  • Reuse the existing optimize/vacuum machinery for reclaiming space.

Automated hermes-sweeper review.

alt-glitch pushed a commit that referenced this pull request Jun 14, 2026
…imize'

The FTS5 indexes (messages_fts, messages_fts_trigram) grow as a series of
incremental b-tree segments — one per trigger-driven insert batch. SQLite's
automerge caps at ~16 segments, so a long-lived store keeps scanning many
segments per MATCH and never collapses them unless the special 'optimize'
command runs. Nothing in the codebase ever ran it: vacuum() only fired after
a prune that deleted rows, and even then never merged FTS segments.

Changes:
- SessionDB.optimize_fts(): merges each FTS5 index to a single segment,
  probing for the (optional/lazy) trigram table first so it is safe to call
  unconditionally. Layout-only — search results and snippet() are unchanged.
- vacuum() now calls optimize_fts() before VACUUM so freed index pages are
  returned to the OS in the same pass.
- 'hermes sessions optimize' CLI subcommand for on-demand reclamation +
  segment compaction (previously there was no way to compact the store
  without a prune deleting rows), with before/after size reporting.

Benchmark (8000 msgs, fragmented to 8 segments/index):
- segments 8 -> 1 on both indexes
- porter MATCH 5.5x faster (0.449 -> 0.081 ms/q)
- trigram MATCH 3.0x faster (0.632 -> 0.207 ms/q)
- 8000 matches before == 8000 after, identical row ids (no functional change)

Orthogonal to the structural FTS-size PRs (#20239 external-content,
#27770 optional trigram) — segment merge helps regardless of those.

Tests: TestOptimizeFts covers index count, search+snippet preservation,
missing-trigram path, and idempotency. Full test_hermes_state.py green (227).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P3 Low — cosmetic, nice to have type/feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants