Skip to content

feat(state): make trigram FTS5 index optional#22710

Closed
cotrelllucia wants to merge 2 commits into
NousResearch:mainfrom
cotrelllucia:devin/1778345035-fts-trigram-optional
Closed

feat(state): make trigram FTS5 index optional#22710
cotrelllucia wants to merge 2 commits into
NousResearch:mainfrom
cotrelllucia:devin/1778345035-fts-trigram-optional

Conversation

@cotrelllucia

Copy link
Copy Markdown

What does this PR do?

Makes the messages_fts_trigram virtual table optional. The trigram FTS5 index is only used to serve CJK substring queries with three or more characters. On instances that never run such queries it typically accounts for ~50 % of state.db size because trigram tokens expand more aggressively for CJK text than porter stemming does for English. Pure-English deployments paid the storage cost for a feature they did not use, and there was no clean way to disable or reclaim it.

This change adds an explicit opt-out path — without breaking the existing CJK search behavior for users who want it — and exposes the maintenance hooks needed to reclaim space on databases that have already been bloated by the index.

The approach is intentionally minimal:

  • a single env var (HERMES_DISABLE_FTS_TRIGRAM) gates creation of the virtual table and triggers in both v10/v11 migration paths and the post-migration existence check;
  • SessionDB.drop_fts_trigram() lets operators reclaim space on existing databases by dropping the table and triggers and running VACUUM;
  • search_messages() already had a LIKE-fallback path for short (1–2 char) CJK queries, so the new behavior reuses it for longer CJK queries when the trigram table is absent — no new query path was introduced.

Related Issue

Refs #22478

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • hermes_state.py
    • Added module-level helpers _env_flag() and _fts_trigram_disabled() that read the new HERMES_DISABLE_FTS_TRIGRAM env var (1/true/yes/on).
    • In _init_schema():
      • the v10 backfill block is gated on _fts_trigram_disabled();
      • the v11 re-index block runs the porter-FTS recreation unconditionally and only recreates+backfills the trigram FTS when the flag is off;
      • the post-migration existence check creates the trigram virtual table only when the flag is off.
    • Added SessionDB._fts_trigram_available cache + _has_fts_trigram() runtime probe so search_messages() can route around a missing table without raising.
    • Added SessionDB.drop_fts_trigram() — drops the trigram triggers and table, then VACUUMs; idempotent on a database that already has the index dropped.
    • Added SessionDB.vacuum() — plain VACUUM for callers that want to defragment after large deletions.
    • In search_messages() the long-CJK branch now requires _has_fts_trigram(); otherwise it falls through to the existing LIKE substring path that was already used for 1–2 char CJK queries (no new query code path was introduced).
  • tests/test_hermes_state.py — new TestFTS5TrigramOptional class with 10 regression tests covering: env-var skip on fresh DB (table + triggers), porter FTS still works, long CJK queries fall back to LIKE when disabled, short CJK queries unaffected, INSERTs don't fail when triggers are missing, drop_fts_trigram() removes table + triggers, drop_fts_trigram() is idempotent, search_messages() routes long CJK to LIKE after the index is dropped, vacuum() runs cleanly.
  • website/docs/developer-guide/session-storage.md — new "Disabling the trigram FTS5 index" section explaining the env var, drop_fts_trigram(), and re-enable semantics.
  • .env.example — documented the new env var under a new SESSION DATABASE (state.db) section.

How to Test

# 1. Full hermes_state regression suite (220 tests, including 10 new ones)
scripts/run_tests.sh tests/test_hermes_state.py -q

# 2. Just the new tests
scripts/run_tests.sh tests/test_hermes_state.py::TestFTS5TrigramOptional -v

# 3. End-to-end on a fresh DB
HERMES_DISABLE_FTS_TRIGRAM=1 hermes -q "Test message"
sqlite3 ~/.hermes/state.db ".tables" | tr ' ' '\n' | grep fts
#   → only `messages_fts` is listed; `messages_fts_trigram` is absent.

# 4. Reclaim space on an existing bloated DB
python -c "from hermes_state import SessionDB; db = SessionDB(); db.drop_fts_trigram(); db.close()"
ls -la ~/.hermes/state.db
#   → file size drops by ~50 % on CJK-heavy English-only deployments.

CJK substring search still works after the env var is set (1–2 char and long queries both use the LIKE substring path) — see test_cjk_search_falls_back_to_like_when_trigram_disabled and test_short_cjk_search_works_when_trigram_disabled.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (feat(state): make trigram FTS5 index optional)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this feature (no unrelated commits)
  • I've run pytest tests/test_hermes_state.py -q and all 220 tests pass
  • I've added tests for my changes (10 new regression tests in TestFTS5TrigramOptional)
  • I've tested on my platform: Ubuntu 24.04, Python 3.11.12

Documentation & Housekeeping

  • I've updated relevant documentation (website/docs/developer-guide/session-storage.md, .env.example)
  • I've updated cli-config.yaml.example if I added/changed config keys — N/A (env var, not config key)
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — env var reads via os.environ, no platform-specific calls; ruff check . and scripts/check-windows-footguns.py --all both pass.
  • I've updated tool descriptions/schemas if I changed tool behavior — N/A

Screenshots / Logs

$ scripts/run_tests.sh tests/test_hermes_state.py::TestFTS5TrigramOptional -v
[gw0] PASSED tests/.../test_env_var_skips_trigram_table
[gw1] PASSED tests/.../test_porter_fts_still_works_when_trigram_disabled
[gw2] PASSED tests/.../test_short_cjk_search_works_when_trigram_disabled
[gw3] PASSED tests/.../test_drop_fts_trigram_removes_table_and_triggers
[gw0] PASSED tests/.../test_env_var_skips_trigram_triggers
[gw2] PASSED tests/.../test_inserts_dont_write_to_missing_trigram_table
[gw1] PASSED tests/.../test_cjk_search_falls_back_to_like_when_trigram_disabled
[gw3] PASSED tests/.../test_drop_fts_trigram_is_idempotent
[gw0] PASSED tests/.../test_search_after_drop_fts_trigram_routes_cjk_to_like
[gw2] PASSED tests/.../test_vacuum_runs_without_error

============================== 10 passed in 3.81s ==============================
$ scripts/run_tests.sh tests/test_hermes_state.py -q
220 passed in 2.69s

@alt-glitch alt-glitch added type/feature New feature or request P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder labels May 9, 2026
@devin-ai-integration devin-ai-integration Bot force-pushed the devin/1778345035-fts-trigram-optional branch from 14bb3bc to 439f6bf Compare May 10, 2026 16:10
@cotrelllucia cotrelllucia force-pushed the devin/1778345035-fts-trigram-optional branch from 439f6bf to c8c14e7 Compare May 10, 2026 16:43
Adds an opt-out path for the messages_fts_trigram virtual table, which
roughly doubles state.db size on top of the porter index but is only
useful for CJK substring queries with three or more characters.

* HERMES_DISABLE_FTS_TRIGRAM=1 skips the trigram virtual table, its
  triggers, and the v10 backfill on fresh databases.
* SessionDB.drop_fts_trigram() drops the index, its triggers, and runs
  VACUUM to reclaim freed pages on existing databases. Idempotent.
* SessionDB.vacuum() exposes plain VACUUM for callers that just want
  to defragment after large deletions.
* search_messages() automatically falls back to LIKE for CJK queries
  with 3+ characters when the trigram table is unavailable, matching
  the existing 1-2 char path.

Documented in website/docs/developer-guide/session-storage.md and
.env.example. Adds 10 regression tests covering the env-var path, the
drop_fts_trigram() path, and CJK fallback behavior.

Refs NousResearch#22478
Adds the email→GitHub username mapping required by the
contributor-attribution CI check.
@cotrelllucia cotrelllucia force-pushed the devin/1778345035-fts-trigram-optional branch from c8c14e7 to f8fb15c Compare May 10, 2026 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P2 Medium — degraded but workaround exists type/feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants