Skip to content

fix(state): honour HERMES_DISABLE_FTS_TRIGRAM to skip CJK index#42190

Open
bilawalriaz wants to merge 1 commit into
NousResearch:mainfrom
bilawalriaz:fix/optional-trigram-fts
Open

fix(state): honour HERMES_DISABLE_FTS_TRIGRAM to skip CJK index#42190
bilawalriaz wants to merge 1 commit into
NousResearch:mainfrom
bilawalriaz:fix/optional-trigram-fts

Conversation

@bilawalriaz

Copy link
Copy Markdown

What & why

Closes #22478.

The CJK trigram FTS5 index is currently created unconditionally when FTS5 is available. On large histories it accounts for the majority of state.db size — the user-reported numbers in #22478 show messages_fts_trigram alone at 247 MB (49% of a 505 MB DB), and the trigram index is 2.2× larger than the porter-stemmer index that indexes the same data.

The docstring on optimize_fts() already promised an HERMES_DISABLE_FTS_TRIGRAM opt-out, but no code path actually read the variable — it was dead documentation. This PR wires the gate up everywhere it matters.

Changes

  • Add _trigram_fts_disabled() helper reading HERMES_DISABLE_FTS_TRIGRAM (accepts 1/true/yes/on, case-insensitive, whitespace-trimmed).
  • Split _FTS_TRIGGERS into _PORTER_FTS_TRIGGERS and _TRIGRAM_FTS_TRIGGERS. _drop_fts_triggers still iterates the full union so a user who flips the var back on doesn't leave stale triggers on disk.
  • Gate the v10 and v11 schema migrations, the normal-startup schema creation, _rebuild_fts_indexes, and _fts_trigger_count on the helper.
  • search_messages() now falls through to the LIKE-based CJK path when the trigram table is absent, so English / short-CJK / mixed queries still work after disabling the index.
  • New TestTrigramFtsDisabled class with 7 tests covering: env-var truthy parsing, default behaviour, table non-creation, trigger non-creation, English search still functional, and optimize_fts() returning 1 (porter only) instead of 2.

How to test

# 1. Confirm the new behaviour
HERMES_DISABLE_FTS_TRIGRAM=1 scripts/run_tests.sh tests/test_hermes_state.py -- -v -k TestTrigramFtsDisabled

# 2. Regressions on the FTS path
scripts/run_tests.sh tests/test_hermes_state.py
scripts/run_tests.sh tests/test_hermes_state.py tests/tools/test_session_search.py

To verify end-to-end space reclamation on a real DB:

cp ~/.hermes/state.db ~/.hermes/state.db.pre-trigram
echo "HERMES_DISABLE_FTS_TRIGRAM=1" >> ~/.hermes/.env
hermes   # trigram table gets dropped on startup
sqlite3 ~/.hermes/state.db "VACUUM"

Platforms tested

  • Linux (Ubuntu)

Notes

  • Pre-existing failure in tests/agent/test_auxiliary_client.py (8 tests) is unrelated and reproduces on main before this branch.
  • No new dependencies. import os added to hermes_state.py.

The CJK trigram FTS5 index is currently created unconditionally when
FTS5 is available, and on large histories it accounts for the majority
of state.db size (~70% per NousResearch#22478). The docstring on optimize_fts
already referenced an env-var opt-out, but no code path checked it.

Introduce _trigram_fts_disabled() reading HERMES_DISABLE_FTS_TRIGRAM
(accepts 1/true/yes/on, case-insensitive). Gate the v10 / v11 schema
migrations, the normal-startup schema creation, _rebuild_fts_indexes,
and _fts_trigger_count on the helper. Split _FTS_TRIGGERS into the
porter and trigram halves so _drop_fts_triggers still cleans up
stale trigram triggers for users who flip the var back on.

search_messages() now falls through to the LIKE-based CJK path when
the trigram table is absent, so non-CJK and short-CJK queries still
work after disabling the index.

Fixes NousResearch#22478
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

P3 Low — cosmetic, nice to have type/feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

state.db FTS trigram index bloat: 70% of DB size is full-text indexes

2 participants