feat(state): make trigram FTS5 index optional#22710
Closed
cotrelllucia wants to merge 2 commits into
Closed
Conversation
14bb3bc to
439f6bf
Compare
439f6bf to
c8c14e7
Compare
Adds an opt-out path for the messages_fts_trigram virtual table, which roughly doubles state.db size on top of the porter index but is only useful for CJK substring queries with three or more characters. * HERMES_DISABLE_FTS_TRIGRAM=1 skips the trigram virtual table, its triggers, and the v10 backfill on fresh databases. * SessionDB.drop_fts_trigram() drops the index, its triggers, and runs VACUUM to reclaim freed pages on existing databases. Idempotent. * SessionDB.vacuum() exposes plain VACUUM for callers that just want to defragment after large deletions. * search_messages() automatically falls back to LIKE for CJK queries with 3+ characters when the trigram table is unavailable, matching the existing 1-2 char path. Documented in website/docs/developer-guide/session-storage.md and .env.example. Adds 10 regression tests covering the env-var path, the drop_fts_trigram() path, and CJK fallback behavior. Refs NousResearch#22478
Adds the email→GitHub username mapping required by the contributor-attribution CI check.
c8c14e7 to
f8fb15c
Compare
19 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Makes the
messages_fts_trigramvirtual table optional. The trigram FTS5 index is only used to serve CJK substring queries with three or more characters. On instances that never run such queries it typically accounts for ~50 % ofstate.dbsize because trigram tokens expand more aggressively for CJK text than porter stemming does for English. Pure-English deployments paid the storage cost for a feature they did not use, and there was no clean way to disable or reclaim it.This change adds an explicit opt-out path — without breaking the existing CJK search behavior for users who want it — and exposes the maintenance hooks needed to reclaim space on databases that have already been bloated by the index.
The approach is intentionally minimal:
HERMES_DISABLE_FTS_TRIGRAM) gates creation of the virtual table and triggers in both v10/v11 migration paths and the post-migration existence check;SessionDB.drop_fts_trigram()lets operators reclaim space on existing databases by dropping the table and triggers and runningVACUUM;search_messages()already had a LIKE-fallback path for short (1–2 char) CJK queries, so the new behavior reuses it for longer CJK queries when the trigram table is absent — no new query path was introduced.Related Issue
Refs #22478
Type of Change
Changes Made
hermes_state.py_env_flag()and_fts_trigram_disabled()that read the newHERMES_DISABLE_FTS_TRIGRAMenv var (1/true/yes/on)._init_schema():_fts_trigram_disabled();SessionDB._fts_trigram_availablecache +_has_fts_trigram()runtime probe sosearch_messages()can route around a missing table without raising.SessionDB.drop_fts_trigram()— drops the trigram triggers and table, thenVACUUMs; idempotent on a database that already has the index dropped.SessionDB.vacuum()— plainVACUUMfor callers that want to defragment after large deletions.search_messages()the long-CJK branch now requires_has_fts_trigram(); otherwise it falls through to the existing LIKE substring path that was already used for 1–2 char CJK queries (no new query code path was introduced).tests/test_hermes_state.py— newTestFTS5TrigramOptionalclass with 10 regression tests covering: env-var skip on fresh DB (table + triggers), porter FTS still works, long CJK queries fall back to LIKE when disabled, short CJK queries unaffected, INSERTs don't fail when triggers are missing,drop_fts_trigram()removes table + triggers,drop_fts_trigram()is idempotent,search_messages()routes long CJK to LIKE after the index is dropped,vacuum()runs cleanly.website/docs/developer-guide/session-storage.md— new "Disabling the trigram FTS5 index" section explaining the env var,drop_fts_trigram(), and re-enable semantics..env.example— documented the new env var under a newSESSION DATABASE (state.db)section.How to Test
CJK substring search still works after the env var is set (1–2 char and long queries both use the LIKE substring path) — see
test_cjk_search_falls_back_to_like_when_trigram_disabledandtest_short_cjk_search_works_when_trigram_disabled.Checklist
Code
feat(state): make trigram FTS5 index optional)pytest tests/test_hermes_state.py -qand all 220 tests passTestFTS5TrigramOptional)Documentation & Housekeeping
website/docs/developer-guide/session-storage.md,.env.example)cli-config.yaml.exampleif I added/changed config keys — N/A (env var, not config key)CONTRIBUTING.mdorAGENTS.mdif I changed architecture or workflows — N/Aos.environ, no platform-specific calls;ruff check .andscripts/check-windows-footguns.py --allboth pass.Screenshots / Logs