perf(state): reduce FTS storage with external-content indexes (schema v15)#20239
perf(state): reduce FTS storage with external-content indexes (schema v15)#20239JabberELF wants to merge 1 commit into
Conversation
f72b1f6 to
ad232f3
Compare
|
Refreshed this PR to make it easier to review. What changed:
Verification added/updated:
Local verification:
Synthetic SQLite benchmark on 2,500 generated messages:
Also adjusted the rollback wording: since the FTS tables are derived indexes over Happy to rebase again or adjust around #27770 if that lands first. |
…imize' The FTS5 indexes (messages_fts, messages_fts_trigram) grow as a series of incremental b-tree segments — one per trigger-driven insert batch. SQLite's automerge caps at ~16 segments, so a long-lived store keeps scanning many segments per MATCH and never collapses them unless the special 'optimize' command runs. Nothing in the codebase ever ran it: vacuum() only fired after a prune that deleted rows, and even then never merged FTS segments. Changes: - SessionDB.optimize_fts(): merges each FTS5 index to a single segment, probing for the (optional/lazy) trigram table first so it is safe to call unconditionally. Layout-only — search results and snippet() are unchanged. - vacuum() now calls optimize_fts() before VACUUM so freed index pages are returned to the OS in the same pass. - 'hermes sessions optimize' CLI subcommand for on-demand reclamation + segment compaction (previously there was no way to compact the store without a prune deleting rows), with before/after size reporting. Benchmark (8000 msgs, fragmented to 8 segments/index): - segments 8 -> 1 on both indexes - porter MATCH 5.5x faster (0.449 -> 0.081 ms/q) - trigram MATCH 3.0x faster (0.632 -> 0.207 ms/q) - 8000 matches before == 8000 after, identical row ids (no functional change) Orthogonal to the structural FTS-size PRs (#20239 external-content, #27770 optional trigram) — segment merge helps regardless of those. Tests: TestOptimizeFts covers index count, search+snippet preservation, missing-trigram path, and idempotency. Full test_hermes_state.py green (227).
…imize' The FTS5 indexes (messages_fts, messages_fts_trigram) grow as a series of incremental b-tree segments — one per trigger-driven insert batch. SQLite's automerge caps at ~16 segments, so a long-lived store keeps scanning many segments per MATCH and never collapses them unless the special 'optimize' command runs. Nothing in the codebase ever ran it: vacuum() only fired after a prune that deleted rows, and even then never merged FTS segments. Changes: - SessionDB.optimize_fts(): merges each FTS5 index to a single segment, probing for the (optional/lazy) trigram table first so it is safe to call unconditionally. Layout-only — search results and snippet() are unchanged. - vacuum() now calls optimize_fts() before VACUUM so freed index pages are returned to the OS in the same pass. - 'hermes sessions optimize' CLI subcommand for on-demand reclamation + segment compaction (previously there was no way to compact the store without a prune deleting rows), with before/after size reporting. Benchmark (8000 msgs, fragmented to 8 segments/index): - segments 8 -> 1 on both indexes - porter MATCH 5.5x faster (0.449 -> 0.081 ms/q) - trigram MATCH 3.0x faster (0.632 -> 0.207 ms/q) - 8000 matches before == 8000 after, identical row ids (no functional change) Orthogonal to the structural FTS-size PRs (NousResearch#20239 external-content, NousResearch#27770 optional trigram) — segment merge helps regardless of those. Tests: TestOptimizeFts covers index count, search+snippet preservation, missing-trigram path, and idempotency. Full test_hermes_state.py green (227).
ad232f3 to
6ba4830
Compare
|
Refreshed again onto current What changed since the previous refresh:
Verification:
I also ran an independent read-only review of the rebased diff. It found no Critical/Must Fix issues; the only suggestions were comment wording cleanups, which are included in this update. |
Rebase the external-content FTS change onto current main and move the schema bump to v13, since upstream main already uses schema v12. Drop the unrelated fork workflow cleanup from the PR branch. The migration rebuilds both FTS indexes from messages using external-content tables and keeps the messages table as the source of truth. Users who need rollback should restore a pre-upgrade state.db backup.
6ba4830 to
903b6f9
Compare
|
Refreshed this PR onto current What changed in the refresh:
Local verification:
|
…imize' The FTS5 indexes (messages_fts, messages_fts_trigram) grow as a series of incremental b-tree segments — one per trigger-driven insert batch. SQLite's automerge caps at ~16 segments, so a long-lived store keeps scanning many segments per MATCH and never collapses them unless the special 'optimize' command runs. Nothing in the codebase ever ran it: vacuum() only fired after a prune that deleted rows, and even then never merged FTS segments. Changes: - SessionDB.optimize_fts(): merges each FTS5 index to a single segment, probing for the (optional/lazy) trigram table first so it is safe to call unconditionally. Layout-only — search results and snippet() are unchanged. - vacuum() now calls optimize_fts() before VACUUM so freed index pages are returned to the OS in the same pass. - 'hermes sessions optimize' CLI subcommand for on-demand reclamation + segment compaction (previously there was no way to compact the store without a prune deleting rows), with before/after size reporting. Benchmark (8000 msgs, fragmented to 8 segments/index): - segments 8 -> 1 on both indexes - porter MATCH 5.5x faster (0.449 -> 0.081 ms/q) - trigram MATCH 3.0x faster (0.632 -> 0.207 ms/q) - 8000 matches before == 8000 after, identical row ids (no functional change) Orthogonal to the structural FTS-size PRs (NousResearch#20239 external-content, NousResearch#27770 optional trigram) — segment merge helps regardless of those. Tests: TestOptimizeFts covers index count, search+snippet preservation, missing-trigram path, and idempotency. Full test_hermes_state.py green (227).
…imize' The FTS5 indexes (messages_fts, messages_fts_trigram) grow as a series of incremental b-tree segments — one per trigger-driven insert batch. SQLite's automerge caps at ~16 segments, so a long-lived store keeps scanning many segments per MATCH and never collapses them unless the special 'optimize' command runs. Nothing in the codebase ever ran it: vacuum() only fired after a prune that deleted rows, and even then never merged FTS segments. Changes: - SessionDB.optimize_fts(): merges each FTS5 index to a single segment, probing for the (optional/lazy) trigram table first so it is safe to call unconditionally. Layout-only — search results and snippet() are unchanged. - vacuum() now calls optimize_fts() before VACUUM so freed index pages are returned to the OS in the same pass. - 'hermes sessions optimize' CLI subcommand for on-demand reclamation + segment compaction (previously there was no way to compact the store without a prune deleting rows), with before/after size reporting. Benchmark (8000 msgs, fragmented to 8 segments/index): - segments 8 -> 1 on both indexes - porter MATCH 5.5x faster (0.449 -> 0.081 ms/q) - trigram MATCH 3.0x faster (0.632 -> 0.207 ms/q) - 8000 matches before == 8000 after, identical row ids (no functional change) Orthogonal to the structural FTS-size PRs (#20239 external-content, #27770 optional trigram) — segment merge helps regardless of those. Tests: TestOptimizeFts covers index count, search+snippet preservation, missing-trigram path, and idempotency. Full test_hermes_state.py green (227).
Summary
Switch
messages_ftsandmessages_fts_trigramfrom inline FTS5 storage to external-content FTS5 (schema v14 → v15 on currentmain). This removes duplicated message text from FTS shadow storage while keepingmessagesas the source of truth.Changes
hermes_state.pySCHEMA_VERSION14 → 15.messages_ftsandmessages_fts_trigramas external-content tables withcontent='messages'andcontent_rowid='id'.content,tool_name, andtool_callsindexed.messages.snippet(..., -1, ...)so matches in non-contentcolumns can produce useful snippets.tests/test_hermes_state.pycontent,tool_name,tool_calls, and CJK trigram text.Compatibility
messagesremains the source of truth and is not mutated by the migration.state.dbbackup.VACUUM/ the existing maintenance path.Validation
python -m pytest -o addopts='' tests/test_hermes_state.py -q --tb=short231 passedpython -m ruff check hermes_state.py tests/test_hermes_state.pyAll checks passed!python -m compileall -q hermes_state.py tests/test_hermes_state.pygit diff --checkIndependent read-only review of the rebased diff found no Critical/Must Fix issues.