fix(session_search): supplement FTS5 with LIKE for CJK partial results#14842
Closed
kagura-agent wants to merge 2 commits into
Closed
fix(session_search): supplement FTS5 with LIKE for CJK partial results#14842kagura-agent wants to merge 2 commits into
kagura-agent wants to merge 2 commits into
Conversation
afaef94 to
dd67359
Compare
Contributor
Author
|
Rebased on latest main to resolve conflicts. |
dd67359 to
a95cf49
Compare
Contributor
Author
|
Rebased on latest main to resolve merge conflict. |
369b711 to
f1d6cfd
Compare
NousResearch#14829) FTS5 unicode61 tokenizer drops certain CJK characters, causing queries like '昨晚' to return only 12.5% of actual matches. The existing LIKE fallback only triggers when FTS5 returns zero results, missing the common case where FTS5 returns *some* but not all matches. Change the LIKE path from a fallback (only on empty results) to a supplement (always runs for CJK queries). Results are merged with deduplication by message id, so FTS5 results are preserved and LIKE fills in what FTS5 missed. - Always run LIKE for CJK queries, not just on zero FTS5 results - Deduplicate merged results by message id - Add regression test for partial-result supplementation - Add deduplication correctness test
f1d6cfd to
3ee95d1
Compare
Contributor
Author
|
Closing — superseded by #16651 which implemented a trigram FTS5 index approach for CJK search, replacing the LIKE fallback strategy from this PR. The issue is fully resolved. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #14829
FTS5's
unicode61tokenizer silently drops certain CJK characters, causing queries like昨晚to return only a fraction of actual matches. The existing LIKE fallback (added in8826d9cfor #11511) only triggers when FTS5 returns zero results, but the more common case is FTS5 returning some results while missing many others.Changes
hermes_state.py: Change the LIKE path from a zero-result fallback to an always-run supplement for CJK queries. Results are merged with deduplication by message id, preserving FTS5 results while LIKE fills in the gaps.tests/test_hermes_state.py: Add two regression tests:test_cjk_partial_fts5_results_supplemented_by_like: verifies LIKE supplements partial FTS5 resultstest_cjk_like_dedup_no_duplicates: verifies no duplicate results when both FTS5 and LIKE matchBefore / After
Testing