fix: FTS5 LIKE fallback for CJK (Chinese/Japanese/Korean) queries#11516
Closed
vominh1919 wants to merge 1 commit into
Closed
fix: FTS5 LIKE fallback for CJK (Chinese/Japanese/Korean) queries#11516vominh1919 wants to merge 1 commit into
vominh1919 wants to merge 1 commit into
Conversation
FTS5 default tokenizer splits CJK text character-by-character, causing multi-character queries like '记忆断裂' to return 0 results. This fix adds a LIKE fallback: when FTS5 returns no results and the query contains CJK characters, retry with WHERE content LIKE '%query%'. Preserves FTS5 performance for English queries. Fixes NousResearch#11511
teknium1
pushed a commit
that referenced
this pull request
Apr 18, 2026
Twelve tests under TestCJKSearchFallback guarding: - CJK detection across Chinese/Japanese/Korean/Hiragana/Katakana ranges (including the full Hangul syllables block \uac00-\ud7af, to catch the shorter-range typo from one of the duplicate PRs) - Substring match for multi-char Chinese, Japanese, Korean queries - Filter preservation (source_filter, exclude_sources, role_filter) in the LIKE path — guards against the SQL-builder bug from another duplicate PR where filter clauses landed after LIMIT/OFFSET - Snippet centered on the matched term (instr-based substr window), not the leading 200 chars of content - English fast-path untouched - Empty/no-match cases - Mixed CJK+English queries Also: - hermes_state.py: LIKE-fallback snippet is now `substr(content, max(1, instr(content, ?) - 40), 120)`, centered on the match instead of the whole-content default. Credit goes to @iamagenius00 for the snippet idea in PR #11517. - scripts/release.py: add @iamagenius00 to AUTHOR_MAP so future release attribution resolves cleanly. Refs #11511, #11516, #11517, #11541. Co-authored-by: iamagenius00 <iamagenius00@users.noreply.github.com>
teknium1
pushed a commit
that referenced
this pull request
Apr 18, 2026
Twelve tests under TestCJKSearchFallback guarding: - CJK detection across Chinese/Japanese/Korean/Hiragana/Katakana ranges (including the full Hangul syllables block \uac00-\ud7af, to catch the shorter-range typo from one of the duplicate PRs) - Substring match for multi-char Chinese, Japanese, Korean queries - Filter preservation (source_filter, exclude_sources, role_filter) in the LIKE path — guards against the SQL-builder bug from another duplicate PR where filter clauses landed after LIMIT/OFFSET - Snippet centered on the matched term (instr-based substr window), not the leading 200 chars of content - English fast-path untouched - Empty/no-match cases - Mixed CJK+English queries Also: - hermes_state.py: LIKE-fallback snippet is now `substr(content, max(1, instr(content, ?) - 40), 120)`, centered on the match instead of the whole-content default. Credit goes to @iamagenius00 for the snippet idea in PR #11517. - scripts/release.py: add @iamagenius00 to AUTHOR_MAP so future release attribution resolves cleanly. Refs #11511, #11516, #11517, #11541. Co-authored-by: iamagenius00 <iamagenius00@users.noreply.github.com>
Contributor
|
Salvaged into #12075 (merged to main as commit 8826d9c). Your implementation was the cleanest of the three concurrent fixes for #11511 and was used as the base — your commit lands on main with your authorship preserved in git log. Thank you for the fix! 🙏 Context: three PRs were submitted within 35 minutes of each other for this issue (#11516 yours, #11517 @iamagenius00, #11541 @gongli0929). Picked yours as the base because it kept the LIKE fallback minimal and correctly preserved all filters. Added @iamagenius00's centered-snippet idea from #11517 as a follow-up commit, plus 12 regression tests. |
This was referenced Apr 18, 2026
ulasbilgen
pushed a commit
to ulasbilgen/hermes-adhd-agent
that referenced
this pull request
May 1, 2026
Twelve tests under TestCJKSearchFallback guarding: - CJK detection across Chinese/Japanese/Korean/Hiragana/Katakana ranges (including the full Hangul syllables block \uac00-\ud7af, to catch the shorter-range typo from one of the duplicate PRs) - Substring match for multi-char Chinese, Japanese, Korean queries - Filter preservation (source_filter, exclude_sources, role_filter) in the LIKE path — guards against the SQL-builder bug from another duplicate PR where filter clauses landed after LIMIT/OFFSET - Snippet centered on the matched term (instr-based substr window), not the leading 200 chars of content - English fast-path untouched - Empty/no-match cases - Mixed CJK+English queries Also: - hermes_state.py: LIKE-fallback snippet is now `substr(content, max(1, instr(content, ?) - 40), 120)`, centered on the match instead of the whole-content default. Credit goes to @iamagenius00 for the snippet idea in PR NousResearch#11517. - scripts/release.py: add @iamagenius00 to AUTHOR_MAP so future release attribution resolves cleanly. Refs NousResearch#11511, NousResearch#11516, NousResearch#11517, NousResearch#11541. Co-authored-by: iamagenius00 <iamagenius00@users.noreply.github.com>
aj-nt
pushed a commit
to aj-nt/hermes-agent
that referenced
this pull request
May 1, 2026
Twelve tests under TestCJKSearchFallback guarding: - CJK detection across Chinese/Japanese/Korean/Hiragana/Katakana ranges (including the full Hangul syllables block \uac00-\ud7af, to catch the shorter-range typo from one of the duplicate PRs) - Substring match for multi-char Chinese, Japanese, Korean queries - Filter preservation (source_filter, exclude_sources, role_filter) in the LIKE path — guards against the SQL-builder bug from another duplicate PR where filter clauses landed after LIMIT/OFFSET - Snippet centered on the matched term (instr-based substr window), not the leading 200 chars of content - English fast-path untouched - Empty/no-match cases - Mixed CJK+English queries Also: - hermes_state.py: LIKE-fallback snippet is now `substr(content, max(1, instr(content, ?) - 40), 120)`, centered on the match instead of the whole-content default. Credit goes to @iamagenius00 for the snippet idea in PR NousResearch#11517. - scripts/release.py: add @iamagenius00 to AUTHOR_MAP so future release attribution resolves cleanly. Refs NousResearch#11511, NousResearch#11516, NousResearch#11517, NousResearch#11541. Co-authored-by: iamagenius00 <iamagenius00@users.noreply.github.com>
02356abc
pushed a commit
to 02356abc/hermes-agent
that referenced
this pull request
May 14, 2026
Twelve tests under TestCJKSearchFallback guarding: - CJK detection across Chinese/Japanese/Korean/Hiragana/Katakana ranges (including the full Hangul syllables block \uac00-\ud7af, to catch the shorter-range typo from one of the duplicate PRs) - Substring match for multi-char Chinese, Japanese, Korean queries - Filter preservation (source_filter, exclude_sources, role_filter) in the LIKE path — guards against the SQL-builder bug from another duplicate PR where filter clauses landed after LIMIT/OFFSET - Snippet centered on the matched term (instr-based substr window), not the leading 200 chars of content - English fast-path untouched - Empty/no-match cases - Mixed CJK+English queries Also: - hermes_state.py: LIKE-fallback snippet is now `substr(content, max(1, instr(content, ?) - 40), 120)`, centered on the match instead of the whole-content default. Credit goes to @iamagenius00 for the snippet idea in PR NousResearch#11517. - scripts/release.py: add @iamagenius00 to AUTHOR_MAP so future release attribution resolves cleanly. Refs NousResearch#11511, NousResearch#11516, NousResearch#11517, NousResearch#11541. Co-authored-by: iamagenius00 <iamagenius00@users.noreply.github.com>
gweeteve
pushed a commit
to gweeteve/hermes-agent
that referenced
this pull request
Jun 2, 2026
Twelve tests under TestCJKSearchFallback guarding: - CJK detection across Chinese/Japanese/Korean/Hiragana/Katakana ranges (including the full Hangul syllables block \uac00-\ud7af, to catch the shorter-range typo from one of the duplicate PRs) - Substring match for multi-char Chinese, Japanese, Korean queries - Filter preservation (source_filter, exclude_sources, role_filter) in the LIKE path — guards against the SQL-builder bug from another duplicate PR where filter clauses landed after LIMIT/OFFSET - Snippet centered on the matched term (instr-based substr window), not the leading 200 chars of content - English fast-path untouched - Empty/no-match cases - Mixed CJK+English queries Also: - hermes_state.py: LIKE-fallback snippet is now `substr(content, max(1, instr(content, ?) - 40), 120)`, centered on the match instead of the whole-content default. Credit goes to @iamagenius00 for the snippet idea in PR NousResearch#11517. - scripts/release.py: add @iamagenius00 to AUTHOR_MAP so future release attribution resolves cleanly. Refs NousResearch#11511, NousResearch#11516, NousResearch#11517, NousResearch#11541. Co-authored-by: iamagenius00 <iamagenius00@users.noreply.github.com>
Egavasyug
pushed a commit
to Egavasyug/hermes-agent
that referenced
this pull request
Jun 10, 2026
Twelve tests under TestCJKSearchFallback guarding: - CJK detection across Chinese/Japanese/Korean/Hiragana/Katakana ranges (including the full Hangul syllables block \uac00-\ud7af, to catch the shorter-range typo from one of the duplicate PRs) - Substring match for multi-char Chinese, Japanese, Korean queries - Filter preservation (source_filter, exclude_sources, role_filter) in the LIKE path — guards against the SQL-builder bug from another duplicate PR where filter clauses landed after LIMIT/OFFSET - Snippet centered on the matched term (instr-based substr window), not the leading 200 chars of content - English fast-path untouched - Empty/no-match cases - Mixed CJK+English queries Also: - hermes_state.py: LIKE-fallback snippet is now `substr(content, max(1, instr(content, ?) - 40), 120)`, centered on the match instead of the whole-content default. Credit goes to @iamagenius00 for the snippet idea in PR NousResearch#11517. - scripts/release.py: add @iamagenius00 to AUTHOR_MAP so future release attribution resolves cleanly. Refs NousResearch#11511, NousResearch#11516, NousResearch#11517, NousResearch#11541. Co-authored-by: iamagenius00 <iamagenius00@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
session_searchuses SQLite FTS5 for full-text search. FTS5's default tokenizer splits CJK text character-by-character (no spaces between words). Multi-character Chinese queries like "记忆断裂" become记 AND 忆 AND 断 AND 裂, requiring all 4 characters to match — returning 0 results despite data existing.This affects all CJK (Chinese, Japanese, Korean) users.
Solution
Add a LIKE fallback in
SessionDB.search_messages(): when FTS5 returns no results and the query contains CJK characters, retry withWHERE content LIKE ?.Changes
_contains_cjk()static method — detects CJK Unicode ranges (Chinese, Hiragana, Katakana, Hangul)search_messages()— when FTS5 returns empty and query has CJK, retries with LIKE-based query preserving all source/role filtersDesign decisions
timestamp DESC(most recent first) sincerankis unavailableFixes #11511
Related: #9135, #9651