fix(session-search): LIKE fallback for CJK queries (salvages #11516, #11517, #11541)#12075
Merged
Conversation
FTS5 default tokenizer splits CJK text character-by-character, causing multi-character queries like '记忆断裂' to return 0 results. This fix adds a LIKE fallback: when FTS5 returns no results and the query contains CJK characters, retry with WHERE content LIKE '%query%'. Preserves FTS5 performance for English queries. Fixes #11511
Twelve tests under TestCJKSearchFallback guarding: - CJK detection across Chinese/Japanese/Korean/Hiragana/Katakana ranges (including the full Hangul syllables block \uac00-\ud7af, to catch the shorter-range typo from one of the duplicate PRs) - Substring match for multi-char Chinese, Japanese, Korean queries - Filter preservation (source_filter, exclude_sources, role_filter) in the LIKE path — guards against the SQL-builder bug from another duplicate PR where filter clauses landed after LIMIT/OFFSET - Snippet centered on the matched term (instr-based substr window), not the leading 200 chars of content - English fast-path untouched - Empty/no-match cases - Mixed CJK+English queries Also: - hermes_state.py: LIKE-fallback snippet is now `substr(content, max(1, instr(content, ?) - 40), 120)`, centered on the match instead of the whole-content default. Credit goes to @iamagenius00 for the snippet idea in PR #11517. - scripts/release.py: add @iamagenius00 to AUTHOR_MAP so future release attribution resolves cleanly. Refs #11511, #11516, #11517, #11541. Co-authored-by: iamagenius00 <iamagenius00@users.noreply.github.com>
This was referenced Apr 18, 2026
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
session_searchnow finds Chinese, Japanese, and Korean content instead of returning [].Root cause: SQLite FTS5's default tokenizer (
unicode61) treats a contiguous CJK run as a single token, sosearch_messages("记忆断裂")against a message like "…的聊天记录记忆断裂问题…" runsMATCH '记忆断裂'against the indexed token'的聊天记录记忆断裂问题'and returns zero — despite the substring being right there. This affects every CJK user.Fix: when FTS5 returns no results and the query contains any CJK character, retry with
WHERE content LIKE '%query%'preserving all filters. English queries are untouched and keep the FTS5 fast path.Salvages the substantive work from three duplicate PRs (#11516, #11517, #11541) — submitted within 35 minutes of each other, all against #11511. Picks #11516's cleaner structure as the base (commit authored by @vominh1919), adds @iamagenius00's centered-snippet idea from #11517, adds regression coverage that also guards against two bugs observed in #11541 (SQL filter clauses landing after
LIMIT/OFFSET, truncated Hangul range).Changes
hermes_state.py:_contains_cjk()helper + LIKE fallback insearch_messages()preservingsource_filter,exclude_sources,role_filter. Snippet issubstr(content, max(1, instr(content, ?) - 40), 120)— centered on the match.tests/test_hermes_state.py: newTestCJKSearchFallbackclass with 12 tests covering CJK detection ranges, Chinese/Japanese/Korean queries, filter preservation, centered snippets, English fast-path, and the no-match case.scripts/release.py: addiamagenius00to AUTHOR_MAP.Validation
search_messages("记忆断裂")on data containing itsearch_messages("안녕")on Korean contentsearch_messages("docker")(English fast-path)tests/test_hermes_state.pytests/tools/test_session_search.pyE2E verified with real
SessionDB+ real SQLite against the exact Twitter-thread query ("和其他Agent的聊天记录") — finds it. Filter preservation verified withsource_filter=["telegram"]on CJK query. Centered snippet verified — 164-char content returns a 120-char snippet with the matched term in the middle.Credits
Closes #11511. Supersedes #11516, #11517, #11541.
Follow-up (not in this PR)
A proper long-term fix is to switch the FTS5 virtual table to the
trigramtokenizer (SQLite 3.34+), which handles CJK substring matching natively without needing LIKE. That requires a schema migration (DROP + CREATE + reindex) and a minimum-SQLite check — worth its own PR.