Skip to content

fix(session-search): LIKE fallback for CJK queries (salvages #11516, #11517, #11541)#12075

Merged
teknium1 merged 2 commits into
mainfrom
hermes/hermes-e16ba93d
Apr 18, 2026
Merged

fix(session-search): LIKE fallback for CJK queries (salvages #11516, #11517, #11541)#12075
teknium1 merged 2 commits into
mainfrom
hermes/hermes-e16ba93d

Conversation

@teknium1

Copy link
Copy Markdown
Contributor

Summary

session_search now finds Chinese, Japanese, and Korean content instead of returning [].

Root cause: SQLite FTS5's default tokenizer (unicode61) treats a contiguous CJK run as a single token, so search_messages("记忆断裂") against a message like "…的聊天记录记忆断裂问题…" runs MATCH '记忆断裂' against the indexed token '的聊天记录记忆断裂问题' and returns zero — despite the substring being right there. This affects every CJK user.

Fix: when FTS5 returns no results and the query contains any CJK character, retry with WHERE content LIKE '%query%' preserving all filters. English queries are untouched and keep the FTS5 fast path.

Salvages the substantive work from three duplicate PRs (#11516, #11517, #11541) — submitted within 35 minutes of each other, all against #11511. Picks #11516's cleaner structure as the base (commit authored by @vominh1919), adds @iamagenius00's centered-snippet idea from #11517, adds regression coverage that also guards against two bugs observed in #11541 (SQL filter clauses landing after LIMIT/OFFSET, truncated Hangul range).

Changes

  • hermes_state.py: _contains_cjk() helper + LIKE fallback in search_messages() preserving source_filter, exclude_sources, role_filter. Snippet is substr(content, max(1, instr(content, ?) - 40), 120) — centered on the match.
  • tests/test_hermes_state.py: new TestCJKSearchFallback class with 12 tests covering CJK detection ranges, Chinese/Japanese/Korean queries, filter preservation, centered snippets, English fast-path, and the no-match case.
  • scripts/release.py: add iamagenius00 to AUTHOR_MAP.

Validation

Before After
search_messages("记忆断裂") on data containing it 0 results finds it
search_messages("안녕") on Korean content 0 results finds it
search_messages("docker") (English fast-path) works works (unchanged)
tests/test_hermes_state.py 137 pass 149 pass (12 new)
tests/tools/test_session_search.py 32 pass 32 pass

E2E verified with real SessionDB + real SQLite against the exact Twitter-thread query ("和其他Agent的聊天记录") — finds it. Filter preservation verified with source_filter=["telegram"] on CJK query. Centered snippet verified — 164-char content returns a 120-char snippet with the matched term in the middle.

Credits

Closes #11511. Supersedes #11516, #11517, #11541.

Follow-up (not in this PR)

A proper long-term fix is to switch the FTS5 virtual table to the trigram tokenizer (SQLite 3.34+), which handles CJK substring matching natively without needing LIKE. That requires a schema migration (DROP + CREATE + reindex) and a minimum-SQLite check — worth its own PR.

vominh1919 and others added 2 commits April 18, 2026 01:52
FTS5 default tokenizer splits CJK text character-by-character, causing
multi-character queries like '记忆断裂' to return 0 results.

This fix adds a LIKE fallback: when FTS5 returns no results and the
query contains CJK characters, retry with WHERE content LIKE '%query%'.
Preserves FTS5 performance for English queries.

Fixes #11511
Twelve tests under TestCJKSearchFallback guarding:
 - CJK detection across Chinese/Japanese/Korean/Hiragana/Katakana ranges
   (including the full Hangul syllables block \uac00-\ud7af, to catch
   the shorter-range typo from one of the duplicate PRs)
 - Substring match for multi-char Chinese, Japanese, Korean queries
 - Filter preservation (source_filter, exclude_sources, role_filter)
   in the LIKE path — guards against the SQL-builder bug from another
   duplicate PR where filter clauses landed after LIMIT/OFFSET
 - Snippet centered on the matched term (instr-based substr window),
   not the leading 200 chars of content
 - English fast-path untouched
 - Empty/no-match cases
 - Mixed CJK+English queries

Also:
 - hermes_state.py: LIKE-fallback snippet is now
   `substr(content, max(1, instr(content, ?) - 40), 120)`, centered on
   the match instead of the whole-content default. Credit goes to
   @iamagenius00 for the snippet idea in PR #11517.
 - scripts/release.py: add @iamagenius00 to AUTHOR_MAP so future
   release attribution resolves cleanly.

Refs #11511, #11516, #11517, #11541.

Co-authored-by: iamagenius00 <iamagenius00@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

session_search: FTS5 returns empty results for Chinese/CJK queries

3 participants