fix: FTS5 LIKE fallback for CJK (Chinese/Japanese/Korean) queries by vominh1919 · Pull Request #11516 · NousResearch/hermes-agent

vominh1919 · 2026-04-17T09:35:19Z

Problem

session_search uses SQLite FTS5 for full-text search. FTS5's default tokenizer splits CJK text character-by-character (no spaces between words). Multi-character Chinese queries like "记忆断裂" become 记 AND 忆 AND 断 AND 裂, requiring all 4 characters to match — returning 0 results despite data existing.

This affects all CJK (Chinese, Japanese, Korean) users.

Solution

Add a LIKE fallback in SessionDB.search_messages(): when FTS5 returns no results and the query contains CJK characters, retry with WHERE content LIKE ?.

Changes

New _contains_cjk() static method — detects CJK Unicode ranges (Chinese, Hiragana, Katakana, Hangul)
LIKE fallback in search_messages() — when FTS5 returns empty and query has CJK, retries with LIKE-based query preserving all source/role filters

Design decisions

FTS5 is tried first for all queries (preserves English performance)
LIKE fallback only triggers when FTS5 returns 0 results AND query contains CJK
LIKE results ordered by timestamp DESC (most recent first) since rank is unavailable
All existing filters (source, exclude_sources, role) are preserved in the LIKE query

Fixes #11511

Related: #9135, #9651

FTS5 default tokenizer splits CJK text character-by-character, causing multi-character queries like '记忆断裂' to return 0 results. This fix adds a LIKE fallback: when FTS5 returns no results and the query contains CJK characters, retry with WHERE content LIKE '%query%'. Preserves FTS5 performance for English queries. Fixes NousResearch#11511

@iamagenius00

Twelve tests under TestCJKSearchFallback guarding: - CJK detection across Chinese/Japanese/Korean/Hiragana/Katakana ranges (including the full Hangul syllables block \uac00-\ud7af, to catch the shorter-range typo from one of the duplicate PRs) - Substring match for multi-char Chinese, Japanese, Korean queries - Filter preservation (source_filter, exclude_sources, role_filter) in the LIKE path — guards against the SQL-builder bug from another duplicate PR where filter clauses landed after LIMIT/OFFSET - Snippet centered on the matched term (instr-based substr window), not the leading 200 chars of content - English fast-path untouched - Empty/no-match cases - Mixed CJK+English queries Also: - hermes_state.py: LIKE-fallback snippet is now `substr(content, max(1, instr(content, ?) - 40), 120)`, centered on the match instead of the whole-content default. Credit goes to @iamagenius00 for the snippet idea in PR #11517. - scripts/release.py: add @iamagenius00 to AUTHOR_MAP so future release attribution resolves cleanly. Refs #11511, #11516, #11517, #11541. Co-authored-by: iamagenius00 <iamagenius00@users.noreply.github.com>

@iamagenius00

Twelve tests under TestCJKSearchFallback guarding: - CJK detection across Chinese/Japanese/Korean/Hiragana/Katakana ranges (including the full Hangul syllables block \uac00-\ud7af, to catch the shorter-range typo from one of the duplicate PRs) - Substring match for multi-char Chinese, Japanese, Korean queries - Filter preservation (source_filter, exclude_sources, role_filter) in the LIKE path — guards against the SQL-builder bug from another duplicate PR where filter clauses landed after LIMIT/OFFSET - Snippet centered on the matched term (instr-based substr window), not the leading 200 chars of content - English fast-path untouched - Empty/no-match cases - Mixed CJK+English queries Also: - hermes_state.py: LIKE-fallback snippet is now `substr(content, max(1, instr(content, ?) - 40), 120)`, centered on the match instead of the whole-content default. Credit goes to @iamagenius00 for the snippet idea in PR #11517. - scripts/release.py: add @iamagenius00 to AUTHOR_MAP so future release attribution resolves cleanly. Refs #11511, #11516, #11517, #11541. Co-authored-by: iamagenius00 <iamagenius00@users.noreply.github.com>

teknium1 · 2026-04-18T08:58:29Z

Salvaged into #12075 (merged to main as commit 8826d9c). Your implementation was the cleanest of the three concurrent fixes for #11511 and was used as the base — your commit lands on main with your authorship preserved in git log. Thank you for the fix! 🙏

Context: three PRs were submitted within 35 minutes of each other for this issue (#11516 yours, #11517 @iamagenius00, #11541 @gongli0929). Picked yours as the base because it kept the LIKE fallback minimal and correctly preserved all filters. Added @iamagenius00's centered-snippet idea from #11517 as a follow-up commit, plus 12 regression tests.

#12075

@iamagenius00

Twelve tests under TestCJKSearchFallback guarding: - CJK detection across Chinese/Japanese/Korean/Hiragana/Katakana ranges (including the full Hangul syllables block \uac00-\ud7af, to catch the shorter-range typo from one of the duplicate PRs) - Substring match for multi-char Chinese, Japanese, Korean queries - Filter preservation (source_filter, exclude_sources, role_filter) in the LIKE path — guards against the SQL-builder bug from another duplicate PR where filter clauses landed after LIMIT/OFFSET - Snippet centered on the matched term (instr-based substr window), not the leading 200 chars of content - English fast-path untouched - Empty/no-match cases - Mixed CJK+English queries Also: - hermes_state.py: LIKE-fallback snippet is now `substr(content, max(1, instr(content, ?) - 40), 120)`, centered on the match instead of the whole-content default. Credit goes to @iamagenius00 for the snippet idea in PR NousResearch#11517. - scripts/release.py: add @iamagenius00 to AUTHOR_MAP so future release attribution resolves cleanly. Refs NousResearch#11511, NousResearch#11516, NousResearch#11517, NousResearch#11541. Co-authored-by: iamagenius00 <iamagenius00@users.noreply.github.com>

@iamagenius00

Twelve tests under TestCJKSearchFallback guarding: - CJK detection across Chinese/Japanese/Korean/Hiragana/Katakana ranges (including the full Hangul syllables block \uac00-\ud7af, to catch the shorter-range typo from one of the duplicate PRs) - Substring match for multi-char Chinese, Japanese, Korean queries - Filter preservation (source_filter, exclude_sources, role_filter) in the LIKE path — guards against the SQL-builder bug from another duplicate PR where filter clauses landed after LIMIT/OFFSET - Snippet centered on the matched term (instr-based substr window), not the leading 200 chars of content - English fast-path untouched - Empty/no-match cases - Mixed CJK+English queries Also: - hermes_state.py: LIKE-fallback snippet is now `substr(content, max(1, instr(content, ?) - 40), 120)`, centered on the match instead of the whole-content default. Credit goes to @iamagenius00 for the snippet idea in PR NousResearch#11517. - scripts/release.py: add @iamagenius00 to AUTHOR_MAP so future release attribution resolves cleanly. Refs NousResearch#11511, NousResearch#11516, NousResearch#11517, NousResearch#11541. Co-authored-by: iamagenius00 <iamagenius00@users.noreply.github.com>

@iamagenius00

Twelve tests under TestCJKSearchFallback guarding: - CJK detection across Chinese/Japanese/Korean/Hiragana/Katakana ranges (including the full Hangul syllables block \uac00-\ud7af, to catch the shorter-range typo from one of the duplicate PRs) - Substring match for multi-char Chinese, Japanese, Korean queries - Filter preservation (source_filter, exclude_sources, role_filter) in the LIKE path — guards against the SQL-builder bug from another duplicate PR where filter clauses landed after LIMIT/OFFSET - Snippet centered on the matched term (instr-based substr window), not the leading 200 chars of content - English fast-path untouched - Empty/no-match cases - Mixed CJK+English queries Also: - hermes_state.py: LIKE-fallback snippet is now `substr(content, max(1, instr(content, ?) - 40), 120)`, centered on the match instead of the whole-content default. Credit goes to @iamagenius00 for the snippet idea in PR NousResearch#11517. - scripts/release.py: add @iamagenius00 to AUTHOR_MAP so future release attribution resolves cleanly. Refs NousResearch#11511, NousResearch#11516, NousResearch#11517, NousResearch#11541. Co-authored-by: iamagenius00 <iamagenius00@users.noreply.github.com>

@iamagenius00

Twelve tests under TestCJKSearchFallback guarding: - CJK detection across Chinese/Japanese/Korean/Hiragana/Katakana ranges (including the full Hangul syllables block \uac00-\ud7af, to catch the shorter-range typo from one of the duplicate PRs) - Substring match for multi-char Chinese, Japanese, Korean queries - Filter preservation (source_filter, exclude_sources, role_filter) in the LIKE path — guards against the SQL-builder bug from another duplicate PR where filter clauses landed after LIMIT/OFFSET - Snippet centered on the matched term (instr-based substr window), not the leading 200 chars of content - English fast-path untouched - Empty/no-match cases - Mixed CJK+English queries Also: - hermes_state.py: LIKE-fallback snippet is now `substr(content, max(1, instr(content, ?) - 40), 120)`, centered on the match instead of the whole-content default. Credit goes to @iamagenius00 for the snippet idea in PR NousResearch#11517. - scripts/release.py: add @iamagenius00 to AUTHOR_MAP so future release attribution resolves cleanly. Refs NousResearch#11511, NousResearch#11516, NousResearch#11517, NousResearch#11541. Co-authored-by: iamagenius00 <iamagenius00@users.noreply.github.com>

@iamagenius00

Twelve tests under TestCJKSearchFallback guarding: - CJK detection across Chinese/Japanese/Korean/Hiragana/Katakana ranges (including the full Hangul syllables block \uac00-\ud7af, to catch the shorter-range typo from one of the duplicate PRs) - Substring match for multi-char Chinese, Japanese, Korean queries - Filter preservation (source_filter, exclude_sources, role_filter) in the LIKE path — guards against the SQL-builder bug from another duplicate PR where filter clauses landed after LIMIT/OFFSET - Snippet centered on the matched term (instr-based substr window), not the leading 200 chars of content - English fast-path untouched - Empty/no-match cases - Mixed CJK+English queries Also: - hermes_state.py: LIKE-fallback snippet is now `substr(content, max(1, instr(content, ?) - 40), 120)`, centered on the match instead of the whole-content default. Credit goes to @iamagenius00 for the snippet idea in PR NousResearch#11517. - scripts/release.py: add @iamagenius00 to AUTHOR_MAP so future release attribution resolves cleanly. Refs NousResearch#11511, NousResearch#11516, NousResearch#11517, NousResearch#11541. Co-authored-by: iamagenius00 <iamagenius00@users.noreply.github.com>

teknium1 mentioned this pull request Apr 18, 2026

fix(session-search): LIKE fallback for CJK queries (salvages #11516, #11517, #11541) #12075

Merged

teknium1 closed this Apr 18, 2026

This was referenced Apr 18, 2026

fix: add LIKE fallback for CJK queries in session_search #11517

Closed

session_search: FTS5 returns empty results for Chinese/CJK queries #11511

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: FTS5 LIKE fallback for CJK (Chinese/Japanese/Korean) queries#11516

fix: FTS5 LIKE fallback for CJK (Chinese/Japanese/Korean) queries#11516
vominh1919 wants to merge 1 commit into
NousResearch:mainfrom
vominh1919:fix-cjk-fts5

vominh1919 commented Apr 17, 2026

Uh oh!

teknium1 commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vominh1919 commented Apr 17, 2026

Problem

Solution

Changes

Design decisions

Uh oh!

teknium1 commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants