Skip to content

fix: FTS5 LIKE fallback for CJK (Chinese/Japanese/Korean) queries#11516

Closed
vominh1919 wants to merge 1 commit into
NousResearch:mainfrom
vominh1919:fix-cjk-fts5
Closed

fix: FTS5 LIKE fallback for CJK (Chinese/Japanese/Korean) queries#11516
vominh1919 wants to merge 1 commit into
NousResearch:mainfrom
vominh1919:fix-cjk-fts5

Conversation

@vominh1919

Copy link
Copy Markdown
Contributor

Problem

session_search uses SQLite FTS5 for full-text search. FTS5's default tokenizer splits CJK text character-by-character (no spaces between words). Multi-character Chinese queries like "记忆断裂" become 记 AND 忆 AND 断 AND 裂, requiring all 4 characters to match — returning 0 results despite data existing.

This affects all CJK (Chinese, Japanese, Korean) users.

Solution

Add a LIKE fallback in SessionDB.search_messages(): when FTS5 returns no results and the query contains CJK characters, retry with WHERE content LIKE ?.

Changes

  1. New _contains_cjk() static method — detects CJK Unicode ranges (Chinese, Hiragana, Katakana, Hangul)
  2. LIKE fallback in search_messages() — when FTS5 returns empty and query has CJK, retries with LIKE-based query preserving all source/role filters

Design decisions

  • FTS5 is tried first for all queries (preserves English performance)
  • LIKE fallback only triggers when FTS5 returns 0 results AND query contains CJK
  • LIKE results ordered by timestamp DESC (most recent first) since rank is unavailable
  • All existing filters (source, exclude_sources, role) are preserved in the LIKE query

Fixes #11511

Related: #9135, #9651

FTS5 default tokenizer splits CJK text character-by-character, causing
multi-character queries like '记忆断裂' to return 0 results.

This fix adds a LIKE fallback: when FTS5 returns no results and the
query contains CJK characters, retry with WHERE content LIKE '%query%'.
Preserves FTS5 performance for English queries.

Fixes NousResearch#11511
teknium1 pushed a commit that referenced this pull request Apr 18, 2026
Twelve tests under TestCJKSearchFallback guarding:
 - CJK detection across Chinese/Japanese/Korean/Hiragana/Katakana ranges
   (including the full Hangul syllables block \uac00-\ud7af, to catch
   the shorter-range typo from one of the duplicate PRs)
 - Substring match for multi-char Chinese, Japanese, Korean queries
 - Filter preservation (source_filter, exclude_sources, role_filter)
   in the LIKE path — guards against the SQL-builder bug from another
   duplicate PR where filter clauses landed after LIMIT/OFFSET
 - Snippet centered on the matched term (instr-based substr window),
   not the leading 200 chars of content
 - English fast-path untouched
 - Empty/no-match cases
 - Mixed CJK+English queries

Also:
 - hermes_state.py: LIKE-fallback snippet is now
   `substr(content, max(1, instr(content, ?) - 40), 120)`, centered on
   the match instead of the whole-content default. Credit goes to
   @iamagenius00 for the snippet idea in PR #11517.
 - scripts/release.py: add @iamagenius00 to AUTHOR_MAP so future
   release attribution resolves cleanly.

Refs #11511, #11516, #11517, #11541.

Co-authored-by: iamagenius00 <iamagenius00@users.noreply.github.com>
teknium1 pushed a commit that referenced this pull request Apr 18, 2026
Twelve tests under TestCJKSearchFallback guarding:
 - CJK detection across Chinese/Japanese/Korean/Hiragana/Katakana ranges
   (including the full Hangul syllables block \uac00-\ud7af, to catch
   the shorter-range typo from one of the duplicate PRs)
 - Substring match for multi-char Chinese, Japanese, Korean queries
 - Filter preservation (source_filter, exclude_sources, role_filter)
   in the LIKE path — guards against the SQL-builder bug from another
   duplicate PR where filter clauses landed after LIMIT/OFFSET
 - Snippet centered on the matched term (instr-based substr window),
   not the leading 200 chars of content
 - English fast-path untouched
 - Empty/no-match cases
 - Mixed CJK+English queries

Also:
 - hermes_state.py: LIKE-fallback snippet is now
   `substr(content, max(1, instr(content, ?) - 40), 120)`, centered on
   the match instead of the whole-content default. Credit goes to
   @iamagenius00 for the snippet idea in PR #11517.
 - scripts/release.py: add @iamagenius00 to AUTHOR_MAP so future
   release attribution resolves cleanly.

Refs #11511, #11516, #11517, #11541.

Co-authored-by: iamagenius00 <iamagenius00@users.noreply.github.com>
@teknium1

Copy link
Copy Markdown
Contributor

Salvaged into #12075 (merged to main as commit 8826d9c). Your implementation was the cleanest of the three concurrent fixes for #11511 and was used as the base — your commit lands on main with your authorship preserved in git log. Thank you for the fix! 🙏

Context: three PRs were submitted within 35 minutes of each other for this issue (#11516 yours, #11517 @iamagenius00, #11541 @gongli0929). Picked yours as the base because it kept the LIKE fallback minimal and correctly preserved all filters. Added @iamagenius00's centered-snippet idea from #11517 as a follow-up commit, plus 12 regression tests.

#12075

@teknium1 teknium1 closed this Apr 18, 2026
ulasbilgen pushed a commit to ulasbilgen/hermes-adhd-agent that referenced this pull request May 1, 2026
Twelve tests under TestCJKSearchFallback guarding:
 - CJK detection across Chinese/Japanese/Korean/Hiragana/Katakana ranges
   (including the full Hangul syllables block \uac00-\ud7af, to catch
   the shorter-range typo from one of the duplicate PRs)
 - Substring match for multi-char Chinese, Japanese, Korean queries
 - Filter preservation (source_filter, exclude_sources, role_filter)
   in the LIKE path — guards against the SQL-builder bug from another
   duplicate PR where filter clauses landed after LIMIT/OFFSET
 - Snippet centered on the matched term (instr-based substr window),
   not the leading 200 chars of content
 - English fast-path untouched
 - Empty/no-match cases
 - Mixed CJK+English queries

Also:
 - hermes_state.py: LIKE-fallback snippet is now
   `substr(content, max(1, instr(content, ?) - 40), 120)`, centered on
   the match instead of the whole-content default. Credit goes to
   @iamagenius00 for the snippet idea in PR NousResearch#11517.
 - scripts/release.py: add @iamagenius00 to AUTHOR_MAP so future
   release attribution resolves cleanly.

Refs NousResearch#11511, NousResearch#11516, NousResearch#11517, NousResearch#11541.

Co-authored-by: iamagenius00 <iamagenius00@users.noreply.github.com>
aj-nt pushed a commit to aj-nt/hermes-agent that referenced this pull request May 1, 2026
Twelve tests under TestCJKSearchFallback guarding:
 - CJK detection across Chinese/Japanese/Korean/Hiragana/Katakana ranges
   (including the full Hangul syllables block \uac00-\ud7af, to catch
   the shorter-range typo from one of the duplicate PRs)
 - Substring match for multi-char Chinese, Japanese, Korean queries
 - Filter preservation (source_filter, exclude_sources, role_filter)
   in the LIKE path — guards against the SQL-builder bug from another
   duplicate PR where filter clauses landed after LIMIT/OFFSET
 - Snippet centered on the matched term (instr-based substr window),
   not the leading 200 chars of content
 - English fast-path untouched
 - Empty/no-match cases
 - Mixed CJK+English queries

Also:
 - hermes_state.py: LIKE-fallback snippet is now
   `substr(content, max(1, instr(content, ?) - 40), 120)`, centered on
   the match instead of the whole-content default. Credit goes to
   @iamagenius00 for the snippet idea in PR NousResearch#11517.
 - scripts/release.py: add @iamagenius00 to AUTHOR_MAP so future
   release attribution resolves cleanly.

Refs NousResearch#11511, NousResearch#11516, NousResearch#11517, NousResearch#11541.

Co-authored-by: iamagenius00 <iamagenius00@users.noreply.github.com>
02356abc pushed a commit to 02356abc/hermes-agent that referenced this pull request May 14, 2026
Twelve tests under TestCJKSearchFallback guarding:
 - CJK detection across Chinese/Japanese/Korean/Hiragana/Katakana ranges
   (including the full Hangul syllables block \uac00-\ud7af, to catch
   the shorter-range typo from one of the duplicate PRs)
 - Substring match for multi-char Chinese, Japanese, Korean queries
 - Filter preservation (source_filter, exclude_sources, role_filter)
   in the LIKE path — guards against the SQL-builder bug from another
   duplicate PR where filter clauses landed after LIMIT/OFFSET
 - Snippet centered on the matched term (instr-based substr window),
   not the leading 200 chars of content
 - English fast-path untouched
 - Empty/no-match cases
 - Mixed CJK+English queries

Also:
 - hermes_state.py: LIKE-fallback snippet is now
   `substr(content, max(1, instr(content, ?) - 40), 120)`, centered on
   the match instead of the whole-content default. Credit goes to
   @iamagenius00 for the snippet idea in PR NousResearch#11517.
 - scripts/release.py: add @iamagenius00 to AUTHOR_MAP so future
   release attribution resolves cleanly.

Refs NousResearch#11511, NousResearch#11516, NousResearch#11517, NousResearch#11541.

Co-authored-by: iamagenius00 <iamagenius00@users.noreply.github.com>
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
Twelve tests under TestCJKSearchFallback guarding:
 - CJK detection across Chinese/Japanese/Korean/Hiragana/Katakana ranges
   (including the full Hangul syllables block \uac00-\ud7af, to catch
   the shorter-range typo from one of the duplicate PRs)
 - Substring match for multi-char Chinese, Japanese, Korean queries
 - Filter preservation (source_filter, exclude_sources, role_filter)
   in the LIKE path — guards against the SQL-builder bug from another
   duplicate PR where filter clauses landed after LIMIT/OFFSET
 - Snippet centered on the matched term (instr-based substr window),
   not the leading 200 chars of content
 - English fast-path untouched
 - Empty/no-match cases
 - Mixed CJK+English queries

Also:
 - hermes_state.py: LIKE-fallback snippet is now
   `substr(content, max(1, instr(content, ?) - 40), 120)`, centered on
   the match instead of the whole-content default. Credit goes to
   @iamagenius00 for the snippet idea in PR NousResearch#11517.
 - scripts/release.py: add @iamagenius00 to AUTHOR_MAP so future
   release attribution resolves cleanly.

Refs NousResearch#11511, NousResearch#11516, NousResearch#11517, NousResearch#11541.

Co-authored-by: iamagenius00 <iamagenius00@users.noreply.github.com>
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026
Twelve tests under TestCJKSearchFallback guarding:
 - CJK detection across Chinese/Japanese/Korean/Hiragana/Katakana ranges
   (including the full Hangul syllables block \uac00-\ud7af, to catch
   the shorter-range typo from one of the duplicate PRs)
 - Substring match for multi-char Chinese, Japanese, Korean queries
 - Filter preservation (source_filter, exclude_sources, role_filter)
   in the LIKE path — guards against the SQL-builder bug from another
   duplicate PR where filter clauses landed after LIMIT/OFFSET
 - Snippet centered on the matched term (instr-based substr window),
   not the leading 200 chars of content
 - English fast-path untouched
 - Empty/no-match cases
 - Mixed CJK+English queries

Also:
 - hermes_state.py: LIKE-fallback snippet is now
   `substr(content, max(1, instr(content, ?) - 40), 120)`, centered on
   the match instead of the whole-content default. Credit goes to
   @iamagenius00 for the snippet idea in PR NousResearch#11517.
 - scripts/release.py: add @iamagenius00 to AUTHOR_MAP so future
   release attribution resolves cleanly.

Refs NousResearch#11511, NousResearch#11516, NousResearch#11517, NousResearch#11541.

Co-authored-by: iamagenius00 <iamagenius00@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

session_search: FTS5 returns empty results for Chinese/CJK queries

2 participants