Skip to content

fix: add LIKE fallback for CJK queries in session_search#11517

Closed
iamagenius00 wants to merge 1 commit into
NousResearch:mainfrom
iamagenius00:fix/cjk-fts5-fallback
Closed

fix: add LIKE fallback for CJK queries in session_search#11517
iamagenius00 wants to merge 1 commit into
NousResearch:mainfrom
iamagenius00:fix/cjk-fts5-fallback

Conversation

@iamagenius00

Copy link
Copy Markdown
Contributor

Summary

Fix session_search returning empty results for Chinese/Japanese/Korean queries.

Problem

FTS5's default tokenizer splits CJK text character-by-character (no word boundaries). Searching "记忆断裂" becomes 记 AND 忆 AND 断 AND 裂 — requiring all 4 individual characters in the same message. Despite data existing (LIKE finds 20+ matches), FTS5 returns 0 results.

This affects all CJK users.

Changes

1. CJK-aware query rewriting (_sanitize_fts5_query)

  • Detect CJK characters in query
  • Strip common Chinese stop-words (的、了、是、在、etc.)
  • Build OR-connected bigram pairs for 3+ character queries (better recall than single-char OR, better precision than full AND)

2. LIKE fallback (search_messages)

  • When FTS5 returns 0 results and query contains CJK characters, retry with WHERE content LIKE ?
  • Preserves all existing filters (source exclusion, role filter)
  • Respects existing limit parameter
  • Groups results by session_id consistent with FTS5 path

Performance

LIKE is slower (full table scan), but for Hermes's data volume (thousands of messages) there's no perceptible difference.

Testing

Before fix:

session_search("记忆断裂") → 0 results
session_search("Friday beetle conatus") → 0 results (mixed CJK+English)

After fix:

session_search("记忆断裂") → 20+ results
session_search("Friday beetle conatus") → 3 results

Fixes #11511

FTS5 default tokenizer splits CJK text character-by-character, causing
multi-character Chinese/Japanese/Korean queries to return 0 results
despite matching data existing in the database.

Changes:
- Add CJK-aware query rewriting in _sanitize_fts5_query(): strip
  stop-words, build OR-connected bigram pairs for better recall
- Add LIKE substring fallback in search_messages(): when FTS5 returns
  no results and query contains CJK characters, retry with LIKE
- Preserves FTS5 performance for English/Latin queries

Fixes NousResearch#11511
teknium1 pushed a commit that referenced this pull request Apr 18, 2026
Twelve tests under TestCJKSearchFallback guarding:
 - CJK detection across Chinese/Japanese/Korean/Hiragana/Katakana ranges
   (including the full Hangul syllables block \uac00-\ud7af, to catch
   the shorter-range typo from one of the duplicate PRs)
 - Substring match for multi-char Chinese, Japanese, Korean queries
 - Filter preservation (source_filter, exclude_sources, role_filter)
   in the LIKE path — guards against the SQL-builder bug from another
   duplicate PR where filter clauses landed after LIMIT/OFFSET
 - Snippet centered on the matched term (instr-based substr window),
   not the leading 200 chars of content
 - English fast-path untouched
 - Empty/no-match cases
 - Mixed CJK+English queries

Also:
 - hermes_state.py: LIKE-fallback snippet is now
   `substr(content, max(1, instr(content, ?) - 40), 120)`, centered on
   the match instead of the whole-content default. Credit goes to
   @iamagenius00 for the snippet idea in PR #11517.
 - scripts/release.py: add @iamagenius00 to AUTHOR_MAP so future
   release attribution resolves cleanly.

Refs #11511, #11516, #11517, #11541.

Co-authored-by: iamagenius00 <iamagenius00@users.noreply.github.com>
teknium1 pushed a commit that referenced this pull request Apr 18, 2026
Twelve tests under TestCJKSearchFallback guarding:
 - CJK detection across Chinese/Japanese/Korean/Hiragana/Katakana ranges
   (including the full Hangul syllables block \uac00-\ud7af, to catch
   the shorter-range typo from one of the duplicate PRs)
 - Substring match for multi-char Chinese, Japanese, Korean queries
 - Filter preservation (source_filter, exclude_sources, role_filter)
   in the LIKE path — guards against the SQL-builder bug from another
   duplicate PR where filter clauses landed after LIMIT/OFFSET
 - Snippet centered on the matched term (instr-based substr window),
   not the leading 200 chars of content
 - English fast-path untouched
 - Empty/no-match cases
 - Mixed CJK+English queries

Also:
 - hermes_state.py: LIKE-fallback snippet is now
   `substr(content, max(1, instr(content, ?) - 40), 120)`, centered on
   the match instead of the whole-content default. Credit goes to
   @iamagenius00 for the snippet idea in PR #11517.
 - scripts/release.py: add @iamagenius00 to AUTHOR_MAP so future
   release attribution resolves cleanly.

Refs #11511, #11516, #11517, #11541.

Co-authored-by: iamagenius00 <iamagenius00@users.noreply.github.com>
@teknium1

Copy link
Copy Markdown
Contributor

Salvaged into #12075 (merged to main). The centered-snippet idea from your PR — substr(content, max(1, instr(content, ?) - 40), 120) — was incorporated as a follow-up commit with you as co-author.

Context: three concurrent PRs for #11511 (#11516 @vominh1919, #11517 yours, #11541 @gongli0929). Picked @vominh1919's cleaner base for the LIKE fallback, then added your centered-snippet improvement on top. Regarding the FTS5 bigram-OR rewriting in _sanitize_fts5_query — empirically it doesn't actually help with SQLite's default unicode61 tokenizer (which indexes whole CJK runs as single tokens, so neither bigrams nor single-char OR can match), so only the LIKE fallback logic was carried over. The real long-term fix for that is a trigram tokenizer migration, worth its own PR.

Thank you for contributing! 🙏

#12075

@teknium1 teknium1 closed this Apr 18, 2026
ulasbilgen pushed a commit to ulasbilgen/hermes-adhd-agent that referenced this pull request May 1, 2026
Twelve tests under TestCJKSearchFallback guarding:
 - CJK detection across Chinese/Japanese/Korean/Hiragana/Katakana ranges
   (including the full Hangul syllables block \uac00-\ud7af, to catch
   the shorter-range typo from one of the duplicate PRs)
 - Substring match for multi-char Chinese, Japanese, Korean queries
 - Filter preservation (source_filter, exclude_sources, role_filter)
   in the LIKE path — guards against the SQL-builder bug from another
   duplicate PR where filter clauses landed after LIMIT/OFFSET
 - Snippet centered on the matched term (instr-based substr window),
   not the leading 200 chars of content
 - English fast-path untouched
 - Empty/no-match cases
 - Mixed CJK+English queries

Also:
 - hermes_state.py: LIKE-fallback snippet is now
   `substr(content, max(1, instr(content, ?) - 40), 120)`, centered on
   the match instead of the whole-content default. Credit goes to
   @iamagenius00 for the snippet idea in PR NousResearch#11517.
 - scripts/release.py: add @iamagenius00 to AUTHOR_MAP so future
   release attribution resolves cleanly.

Refs NousResearch#11511, NousResearch#11516, NousResearch#11517, NousResearch#11541.

Co-authored-by: iamagenius00 <iamagenius00@users.noreply.github.com>
aj-nt pushed a commit to aj-nt/hermes-agent that referenced this pull request May 1, 2026
Twelve tests under TestCJKSearchFallback guarding:
 - CJK detection across Chinese/Japanese/Korean/Hiragana/Katakana ranges
   (including the full Hangul syllables block \uac00-\ud7af, to catch
   the shorter-range typo from one of the duplicate PRs)
 - Substring match for multi-char Chinese, Japanese, Korean queries
 - Filter preservation (source_filter, exclude_sources, role_filter)
   in the LIKE path — guards against the SQL-builder bug from another
   duplicate PR where filter clauses landed after LIMIT/OFFSET
 - Snippet centered on the matched term (instr-based substr window),
   not the leading 200 chars of content
 - English fast-path untouched
 - Empty/no-match cases
 - Mixed CJK+English queries

Also:
 - hermes_state.py: LIKE-fallback snippet is now
   `substr(content, max(1, instr(content, ?) - 40), 120)`, centered on
   the match instead of the whole-content default. Credit goes to
   @iamagenius00 for the snippet idea in PR NousResearch#11517.
 - scripts/release.py: add @iamagenius00 to AUTHOR_MAP so future
   release attribution resolves cleanly.

Refs NousResearch#11511, NousResearch#11516, NousResearch#11517, NousResearch#11541.

Co-authored-by: iamagenius00 <iamagenius00@users.noreply.github.com>
02356abc pushed a commit to 02356abc/hermes-agent that referenced this pull request May 14, 2026
Twelve tests under TestCJKSearchFallback guarding:
 - CJK detection across Chinese/Japanese/Korean/Hiragana/Katakana ranges
   (including the full Hangul syllables block \uac00-\ud7af, to catch
   the shorter-range typo from one of the duplicate PRs)
 - Substring match for multi-char Chinese, Japanese, Korean queries
 - Filter preservation (source_filter, exclude_sources, role_filter)
   in the LIKE path — guards against the SQL-builder bug from another
   duplicate PR where filter clauses landed after LIMIT/OFFSET
 - Snippet centered on the matched term (instr-based substr window),
   not the leading 200 chars of content
 - English fast-path untouched
 - Empty/no-match cases
 - Mixed CJK+English queries

Also:
 - hermes_state.py: LIKE-fallback snippet is now
   `substr(content, max(1, instr(content, ?) - 40), 120)`, centered on
   the match instead of the whole-content default. Credit goes to
   @iamagenius00 for the snippet idea in PR NousResearch#11517.
 - scripts/release.py: add @iamagenius00 to AUTHOR_MAP so future
   release attribution resolves cleanly.

Refs NousResearch#11511, NousResearch#11516, NousResearch#11517, NousResearch#11541.

Co-authored-by: iamagenius00 <iamagenius00@users.noreply.github.com>
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
Twelve tests under TestCJKSearchFallback guarding:
 - CJK detection across Chinese/Japanese/Korean/Hiragana/Katakana ranges
   (including the full Hangul syllables block \uac00-\ud7af, to catch
   the shorter-range typo from one of the duplicate PRs)
 - Substring match for multi-char Chinese, Japanese, Korean queries
 - Filter preservation (source_filter, exclude_sources, role_filter)
   in the LIKE path — guards against the SQL-builder bug from another
   duplicate PR where filter clauses landed after LIMIT/OFFSET
 - Snippet centered on the matched term (instr-based substr window),
   not the leading 200 chars of content
 - English fast-path untouched
 - Empty/no-match cases
 - Mixed CJK+English queries

Also:
 - hermes_state.py: LIKE-fallback snippet is now
   `substr(content, max(1, instr(content, ?) - 40), 120)`, centered on
   the match instead of the whole-content default. Credit goes to
   @iamagenius00 for the snippet idea in PR NousResearch#11517.
 - scripts/release.py: add @iamagenius00 to AUTHOR_MAP so future
   release attribution resolves cleanly.

Refs NousResearch#11511, NousResearch#11516, NousResearch#11517, NousResearch#11541.

Co-authored-by: iamagenius00 <iamagenius00@users.noreply.github.com>
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026
Twelve tests under TestCJKSearchFallback guarding:
 - CJK detection across Chinese/Japanese/Korean/Hiragana/Katakana ranges
   (including the full Hangul syllables block \uac00-\ud7af, to catch
   the shorter-range typo from one of the duplicate PRs)
 - Substring match for multi-char Chinese, Japanese, Korean queries
 - Filter preservation (source_filter, exclude_sources, role_filter)
   in the LIKE path — guards against the SQL-builder bug from another
   duplicate PR where filter clauses landed after LIMIT/OFFSET
 - Snippet centered on the matched term (instr-based substr window),
   not the leading 200 chars of content
 - English fast-path untouched
 - Empty/no-match cases
 - Mixed CJK+English queries

Also:
 - hermes_state.py: LIKE-fallback snippet is now
   `substr(content, max(1, instr(content, ?) - 40), 120)`, centered on
   the match instead of the whole-content default. Credit goes to
   @iamagenius00 for the snippet idea in PR NousResearch#11517.
 - scripts/release.py: add @iamagenius00 to AUTHOR_MAP so future
   release attribution resolves cleanly.

Refs NousResearch#11511, NousResearch#11516, NousResearch#11517, NousResearch#11541.

Co-authored-by: iamagenius00 <iamagenius00@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

session_search: FTS5 returns empty results for Chinese/CJK queries

2 participants