fix: add LIKE fallback for CJK queries in session_search by iamagenius00 · Pull Request #11517 · NousResearch/hermes-agent

iamagenius00 · 2026-04-17T09:35:57Z

Summary

Fix session_search returning empty results for Chinese/Japanese/Korean queries.

Problem

FTS5's default tokenizer splits CJK text character-by-character (no word boundaries). Searching "记忆断裂" becomes 记 AND 忆 AND 断 AND 裂 — requiring all 4 individual characters in the same message. Despite data existing (LIKE finds 20+ matches), FTS5 returns 0 results.

This affects all CJK users.

Changes

1. CJK-aware query rewriting (`_sanitize_fts5_query`)

Detect CJK characters in query
Strip common Chinese stop-words (的、了、是、在、etc.)
Build OR-connected bigram pairs for 3+ character queries (better recall than single-char OR, better precision than full AND)

2. LIKE fallback (`search_messages`)

When FTS5 returns 0 results and query contains CJK characters, retry with WHERE content LIKE ?
Preserves all existing filters (source exclusion, role filter)
Respects existing limit parameter
Groups results by session_id consistent with FTS5 path

Performance

LIKE is slower (full table scan), but for Hermes's data volume (thousands of messages) there's no perceptible difference.

Testing

Before fix:

session_search("记忆断裂") → 0 results
session_search("Friday beetle conatus") → 0 results (mixed CJK+English)

After fix:

session_search("记忆断裂") → 20+ results
session_search("Friday beetle conatus") → 3 results

Fixes #11511

FTS5 default tokenizer splits CJK text character-by-character, causing multi-character Chinese/Japanese/Korean queries to return 0 results despite matching data existing in the database. Changes: - Add CJK-aware query rewriting in _sanitize_fts5_query(): strip stop-words, build OR-connected bigram pairs for better recall - Add LIKE substring fallback in search_messages(): when FTS5 returns no results and query contains CJK characters, retry with LIKE - Preserves FTS5 performance for English/Latin queries Fixes NousResearch#11511

@iamagenius00

Twelve tests under TestCJKSearchFallback guarding: - CJK detection across Chinese/Japanese/Korean/Hiragana/Katakana ranges (including the full Hangul syllables block \uac00-\ud7af, to catch the shorter-range typo from one of the duplicate PRs) - Substring match for multi-char Chinese, Japanese, Korean queries - Filter preservation (source_filter, exclude_sources, role_filter) in the LIKE path — guards against the SQL-builder bug from another duplicate PR where filter clauses landed after LIMIT/OFFSET - Snippet centered on the matched term (instr-based substr window), not the leading 200 chars of content - English fast-path untouched - Empty/no-match cases - Mixed CJK+English queries Also: - hermes_state.py: LIKE-fallback snippet is now `substr(content, max(1, instr(content, ?) - 40), 120)`, centered on the match instead of the whole-content default. Credit goes to @iamagenius00 for the snippet idea in PR #11517. - scripts/release.py: add @iamagenius00 to AUTHOR_MAP so future release attribution resolves cleanly. Refs #11511, #11516, #11517, #11541. Co-authored-by: iamagenius00 <iamagenius00@users.noreply.github.com>

@iamagenius00

Twelve tests under TestCJKSearchFallback guarding: - CJK detection across Chinese/Japanese/Korean/Hiragana/Katakana ranges (including the full Hangul syllables block \uac00-\ud7af, to catch the shorter-range typo from one of the duplicate PRs) - Substring match for multi-char Chinese, Japanese, Korean queries - Filter preservation (source_filter, exclude_sources, role_filter) in the LIKE path — guards against the SQL-builder bug from another duplicate PR where filter clauses landed after LIMIT/OFFSET - Snippet centered on the matched term (instr-based substr window), not the leading 200 chars of content - English fast-path untouched - Empty/no-match cases - Mixed CJK+English queries Also: - hermes_state.py: LIKE-fallback snippet is now `substr(content, max(1, instr(content, ?) - 40), 120)`, centered on the match instead of the whole-content default. Credit goes to @iamagenius00 for the snippet idea in PR #11517. - scripts/release.py: add @iamagenius00 to AUTHOR_MAP so future release attribution resolves cleanly. Refs #11511, #11516, #11517, #11541. Co-authored-by: iamagenius00 <iamagenius00@users.noreply.github.com>

teknium1 · 2026-04-18T08:58:31Z

Salvaged into #12075 (merged to main). The centered-snippet idea from your PR — substr(content, max(1, instr(content, ?) - 40), 120) — was incorporated as a follow-up commit with you as co-author.

Context: three concurrent PRs for #11511 (#11516 @vominh1919, #11517 yours, #11541 @gongli0929). Picked @vominh1919's cleaner base for the LIKE fallback, then added your centered-snippet improvement on top. Regarding the FTS5 bigram-OR rewriting in _sanitize_fts5_query — empirically it doesn't actually help with SQLite's default unicode61 tokenizer (which indexes whole CJK runs as single tokens, so neither bigrams nor single-char OR can match), so only the LIKE fallback logic was carried over. The real long-term fix for that is a trigram tokenizer migration, worth its own PR.

Thank you for contributing! 🙏

#12075

@iamagenius00

Twelve tests under TestCJKSearchFallback guarding: - CJK detection across Chinese/Japanese/Korean/Hiragana/Katakana ranges (including the full Hangul syllables block \uac00-\ud7af, to catch the shorter-range typo from one of the duplicate PRs) - Substring match for multi-char Chinese, Japanese, Korean queries - Filter preservation (source_filter, exclude_sources, role_filter) in the LIKE path — guards against the SQL-builder bug from another duplicate PR where filter clauses landed after LIMIT/OFFSET - Snippet centered on the matched term (instr-based substr window), not the leading 200 chars of content - English fast-path untouched - Empty/no-match cases - Mixed CJK+English queries Also: - hermes_state.py: LIKE-fallback snippet is now `substr(content, max(1, instr(content, ?) - 40), 120)`, centered on the match instead of the whole-content default. Credit goes to @iamagenius00 for the snippet idea in PR NousResearch#11517. - scripts/release.py: add @iamagenius00 to AUTHOR_MAP so future release attribution resolves cleanly. Refs NousResearch#11511, NousResearch#11516, NousResearch#11517, NousResearch#11541. Co-authored-by: iamagenius00 <iamagenius00@users.noreply.github.com>

@iamagenius00

Twelve tests under TestCJKSearchFallback guarding: - CJK detection across Chinese/Japanese/Korean/Hiragana/Katakana ranges (including the full Hangul syllables block \uac00-\ud7af, to catch the shorter-range typo from one of the duplicate PRs) - Substring match for multi-char Chinese, Japanese, Korean queries - Filter preservation (source_filter, exclude_sources, role_filter) in the LIKE path — guards against the SQL-builder bug from another duplicate PR where filter clauses landed after LIMIT/OFFSET - Snippet centered on the matched term (instr-based substr window), not the leading 200 chars of content - English fast-path untouched - Empty/no-match cases - Mixed CJK+English queries Also: - hermes_state.py: LIKE-fallback snippet is now `substr(content, max(1, instr(content, ?) - 40), 120)`, centered on the match instead of the whole-content default. Credit goes to @iamagenius00 for the snippet idea in PR NousResearch#11517. - scripts/release.py: add @iamagenius00 to AUTHOR_MAP so future release attribution resolves cleanly. Refs NousResearch#11511, NousResearch#11516, NousResearch#11517, NousResearch#11541. Co-authored-by: iamagenius00 <iamagenius00@users.noreply.github.com>

@iamagenius00

Twelve tests under TestCJKSearchFallback guarding: - CJK detection across Chinese/Japanese/Korean/Hiragana/Katakana ranges (including the full Hangul syllables block \uac00-\ud7af, to catch the shorter-range typo from one of the duplicate PRs) - Substring match for multi-char Chinese, Japanese, Korean queries - Filter preservation (source_filter, exclude_sources, role_filter) in the LIKE path — guards against the SQL-builder bug from another duplicate PR where filter clauses landed after LIMIT/OFFSET - Snippet centered on the matched term (instr-based substr window), not the leading 200 chars of content - English fast-path untouched - Empty/no-match cases - Mixed CJK+English queries Also: - hermes_state.py: LIKE-fallback snippet is now `substr(content, max(1, instr(content, ?) - 40), 120)`, centered on the match instead of the whole-content default. Credit goes to @iamagenius00 for the snippet idea in PR NousResearch#11517. - scripts/release.py: add @iamagenius00 to AUTHOR_MAP so future release attribution resolves cleanly. Refs NousResearch#11511, NousResearch#11516, NousResearch#11517, NousResearch#11541. Co-authored-by: iamagenius00 <iamagenius00@users.noreply.github.com>

@iamagenius00

Twelve tests under TestCJKSearchFallback guarding: - CJK detection across Chinese/Japanese/Korean/Hiragana/Katakana ranges (including the full Hangul syllables block \uac00-\ud7af, to catch the shorter-range typo from one of the duplicate PRs) - Substring match for multi-char Chinese, Japanese, Korean queries - Filter preservation (source_filter, exclude_sources, role_filter) in the LIKE path — guards against the SQL-builder bug from another duplicate PR where filter clauses landed after LIMIT/OFFSET - Snippet centered on the matched term (instr-based substr window), not the leading 200 chars of content - English fast-path untouched - Empty/no-match cases - Mixed CJK+English queries Also: - hermes_state.py: LIKE-fallback snippet is now `substr(content, max(1, instr(content, ?) - 40), 120)`, centered on the match instead of the whole-content default. Credit goes to @iamagenius00 for the snippet idea in PR NousResearch#11517. - scripts/release.py: add @iamagenius00 to AUTHOR_MAP so future release attribution resolves cleanly. Refs NousResearch#11511, NousResearch#11516, NousResearch#11517, NousResearch#11541. Co-authored-by: iamagenius00 <iamagenius00@users.noreply.github.com>

@iamagenius00

Twelve tests under TestCJKSearchFallback guarding: - CJK detection across Chinese/Japanese/Korean/Hiragana/Katakana ranges (including the full Hangul syllables block \uac00-\ud7af, to catch the shorter-range typo from one of the duplicate PRs) - Substring match for multi-char Chinese, Japanese, Korean queries - Filter preservation (source_filter, exclude_sources, role_filter) in the LIKE path — guards against the SQL-builder bug from another duplicate PR where filter clauses landed after LIMIT/OFFSET - Snippet centered on the matched term (instr-based substr window), not the leading 200 chars of content - English fast-path untouched - Empty/no-match cases - Mixed CJK+English queries Also: - hermes_state.py: LIKE-fallback snippet is now `substr(content, max(1, instr(content, ?) - 40), 120)`, centered on the match instead of the whole-content default. Credit goes to @iamagenius00 for the snippet idea in PR NousResearch#11517. - scripts/release.py: add @iamagenius00 to AUTHOR_MAP so future release attribution resolves cleanly. Refs NousResearch#11511, NousResearch#11516, NousResearch#11517, NousResearch#11541. Co-authored-by: iamagenius00 <iamagenius00@users.noreply.github.com>

teknium1 mentioned this pull request Apr 18, 2026

fix(session-search): LIKE fallback for CJK queries (salvages #11516, #11517, #11541) #12075

Merged

teknium1 mentioned this pull request Apr 18, 2026

fix: FTS5 LIKE fallback for CJK (Chinese/Japanese/Korean) queries #11516

Closed

teknium1 closed this Apr 18, 2026

teknium1 mentioned this pull request Apr 18, 2026

session_search: FTS5 returns empty results for Chinese/CJK queries #11511

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add LIKE fallback for CJK queries in session_search#11517

fix: add LIKE fallback for CJK queries in session_search#11517
iamagenius00 wants to merge 1 commit into
NousResearch:mainfrom
iamagenius00:fix/cjk-fts5-fallback

iamagenius00 commented Apr 17, 2026

Uh oh!

teknium1 commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

iamagenius00 commented Apr 17, 2026

Summary

Problem

Changes

1. CJK-aware query rewriting (_sanitize_fts5_query)

2. LIKE fallback (search_messages)

Performance

Testing

Uh oh!

teknium1 commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. CJK-aware query rewriting (`_sanitize_fts5_query`)

2. LIKE fallback (`search_messages`)