fix(session_search): bypass FTS5 for CJK queries, escape LIKE wildcards by alt-glitch · Pull Request #16276 · NousResearch/hermes-agent

alt-glitch · 2026-04-27T01:15:16Z

Salvage of #15509

Closes #15500 (duplicate of #14829). Supersedes #15509 and #14842.

Problem

session_search returns zero or very few results for CJK (Chinese/Japanese/Korean) queries despite session files clearly containing matching content. grep finds 148 matches but session_search returns only 1.

Root cause: FTS5's unicode61 tokenizer splits CJK characters into individual tokens. A query like "大别山项目" becomes 大 AND 别 AND 山 AND 项 AND 目. The existing LIKE fallback only triggered when FTS5 returned zero results — if FTS5 returned even 1-2 accidental matches from scattered character hits, the more accurate LIKE fallback never ran.

Fix

Cherry-picked from #15509 (author: @vominh1919) with improvements on top:

From #15509

For CJK queries, bypass FTS5 entirely and use LIKE '%query%' for accurate phrase matching

Added in this PR

Cache _contains_cjk() result in a local var — avoids redundant O(n) scans (was called twice per CJK query)
Escape LIKE wildcards (%, _) in user queries so they are treated as literals, consistent with other LIKE queries in hermes_state.py that use ESCAPE '\\'
Fix misleading comment — 'or CJK fallback' no longer describes the new code path
3 regression tests — partial FTS5 results ([BUG] session_search returns zero/partial results despite session files containing matching content #15500/FTS5 unicode61 tokenizer silently drops CJK characters, LIKE fallback only triggers on zero results #14829), dedup, and wildcard escaping

Not included (from #15509)

Hindsight shutdown guard (commit cfb22a0) — unrelated fix, should be a separate PR

Before / After

Scenario	Before	After
CJK query, FTS5 returns 0	✅ LIKE fallback	✅ LIKE runs (primary)
CJK query, FTS5 returns partial	❌ LIKE skipped	✅ LIKE runs (primary)
CJK query with `%` or `_`	❌ Treated as SQL wildcards	✅ Escaped, literal match
English query	✅ FTS5 only	✅ FTS5 only (unchanged)
Mixed CJK + English	✅ LIKE fallback on 0	✅ LIKE always for CJK

Testing

pytest tests/test_hermes_state.py::TestCJKSearchFallback -v
# 15 passed (12 existing + 3 new)

pytest tests/test_hermes_state.py -v
# 181 passed

FTS5 default tokenizer splits CJK characters into individual tokens, so multi-character queries like "大别山项目" become AND of single chars. This produces few/no results compared to LIKE substring search. For CJK queries, skip FTS5 entirely and use LIKE for accurate phrase matching. Fixes #15500

On top of the CJK FTS5 bypass from #15509: - Cache _contains_cjk() result in a local var to avoid redundant O(n) scans on every CJK query - Escape %, _ in LIKE queries so literal wildcards in user input are not treated as SQL wildcards (consistent with other LIKE queries in hermes_state.py that use ESCAPE '\') - Fix misleading comment ('or CJK fallback' → accurate description) - Add 3 regression tests: - test_cjk_partial_fts5_results_supplemented_by_like (#15500 / #14829) - test_cjk_like_dedup_no_duplicates - test_cjk_like_escapes_wildcards (new wildcard escaping)

alt-glitch · 2026-04-27T01:52:01Z

@BugBot review

cursor

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

^{Reviewed by Cursor Bugbot for commit c93f1c8. Configure here.}

vominh1919 · 2026-04-27T03:54:36Z

alt-glitch
Can it be merged, bro?

alt-glitch

Code Review

Verdict: Approve

Clean, focused fix. The root cause analysis is correct — FTS5 unicode61 splits CJK into individual character tokens, and the old LIKE fallback only triggered on zero FTS5 results, missing the partial-hit case.

What's right

Bypassing FTS5 entirely for CJK (vs. the old "fallback on zero") is the correct fix. LIKE substring search is the only reliable path for CJK phrase matching with the default tokenizer.
LIKE wildcard escaping (\ → \\, % → \%, _ → \_) is correct and consistent with existing patterns elsewhere in hermes_state.py (lines 601, 720, 750). Escape order is right — backslash first.
is_cjk caching avoids redundant _contains_cjk() O(n) scan.
instr() snippet parameter correctly uses unescaped raw_query (literal string matching, not LIKE).
All 3 new tests target specific failure modes from the bug report. 15/15 CJK tests pass, no regressions in the full test_hermes_state.py suite.
Hindsight shutdown guard exclusion keeps the PR scoped to the bug fix.

No issues found. LGTM.

Reviewed by Hermes Agent

Replace the LIKE '%query%' full-table-scan fallback for CJK queries with a proper trigram FTS5 index (messages_fts_trigram). The trigram tokenizer creates overlapping 3-byte sequences so substring matching works natively for any script — CJK, Thai, etc. For queries with 3+ CJK characters: uses the trigram FTS5 table with proper ranking, snippets, and indexed lookups. For shorter queries (1-2 CJK chars): falls back to LIKE since the trigram tokenizer needs ≥9 UTF-8 bytes (3 CJK chars) minimum. Schema v10 migration creates the trigram table and backfills existing messages. Triggers keep the index in sync on INSERT/UPDATE/DELETE. Builds on top of #16276 (bypass FTS5 for CJK, escape LIKE wildcards).

* fix: bypass FTS5 for CJK queries in session_search FTS5 default tokenizer splits CJK characters into individual tokens, so multi-character queries like "大别山项目" become AND of single chars. This produces few/no results compared to LIKE substring search. For CJK queries, skip FTS5 entirely and use LIKE for accurate phrase matching. Fixes #15500 * fix: cache _contains_cjk, escape LIKE wildcards, add regression tests On top of the CJK FTS5 bypass from #15509: - Cache _contains_cjk() result in a local var to avoid redundant O(n) scans on every CJK query - Escape %, _ in LIKE queries so literal wildcards in user input are not treated as SQL wildcards (consistent with other LIKE queries in hermes_state.py that use ESCAPE '\') - Fix misleading comment ('or CJK fallback' → accurate description) - Add 3 regression tests: - test_cjk_partial_fts5_results_supplemented_by_like (#15500 / #14829) - test_cjk_like_dedup_no_duplicates - test_cjk_like_escapes_wildcards (new wildcard escaping) * feat: trigram FTS5 index for CJK search, replace LIKE fallback Replace the LIKE '%query%' full-table-scan fallback for CJK queries with a proper trigram FTS5 index (messages_fts_trigram). The trigram tokenizer creates overlapping 3-byte sequences so substring matching works natively for any script — CJK, Thai, etc. For queries with 3+ CJK characters: uses the trigram FTS5 table with proper ranking, snippets, and indexed lookups. For shorter queries (1-2 CJK chars): falls back to LIKE since the trigram tokenizer needs ≥9 UTF-8 bytes (3 CJK chars) minimum. Schema v10 migration creates the trigram table and backfills existing messages. Triggers keep the index in sync on INSERT/UPDATE/DELETE. Builds on top of #16276 (bypass FTS5 for CJK, escape LIKE wildcards). --------- Co-authored-by: vominh1919 <vominh1919@gmail.com>

alt-glitch · 2026-04-27T18:47:44Z

closed in favour of #16651

alt-glitch · 2026-04-27T18:47:50Z

closed in favour of #16651

…esearch#16651) * fix: bypass FTS5 for CJK queries in session_search FTS5 default tokenizer splits CJK characters into individual tokens, so multi-character queries like "大别山项目" become AND of single chars. This produces few/no results compared to LIKE substring search. For CJK queries, skip FTS5 entirely and use LIKE for accurate phrase matching. Fixes NousResearch#15500 * fix: cache _contains_cjk, escape LIKE wildcards, add regression tests On top of the CJK FTS5 bypass from NousResearch#15509: - Cache _contains_cjk() result in a local var to avoid redundant O(n) scans on every CJK query - Escape %, _ in LIKE queries so literal wildcards in user input are not treated as SQL wildcards (consistent with other LIKE queries in hermes_state.py that use ESCAPE '\') - Fix misleading comment ('or CJK fallback' → accurate description) - Add 3 regression tests: - test_cjk_partial_fts5_results_supplemented_by_like (NousResearch#15500 / NousResearch#14829) - test_cjk_like_dedup_no_duplicates - test_cjk_like_escapes_wildcards (new wildcard escaping) * feat: trigram FTS5 index for CJK search, replace LIKE fallback Replace the LIKE '%query%' full-table-scan fallback for CJK queries with a proper trigram FTS5 index (messages_fts_trigram). The trigram tokenizer creates overlapping 3-byte sequences so substring matching works natively for any script — CJK, Thai, etc. For queries with 3+ CJK characters: uses the trigram FTS5 table with proper ranking, snippets, and indexed lookups. For shorter queries (1-2 CJK chars): falls back to LIKE since the trigram tokenizer needs ≥9 UTF-8 bytes (3 CJK chars) minimum. Schema v10 migration creates the trigram table and backfills existing messages. Triggers keep the index in sync on INSERT/UPDATE/DELETE. Builds on top of NousResearch#16276 (bypass FTS5 for CJK, escape LIKE wildcards). --------- Co-authored-by: vominh1919 <vominh1919@gmail.com>

vominh1919 and others added 2 commits April 27, 2026 06:43

alt-glitch added type/bug Something isn't working comp/agent Core agent loop, run_agent.py, prompt builder labels Apr 27, 2026

alt-glitch mentioned this pull request Apr 27, 2026

fix: bypass FTS5 for CJK queries in session_search #15509

Closed

cursor Bot reviewed Apr 27, 2026

View reviewed changes

alt-glitch added the P2 Medium — degraded but workaround exists label Apr 27, 2026

alt-glitch commented Apr 27, 2026

View reviewed changes

alt-glitch mentioned this pull request Apr 27, 2026

feat: trigram FTS5 index for CJK search, replace LIKE fallback #16651

Merged

alt-glitch closed this Apr 27, 2026

alt-glitch deleted the fix/session-search-cjk-bypass branch April 27, 2026 18:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(session_search): bypass FTS5 for CJK queries, escape LIKE wildcards#16276

fix(session_search): bypass FTS5 for CJK queries, escape LIKE wildcards#16276
alt-glitch wants to merge 2 commits into
mainfrom
fix/session-search-cjk-bypass

alt-glitch commented Apr 27, 2026

Uh oh!

alt-glitch commented Apr 27, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

vominh1919 commented Apr 27, 2026

Uh oh!

alt-glitch left a comment

Uh oh!

alt-glitch commented Apr 27, 2026

Uh oh!

alt-glitch commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alt-glitch commented Apr 27, 2026

Salvage of #15509

Problem

Fix

From #15509

Added in this PR

Not included (from #15509)

Before / After

Testing

Uh oh!

alt-glitch commented Apr 27, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

vominh1919 commented Apr 27, 2026

Uh oh!

alt-glitch left a comment

Choose a reason for hiding this comment

Code Review

What's right

Uh oh!

alt-glitch commented Apr 27, 2026

Uh oh!

alt-glitch commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants