Skip to content

fix(session_search): bypass FTS5 for CJK queries, escape LIKE wildcards#16276

Closed
alt-glitch wants to merge 2 commits into
mainfrom
fix/session-search-cjk-bypass
Closed

fix(session_search): bypass FTS5 for CJK queries, escape LIKE wildcards#16276
alt-glitch wants to merge 2 commits into
mainfrom
fix/session-search-cjk-bypass

Conversation

@alt-glitch

Copy link
Copy Markdown
Collaborator

Salvage of #15509

Closes #15500 (duplicate of #14829). Supersedes #15509 and #14842.

Problem

session_search returns zero or very few results for CJK (Chinese/Japanese/Korean) queries despite session files clearly containing matching content. grep finds 148 matches but session_search returns only 1.

Root cause: FTS5's unicode61 tokenizer splits CJK characters into individual tokens. A query like "大别山项目" becomes 大 AND 别 AND 山 AND 项 AND 目. The existing LIKE fallback only triggered when FTS5 returned zero results — if FTS5 returned even 1-2 accidental matches from scattered character hits, the more accurate LIKE fallback never ran.

Fix

Cherry-picked from #15509 (author: @vominh1919) with improvements on top:

From #15509

  • For CJK queries, bypass FTS5 entirely and use LIKE '%query%' for accurate phrase matching

Added in this PR

Not included (from #15509)

  • Hindsight shutdown guard (commit cfb22a0) — unrelated fix, should be a separate PR

Before / After

Scenario Before After
CJK query, FTS5 returns 0 ✅ LIKE fallback ✅ LIKE runs (primary)
CJK query, FTS5 returns partial ❌ LIKE skipped ✅ LIKE runs (primary)
CJK query with % or _ ❌ Treated as SQL wildcards ✅ Escaped, literal match
English query ✅ FTS5 only ✅ FTS5 only (unchanged)
Mixed CJK + English ✅ LIKE fallback on 0 ✅ LIKE always for CJK

Testing

pytest tests/test_hermes_state.py::TestCJKSearchFallback -v
# 15 passed (12 existing + 3 new)

pytest tests/test_hermes_state.py -v
# 181 passed

vominh1919 and others added 2 commits April 27, 2026 06:43
FTS5 default tokenizer splits CJK characters into individual tokens,
so multi-character queries like "大别山项目" become AND of single chars.
This produces few/no results compared to LIKE substring search.

For CJK queries, skip FTS5 entirely and use LIKE for accurate
phrase matching.

Fixes #15500
On top of the CJK FTS5 bypass from #15509:

- Cache _contains_cjk() result in a local var to avoid redundant O(n)
  scans on every CJK query
- Escape %, _ in LIKE queries so literal wildcards in user input are
  not treated as SQL wildcards (consistent with other LIKE queries in
  hermes_state.py that use ESCAPE '\')
- Fix misleading comment ('or CJK fallback' → accurate description)
- Add 3 regression tests:
  - test_cjk_partial_fts5_results_supplemented_by_like (#15500 / #14829)
  - test_cjk_like_dedup_no_duplicates
  - test_cjk_like_escapes_wildcards (new wildcard escaping)
@alt-glitch alt-glitch added type/bug Something isn't working comp/agent Core agent loop, run_agent.py, prompt builder labels Apr 27, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator Author

@BugBot review

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit c93f1c8. Configure here.

@alt-glitch alt-glitch added the P2 Medium — degraded but workaround exists label Apr 27, 2026
@vominh1919

Copy link
Copy Markdown
Contributor

alt-glitch
Can it be merged, bro?

@alt-glitch alt-glitch left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Verdict: Approve

Clean, focused fix. The root cause analysis is correct — FTS5 unicode61 splits CJK into individual character tokens, and the old LIKE fallback only triggered on zero FTS5 results, missing the partial-hit case.

What's right

  • Bypassing FTS5 entirely for CJK (vs. the old "fallback on zero") is the correct fix. LIKE substring search is the only reliable path for CJK phrase matching with the default tokenizer.
  • LIKE wildcard escaping (\\\, %\%, _\_) is correct and consistent with existing patterns elsewhere in hermes_state.py (lines 601, 720, 750). Escape order is right — backslash first.
  • is_cjk caching avoids redundant _contains_cjk() O(n) scan.
  • instr() snippet parameter correctly uses unescaped raw_query (literal string matching, not LIKE).
  • All 3 new tests target specific failure modes from the bug report. 15/15 CJK tests pass, no regressions in the full test_hermes_state.py suite.
  • Hindsight shutdown guard exclusion keeps the PR scoped to the bug fix.

No issues found. LGTM.


Reviewed by Hermes Agent

alt-glitch added a commit that referenced this pull request Apr 27, 2026
Replace the LIKE '%query%' full-table-scan fallback for CJK queries with
a proper trigram FTS5 index (messages_fts_trigram).  The trigram tokenizer
creates overlapping 3-byte sequences so substring matching works natively
for any script — CJK, Thai, etc.

For queries with 3+ CJK characters: uses the trigram FTS5 table with
proper ranking, snippets, and indexed lookups.  For shorter queries
(1-2 CJK chars): falls back to LIKE since the trigram tokenizer needs
≥9 UTF-8 bytes (3 CJK chars) minimum.

Schema v10 migration creates the trigram table and backfills existing
messages.  Triggers keep the index in sync on INSERT/UPDATE/DELETE.

Builds on top of #16276 (bypass FTS5 for CJK, escape LIKE wildcards).
alt-glitch added a commit that referenced this pull request Apr 27, 2026
Replace the LIKE '%query%' full-table-scan fallback for CJK queries with
a proper trigram FTS5 index (messages_fts_trigram).  The trigram tokenizer
creates overlapping 3-byte sequences so substring matching works natively
for any script — CJK, Thai, etc.

For queries with 3+ CJK characters: uses the trigram FTS5 table with
proper ranking, snippets, and indexed lookups.  For shorter queries
(1-2 CJK chars): falls back to LIKE since the trigram tokenizer needs
≥9 UTF-8 bytes (3 CJK chars) minimum.

Schema v10 migration creates the trigram table and backfills existing
messages.  Triggers keep the index in sync on INSERT/UPDATE/DELETE.

Builds on top of #16276 (bypass FTS5 for CJK, escape LIKE wildcards).
alt-glitch added a commit that referenced this pull request Apr 27, 2026
* fix: bypass FTS5 for CJK queries in session_search

FTS5 default tokenizer splits CJK characters into individual tokens,
so multi-character queries like "大别山项目" become AND of single chars.
This produces few/no results compared to LIKE substring search.

For CJK queries, skip FTS5 entirely and use LIKE for accurate
phrase matching.

Fixes #15500

* fix: cache _contains_cjk, escape LIKE wildcards, add regression tests

On top of the CJK FTS5 bypass from #15509:

- Cache _contains_cjk() result in a local var to avoid redundant O(n)
  scans on every CJK query
- Escape %, _ in LIKE queries so literal wildcards in user input are
  not treated as SQL wildcards (consistent with other LIKE queries in
  hermes_state.py that use ESCAPE '\')
- Fix misleading comment ('or CJK fallback' → accurate description)
- Add 3 regression tests:
  - test_cjk_partial_fts5_results_supplemented_by_like (#15500 / #14829)
  - test_cjk_like_dedup_no_duplicates
  - test_cjk_like_escapes_wildcards (new wildcard escaping)

* feat: trigram FTS5 index for CJK search, replace LIKE fallback

Replace the LIKE '%query%' full-table-scan fallback for CJK queries with
a proper trigram FTS5 index (messages_fts_trigram).  The trigram tokenizer
creates overlapping 3-byte sequences so substring matching works natively
for any script — CJK, Thai, etc.

For queries with 3+ CJK characters: uses the trigram FTS5 table with
proper ranking, snippets, and indexed lookups.  For shorter queries
(1-2 CJK chars): falls back to LIKE since the trigram tokenizer needs
≥9 UTF-8 bytes (3 CJK chars) minimum.

Schema v10 migration creates the trigram table and backfills existing
messages.  Triggers keep the index in sync on INSERT/UPDATE/DELETE.

Builds on top of #16276 (bypass FTS5 for CJK, escape LIKE wildcards).

---------

Co-authored-by: vominh1919 <vominh1919@gmail.com>
@alt-glitch

Copy link
Copy Markdown
Collaborator Author

closed in favour of #16651

1 similar comment
@alt-glitch

Copy link
Copy Markdown
Collaborator Author

closed in favour of #16651

@alt-glitch alt-glitch closed this Apr 27, 2026
@alt-glitch alt-glitch deleted the fix/session-search-cjk-bypass branch April 27, 2026 18:48
ulasbilgen pushed a commit to ulasbilgen/hermes-adhd-agent that referenced this pull request May 1, 2026
…esearch#16651)

* fix: bypass FTS5 for CJK queries in session_search

FTS5 default tokenizer splits CJK characters into individual tokens,
so multi-character queries like "大别山项目" become AND of single chars.
This produces few/no results compared to LIKE substring search.

For CJK queries, skip FTS5 entirely and use LIKE for accurate
phrase matching.

Fixes NousResearch#15500

* fix: cache _contains_cjk, escape LIKE wildcards, add regression tests

On top of the CJK FTS5 bypass from NousResearch#15509:

- Cache _contains_cjk() result in a local var to avoid redundant O(n)
  scans on every CJK query
- Escape %, _ in LIKE queries so literal wildcards in user input are
  not treated as SQL wildcards (consistent with other LIKE queries in
  hermes_state.py that use ESCAPE '\')
- Fix misleading comment ('or CJK fallback' → accurate description)
- Add 3 regression tests:
  - test_cjk_partial_fts5_results_supplemented_by_like (NousResearch#15500 / NousResearch#14829)
  - test_cjk_like_dedup_no_duplicates
  - test_cjk_like_escapes_wildcards (new wildcard escaping)

* feat: trigram FTS5 index for CJK search, replace LIKE fallback

Replace the LIKE '%query%' full-table-scan fallback for CJK queries with
a proper trigram FTS5 index (messages_fts_trigram).  The trigram tokenizer
creates overlapping 3-byte sequences so substring matching works natively
for any script — CJK, Thai, etc.

For queries with 3+ CJK characters: uses the trigram FTS5 table with
proper ranking, snippets, and indexed lookups.  For shorter queries
(1-2 CJK chars): falls back to LIKE since the trigram tokenizer needs
≥9 UTF-8 bytes (3 CJK chars) minimum.

Schema v10 migration creates the trigram table and backfills existing
messages.  Triggers keep the index in sync on INSERT/UPDATE/DELETE.

Builds on top of NousResearch#16276 (bypass FTS5 for CJK, escape LIKE wildcards).

---------

Co-authored-by: vominh1919 <vominh1919@gmail.com>
donald131 pushed a commit to donald131/hermes-agent that referenced this pull request May 2, 2026
…esearch#16651)

* fix: bypass FTS5 for CJK queries in session_search

FTS5 default tokenizer splits CJK characters into individual tokens,
so multi-character queries like "大别山项目" become AND of single chars.
This produces few/no results compared to LIKE substring search.

For CJK queries, skip FTS5 entirely and use LIKE for accurate
phrase matching.

Fixes NousResearch#15500

* fix: cache _contains_cjk, escape LIKE wildcards, add regression tests

On top of the CJK FTS5 bypass from NousResearch#15509:

- Cache _contains_cjk() result in a local var to avoid redundant O(n)
  scans on every CJK query
- Escape %, _ in LIKE queries so literal wildcards in user input are
  not treated as SQL wildcards (consistent with other LIKE queries in
  hermes_state.py that use ESCAPE '\')
- Fix misleading comment ('or CJK fallback' → accurate description)
- Add 3 regression tests:
  - test_cjk_partial_fts5_results_supplemented_by_like (NousResearch#15500 / NousResearch#14829)
  - test_cjk_like_dedup_no_duplicates
  - test_cjk_like_escapes_wildcards (new wildcard escaping)

* feat: trigram FTS5 index for CJK search, replace LIKE fallback

Replace the LIKE '%query%' full-table-scan fallback for CJK queries with
a proper trigram FTS5 index (messages_fts_trigram).  The trigram tokenizer
creates overlapping 3-byte sequences so substring matching works natively
for any script — CJK, Thai, etc.

For queries with 3+ CJK characters: uses the trigram FTS5 table with
proper ranking, snippets, and indexed lookups.  For shorter queries
(1-2 CJK chars): falls back to LIKE since the trigram tokenizer needs
≥9 UTF-8 bytes (3 CJK chars) minimum.

Schema v10 migration creates the trigram table and backfills existing
messages.  Triggers keep the index in sync on INSERT/UPDATE/DELETE.

Builds on top of NousResearch#16276 (bypass FTS5 for CJK, escape LIKE wildcards).

---------

Co-authored-by: vominh1919 <vominh1919@gmail.com>
02356abc pushed a commit to 02356abc/hermes-agent that referenced this pull request May 14, 2026
…esearch#16651)

* fix: bypass FTS5 for CJK queries in session_search

FTS5 default tokenizer splits CJK characters into individual tokens,
so multi-character queries like "大别山项目" become AND of single chars.
This produces few/no results compared to LIKE substring search.

For CJK queries, skip FTS5 entirely and use LIKE for accurate
phrase matching.

Fixes NousResearch#15500

* fix: cache _contains_cjk, escape LIKE wildcards, add regression tests

On top of the CJK FTS5 bypass from NousResearch#15509:

- Cache _contains_cjk() result in a local var to avoid redundant O(n)
  scans on every CJK query
- Escape %, _ in LIKE queries so literal wildcards in user input are
  not treated as SQL wildcards (consistent with other LIKE queries in
  hermes_state.py that use ESCAPE '\')
- Fix misleading comment ('or CJK fallback' → accurate description)
- Add 3 regression tests:
  - test_cjk_partial_fts5_results_supplemented_by_like (NousResearch#15500 / NousResearch#14829)
  - test_cjk_like_dedup_no_duplicates
  - test_cjk_like_escapes_wildcards (new wildcard escaping)

* feat: trigram FTS5 index for CJK search, replace LIKE fallback

Replace the LIKE '%query%' full-table-scan fallback for CJK queries with
a proper trigram FTS5 index (messages_fts_trigram).  The trigram tokenizer
creates overlapping 3-byte sequences so substring matching works natively
for any script — CJK, Thai, etc.

For queries with 3+ CJK characters: uses the trigram FTS5 table with
proper ranking, snippets, and indexed lookups.  For shorter queries
(1-2 CJK chars): falls back to LIKE since the trigram tokenizer needs
≥9 UTF-8 bytes (3 CJK chars) minimum.

Schema v10 migration creates the trigram table and backfills existing
messages.  Triggers keep the index in sync on INSERT/UPDATE/DELETE.

Builds on top of NousResearch#16276 (bypass FTS5 for CJK, escape LIKE wildcards).

---------

Co-authored-by: vominh1919 <vominh1919@gmail.com>
dannyJ848 pushed a commit to dannyJ848/hermes-agent that referenced this pull request May 17, 2026
…esearch#16651)

* fix: bypass FTS5 for CJK queries in session_search

FTS5 default tokenizer splits CJK characters into individual tokens,
so multi-character queries like "大别山项目" become AND of single chars.
This produces few/no results compared to LIKE substring search.

For CJK queries, skip FTS5 entirely and use LIKE for accurate
phrase matching.

Fixes NousResearch#15500

* fix: cache _contains_cjk, escape LIKE wildcards, add regression tests

On top of the CJK FTS5 bypass from NousResearch#15509:

- Cache _contains_cjk() result in a local var to avoid redundant O(n)
  scans on every CJK query
- Escape %, _ in LIKE queries so literal wildcards in user input are
  not treated as SQL wildcards (consistent with other LIKE queries in
  hermes_state.py that use ESCAPE '\')
- Fix misleading comment ('or CJK fallback' → accurate description)
- Add 3 regression tests:
  - test_cjk_partial_fts5_results_supplemented_by_like (NousResearch#15500 / NousResearch#14829)
  - test_cjk_like_dedup_no_duplicates
  - test_cjk_like_escapes_wildcards (new wildcard escaping)

* feat: trigram FTS5 index for CJK search, replace LIKE fallback

Replace the LIKE '%query%' full-table-scan fallback for CJK queries with
a proper trigram FTS5 index (messages_fts_trigram).  The trigram tokenizer
creates overlapping 3-byte sequences so substring matching works natively
for any script — CJK, Thai, etc.

For queries with 3+ CJK characters: uses the trigram FTS5 table with
proper ranking, snippets, and indexed lookups.  For shorter queries
(1-2 CJK chars): falls back to LIKE since the trigram tokenizer needs
≥9 UTF-8 bytes (3 CJK chars) minimum.

Schema v10 migration creates the trigram table and backfills existing
messages.  Triggers keep the index in sync on INSERT/UPDATE/DELETE.

Builds on top of NousResearch#16276 (bypass FTS5 for CJK, escape LIKE wildcards).

---------

Co-authored-by: vominh1919 <vominh1919@gmail.com>
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
…esearch#16651)

* fix: bypass FTS5 for CJK queries in session_search

FTS5 default tokenizer splits CJK characters into individual tokens,
so multi-character queries like "大别山项目" become AND of single chars.
This produces few/no results compared to LIKE substring search.

For CJK queries, skip FTS5 entirely and use LIKE for accurate
phrase matching.

Fixes NousResearch#15500

* fix: cache _contains_cjk, escape LIKE wildcards, add regression tests

On top of the CJK FTS5 bypass from NousResearch#15509:

- Cache _contains_cjk() result in a local var to avoid redundant O(n)
  scans on every CJK query
- Escape %, _ in LIKE queries so literal wildcards in user input are
  not treated as SQL wildcards (consistent with other LIKE queries in
  hermes_state.py that use ESCAPE '\')
- Fix misleading comment ('or CJK fallback' → accurate description)
- Add 3 regression tests:
  - test_cjk_partial_fts5_results_supplemented_by_like (NousResearch#15500 / NousResearch#14829)
  - test_cjk_like_dedup_no_duplicates
  - test_cjk_like_escapes_wildcards (new wildcard escaping)

* feat: trigram FTS5 index for CJK search, replace LIKE fallback

Replace the LIKE '%query%' full-table-scan fallback for CJK queries with
a proper trigram FTS5 index (messages_fts_trigram).  The trigram tokenizer
creates overlapping 3-byte sequences so substring matching works natively
for any script — CJK, Thai, etc.

For queries with 3+ CJK characters: uses the trigram FTS5 table with
proper ranking, snippets, and indexed lookups.  For shorter queries
(1-2 CJK chars): falls back to LIKE since the trigram tokenizer needs
≥9 UTF-8 bytes (3 CJK chars) minimum.

Schema v10 migration creates the trigram table and backfills existing
messages.  Triggers keep the index in sync on INSERT/UPDATE/DELETE.

Builds on top of NousResearch#16276 (bypass FTS5 for CJK, escape LIKE wildcards).

---------

Co-authored-by: vominh1919 <vominh1919@gmail.com>
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026
…esearch#16651)

* fix: bypass FTS5 for CJK queries in session_search

FTS5 default tokenizer splits CJK characters into individual tokens,
so multi-character queries like "大别山项目" become AND of single chars.
This produces few/no results compared to LIKE substring search.

For CJK queries, skip FTS5 entirely and use LIKE for accurate
phrase matching.

Fixes NousResearch#15500

* fix: cache _contains_cjk, escape LIKE wildcards, add regression tests

On top of the CJK FTS5 bypass from NousResearch#15509:

- Cache _contains_cjk() result in a local var to avoid redundant O(n)
  scans on every CJK query
- Escape %, _ in LIKE queries so literal wildcards in user input are
  not treated as SQL wildcards (consistent with other LIKE queries in
  hermes_state.py that use ESCAPE '\')
- Fix misleading comment ('or CJK fallback' → accurate description)
- Add 3 regression tests:
  - test_cjk_partial_fts5_results_supplemented_by_like (NousResearch#15500 / NousResearch#14829)
  - test_cjk_like_dedup_no_duplicates
  - test_cjk_like_escapes_wildcards (new wildcard escaping)

* feat: trigram FTS5 index for CJK search, replace LIKE fallback

Replace the LIKE '%query%' full-table-scan fallback for CJK queries with
a proper trigram FTS5 index (messages_fts_trigram).  The trigram tokenizer
creates overlapping 3-byte sequences so substring matching works natively
for any script — CJK, Thai, etc.

For queries with 3+ CJK characters: uses the trigram FTS5 table with
proper ranking, snippets, and indexed lookups.  For shorter queries
(1-2 CJK chars): falls back to LIKE since the trigram tokenizer needs
≥9 UTF-8 bytes (3 CJK chars) minimum.

Schema v10 migration creates the trigram table and backfills existing
messages.  Triggers keep the index in sync on INSERT/UPDATE/DELETE.

Builds on top of NousResearch#16276 (bypass FTS5 for CJK, escape LIKE wildcards).

---------

Co-authored-by: vominh1919 <vominh1919@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] session_search returns zero/partial results despite session files containing matching content

2 participants