Skip to content

fix: bypass FTS5 for CJK queries in session_search#15509

Closed
vominh1919 wants to merge 2 commits into
NousResearch:mainfrom
vominh1919:fix/session-search-cjk-fallback
Closed

fix: bypass FTS5 for CJK queries in session_search#15509
vominh1919 wants to merge 2 commits into
NousResearch:mainfrom
vominh1919:fix/session-search-cjk-fallback

Conversation

@vominh1919

Copy link
Copy Markdown
Contributor

Fixes #15500

Problem

session_search returns zero or very few results for CJK (Chinese/Japanese/Korean) queries despite session files clearly containing matching content. grep finds 148 matches but session_search returns only 1.

Root Cause

FTS5's default tokenizer splits CJK characters into individual tokens. A query like "大别山项目" becomes 大 AND 别 AND 山 AND 项 AND 目. The existing LIKE fallback only triggers when FTS5 returns zero results. If FTS5 returns even 1-2 matches, the LIKE fallback never runs.

Fix

For CJK queries, bypass FTS5 entirely and use LIKE '%query%' for accurate phrase matching.

Files Changed

  • hermes_state.py — restructured CJK search path

Add shutdown guard to sync_turn() and defense-in-depth loop.is_closed()
check in _run_sync() to prevent RuntimeError during interpreter teardown.

Fixes NousResearch#15497
FTS5 default tokenizer splits CJK characters into individual tokens,
so multi-character queries like "大别山项目" become AND of single chars.
This produces few/no results compared to LIKE substring search.

For CJK queries, skip FTS5 entirely and use LIKE for accurate
phrase matching.

Fixes NousResearch#15500
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder labels Apr 25, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Likely duplicate of #14842 — both fix CJK partial-result issue in session_search FTS5 for #14829. This PR bypasses FTS5 entirely for CJK while #14842 supplements with LIKE.

@alt-glitch

Copy link
Copy Markdown
Collaborator

Salvaged into #16276 with improvements on top:

  • Cached _contains_cjk() result to avoid redundant O(n) scans
  • Added LIKE wildcard escaping (%, _) consistent with other LIKE queries in hermes_state.py
  • Fixed misleading comment
  • Added 3 regression tests (partial FTS5 results, dedup, wildcard escaping)

The hindsight shutdown guard (commit cfb22a0) was left out since it's an unrelated fix — should go in a separate PR.

Thank you @vominh1919 for the fix — the core approach (bypassing FTS5 entirely for CJK) is clean and correct.

alt-glitch added a commit that referenced this pull request Apr 27, 2026
* fix: bypass FTS5 for CJK queries in session_search

FTS5 default tokenizer splits CJK characters into individual tokens,
so multi-character queries like "大别山项目" become AND of single chars.
This produces few/no results compared to LIKE substring search.

For CJK queries, skip FTS5 entirely and use LIKE for accurate
phrase matching.

Fixes #15500

* fix: cache _contains_cjk, escape LIKE wildcards, add regression tests

On top of the CJK FTS5 bypass from #15509:

- Cache _contains_cjk() result in a local var to avoid redundant O(n)
  scans on every CJK query
- Escape %, _ in LIKE queries so literal wildcards in user input are
  not treated as SQL wildcards (consistent with other LIKE queries in
  hermes_state.py that use ESCAPE '\')
- Fix misleading comment ('or CJK fallback' → accurate description)
- Add 3 regression tests:
  - test_cjk_partial_fts5_results_supplemented_by_like (#15500 / #14829)
  - test_cjk_like_dedup_no_duplicates
  - test_cjk_like_escapes_wildcards (new wildcard escaping)

* feat: trigram FTS5 index for CJK search, replace LIKE fallback

Replace the LIKE '%query%' full-table-scan fallback for CJK queries with
a proper trigram FTS5 index (messages_fts_trigram).  The trigram tokenizer
creates overlapping 3-byte sequences so substring matching works natively
for any script — CJK, Thai, etc.

For queries with 3+ CJK characters: uses the trigram FTS5 table with
proper ranking, snippets, and indexed lookups.  For shorter queries
(1-2 CJK chars): falls back to LIKE since the trigram tokenizer needs
≥9 UTF-8 bytes (3 CJK chars) minimum.

Schema v10 migration creates the trigram table and backfills existing
messages.  Triggers keep the index in sync on INSERT/UPDATE/DELETE.

Builds on top of #16276 (bypass FTS5 for CJK, escape LIKE wildcards).

---------

Co-authored-by: vominh1919 <vominh1919@gmail.com>
@alt-glitch

Copy link
Copy Markdown
Collaborator

closed in favour of #16651

@alt-glitch alt-glitch closed this Apr 27, 2026
ulasbilgen pushed a commit to ulasbilgen/hermes-adhd-agent that referenced this pull request May 1, 2026
…esearch#16651)

* fix: bypass FTS5 for CJK queries in session_search

FTS5 default tokenizer splits CJK characters into individual tokens,
so multi-character queries like "大别山项目" become AND of single chars.
This produces few/no results compared to LIKE substring search.

For CJK queries, skip FTS5 entirely and use LIKE for accurate
phrase matching.

Fixes NousResearch#15500

* fix: cache _contains_cjk, escape LIKE wildcards, add regression tests

On top of the CJK FTS5 bypass from NousResearch#15509:

- Cache _contains_cjk() result in a local var to avoid redundant O(n)
  scans on every CJK query
- Escape %, _ in LIKE queries so literal wildcards in user input are
  not treated as SQL wildcards (consistent with other LIKE queries in
  hermes_state.py that use ESCAPE '\')
- Fix misleading comment ('or CJK fallback' → accurate description)
- Add 3 regression tests:
  - test_cjk_partial_fts5_results_supplemented_by_like (NousResearch#15500 / NousResearch#14829)
  - test_cjk_like_dedup_no_duplicates
  - test_cjk_like_escapes_wildcards (new wildcard escaping)

* feat: trigram FTS5 index for CJK search, replace LIKE fallback

Replace the LIKE '%query%' full-table-scan fallback for CJK queries with
a proper trigram FTS5 index (messages_fts_trigram).  The trigram tokenizer
creates overlapping 3-byte sequences so substring matching works natively
for any script — CJK, Thai, etc.

For queries with 3+ CJK characters: uses the trigram FTS5 table with
proper ranking, snippets, and indexed lookups.  For shorter queries
(1-2 CJK chars): falls back to LIKE since the trigram tokenizer needs
≥9 UTF-8 bytes (3 CJK chars) minimum.

Schema v10 migration creates the trigram table and backfills existing
messages.  Triggers keep the index in sync on INSERT/UPDATE/DELETE.

Builds on top of NousResearch#16276 (bypass FTS5 for CJK, escape LIKE wildcards).

---------

Co-authored-by: vominh1919 <vominh1919@gmail.com>
donald131 pushed a commit to donald131/hermes-agent that referenced this pull request May 2, 2026
…esearch#16651)

* fix: bypass FTS5 for CJK queries in session_search

FTS5 default tokenizer splits CJK characters into individual tokens,
so multi-character queries like "大别山项目" become AND of single chars.
This produces few/no results compared to LIKE substring search.

For CJK queries, skip FTS5 entirely and use LIKE for accurate
phrase matching.

Fixes NousResearch#15500

* fix: cache _contains_cjk, escape LIKE wildcards, add regression tests

On top of the CJK FTS5 bypass from NousResearch#15509:

- Cache _contains_cjk() result in a local var to avoid redundant O(n)
  scans on every CJK query
- Escape %, _ in LIKE queries so literal wildcards in user input are
  not treated as SQL wildcards (consistent with other LIKE queries in
  hermes_state.py that use ESCAPE '\')
- Fix misleading comment ('or CJK fallback' → accurate description)
- Add 3 regression tests:
  - test_cjk_partial_fts5_results_supplemented_by_like (NousResearch#15500 / NousResearch#14829)
  - test_cjk_like_dedup_no_duplicates
  - test_cjk_like_escapes_wildcards (new wildcard escaping)

* feat: trigram FTS5 index for CJK search, replace LIKE fallback

Replace the LIKE '%query%' full-table-scan fallback for CJK queries with
a proper trigram FTS5 index (messages_fts_trigram).  The trigram tokenizer
creates overlapping 3-byte sequences so substring matching works natively
for any script — CJK, Thai, etc.

For queries with 3+ CJK characters: uses the trigram FTS5 table with
proper ranking, snippets, and indexed lookups.  For shorter queries
(1-2 CJK chars): falls back to LIKE since the trigram tokenizer needs
≥9 UTF-8 bytes (3 CJK chars) minimum.

Schema v10 migration creates the trigram table and backfills existing
messages.  Triggers keep the index in sync on INSERT/UPDATE/DELETE.

Builds on top of NousResearch#16276 (bypass FTS5 for CJK, escape LIKE wildcards).

---------

Co-authored-by: vominh1919 <vominh1919@gmail.com>
02356abc pushed a commit to 02356abc/hermes-agent that referenced this pull request May 14, 2026
…esearch#16651)

* fix: bypass FTS5 for CJK queries in session_search

FTS5 default tokenizer splits CJK characters into individual tokens,
so multi-character queries like "大别山项目" become AND of single chars.
This produces few/no results compared to LIKE substring search.

For CJK queries, skip FTS5 entirely and use LIKE for accurate
phrase matching.

Fixes NousResearch#15500

* fix: cache _contains_cjk, escape LIKE wildcards, add regression tests

On top of the CJK FTS5 bypass from NousResearch#15509:

- Cache _contains_cjk() result in a local var to avoid redundant O(n)
  scans on every CJK query
- Escape %, _ in LIKE queries so literal wildcards in user input are
  not treated as SQL wildcards (consistent with other LIKE queries in
  hermes_state.py that use ESCAPE '\')
- Fix misleading comment ('or CJK fallback' → accurate description)
- Add 3 regression tests:
  - test_cjk_partial_fts5_results_supplemented_by_like (NousResearch#15500 / NousResearch#14829)
  - test_cjk_like_dedup_no_duplicates
  - test_cjk_like_escapes_wildcards (new wildcard escaping)

* feat: trigram FTS5 index for CJK search, replace LIKE fallback

Replace the LIKE '%query%' full-table-scan fallback for CJK queries with
a proper trigram FTS5 index (messages_fts_trigram).  The trigram tokenizer
creates overlapping 3-byte sequences so substring matching works natively
for any script — CJK, Thai, etc.

For queries with 3+ CJK characters: uses the trigram FTS5 table with
proper ranking, snippets, and indexed lookups.  For shorter queries
(1-2 CJK chars): falls back to LIKE since the trigram tokenizer needs
≥9 UTF-8 bytes (3 CJK chars) minimum.

Schema v10 migration creates the trigram table and backfills existing
messages.  Triggers keep the index in sync on INSERT/UPDATE/DELETE.

Builds on top of NousResearch#16276 (bypass FTS5 for CJK, escape LIKE wildcards).

---------

Co-authored-by: vominh1919 <vominh1919@gmail.com>
dannyJ848 pushed a commit to dannyJ848/hermes-agent that referenced this pull request May 17, 2026
…esearch#16651)

* fix: bypass FTS5 for CJK queries in session_search

FTS5 default tokenizer splits CJK characters into individual tokens,
so multi-character queries like "大别山项目" become AND of single chars.
This produces few/no results compared to LIKE substring search.

For CJK queries, skip FTS5 entirely and use LIKE for accurate
phrase matching.

Fixes NousResearch#15500

* fix: cache _contains_cjk, escape LIKE wildcards, add regression tests

On top of the CJK FTS5 bypass from NousResearch#15509:

- Cache _contains_cjk() result in a local var to avoid redundant O(n)
  scans on every CJK query
- Escape %, _ in LIKE queries so literal wildcards in user input are
  not treated as SQL wildcards (consistent with other LIKE queries in
  hermes_state.py that use ESCAPE '\')
- Fix misleading comment ('or CJK fallback' → accurate description)
- Add 3 regression tests:
  - test_cjk_partial_fts5_results_supplemented_by_like (NousResearch#15500 / NousResearch#14829)
  - test_cjk_like_dedup_no_duplicates
  - test_cjk_like_escapes_wildcards (new wildcard escaping)

* feat: trigram FTS5 index for CJK search, replace LIKE fallback

Replace the LIKE '%query%' full-table-scan fallback for CJK queries with
a proper trigram FTS5 index (messages_fts_trigram).  The trigram tokenizer
creates overlapping 3-byte sequences so substring matching works natively
for any script — CJK, Thai, etc.

For queries with 3+ CJK characters: uses the trigram FTS5 table with
proper ranking, snippets, and indexed lookups.  For shorter queries
(1-2 CJK chars): falls back to LIKE since the trigram tokenizer needs
≥9 UTF-8 bytes (3 CJK chars) minimum.

Schema v10 migration creates the trigram table and backfills existing
messages.  Triggers keep the index in sync on INSERT/UPDATE/DELETE.

Builds on top of NousResearch#16276 (bypass FTS5 for CJK, escape LIKE wildcards).

---------

Co-authored-by: vominh1919 <vominh1919@gmail.com>
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
…esearch#16651)

* fix: bypass FTS5 for CJK queries in session_search

FTS5 default tokenizer splits CJK characters into individual tokens,
so multi-character queries like "大别山项目" become AND of single chars.
This produces few/no results compared to LIKE substring search.

For CJK queries, skip FTS5 entirely and use LIKE for accurate
phrase matching.

Fixes NousResearch#15500

* fix: cache _contains_cjk, escape LIKE wildcards, add regression tests

On top of the CJK FTS5 bypass from NousResearch#15509:

- Cache _contains_cjk() result in a local var to avoid redundant O(n)
  scans on every CJK query
- Escape %, _ in LIKE queries so literal wildcards in user input are
  not treated as SQL wildcards (consistent with other LIKE queries in
  hermes_state.py that use ESCAPE '\')
- Fix misleading comment ('or CJK fallback' → accurate description)
- Add 3 regression tests:
  - test_cjk_partial_fts5_results_supplemented_by_like (NousResearch#15500 / NousResearch#14829)
  - test_cjk_like_dedup_no_duplicates
  - test_cjk_like_escapes_wildcards (new wildcard escaping)

* feat: trigram FTS5 index for CJK search, replace LIKE fallback

Replace the LIKE '%query%' full-table-scan fallback for CJK queries with
a proper trigram FTS5 index (messages_fts_trigram).  The trigram tokenizer
creates overlapping 3-byte sequences so substring matching works natively
for any script — CJK, Thai, etc.

For queries with 3+ CJK characters: uses the trigram FTS5 table with
proper ranking, snippets, and indexed lookups.  For shorter queries
(1-2 CJK chars): falls back to LIKE since the trigram tokenizer needs
≥9 UTF-8 bytes (3 CJK chars) minimum.

Schema v10 migration creates the trigram table and backfills existing
messages.  Triggers keep the index in sync on INSERT/UPDATE/DELETE.

Builds on top of NousResearch#16276 (bypass FTS5 for CJK, escape LIKE wildcards).

---------

Co-authored-by: vominh1919 <vominh1919@gmail.com>
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026
…esearch#16651)

* fix: bypass FTS5 for CJK queries in session_search

FTS5 default tokenizer splits CJK characters into individual tokens,
so multi-character queries like "大别山项目" become AND of single chars.
This produces few/no results compared to LIKE substring search.

For CJK queries, skip FTS5 entirely and use LIKE for accurate
phrase matching.

Fixes NousResearch#15500

* fix: cache _contains_cjk, escape LIKE wildcards, add regression tests

On top of the CJK FTS5 bypass from NousResearch#15509:

- Cache _contains_cjk() result in a local var to avoid redundant O(n)
  scans on every CJK query
- Escape %, _ in LIKE queries so literal wildcards in user input are
  not treated as SQL wildcards (consistent with other LIKE queries in
  hermes_state.py that use ESCAPE '\')
- Fix misleading comment ('or CJK fallback' → accurate description)
- Add 3 regression tests:
  - test_cjk_partial_fts5_results_supplemented_by_like (NousResearch#15500 / NousResearch#14829)
  - test_cjk_like_dedup_no_duplicates
  - test_cjk_like_escapes_wildcards (new wildcard escaping)

* feat: trigram FTS5 index for CJK search, replace LIKE fallback

Replace the LIKE '%query%' full-table-scan fallback for CJK queries with
a proper trigram FTS5 index (messages_fts_trigram).  The trigram tokenizer
creates overlapping 3-byte sequences so substring matching works natively
for any script — CJK, Thai, etc.

For queries with 3+ CJK characters: uses the trigram FTS5 table with
proper ranking, snippets, and indexed lookups.  For shorter queries
(1-2 CJK chars): falls back to LIKE since the trigram tokenizer needs
≥9 UTF-8 bytes (3 CJK chars) minimum.

Schema v10 migration creates the trigram table and backfills existing
messages.  Triggers keep the index in sync on INSERT/UPDATE/DELETE.

Builds on top of NousResearch#16276 (bypass FTS5 for CJK, escape LIKE wildcards).

---------

Co-authored-by: vominh1919 <vominh1919@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] session_search returns zero/partial results despite session files containing matching content

2 participants