Skip to content

fix(session_search): supplement FTS5 with LIKE for CJK partial results#14842

Closed
kagura-agent wants to merge 2 commits into
NousResearch:mainfrom
kagura-agent:fix/cjk-fts5-partial-results
Closed

fix(session_search): supplement FTS5 with LIKE for CJK partial results#14842
kagura-agent wants to merge 2 commits into
NousResearch:mainfrom
kagura-agent:fix/cjk-fts5-partial-results

Conversation

@kagura-agent

Copy link
Copy Markdown
Contributor

Summary

Fixes #14829

FTS5's unicode61 tokenizer silently drops certain CJK characters, causing queries like 昨晚 to return only a fraction of actual matches. The existing LIKE fallback (added in 8826d9c for #11511) only triggers when FTS5 returns zero results, but the more common case is FTS5 returning some results while missing many others.

Changes

  • hermes_state.py: Change the LIKE path from a zero-result fallback to an always-run supplement for CJK queries. Results are merged with deduplication by message id, preserving FTS5 results while LIKE fills in the gaps.
  • tests/test_hermes_state.py: Add two regression tests:
    • test_cjk_partial_fts5_results_supplemented_by_like: verifies LIKE supplements partial FTS5 results
    • test_cjk_like_dedup_no_duplicates: verifies no duplicate results when both FTS5 and LIKE match

Before / After

Scenario Before After
CJK query, FTS5 returns 0 ✅ LIKE fallback ✅ LIKE runs
CJK query, FTS5 returns partial ❌ LIKE skipped ✅ LIKE supplements
English query ✅ FTS5 only ✅ FTS5 only (unchanged)
CJK + English mixed ✅ LIKE fallback on 0 ✅ LIKE always supplements

Testing

pytest tests/test_hermes_state.py::TestCJKSearchFallback -v
# 14 passed (12 existing + 2 new)

pytest tests/test_hermes_state.py -v
# 176 passed

@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder labels Apr 24, 2026
@kagura-agent kagura-agent force-pushed the fix/cjk-fts5-partial-results branch from afaef94 to dd67359 Compare April 25, 2026 00:43
@kagura-agent

Copy link
Copy Markdown
Contributor Author

Rebased on latest main to resolve conflicts.

@kagura-agent

Copy link
Copy Markdown
Contributor Author

Rebased on latest main to resolve merge conflict.

NousResearch#14829)

FTS5 unicode61 tokenizer drops certain CJK characters, causing queries
like '昨晚' to return only 12.5% of actual matches. The existing LIKE
fallback only triggers when FTS5 returns zero results, missing the
common case where FTS5 returns *some* but not all matches.

Change the LIKE path from a fallback (only on empty results) to a
supplement (always runs for CJK queries). Results are merged with
deduplication by message id, so FTS5 results are preserved and LIKE
fills in what FTS5 missed.

- Always run LIKE for CJK queries, not just on zero FTS5 results
- Deduplicate merged results by message id
- Add regression test for partial-result supplementation
- Add deduplication correctness test
@kagura-agent kagura-agent force-pushed the fix/cjk-fts5-partial-results branch from f1d6cfd to 3ee95d1 Compare April 27, 2026 16:16
@kagura-agent

Copy link
Copy Markdown
Contributor Author

Closing — superseded by #16651 which implemented a trigram FTS5 index approach for CJK search, replacing the LIKE fallback strategy from this PR. The issue is fully resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FTS5 unicode61 tokenizer silently drops CJK characters, LIKE fallback only triggers on zero results

2 participants