Bug: CJK full-text search returns incomplete results
Summary
The search_messages() method in hermes_state.py uses FTS5 with the default unicode61 tokenizer for session search. This tokenizer silently drops many CJK characters, causing Chinese/Japanese/Korean queries to return incomplete results. The existing LIKE fallback only activates when FTS5 returns zero matches, so it misses the common case where FTS5 returns some results but misses many others.
Root Cause
Two compounding issues:
1. FTS5 unicode61 drops CJK characters
The unicode61 tokenizer does not properly tokenize CJK characters — many are silently discarded as if they were punctuation. This is a known SQLite limitation. Example from a real database:
| Query |
FTS5 matches |
LIKE matches |
Coverage |
昨晚 |
2 |
16 |
12.5% |
半夜 |
0 |
2 |
0% |
中欧红利 |
37 |
211 |
17.5% |
Individual character analysis shows certain CJK chars are completely absent from the FTS5 index:
| Character |
FTS5 hits |
LIKE hits |
Status |
| 守 |
0 |
169 |
❌ Dropped |
| 昨 |
0 |
133 |
❌ Dropped |
| 利 |
0 |
1358 |
❌ Dropped |
| 晚 |
25 |
266 |
⚠️ Partial |
2. LIKE fallback condition is too narrow
Current logic (line 1248):
if not matches and self._contains_cjk(query):
# LIKE fallback
This only triggers when FTS5 returns zero results. But as shown above, FTS5 often returns some results for CJK queries — just far fewer than it should. The fallback is never reached in those cases.
Impact
- Users in CJK locales (Chinese, Japanese, Korean) get unreliable
session_search results
- The agent reports "no matching sessions found" for conversations that clearly exist
- This is especially impactful for Feishu/WeChat/DingTalk users whose messages are predominantly CJK
Suggested Fix
For CJK queries, skip FTS5 entirely and go straight to LIKE (or always run LIKE as a supplement). Example:
# Option A: CJK queries bypass FTS5 entirely
if self._contains_cjk(original_query):
# go straight to LIKE fallback
...
# Option B: Always supplement FTS5 with LIKE for CJK queries
if self._contains_cjk(original_query):
# merge FTS5 + LIKE results (dedup by message id)
...
Environment
- Hermes Agent v0.11.0 (2026.4.23)
- SQLite 3.x with FTS5 (default
unicode61 tokenizer)
- Affects all platforms where CJK session content is stored
Related Code
hermes_state.py: search_messages() (line 1164), _contains_cjk() (line 1150), _sanitize_fts5_query() (line 1096)
Bug: CJK full-text search returns incomplete results
Summary
The
search_messages()method inhermes_state.pyuses FTS5 with the defaultunicode61tokenizer for session search. This tokenizer silently drops many CJK characters, causing Chinese/Japanese/Korean queries to return incomplete results. The existing LIKE fallback only activates when FTS5 returns zero matches, so it misses the common case where FTS5 returns some results but misses many others.Root Cause
Two compounding issues:
1. FTS5 unicode61 drops CJK characters
The
unicode61tokenizer does not properly tokenize CJK characters — many are silently discarded as if they were punctuation. This is a known SQLite limitation. Example from a real database:昨晚半夜中欧红利Individual character analysis shows certain CJK chars are completely absent from the FTS5 index:
2. LIKE fallback condition is too narrow
Current logic (line 1248):
This only triggers when FTS5 returns zero results. But as shown above, FTS5 often returns some results for CJK queries — just far fewer than it should. The fallback is never reached in those cases.
Impact
session_searchresultsSuggested Fix
For CJK queries, skip FTS5 entirely and go straight to LIKE (or always run LIKE as a supplement). Example:
Environment
unicode61tokenizer)Related Code
hermes_state.py:search_messages()(line 1164),_contains_cjk()(line 1150),_sanitize_fts5_query()(line 1096)