Skip to content

FTS5 unicode61 tokenizer silently drops CJK characters, LIKE fallback only triggers on zero results #14829

@vincentdongsheng-Dstrom

Description

Bug: CJK full-text search returns incomplete results

Summary

The search_messages() method in hermes_state.py uses FTS5 with the default unicode61 tokenizer for session search. This tokenizer silently drops many CJK characters, causing Chinese/Japanese/Korean queries to return incomplete results. The existing LIKE fallback only activates when FTS5 returns zero matches, so it misses the common case where FTS5 returns some results but misses many others.

Root Cause

Two compounding issues:

1. FTS5 unicode61 drops CJK characters

The unicode61 tokenizer does not properly tokenize CJK characters — many are silently discarded as if they were punctuation. This is a known SQLite limitation. Example from a real database:

Query FTS5 matches LIKE matches Coverage
昨晚 2 16 12.5%
半夜 0 2 0%
中欧红利 37 211 17.5%

Individual character analysis shows certain CJK chars are completely absent from the FTS5 index:

Character FTS5 hits LIKE hits Status
0 169 ❌ Dropped
0 133 ❌ Dropped
0 1358 ❌ Dropped
25 266 ⚠️ Partial

2. LIKE fallback condition is too narrow

Current logic (line 1248):

if not matches and self._contains_cjk(query):
    # LIKE fallback

This only triggers when FTS5 returns zero results. But as shown above, FTS5 often returns some results for CJK queries — just far fewer than it should. The fallback is never reached in those cases.

Impact

  • Users in CJK locales (Chinese, Japanese, Korean) get unreliable session_search results
  • The agent reports "no matching sessions found" for conversations that clearly exist
  • This is especially impactful for Feishu/WeChat/DingTalk users whose messages are predominantly CJK

Suggested Fix

For CJK queries, skip FTS5 entirely and go straight to LIKE (or always run LIKE as a supplement). Example:

# Option A: CJK queries bypass FTS5 entirely
if self._contains_cjk(original_query):
    # go straight to LIKE fallback
    ...

# Option B: Always supplement FTS5 with LIKE for CJK queries
if self._contains_cjk(original_query):
    # merge FTS5 + LIKE results (dedup by message id)
    ...

Environment

  • Hermes Agent v0.11.0 (2026.4.23)
  • SQLite 3.x with FTS5 (default unicode61 tokenizer)
  • Affects all platforms where CJK session content is stored

Related Code

  • hermes_state.py: search_messages() (line 1164), _contains_cjk() (line 1150), _sanitize_fts5_query() (line 1096)

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existscomp/agentCore agent loop, run_agent.py, prompt buildertype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions