fix(session-search): LIKE fallback for CJK queries (salvages #11516, #11517, #11541) by teknium1 · Pull Request #12075 · NousResearch/hermes-agent

teknium1 · 2026-04-18T08:56:52Z

Summary

session_search now finds Chinese, Japanese, and Korean content instead of returning [].

Root cause: SQLite FTS5's default tokenizer (unicode61) treats a contiguous CJK run as a single token, so search_messages("记忆断裂") against a message like "…的聊天记录记忆断裂问题…" runs MATCH '记忆断裂' against the indexed token '的聊天记录记忆断裂问题' and returns zero — despite the substring being right there. This affects every CJK user.

Fix: when FTS5 returns no results and the query contains any CJK character, retry with WHERE content LIKE '%query%' preserving all filters. English queries are untouched and keep the FTS5 fast path.

Salvages the substantive work from three duplicate PRs (#11516, #11517, #11541) — submitted within 35 minutes of each other, all against #11511. Picks #11516's cleaner structure as the base (commit authored by @vominh1919), adds @iamagenius00's centered-snippet idea from #11517, adds regression coverage that also guards against two bugs observed in #11541 (SQL filter clauses landing after LIMIT/OFFSET, truncated Hangul range).

Changes

hermes_state.py: _contains_cjk() helper + LIKE fallback in search_messages() preserving source_filter, exclude_sources, role_filter. Snippet is substr(content, max(1, instr(content, ?) - 40), 120) — centered on the match.
tests/test_hermes_state.py: new TestCJKSearchFallback class with 12 tests covering CJK detection ranges, Chinese/Japanese/Korean queries, filter preservation, centered snippets, English fast-path, and the no-match case.
scripts/release.py: add iamagenius00 to AUTHOR_MAP.

Validation

	Before	After
`search_messages("记忆断裂")` on data containing it	0 results	finds it
`search_messages("안녕")` on Korean content	0 results	finds it
`search_messages("docker")` (English fast-path)	works	works (unchanged)
`tests/test_hermes_state.py`	137 pass	149 pass (12 new)
`tests/tools/test_session_search.py`	32 pass	32 pass

E2E verified with real SessionDB + real SQLite against the exact Twitter-thread query ("和其他Agent的聊天记录") — finds it. Filter preservation verified with source_filter=["telegram"] on CJK query. Centered snippet verified — 164-char content returns a 120-char snippet with the matched term in the middle.

Credits

@vominh1919 — base implementation (PR fix: FTS5 LIKE fallback for CJK (Chinese/Japanese/Korean) queries #11516, first submitted, authorship preserved in commit c7a1b37)
@iamagenius00 — centered-snippet idea from PR fix: add LIKE fallback for CJK queries in session_search #11517 (co-authored the follow-up)
@gongli0929 — submitted a third parallel fix (PR fix: FTS5 LIKE fallback for CJK queries #11541)
@viviennn on X/Twitter — original bug report and write-up

Closes #11511. Supersedes #11516, #11517, #11541.

Follow-up (not in this PR)

A proper long-term fix is to switch the FTS5 virtual table to the trigram tokenizer (SQLite 3.34+), which handles CJK substring matching natively without needing LIKE. That requires a schema migration (DROP + CREATE + reindex) and a minimum-SQLite check — worth its own PR.

FTS5 default tokenizer splits CJK text character-by-character, causing multi-character queries like '记忆断裂' to return 0 results. This fix adds a LIKE fallback: when FTS5 returns no results and the query contains CJK characters, retry with WHERE content LIKE '%query%'. Preserves FTS5 performance for English queries. Fixes #11511

@iamagenius00

Twelve tests under TestCJKSearchFallback guarding: - CJK detection across Chinese/Japanese/Korean/Hiragana/Katakana ranges (including the full Hangul syllables block \uac00-\ud7af, to catch the shorter-range typo from one of the duplicate PRs) - Substring match for multi-char Chinese, Japanese, Korean queries - Filter preservation (source_filter, exclude_sources, role_filter) in the LIKE path — guards against the SQL-builder bug from another duplicate PR where filter clauses landed after LIMIT/OFFSET - Snippet centered on the matched term (instr-based substr window), not the leading 200 chars of content - English fast-path untouched - Empty/no-match cases - Mixed CJK+English queries Also: - hermes_state.py: LIKE-fallback snippet is now `substr(content, max(1, instr(content, ?) - 40), 120)`, centered on the match instead of the whole-content default. Credit goes to @iamagenius00 for the snippet idea in PR #11517. - scripts/release.py: add @iamagenius00 to AUTHOR_MAP so future release attribution resolves cleanly. Refs #11511, #11516, #11517, #11541. Co-authored-by: iamagenius00 <iamagenius00@users.noreply.github.com>

vominh1919 and others added 2 commits April 18, 2026 01:52

teknium1 merged commit 3b69b2f into main Apr 18, 2026
4 of 5 checks passed

teknium1 deleted the hermes/hermes-e16ba93d branch April 18, 2026 08:58

This was referenced Apr 18, 2026

fix: FTS5 LIKE fallback for CJK (Chinese/Japanese/Korean) queries #11516

Closed

fix: add LIKE fallback for CJK queries in session_search #11517

Closed

session_search: FTS5 returns empty results for Chinese/CJK queries #11511

Closed

tznthou mentioned this pull request Apr 19, 2026

FTS5 queries return 0 results for CJK (Chinese/Japanese/Korean) keywords — all recall_query paths effectively broken for non-English users tznthou/ccRecall#10

Closed

6 tasks

alt-glitch mentioned this pull request May 27, 2026

Bug: session_search FTS5 不支持中文搜索 #33069

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(session-search): LIKE fallback for CJK queries (salvages #11516, #11517, #11541)#12075

fix(session-search): LIKE fallback for CJK queries (salvages #11516, #11517, #11541)#12075
teknium1 merged 2 commits into
mainfrom
hermes/hermes-e16ba93d

teknium1 commented Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

teknium1 commented Apr 18, 2026

Summary

Changes

Validation

Credits

Follow-up (not in this PR)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants