feat: CJK trigram FTS search with OR semantics by catgodtwno4 · Pull Request #219 · Martian-Engineering/lossless-claw

catgodtwno4 · 2026-03-31T06:37:02Z

Problem

FTS5 unicode61 (porter) tokenizer cannot segment CJK ideographs. When a query contains Chinese/Japanese/Korean characters, searchSummaries() falls back to a LIKE path with AND logic via buildLikeSearchPlan().

The AND logic fails when the user's phrasing doesn't exactly match the summary text:

Query	Summary contains	Result
`"端到端测试结果"`	`"端到端测试"`	❌ 0 candidates (third term `"端到端测试结果"` not verbatim)
`"配置检查结果"`	`"配置"`	❌ 0 candidates
`"流水线部署结果"`	`"流水线"`	❌ 0 candidates

This affects all CJK users — Chinese, Japanese, and Korean — and is the primary reason lcm_expand_query returns zero candidates for CJK queries (even when matching summaries exist).

Western/Latin users are less affected because space-separated words align naturally with the \S+ regex tokenizer.

Solution

1. Trigram FTS table (`summaries_fts_cjk`)

A new FTS5 virtual table with tokenize='trigram' that indexes every 3-character substring. This enables native CJK substring matching via FTS5 MATCH.

2. `searchCjkTrigram()` — primary CJK search path

Splits CJK segments into overlapping 4-char chunks
Combines with OR semantics via FTS5 MATCH
Non-CJK tokens (English words, version numbers) searched in existing porter FTS table
Results unioned and sorted by recency

3. `searchLikeCjk()` — fallback when trigram table unavailable

Splits CJK text into bigrams (2-char sliding window)
Uses LIKE with OR instead of AND
Ensures partial matches return results even without the trigram table

4. Auto-migration

Creates summaries_fts_cjk and backfills from existing summaries on first run
New summaries indexed on saveSummary()
Graceful degradation: if table doesn't exist, falls through to LIKE OR

Testing

Tested on 4 machines with Chinese query workloads:

Before: "端到端测试结果" → 0 candidates (AND logic, exact match required)
After:  "端到端测试结果" → correct matches (trigram OR, substring matching)

Also verified:

Exact CJK matches still work
Mixed CJK + Latin queries (e.g. "lossless-claw v0.5.2 端到端测试") return correct results
Non-CJK queries unaffected (existing porter FTS path unchanged)
No regressions on lcm_grep or lcm_describe

Files changed

src/db/migration.ts — add summaries_fts_cjk virtual table creation
src/store/summary-store.ts — add searchCjkTrigram(), searchLikeCjk(), update CJK routing, index new summaries into trigram table

Related: #208 (search path for lcm_expand_query candidate resolution)

FTS5 unicode61 tokenizer cannot segment CJK ideographs (Chinese, Japanese, Korean), so CJK queries fall back to a LIKE path with AND logic. When the user's phrasing doesn't exactly match the summary text (e.g. querying "端到端测试结果" when the summary contains "端到端测试"), ALL terms must match and the query returns zero candidates. This commit adds: 1. A new FTS5 trigram-tokenized virtual table (summaries_fts_cjk) that indexes every 3-character substring, enabling native CJK substring matching. 2. searchCjkTrigram() — splits CJK segments into overlapping 4-char chunks and combines them with OR semantics via FTS5 MATCH. Non-CJK tokens (English, version numbers) are searched in the existing porter FTS table. Results are unioned and sorted by recency. 3. searchLikeCjk() — a fallback when the trigram table is unavailable. Splits CJK text into bigrams (2-char sliding window) and uses LIKE with OR instead of AND, so partial matches return results. 4. Auto-migration: creates summaries_fts_cjk and backfills from existing summaries on first run. New summaries are indexed on save. Tested on 4 machines with Chinese query workloads: - Before: "端到端测试结果" → 0 candidates - After: "端到端测试结果" → correct matches via trigram OR Fixes CJK zero-result bug affecting all Chinese/Japanese/Korean users. Related: Martian-Engineering#208 (search path for lcm_expand_query candidate resolution)

Keep mixed CJK and Latin summary queries on full-intent matching while preserving the new CJK-specific recall improvements. Route short CJK segments through the LIKE fallback so one- and two-character queries do not regress, and update fallback coverage plus a release note. Regeneration-Prompt: | Address review feedback on the PR that added trigram-backed CJK summary search. Preserve the additive migration and the improved recall for CJK phrasing differences, but fix the cases where mixed-language queries were broadened from implicit AND to OR and where very short CJK queries could return no results. Keep the work localized to summary search behavior, add regression tests for mixed CJK plus Latin queries and single-character CJK queries, and include a changeset because this is user-facing search behavior.

scott and others added 3 commits March 31, 2026 14:32

docs: add Chinese README (README_zh.md)

176aae2

docs: 更新相關倉庫連結（新命名）

0a1f7e7

smallmj mentioned this pull request Apr 3, 2026

[Bug] CJK token estimation uses length/4 causing severe underestimation #250

Closed

jalehman merged commit 69e5f6a into Martian-Engineering:main Apr 3, 2026
1 check passed

github-actions Bot mentioned this pull request Apr 3, 2026

chore: version packages #231

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: CJK trigram FTS search with OR semantics#219

feat: CJK trigram FTS search with OR semantics#219
jalehman merged 4 commits into
Martian-Engineering:mainfrom
catgodtwno4:feat/cjk-trigram-fts-search

catgodtwno4 commented Mar 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

catgodtwno4 commented Mar 31, 2026

Problem

Solution

1. Trigram FTS table (summaries_fts_cjk)

2. searchCjkTrigram() — primary CJK search path

3. searchLikeCjk() — fallback when trigram table unavailable

4. Auto-migration

Testing

Files changed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

1. Trigram FTS table (`summaries_fts_cjk`)

2. `searchCjkTrigram()` — primary CJK search path

3. `searchLikeCjk()` — fallback when trigram table unavailable