Skip to content

feat: CJK trigram FTS search with OR semantics#219

Merged
jalehman merged 4 commits into
Martian-Engineering:mainfrom
catgodtwno4:feat/cjk-trigram-fts-search
Apr 3, 2026
Merged

feat: CJK trigram FTS search with OR semantics#219
jalehman merged 4 commits into
Martian-Engineering:mainfrom
catgodtwno4:feat/cjk-trigram-fts-search

Conversation

@catgodtwno4

Copy link
Copy Markdown
Contributor

Problem

FTS5 unicode61 (porter) tokenizer cannot segment CJK ideographs. When a query contains Chinese/Japanese/Korean characters, searchSummaries() falls back to a LIKE path with AND logic via buildLikeSearchPlan().

The AND logic fails when the user's phrasing doesn't exactly match the summary text:

Query Summary contains Result
"端到端测试结果" "端到端测试" ❌ 0 candidates (third term "端到端测试结果" not verbatim)
"配置检查结果" "配置" ❌ 0 candidates
"流水线部署结果" "流水线" ❌ 0 candidates

This affects all CJK users — Chinese, Japanese, and Korean — and is the primary reason lcm_expand_query returns zero candidates for CJK queries (even when matching summaries exist).

Western/Latin users are less affected because space-separated words align naturally with the \S+ regex tokenizer.

Solution

1. Trigram FTS table (summaries_fts_cjk)

A new FTS5 virtual table with tokenize='trigram' that indexes every 3-character substring. This enables native CJK substring matching via FTS5 MATCH.

2. searchCjkTrigram() — primary CJK search path

  • Splits CJK segments into overlapping 4-char chunks
  • Combines with OR semantics via FTS5 MATCH
  • Non-CJK tokens (English words, version numbers) searched in existing porter FTS table
  • Results unioned and sorted by recency

3. searchLikeCjk() — fallback when trigram table unavailable

  • Splits CJK text into bigrams (2-char sliding window)
  • Uses LIKE with OR instead of AND
  • Ensures partial matches return results even without the trigram table

4. Auto-migration

  • Creates summaries_fts_cjk and backfills from existing summaries on first run
  • New summaries indexed on saveSummary()
  • Graceful degradation: if table doesn't exist, falls through to LIKE OR

Testing

Tested on 4 machines with Chinese query workloads:

Before: "端到端测试结果" → 0 candidates (AND logic, exact match required)
After:  "端到端测试结果" → correct matches (trigram OR, substring matching)

Also verified:

  • Exact CJK matches still work
  • Mixed CJK + Latin queries (e.g. "lossless-claw v0.5.2 端到端测试") return correct results
  • Non-CJK queries unaffected (existing porter FTS path unchanged)
  • No regressions on lcm_grep or lcm_describe

Files changed

  • src/db/migration.ts — add summaries_fts_cjk virtual table creation
  • src/store/summary-store.ts — add searchCjkTrigram(), searchLikeCjk(), update CJK routing, index new summaries into trigram table

Related: #208 (search path for lcm_expand_query candidate resolution)

scott and others added 3 commits March 31, 2026 14:32
FTS5 unicode61 tokenizer cannot segment CJK ideographs (Chinese, Japanese,
Korean), so CJK queries fall back to a LIKE path with AND logic. When the
user's phrasing doesn't exactly match the summary text (e.g. querying
"端到端测试结果" when the summary contains "端到端测试"), ALL terms
must match and the query returns zero candidates.

This commit adds:

1. A new FTS5 trigram-tokenized virtual table (summaries_fts_cjk) that
   indexes every 3-character substring, enabling native CJK substring
   matching.

2. searchCjkTrigram() — splits CJK segments into overlapping 4-char
   chunks and combines them with OR semantics via FTS5 MATCH. Non-CJK
   tokens (English, version numbers) are searched in the existing porter
   FTS table. Results are unioned and sorted by recency.

3. searchLikeCjk() — a fallback when the trigram table is unavailable.
   Splits CJK text into bigrams (2-char sliding window) and uses LIKE
   with OR instead of AND, so partial matches return results.

4. Auto-migration: creates summaries_fts_cjk and backfills from existing
   summaries on first run. New summaries are indexed on save.

Tested on 4 machines with Chinese query workloads:
- Before: "端到端测试结果" → 0 candidates
- After:  "端到端测试结果" → correct matches via trigram OR

Fixes CJK zero-result bug affecting all Chinese/Japanese/Korean users.
Related: Martian-Engineering#208 (search path for lcm_expand_query candidate resolution)
Keep mixed CJK and Latin summary queries on full-intent matching while
preserving the new CJK-specific recall improvements. Route short CJK
segments through the LIKE fallback so one- and two-character queries do
not regress, and update fallback coverage plus a release note.

Regeneration-Prompt: |
  Address review feedback on the PR that added trigram-backed CJK summary
  search. Preserve the additive migration and the improved recall for CJK
  phrasing differences, but fix the cases where mixed-language queries were
  broadened from implicit AND to OR and where very short CJK queries could
  return no results. Keep the work localized to summary search behavior,
  add regression tests for mixed CJK plus Latin queries and single-character
  CJK queries, and include a changeset because this is user-facing search
  behavior.
@jalehman jalehman merged commit 69e5f6a into Martian-Engineering:main Apr 3, 2026
1 check passed
@github-actions github-actions Bot mentioned this pull request Apr 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants