feat: CJK trigram FTS search with OR semantics#219
Merged
jalehman merged 4 commits intoApr 3, 2026
Conversation
FTS5 unicode61 tokenizer cannot segment CJK ideographs (Chinese, Japanese, Korean), so CJK queries fall back to a LIKE path with AND logic. When the user's phrasing doesn't exactly match the summary text (e.g. querying "端到端测试结果" when the summary contains "端到端测试"), ALL terms must match and the query returns zero candidates. This commit adds: 1. A new FTS5 trigram-tokenized virtual table (summaries_fts_cjk) that indexes every 3-character substring, enabling native CJK substring matching. 2. searchCjkTrigram() — splits CJK segments into overlapping 4-char chunks and combines them with OR semantics via FTS5 MATCH. Non-CJK tokens (English, version numbers) are searched in the existing porter FTS table. Results are unioned and sorted by recency. 3. searchLikeCjk() — a fallback when the trigram table is unavailable. Splits CJK text into bigrams (2-char sliding window) and uses LIKE with OR instead of AND, so partial matches return results. 4. Auto-migration: creates summaries_fts_cjk and backfills from existing summaries on first run. New summaries are indexed on save. Tested on 4 machines with Chinese query workloads: - Before: "端到端测试结果" → 0 candidates - After: "端到端测试结果" → correct matches via trigram OR Fixes CJK zero-result bug affecting all Chinese/Japanese/Korean users. Related: Martian-Engineering#208 (search path for lcm_expand_query candidate resolution)
Keep mixed CJK and Latin summary queries on full-intent matching while preserving the new CJK-specific recall improvements. Route short CJK segments through the LIKE fallback so one- and two-character queries do not regress, and update fallback coverage plus a release note. Regeneration-Prompt: | Address review feedback on the PR that added trigram-backed CJK summary search. Preserve the additive migration and the improved recall for CJK phrasing differences, but fix the cases where mixed-language queries were broadened from implicit AND to OR and where very short CJK queries could return no results. Keep the work localized to summary search behavior, add regression tests for mixed CJK plus Latin queries and single-character CJK queries, and include a changeset because this is user-facing search behavior.
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
FTS5
unicode61(porter) tokenizer cannot segment CJK ideographs. When a query contains Chinese/Japanese/Korean characters,searchSummaries()falls back to a LIKE path with AND logic viabuildLikeSearchPlan().The AND logic fails when the user's phrasing doesn't exactly match the summary text:
"端到端测试结果""端到端测试""端到端测试结果"not verbatim)"配置检查结果""配置""流水线部署结果""流水线"This affects all CJK users — Chinese, Japanese, and Korean — and is the primary reason
lcm_expand_queryreturns zero candidates for CJK queries (even when matching summaries exist).Western/Latin users are less affected because space-separated words align naturally with the
\S+regex tokenizer.Solution
1. Trigram FTS table (
summaries_fts_cjk)A new FTS5 virtual table with
tokenize='trigram'that indexes every 3-character substring. This enables native CJK substring matching via FTS5 MATCH.2.
searchCjkTrigram()— primary CJK search path3.
searchLikeCjk()— fallback when trigram table unavailable4. Auto-migration
summaries_fts_cjkand backfills from existing summaries on first runsaveSummary()Testing
Tested on 4 machines with Chinese query workloads:
Also verified:
"lossless-claw v0.5.2 端到端测试") return correct resultslcm_greporlcm_describeFiles changed
src/db/migration.ts— addsummaries_fts_cjkvirtual table creationsrc/store/summary-store.ts— addsearchCjkTrigram(),searchLikeCjk(), update CJK routing, index new summaries into trigram tableRelated: #208 (search path for
lcm_expand_querycandidate resolution)