fix: CJK word count in query expansion by YIING99 · Pull Request #98 · garrytan/gbrain

YIING99 · 2026-04-13T13:11:23Z

Summary

expandQuery() skips queries with fewer than 3 words, but counts words by splitting on whitespace
CJK text (Chinese/Japanese/Korean) is not space-delimited — a query like "向量搜索优化" is counted as 1 "word" and silently skipped
This means all short CJK queries never get expanded, degrading search quality for CJK users

Fix

Detect CJK characters (Unicode ranges for Han, Hiragana, Katakana, Hangul) and count non-whitespace characters instead of space-separated tokens when CJK is present.

// Before: always space-based
const wordCount = (query.match(/\S+/g) || []).length;

// After: CJK-aware
const hasCJK = /[\u4e00-\u9fff\u3040-\u309f\u30a0-\u30ff\uac00-\ud7af]/.test(query);
const wordCount = hasCJK ? query.replace(/\s/g, '').length : (query.match(/\S+/g) || []).length;

Impact

3 lines changed, 1 file (src/core/search/expansion.ts)
Zero behavior change for non-CJK queries
CJK queries that were silently skipped now get properly expanded

Test plan

Verify expandQuery("向量搜索") no longer returns early (was counted as 1 word, now 4 chars)
Verify expandQuery("hello world") still returns early (2 words < 3)
Verify expandQuery("machine learning models") still works (3 words ≥ 3)

The expandQuery() function uses space-separated token count to decide whether to expand a query (skipping queries < 3 words). CJK text (Chinese, Japanese, Korean) is not space-delimited, so a query like "向量搜索优化" was counted as 1 word and silently skipped. Fix: detect CJK characters and count non-whitespace characters instead of space-separated tokens when CJK is present.

garrytan · 2026-04-13T19:19:26Z

Hey @YIING99 — this is a great catch. CJK users were silently getting zero query expansion, which is a real search quality regression for non-Latin languages.

Your fix is clean and correct. We've pulled it into our search quality branch (garrytan/search-quality-boost, PR #64) with full attribution via Co-Authored-By. It shipped alongside compiled truth ranking, cosine re-scoring, and the eval harness.

Thank you for the contribution! 🙏

Closing in favor of #64 which includes this fix.

The whitespace-based countWords() treats a whole CJK paragraph as 1 "word" (no spaces between Chinese/Japanese/Korean chars), so chunkText() never splits it. The paragraph then gets sent as one >8192-token embedding request and fails with OpenAI's max input length error. This is the chunker-side counterpart to PR garrytan#98 (fix: CJK word count in query expansion), which fixed the same root cause in expandQuery. Changes: - countWords(): detect CJK and count non-whitespace characters instead of space-separated tokens. Mirrors the detection in PR garrytan#98 for consistent semantics. - DELIMITERS L2/L3: add CJK full-width punctuation (。！？；：，、) so the chunker can find semantic split points in Chinese/Japanese. - splitOnWhitespace(): fall back to character-slicing when CJK text has no whitespace (otherwise a whole paragraph collapses to 1 piece). Impact: pure Chinese/Japanese/Korean documents now produce properly sized chunks (≤ target chars ≈ tokens) instead of one giant chunk that exceeds OpenAI's 8192-token embedding limit. Tests: 3 new cases cover long Chinese paragraphs, Japanese/Korean text, and CJK+English mixed text. All 18 chunker tests pass; no pre-existing test regressions.

slugifySegment's filter regex /[^a-z0-9.\s_-]/g strips every non-ASCII character, so a pure-CJK filename (e.g., "品牌圣经.md", "销售论证文档.md") collapses to an empty string and gets filtered out by slugifyPath's .filter(Boolean). Both files then collapse to the parent directory slug (e.g., "inbox") and silently overwrite each other during gbrain import, which still reports "N imported, 0 errors". This is the slug-side counterpart to PR garrytan#98 (query expansion) and PR garrytan#114 (chunker) — same root cause (ASCII-only text handling) in a third part of the codebase. Changes: - slugifySegment(): add CJK ranges (Han, Hiragana, Katakana, Hangul) to the character-class whitelist. Mirrors the CJK range constant used in the chunker fix for consistent semantics. - Add .normalize('NFC') after the NFD+accent-strip step so Hangul syllables, which NFD decomposes into conjoining Jamo, get recomposed before the filter runs. Without this Korean names still collapse to empty because Jamo are outside the Hangul Syllables block. Impact: pure/mixed CJK filenames now produce meaningful, non-colliding slugs. ASCII-only behavior is unchanged. Tests: 8 new cases cover Chinese, Japanese (Hiragana + Katakana), Korean, mixed CJK+ASCII, CJK directories, and the collision-regression scenario. All 46 slug tests pass. No new regressions in full suite (the 4 PGLiteEngine failures pre-exist on master).

When a git repository contains files with non-ASCII names (common for Chinese/Japanese/Korean users, or for files exported from Apple Notes with spaces + CJK like "2026-04-14 22_38 记录.md"), `git diff --name-status` wraps those paths in double quotes and octal-escapes each byte: A "inbox/2026-04-14 22_38 \350\256\260\345\275\225.md" buildSyncManifest then treats that literal quoted-escaped string as the path, downstream filesystem lookups fail, and the file is silently dropped from the sync manifest. The user sees "added: 0" in the sync result even though git has those files committed, and `gbrain search` can't find the content. The cron log shows success because nothing technically errored. This is the sync-layer counterpart to the same CJK root cause class fixed in garrytan#98 (query expansion), garrytan#114 (chunker), and garrytan#115 (slugify): ASCII-only assumptions baked into a fourth part of the codebase. Reproduction: cd some-brain-repo echo "# test" > "inbox/测试文件.md" git add . && git commit -m test gbrain sync --repo . # -> "added: 0, chunksCreated: 0" ← bug # -> But git log clearly shows the commit added the file. Fix: - Add `-c core.quotepath=false` to the `git()` helper in src/commands/sync.ts. This config tells git to emit paths as-is (UTF-8) in diff/log output instead of the default double-quoted octal-escaped form. The fix is at the call site so all future git invocations through this helper are covered, not just `diff`. Impact: - 2 files changed, +18 / -1 lines (1-line code fix + comment + tests) - Zero behavior change for ASCII-only paths - CJK filenames (with or without spaces) now sync correctly Test plan: - [x] 3 new tests in test/sync.test.ts cover pure-CJK paths (Chinese + Japanese + Korean), CJK-with-spaces (Apple Notes pattern), and CJK rename entries. - [x] All 35 sync tests pass (32 existing + 3 new). - [x] Full `bun test` suite: no new regressions (the 4 pre-existing PGLiteEngine failures are unrelated and exist on master). Companion to garrytan#114 (chunker CJK) and garrytan#115 (slugify CJK). Third in the series; all three can merge independently.

The whitespace-based countWords() treats a whole CJK paragraph as 1 "word" (no spaces between Chinese/Japanese/Korean chars), so chunkText() never splits it. The paragraph then gets sent as one >8192-token embedding request and fails with OpenAI's max input length error. This is the chunker-side counterpart to PR garrytan#98 (fix: CJK word count in query expansion), which fixed the same root cause in expandQuery. Changes: - countWords(): detect CJK and count non-whitespace characters instead of space-separated tokens. Mirrors the detection in PR garrytan#98 for consistent semantics. - DELIMITERS L2/L3: add CJK full-width punctuation (。！？；：，、) so the chunker can find semantic split points in Chinese/Japanese. - splitOnWhitespace(): fall back to character-slicing when CJK text has no whitespace (otherwise a whole paragraph collapses to 1 piece). Impact: pure Chinese/Japanese/Korean documents now produce properly sized chunks (≤ target chars ≈ tokens) instead of one giant chunk that exceeds OpenAI's 8192-token embedding limit. Tests: 3 new cases cover long Chinese paragraphs, Japanese/Korean text, and CJK+English mixed text. All 18 chunker tests pass; no pre-existing test regressions.

slugifySegment's filter regex /[^a-z0-9.\s_-]/g strips every non-ASCII character, so a pure-CJK filename (e.g., "品牌圣经.md", "销售论证文档.md") collapses to an empty string and gets filtered out by slugifyPath's .filter(Boolean). Both files then collapse to the parent directory slug (e.g., "inbox") and silently overwrite each other during gbrain import, which still reports "N imported, 0 errors". This is the slug-side counterpart to PR garrytan#98 (query expansion) and PR garrytan#114 (chunker) — same root cause (ASCII-only text handling) in a third part of the codebase. Changes: - slugifySegment(): add CJK ranges (Han, Hiragana, Katakana, Hangul) to the character-class whitelist. Mirrors the CJK range constant used in the chunker fix for consistent semantics. - Add .normalize('NFC') after the NFD+accent-strip step so Hangul syllables, which NFD decomposes into conjoining Jamo, get recomposed before the filter runs. Without this Korean names still collapse to empty because Jamo are outside the Hangul Syllables block. Impact: pure/mixed CJK filenames now produce meaningful, non-colliding slugs. ASCII-only behavior is unchanged. Tests: 8 new cases cover Chinese, Japanese (Hiragana + Katakana), Korean, mixed CJK+ASCII, CJK directories, and the collision-regression scenario. All 46 slug tests pass. No new regressions in full suite (the 4 PGLiteEngine failures pre-exist on master).

When a git repository contains files with non-ASCII names (common for Chinese/Japanese/Korean users, or for files exported from Apple Notes with spaces + CJK like "2026-04-14 22_38 记录.md"), `git diff --name-status` wraps those paths in double quotes and octal-escapes each byte: A "inbox/2026-04-14 22_38 \350\256\260\345\275\225.md" buildSyncManifest then treats that literal quoted-escaped string as the path, downstream filesystem lookups fail, and the file is silently dropped from the sync manifest. The user sees "added: 0" in the sync result even though git has those files committed, and `gbrain search` can't find the content. The cron log shows success because nothing technically errored. This is the sync-layer counterpart to the same CJK root cause class fixed in garrytan#98 (query expansion), garrytan#114 (chunker), and garrytan#115 (slugify): ASCII-only assumptions baked into a fourth part of the codebase. Reproduction: cd some-brain-repo echo "# test" > "inbox/测试文件.md" git add . && git commit -m test gbrain sync --repo . # -> "added: 0, chunksCreated: 0" ← bug # -> But git log clearly shows the commit added the file. Fix: - Add `-c core.quotepath=false` to the `git()` helper in src/commands/sync.ts. This config tells git to emit paths as-is (UTF-8) in diff/log output instead of the default double-quoted octal-escaped form. The fix is at the call site so all future git invocations through this helper are covered, not just `diff`. Impact: - 2 files changed, +18 / -1 lines (1-line code fix + comment + tests) - Zero behavior change for ASCII-only paths - CJK filenames (with or without spaces) now sync correctly Test plan: - [x] 3 new tests in test/sync.test.ts cover pure-CJK paths (Chinese + Japanese + Korean), CJK-with-spaces (Apple Notes pattern), and CJK rename entries. - [x] All 35 sync tests pass (32 existing + 3 new). - [x] Full `bun test` suite: no new regressions (the 4 pre-existing PGLiteEngine failures are unrelated and exist on master). Companion to garrytan#114 (chunker CJK) and garrytan#115 (slugify CJK). Third in the series; all three can merge independently.

YIING99 mentioned this pull request Apr 13, 2026

feat: support alternative embedding and LLM providers via env vars #99

Closed

garrytan closed this Apr 13, 2026

vinsew mentioned this pull request Apr 14, 2026

fix: CJK word count and delimiters in recursive chunker #114

Open

3 tasks

vinsew mentioned this pull request Apr 14, 2026

fix: preserve CJK characters in slugify, prevent silent collision #115

Open

3 tasks

vinsew mentioned this pull request Apr 14, 2026

fix: preserve CJK paths in gbrain sync (core.quotepath=false) #119

Open

3 tasks

YIING99 deleted the fix/cjk-word-count branch April 19, 2026 10:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: CJK word count in query expansion#98

fix: CJK word count in query expansion#98
YIING99 wants to merge 1 commit intogarrytan:masterfrom
YIING99:fix/cjk-word-count

YIING99 commented Apr 13, 2026

Uh oh!

garrytan commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

YIING99 commented Apr 13, 2026

Summary

Fix

Impact

Test plan

Uh oh!

garrytan commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants