fix: preserve CJK characters in slugify, prevent silent collision by vinsew · Pull Request #115 · garrytan/gbrain

vinsew · 2026-04-14T12:18:30Z

Summary

slugifySegment() in src/core/sync.ts uses /[^a-z0-9.\s_-]/g to strip "special" chars. That regex strips every non-ASCII character, including all CJK. Consequence:

品牌圣经.md → slug segment \"\" (empty)
销售论证文档.md → slug segment \"\" (empty)

Both empty segments are filtered out by slugifyPath's .filter(Boolean), and both files collapse to the parent directory slug (e.g., inbox). The second import silently overwrites the first. Worse, gbrain import reports N imported, 0 errors because UPSERT doesn't distinguish overwrite from insert.

This is the slug-layer counterpart to #98 (query expansion) and #114 (chunker) — the same ASCII-only assumption baked into a third part of the codebase.

Reproduction

mkdir repro && cd repro
echo '# A' > 品牌圣经.md
echo '# B' > 销售论证文档.md
gbrain init
gbrain import .
# → \"2 pages imported, 0 errors\"
gbrain list
# → Only 1 page! One file was silently overwritten.

Fix

Two small changes in src/core/sync.ts:

Add CJK ranges to the character whitelist. Han (\u4e00-\u9fff), Hiragana (\u3040-\u309f), Katakana (\u30a0-\u30ff), Hangul Syllables (\uac00-\ud7af). Same set used by the chunker fix (fix: CJK word count and delimiters in recursive chunker #114) and expansion fix (fix: CJK word count in query expansion #98) for consistency.
Re-normalize to NFC after the accent-strip step. NFD decomposes Hangul syllables (e.g., 한) into conjoining Jamo (ㅎ + ㅏ + ㄴ) in the U+1100 block, which sit outside the Hangul Syllables block. Without re-composing, Korean names would still collapse to empty. The NFD+NFC dance is the same trick Unicode libraries use to apply combining-mark operations on Latin without mangling Hangul.

Impact

2 files changed, +54 / -2 lines
Zero behavior change for ASCII-only text (same regex shape, same steps, same output)
Pure and mixed CJK filenames now produce meaningful, non-colliding slugs
Closes a silent-data-loss class of bug for CJK users

Test plan

8 new cases in test/slug-validation.test.ts:
- slugifySegment preserves Chinese (品牌圣经 → 品牌圣经)
- slugifySegment preserves Japanese (Hiragana ひらがな, Katakana カタカナ, mixed テスト文書)
- slugifySegment preserves Korean (한글테스트 → 한글테스트) — exercises the NFC recomposition
- slugifySegment mixed CJK + ASCII: lowercases ASCII, preserves CJK, space → hyphen (ICP-理想客户画像 → icp-理想客户画像)
- slugifySegment collision regression: two different CJK names produce different slugs
- slugifyPath pure-CJK files keep their characters (inbox/品牌圣经.md → inbox/品牌圣经)
- slugifyPath collision regression at the path level
- slugifyPath CJK directory names preserved
All 46 slug tests pass (8 new + 38 existing)
bun test shows no new regressions (the 4 pre-existing PGLiteEngine failures are unrelated and exist on master)

Companion to #114 (chunker CJK fix). Either can be merged independently.

When a git repository contains files with non-ASCII names (common for Chinese/Japanese/Korean users, or for files exported from Apple Notes with spaces + CJK like "2026-04-14 22_38 记录.md"), `git diff --name-status` wraps those paths in double quotes and octal-escapes each byte: A "inbox/2026-04-14 22_38 \350\256\260\345\275\225.md" buildSyncManifest then treats that literal quoted-escaped string as the path, downstream filesystem lookups fail, and the file is silently dropped from the sync manifest. The user sees "added: 0" in the sync result even though git has those files committed, and `gbrain search` can't find the content. The cron log shows success because nothing technically errored. This is the sync-layer counterpart to the same CJK root cause class fixed in garrytan#98 (query expansion), garrytan#114 (chunker), and garrytan#115 (slugify): ASCII-only assumptions baked into a fourth part of the codebase. Reproduction: cd some-brain-repo echo "# test" > "inbox/测试文件.md" git add . && git commit -m test gbrain sync --repo . # -> "added: 0, chunksCreated: 0" ← bug # -> But git log clearly shows the commit added the file. Fix: - Add `-c core.quotepath=false` to the `git()` helper in src/commands/sync.ts. This config tells git to emit paths as-is (UTF-8) in diff/log output instead of the default double-quoted octal-escaped form. The fix is at the call site so all future git invocations through this helper are covered, not just `diff`. Impact: - 2 files changed, +18 / -1 lines (1-line code fix + comment + tests) - Zero behavior change for ASCII-only paths - CJK filenames (with or without spaces) now sync correctly Test plan: - [x] 3 new tests in test/sync.test.ts cover pure-CJK paths (Chinese + Japanese + Korean), CJK-with-spaces (Apple Notes pattern), and CJK rename entries. - [x] All 35 sync tests pass (32 existing + 3 new). - [x] Full `bun test` suite: no new regressions (the 4 pre-existing PGLiteEngine failures are unrelated and exist on master). Companion to garrytan#114 (chunker CJK) and garrytan#115 (slugify CJK). Third in the series; all three can merge independently.

GBrain stores internal cross-page references in slug form (e.g. `[Alice](./alice)`) because the slug is the canonical identifier in the DB. That works inside GBrain's own resolution layer. But when those pages are exported as `.md` files on disk and opened in standard markdown viewers (Obsidian, VS Code preview, GitHub web view, typical mkdocs/jekyll renderers), the viewers look for a literal file at `./alice` — which doesn't exist. The actual file is `./alice.md`. Result: every internal link in an exported brain is silently broken on disk. The user clicks `[小龙]` in `龙虾群.md`, sees a 404 / empty page, and cannot navigate the brain outside of GBrain itself. This defeats half the value of having the brain stored as portable markdown. Fix: Add `normalizeInternalLinks(content)` that runs over each page's serialized markdown right before `writeFileSync` and rewrites slug-form internal links to filename-form by appending `.md`: [Alice](./alice) -> [Alice](./alice.md) [Alice](alice) -> [Alice](alice.md) [Alice](../people/alice) -> [Alice](../people/alice.md) [小龙](../people/小龙) -> [小龙](../people/小龙.md) Conservative: leaves untouched anything that looks external or already extended: - URL schemes (http:, https:, mailto:, ftp:, file:, tel:, ...) — skip - Anchors (#section) — skip - Empty targets — skip - Trailing slash (directory references) — skip - Already has any extension (.md, .png, .pdf, .MD, ...) — skip - Preserves query strings and anchors when appending: [Section](./alice#bio) -> [Section](./alice.md#bio) [Search](./alice?q=t) -> [Search](./alice.md?q=t) The DB content stays slug-form (GBrain's internal convention is unchanged). Only the on-disk export gets the `.md` annotation, so the exported markdown is viewable as-is by any standard renderer. Real-world reproduction this fix addresses: $ gbrain put 龙虾群 < <(echo '[小龙](./小龙)') $ gbrain export --dir /tmp/out $ cat /tmp/out/龙虾群.md # before this PR: contains [小龙](./小龙) — clicking 404s # after this PR: contains [小龙](./小龙.md) — clicking opens the file Impact: - 2 files changed, +149 / -1 lines (1 line of helper invocation + ~40 lines of helper + comment + 26 tests) - Zero behavior change for external URLs, anchors, or already-extended links - DB content unchanged — only the on-disk export representation gains the `.md` annotation - Existing exports remain valid (re-running export on an already-exported brain is idempotent because already-extended links are skipped) Tests: - 26 new tests covering: same-dir slug, parent-dir slug, deep nesting, CJK slugs, multiple links per line, multi-line markdown, all 6 external schemes (http/https/mailto/file/ftp/tel), all 4 extension cases (md/png/pdf/uppercase), anchor preservation, query preservation, empty/trailing-slash/no-link edge cases. - All 26 tests pass. - Full suite: 612 pass / no new regressions (4 pre-existing PGLiteEngine failures are unrelated and exist on master). Fifth in a series of practical PRs from a real Chinese-speaking deploy. Companion to: - garrytan#114 (chunker CJK) - garrytan#115 (slugify CJK) - garrytan#119 (sync git quotepath CJK) - garrytan#121 (self-contained API keys) Same theme: GBrain is meaningfully more useful when the markdown export is a first-class deliverable, not a half-broken side-effect.

slugifySegment's filter regex /[^a-z0-9.\s_-]/g strips every non-ASCII character, so a pure-CJK filename (e.g., "品牌圣经.md", "销售论证文档.md") collapses to an empty string and gets filtered out by slugifyPath's .filter(Boolean). Both files then collapse to the parent directory slug (e.g., "inbox") and silently overwrite each other during gbrain import, which still reports "N imported, 0 errors". This is the slug-side counterpart to PR garrytan#98 (query expansion) and PR garrytan#114 (chunker) — same root cause (ASCII-only text handling) in a third part of the codebase. Changes: - slugifySegment(): add CJK ranges (Han, Hiragana, Katakana, Hangul) to the character-class whitelist. Mirrors the CJK range constant used in the chunker fix for consistent semantics. - Add .normalize('NFC') after the NFD+accent-strip step so Hangul syllables, which NFD decomposes into conjoining Jamo, get recomposed before the filter runs. Without this Korean names still collapse to empty because Jamo are outside the Hangul Syllables block. Impact: pure/mixed CJK filenames now produce meaningful, non-colliding slugs. ASCII-only behavior is unchanged. Tests: 8 new cases cover Chinese, Japanese (Hiragana + Katakana), Korean, mixed CJK+ASCII, CJK directories, and the collision-regression scenario. All 46 slug tests pass. No new regressions in full suite (the 4 PGLiteEngine failures pre-exist on master).

When a git repository contains files with non-ASCII names (common for Chinese/Japanese/Korean users, or for files exported from Apple Notes with spaces + CJK like "2026-04-14 22_38 记录.md"), `git diff --name-status` wraps those paths in double quotes and octal-escapes each byte: A "inbox/2026-04-14 22_38 \350\256\260\345\275\225.md" buildSyncManifest then treats that literal quoted-escaped string as the path, downstream filesystem lookups fail, and the file is silently dropped from the sync manifest. The user sees "added: 0" in the sync result even though git has those files committed, and `gbrain search` can't find the content. The cron log shows success because nothing technically errored. This is the sync-layer counterpart to the same CJK root cause class fixed in garrytan#98 (query expansion), garrytan#114 (chunker), and garrytan#115 (slugify): ASCII-only assumptions baked into a fourth part of the codebase. Reproduction: cd some-brain-repo echo "# test" > "inbox/测试文件.md" git add . && git commit -m test gbrain sync --repo . # -> "added: 0, chunksCreated: 0" ← bug # -> But git log clearly shows the commit added the file. Fix: - Add `-c core.quotepath=false` to the `git()` helper in src/commands/sync.ts. This config tells git to emit paths as-is (UTF-8) in diff/log output instead of the default double-quoted octal-escaped form. The fix is at the call site so all future git invocations through this helper are covered, not just `diff`. Impact: - 2 files changed, +18 / -1 lines (1-line code fix + comment + tests) - Zero behavior change for ASCII-only paths - CJK filenames (with or without spaces) now sync correctly Test plan: - [x] 3 new tests in test/sync.test.ts cover pure-CJK paths (Chinese + Japanese + Korean), CJK-with-spaces (Apple Notes pattern), and CJK rename entries. - [x] All 35 sync tests pass (32 existing + 3 new). - [x] Full `bun test` suite: no new regressions (the 4 pre-existing PGLiteEngine failures are unrelated and exist on master). Companion to garrytan#114 (chunker CJK) and garrytan#115 (slugify CJK). Third in the series; all three can merge independently.

GBrain stores internal cross-page references in slug form (e.g. `[Alice](./alice)`) because the slug is the canonical identifier in the DB. That works inside GBrain's own resolution layer. But when those pages are exported as `.md` files on disk and opened in standard markdown viewers (Obsidian, VS Code preview, GitHub web view, typical mkdocs/jekyll renderers), the viewers look for a literal file at `./alice` — which doesn't exist. The actual file is `./alice.md`. Result: every internal link in an exported brain is silently broken on disk. The user clicks `[小龙]` in `龙虾群.md`, sees a 404 / empty page, and cannot navigate the brain outside of GBrain itself. This defeats half the value of having the brain stored as portable markdown. Fix: Add `normalizeInternalLinks(content)` that runs over each page's serialized markdown right before `writeFileSync` and rewrites slug-form internal links to filename-form by appending `.md`: [Alice](./alice) -> [Alice](./alice.md) [Alice](alice) -> [Alice](alice.md) [Alice](../people/alice) -> [Alice](../people/alice.md) [小龙](../people/小龙) -> [小龙](../people/小龙.md) Conservative: leaves untouched anything that looks external or already extended: - URL schemes (http:, https:, mailto:, ftp:, file:, tel:, ...) — skip - Anchors (#section) — skip - Empty targets — skip - Trailing slash (directory references) — skip - Already has any extension (.md, .png, .pdf, .MD, ...) — skip - Preserves query strings and anchors when appending: [Section](./alice#bio) -> [Section](./alice.md#bio) [Search](./alice?q=t) -> [Search](./alice.md?q=t) The DB content stays slug-form (GBrain's internal convention is unchanged). Only the on-disk export gets the `.md` annotation, so the exported markdown is viewable as-is by any standard renderer. Real-world reproduction this fix addresses: $ gbrain put 龙虾群 < <(echo '[小龙](./小龙)') $ gbrain export --dir /tmp/out $ cat /tmp/out/龙虾群.md # before this PR: contains [小龙](./小龙) — clicking 404s # after this PR: contains [小龙](./小龙.md) — clicking opens the file Impact: - 2 files changed, +149 / -1 lines (1 line of helper invocation + ~40 lines of helper + comment + 26 tests) - Zero behavior change for external URLs, anchors, or already-extended links - DB content unchanged — only the on-disk export representation gains the `.md` annotation - Existing exports remain valid (re-running export on an already-exported brain is idempotent because already-extended links are skipped) Tests: - 26 new tests covering: same-dir slug, parent-dir slug, deep nesting, CJK slugs, multiple links per line, multi-line markdown, all 6 external schemes (http/https/mailto/file/ftp/tel), all 4 extension cases (md/png/pdf/uppercase), anchor preservation, query preservation, empty/trailing-slash/no-link edge cases. - All 26 tests pass. - Full suite: 612 pass / no new regressions (4 pre-existing PGLiteEngine failures are unrelated and exist on master). Fifth in a series of practical PRs from a real Chinese-speaking deploy. Companion to: - garrytan#114 (chunker CJK) - garrytan#115 (slugify CJK) - garrytan#119 (sync git quotepath CJK) - garrytan#121 (self-contained API keys) Same theme: GBrain is meaningfully more useful when the markdown export is a first-class deliverable, not a half-broken side-effect.

This was referenced Apr 14, 2026

fix: preserve CJK paths in gbrain sync (core.quotepath=false) #119

Open

feat: self-contained API keys (read from gbrain's own config, not just env) #121

Open

vinsew mentioned this pull request Apr 14, 2026

fix: gbrain export auto-appends .md to internal slug-form links #123

Open

3 tasks

garrytan mentioned this pull request Apr 18, 2026

fix: JSONB double-encode + splitBody wiki + parseEmbedding (v0.12.1) #196

Merged

7 tasks

vinsew mentioned this pull request Apr 24, 2026

fix: CJK word count and delimiters in recursive chunker #114

Open

3 tasks

vinsew force-pushed the fix/cjk-slug branch from dcd5d84 to 0bf8f7d Compare April 27, 2026 09:17

vinsew mentioned this pull request Apr 27, 2026

fix(sync): reconcile sync-failures.jsonl with reality #479

Open

5 tasks

tamagodo-fu mentioned this pull request May 9, 2026

fix: Unicode slug support across all 4 validators (closes #738, generalizes #115) #782

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: preserve CJK characters in slugify, prevent silent collision#115

fix: preserve CJK characters in slugify, prevent silent collision#115
vinsew wants to merge 1 commit intogarrytan:masterfrom
vinsew:fix/cjk-slug

vinsew commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vinsew commented Apr 14, 2026

Summary

Reproduction

Fix

Impact

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant