Skip to content

fix: preserve CJK paths in gbrain sync (core.quotepath=false)#119

Open
vinsew wants to merge 1 commit intogarrytan:masterfrom
vinsew:fix/cjk-sync-quotepath
Open

fix: preserve CJK paths in gbrain sync (core.quotepath=false)#119
vinsew wants to merge 1 commit intogarrytan:masterfrom
vinsew:fix/cjk-sync-quotepath

Conversation

@vinsew
Copy link
Copy Markdown
Contributor

@vinsew vinsew commented Apr 14, 2026

Summary

gbrain sync silently drops files with non-ASCII names from the sync manifest. Git's default diff --name-status output wraps CJK/unicode paths in double quotes with octal byte escapes:

A	\"inbox/2026-04-14 22_38 \350\256\260\345\275\225.md\"

buildSyncManifest treats that entire string as the literal path, downstream filesystem lookups fail, and the file is silently dropped. The user sees added: 0, chunksCreated: 0 in the sync result even though git has the file committed, and gbrain search can't find the content. The cron log reports success because nothing technically errored.

This is the sync-layer counterpart to the same CJK root cause class fixed in #98 (query expansion), #114 (chunker), and #115 (slugify): ASCII-only assumptions baked into a fourth part of the codebase.

Reproduction

mkdir /tmp/brain && cd /tmp/brain && git init
gbrain init
echo \"# test content\" > \"inbox/测试文件.md\"
git add . && git commit -m \"add test\"
gbrain sync --repo .
# → Output: { \"added\": 0, \"chunksCreated\": 0 }     ← bug
# → But `git log --stat` clearly shows the file was added.
# → `gbrain list` shows no new page; `gbrain search \"测试\"` returns nothing.

Real-world case: Apple Notes exports produce names like 2026-04-14 22_38 记录-个人智能体_原文.md — the space + CJK combo guarantees git quoting, and those files never make it into the brain via the live-sync cron.

Fix

One-line change to the git() helper in src/commands/sync.ts: prepend -c core.quotepath=false before the subcommand.

// before
execFileSync('git', ['-C', repoPath, ...args], ...)

// after
execFileSync('git', ['-c', 'core.quotepath=false', '-C', repoPath, ...args], ...)

core.quotepath=false tells git to emit paths as-is (UTF-8) in diff/log output. Setting it at the helper level means all current and future git invocations through this code path are covered — not just diff.

Impact

  • 2 files changed, +38 / -1 lines (1-line code fix + explanatory comment + 3 tests)
  • Zero behavior change for ASCII-only paths (same flag, same output format)
  • CJK filenames — with or without spaces — now sync correctly
  • Unblocks Chinese/Japanese/Korean users whose incremental sync has been silently partial
  • Also fixes Apple Notes export (spaces + CJK) which is a common ingest pattern

Test plan

  • 3 new tests in test/sync.test.ts:
    • Pure CJK filenames (Chinese Han, Japanese Hiragana, Korean Hangul)
    • CJK filenames with spaces (Apple Notes export pattern)
    • CJK rename entries
  • All 35 sync tests pass (32 existing + 3 new)
  • Full bun test suite runs clean — no new regressions (the 4 pre-existing PGLiteEngine failures are unrelated and exist on master)

Note on manual workaround

Existing users affected by this bug can recover by running gbrain sync --full once, which uses a filesystem walker instead of git-diff parsing and correctly picks up all CJK files. After this PR lands, incremental sync handles them natively.

Third in the CJK series after #114 (chunker) and #115 (slugify). All three are independent, small, and can merge in any order.

vinsew added a commit to vinsew/gbrain that referenced this pull request Apr 14, 2026
GBrain stores internal cross-page references in slug form (e.g.
`[Alice](./alice)`) because the slug is the canonical identifier in the
DB. That works inside GBrain's own resolution layer.

But when those pages are exported as `.md` files on disk and opened in
standard markdown viewers (Obsidian, VS Code preview, GitHub web view,
typical mkdocs/jekyll renderers), the viewers look for a literal file
at `./alice` — which doesn't exist. The actual file is `./alice.md`.

Result: every internal link in an exported brain is silently broken on
disk. The user clicks `[小龙]` in `龙虾群.md`, sees a 404 / empty page,
and cannot navigate the brain outside of GBrain itself. This defeats
half the value of having the brain stored as portable markdown.

Fix:

Add `normalizeInternalLinks(content)` that runs over each page's
serialized markdown right before `writeFileSync` and rewrites slug-form
internal links to filename-form by appending `.md`:

  [Alice](./alice)            -> [Alice](./alice.md)
  [Alice](alice)              -> [Alice](alice.md)
  [Alice](../people/alice)    -> [Alice](../people/alice.md)
  [小龙](../people/小龙)        -> [小龙](../people/小龙.md)

Conservative: leaves untouched anything that looks external or already
extended:

- URL schemes (http:, https:, mailto:, ftp:, file:, tel:, ...) — skip
- Anchors (#section)                                            — skip
- Empty targets                                                 — skip
- Trailing slash (directory references)                         — skip
- Already has any extension (.md, .png, .pdf, .MD, ...)         — skip
- Preserves query strings and anchors when appending:
  [Section](./alice#bio) -> [Section](./alice.md#bio)
  [Search](./alice?q=t)  -> [Search](./alice.md?q=t)

The DB content stays slug-form (GBrain's internal convention is
unchanged). Only the on-disk export gets the `.md` annotation, so the
exported markdown is viewable as-is by any standard renderer.

Real-world reproduction this fix addresses:

  $ gbrain put 龙虾群 < <(echo '[小龙](./小龙)')
  $ gbrain export --dir /tmp/out
  $ cat /tmp/out/龙虾群.md
  # before this PR: contains [小龙](./小龙)  — clicking 404s
  # after this PR:  contains [小龙](./小龙.md) — clicking opens the file

Impact:
- 2 files changed, +149 / -1 lines (1 line of helper invocation +
  ~40 lines of helper + comment + 26 tests)
- Zero behavior change for external URLs, anchors, or already-extended
  links
- DB content unchanged — only the on-disk export representation gains
  the `.md` annotation
- Existing exports remain valid (re-running export on an already-exported
  brain is idempotent because already-extended links are skipped)

Tests:
- 26 new tests covering: same-dir slug, parent-dir slug, deep nesting,
  CJK slugs, multiple links per line, multi-line markdown, all 6
  external schemes (http/https/mailto/file/ftp/tel), all 4 extension
  cases (md/png/pdf/uppercase), anchor preservation, query preservation,
  empty/trailing-slash/no-link edge cases.
- All 26 tests pass.
- Full suite: 612 pass / no new regressions (4 pre-existing PGLiteEngine
  failures are unrelated and exist on master).

Fifth in a series of practical PRs from a real Chinese-speaking deploy.
Companion to:
- garrytan#114 (chunker CJK)
- garrytan#115 (slugify CJK)
- garrytan#119 (sync git quotepath CJK)
- garrytan#121 (self-contained API keys)

Same theme: GBrain is meaningfully more useful when the markdown export
is a first-class deliverable, not a half-broken side-effect.
When a git repository contains files with non-ASCII names (common for
Chinese/Japanese/Korean users, or for files exported from Apple Notes
with spaces + CJK like "2026-04-14 22_38 记录.md"), `git diff
--name-status` wraps those paths in double quotes and octal-escapes
each byte:

    A   "inbox/2026-04-14 22_38 \350\256\260\345\275\225.md"

buildSyncManifest then treats that literal quoted-escaped string as
the path, downstream filesystem lookups fail, and the file is
silently dropped from the sync manifest. The user sees "added: 0"
in the sync result even though git has those files committed, and
`gbrain search` can't find the content. The cron log shows success
because nothing technically errored.

This is the sync-layer counterpart to the same CJK root cause class
fixed in garrytan#98 (query expansion), garrytan#114 (chunker), and garrytan#115 (slugify):
ASCII-only assumptions baked into a fourth part of the codebase.

Reproduction:
    cd some-brain-repo
    echo "# test" > "inbox/测试文件.md"
    git add . && git commit -m test
    gbrain sync --repo .
    # -> "added: 0, chunksCreated: 0"  ← bug
    # -> But git log clearly shows the commit added the file.

Fix:
- Add `-c core.quotepath=false` to the `git()` helper in
  src/commands/sync.ts. This config tells git to emit paths as-is
  (UTF-8) in diff/log output instead of the default double-quoted
  octal-escaped form. The fix is at the call site so all future git
  invocations through this helper are covered, not just `diff`.

Impact:
- 2 files changed, +18 / -1 lines (1-line code fix + comment + tests)
- Zero behavior change for ASCII-only paths
- CJK filenames (with or without spaces) now sync correctly

Test plan:
- [x] 3 new tests in test/sync.test.ts cover pure-CJK paths (Chinese
      + Japanese + Korean), CJK-with-spaces (Apple Notes pattern),
      and CJK rename entries.
- [x] All 35 sync tests pass (32 existing + 3 new).
- [x] Full `bun test` suite: no new regressions (the 4 pre-existing
      PGLiteEngine failures are unrelated and exist on master).

Companion to garrytan#114 (chunker CJK) and garrytan#115 (slugify CJK). Third in
the series; all three can merge independently.
@vinsew vinsew force-pushed the fix/cjk-sync-quotepath branch from 2bb3241 to 86eb7f3 Compare April 27, 2026 09:17
vinsew added a commit to vinsew/gbrain that referenced this pull request Apr 27, 2026
GBrain stores internal cross-page references in slug form (e.g.
`[Alice](./alice)`) because the slug is the canonical identifier in the
DB. That works inside GBrain's own resolution layer.

But when those pages are exported as `.md` files on disk and opened in
standard markdown viewers (Obsidian, VS Code preview, GitHub web view,
typical mkdocs/jekyll renderers), the viewers look for a literal file
at `./alice` — which doesn't exist. The actual file is `./alice.md`.

Result: every internal link in an exported brain is silently broken on
disk. The user clicks `[小龙]` in `龙虾群.md`, sees a 404 / empty page,
and cannot navigate the brain outside of GBrain itself. This defeats
half the value of having the brain stored as portable markdown.

Fix:

Add `normalizeInternalLinks(content)` that runs over each page's
serialized markdown right before `writeFileSync` and rewrites slug-form
internal links to filename-form by appending `.md`:

  [Alice](./alice)            -> [Alice](./alice.md)
  [Alice](alice)              -> [Alice](alice.md)
  [Alice](../people/alice)    -> [Alice](../people/alice.md)
  [小龙](../people/小龙)        -> [小龙](../people/小龙.md)

Conservative: leaves untouched anything that looks external or already
extended:

- URL schemes (http:, https:, mailto:, ftp:, file:, tel:, ...) — skip
- Anchors (#section)                                            — skip
- Empty targets                                                 — skip
- Trailing slash (directory references)                         — skip
- Already has any extension (.md, .png, .pdf, .MD, ...)         — skip
- Preserves query strings and anchors when appending:
  [Section](./alice#bio) -> [Section](./alice.md#bio)
  [Search](./alice?q=t)  -> [Search](./alice.md?q=t)

The DB content stays slug-form (GBrain's internal convention is
unchanged). Only the on-disk export gets the `.md` annotation, so the
exported markdown is viewable as-is by any standard renderer.

Real-world reproduction this fix addresses:

  $ gbrain put 龙虾群 < <(echo '[小龙](./小龙)')
  $ gbrain export --dir /tmp/out
  $ cat /tmp/out/龙虾群.md
  # before this PR: contains [小龙](./小龙)  — clicking 404s
  # after this PR:  contains [小龙](./小龙.md) — clicking opens the file

Impact:
- 2 files changed, +149 / -1 lines (1 line of helper invocation +
  ~40 lines of helper + comment + 26 tests)
- Zero behavior change for external URLs, anchors, or already-extended
  links
- DB content unchanged — only the on-disk export representation gains
  the `.md` annotation
- Existing exports remain valid (re-running export on an already-exported
  brain is idempotent because already-extended links are skipped)

Tests:
- 26 new tests covering: same-dir slug, parent-dir slug, deep nesting,
  CJK slugs, multiple links per line, multi-line markdown, all 6
  external schemes (http/https/mailto/file/ftp/tel), all 4 extension
  cases (md/png/pdf/uppercase), anchor preservation, query preservation,
  empty/trailing-slash/no-link edge cases.
- All 26 tests pass.
- Full suite: 612 pass / no new regressions (4 pre-existing PGLiteEngine
  failures are unrelated and exist on master).

Fifth in a series of practical PRs from a real Chinese-speaking deploy.
Companion to:
- garrytan#114 (chunker CJK)
- garrytan#115 (slugify CJK)
- garrytan#119 (sync git quotepath CJK)
- garrytan#121 (self-contained API keys)

Same theme: GBrain is meaningfully more useful when the markdown export
is a first-class deliverable, not a half-broken side-effect.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant