Skip to content

fix: CJK + opening bracket line break segmentation#148

Closed
mayrang wants to merge 6 commits intochenglou:mainfrom
mayrang:explore/new-korean-bugs
Closed

fix: CJK + opening bracket line break segmentation#148
mayrang wants to merge 6 commits intochenglou:mainfrom
mayrang:explore/new-korean-bugs

Conversation

@mayrang
Copy link
Copy Markdown
Contributor

@mayrang mayrang commented Apr 18, 2026

Closes #145

What broke

The engine produces different line breaks than the browser when CJK text is followed by opening brackets:

Input: "서울(Seoul)과 부산(Busan)" at 180px

Browser:   서울           ✓  ( goes to next line with its content
           (Seoul)과

Pretext:   서울(          ✗  ( stuck to preceding CJK
           Seoul)과

Same bug for all CJK languages: 東京(Tokyo), 北京(Beijing), 인공지능(AI).

Why this only breaks CJK, not English

English and CJK go through different segmentation paths after the analysis merge:

  • English: AB(CD)[AB(] [CD)] — even though ( is merged backward, the line breaker can still break within AB( at character boundaries. No visible problem.
  • CJK: 서울( enters buildBaseCjkUnits() which splits by character → [서] [울] [(] — the ( becomes an isolated unit, disconnected from Seoul). The meaningful group (Seoul) is broken apart, and the line breaker can't reconnect them.

Why it broke

The first-pass merge in src/analysis.ts (~line 1006) has a rule: "non-word-like punctuation sticks to the previous text segment." This correctly handles closing punctuation (), ., ,) but also catches opening brackets ((, [, {) via isEscapedQuoteClusterSegment() matching all kinsokuEnd characters.

This merge runs before the forward-sticky pass, so ( gets consumed backward before it can be attached forward to Seoul):

Word segmenter:     서울 | ( | Seoul | ) | 과
leftSticky merge:   서울 | ( | Seoul) | 과       ← ) sticks to Seoul (correct)
first-pass merge:   서울( | Seoul) | 과           ← BUG: ( sticks backward
forward-sticky:     (nothing — ( already gone)

How it was fixed

Added !mergedContainsCJK[prevIndex] guard at two merge points in src/analysis.ts:

  1. First-pass merge (~line 1008) — the actual fix
  2. Escaped-quote merge (~line 1057) — belt-and-suspenders on the same path

When the previous segment contains CJK, skip the backward merge. The bracket reaches the forward-sticky pass which correctly attaches it forward:

first-pass merge:   SKIPPED (prev is CJK)
forward-sticky:     서울 | (Seoul) | 과           ✓

Why this approach

  • 2 lines of guard conditions — not a restructure, minimal risk
  • Non-CJK unchanged — English AB(CD) still works as before
  • Root cause fix — the backward merge was the wrong operation for brackets after CJK; the forward-sticky pass already does the right thing when brackets reach it

Tests

  • bun test — 88/88 pass
  • bun run scripts/cjk-bracket-check.ts --browser=chrome — 10/10 pass
  ✓ A1: Korean parenthesized English       ✓ A6: Chinese abbreviation bracket
  ✓ A2: Japanese parenthesized English     ✓ A7: Korean square brackets
  ✓ A3: Chinese parenthesized English      ✓ A8: Japanese square brackets
  ✓ A4: Korean abbreviation bracket        ✓ A9: Korean curly braces
  ✓ A5: Japanese abbreviation bracket      ✓ A10: Mixed CJK + nested brackets

chenglou and others added 6 commits April 18, 2026 00:53
Normalize chunked batch line starts through the same segment-kind policy used by streaming so layoutWithLines(), walkLineRanges(), layoutNextLine(), and layout() stay aligned after zero-width break opportunities and collapsible spaces.

Classify Hangul Compatibility Jamo (U+3130..U+318F) as CJK so common Korean compatibility jamo runs break like browser text.

Closes chenglou#121

Closes chenglou#142

Co-authored-by: mayrang <pkss0626@naver.com>

Co-authored-by: voidborne-d <voidborne-d@users.noreply.github.com>

Co-authored-by: lttlin <lttlin@gmail.com>
Add the dedicated chenglou#135 regression shape where a pending soft-hyphen break must not preempt the closer preserved-space break that batch layout chooses.

Co-authored-by: voidborne-d <voidborne-d@users.noreply.github.com>
The escaped-quote backward merge was catching bare opening brackets
like ( after CJK text, producing '서울(' instead of keeping '('
separate. Skip the merge when the preceding segment contains CJK
so brackets flow through to the forward-sticky pass and correctly
attach to the following text: '서울' | '(Seoul)'.
10 test cases covering opening brackets after Korean, Japanese, and
Chinese text — parentheses, square brackets, curly braces, abbreviations,
and nested brackets. All 10/10 pass on Chrome.
@mayrang mayrang force-pushed the explore/new-korean-bugs branch from 45c4140 to 8036cca Compare April 18, 2026 01:13
@chenglou chenglou force-pushed the main branch 2 times, most recently from c12ce6a to 3f0a4e6 Compare April 18, 2026 05:27
chenglou added a commit that referenced this pull request Apr 18, 2026
Skip backward punctuation merges when the previous segment contains CJK so opening brackets can flow through to the forward-sticky pass and attach to the following annotation text.

Closes #145
Refs #148.

Co-authored-by: mayrang <pkss0626@naver.com>
@chenglou
Copy link
Copy Markdown
Owner

Thanks! I implemented the minimal fix in d9f2dff and credited you as co-author.

@chenglou chenglou closed this Apr 18, 2026
nice-hang pushed a commit to nice-hang/pretext that referenced this pull request Apr 18, 2026
Skip backward punctuation merges when the previous segment contains CJK so opening brackets can flow through to the forward-sticky pass and attach to the following annotation text.

Closes chenglou#145
Refs chenglou#148.

Co-authored-by: mayrang <pkss0626@naver.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Opening brackets after CJK text merge into wrong segment

2 participants