Skip to content

Bug: Opening brackets after CJK text merge into wrong segment #145

@mayrang

Description

@mayrang

Summary

The engine produces different line breaks than the browser when CJK text is followed by opening brackets. The bracket sticks to the preceding CJK text instead of the following content, causing visually wrong line wrapping.

Input: "서울(Seoul)과 부산(Busan)" at 180px

Browser:   서울          ← breaks here, ( goes to next line with Seoul
           (Seoul)과

Pretext:   서울(         ← ( stuck to 서울
           Seoul)과

This affects all CJK languages: 東京(Tokyo), 北京(Beijing), 인공지능(AI), etc.

Why this matters

Parenthesized annotations after CJK text are everyday patterns in Korean, Japanese, and Chinese — brand names, technical terms, romanizations, abbreviations. In narrow containers (mobile chat bubbles, card layouts), the bracket hanging at the wrong line end is clearly visible.

Reproduction

Minimal code to see the bug (before fix):

import { prepareWithSegments } from '@chenglou/pretext'

const prepared = prepareWithSegments('서울(Seoul)', '20px serif')
console.log(prepared.segments)
// Before fix: ['서', '울(', 'Seoul)']  ← ( merged into CJK segment
// After fix:  ['서', '울', '(Seoul)']  ✓

Or run the oracle checker against Chrome:

bun run scripts/cjk-bracket-check.ts --browser=chrome

Why this only breaks CJK, not English

English and CJK text go through different segmentation paths:

  • English: word segmenter result stays as-is. AB(CD)[AB(] [CD)] — even though ( is merged, the line breaker can still break within AB( at character boundaries, so it works out.
  • CJK: word segmenter result goes through buildBaseCjkUnits() which splits further by character. 서울([서] [울] [(] — the ( becomes an isolated unit, disconnected from Seoul). The (Seoul) group is broken apart.

Root cause

The analysis pipeline in src/analysis.ts has a first-pass merge step (~line 1006) with a rule: "punctuation that isn't word-like should stick to the previous text segment." This rule correctly handles closing punctuation like ), ., , — but it also catches opening brackets (, [, { because isEscapedQuoteClusterSegment() matches all kinsokuEnd characters.

The merge runs before the forward-sticky pass, so ( gets consumed backward before it can be attached forward to Seoul).

Word segmenter:     서울 | ( | Seoul | ) | 과
leftSticky merge:   서울 | ( | Seoul) | 과      ← ) sticks to Seoul (correct)
first-pass merge:   서울( | Seoul) | 과          ← ( sticks backward (BUG)
forward-sticky:     (nothing to do — ( already gone)

Fix

Added !mergedContainsCJK[prevIndex] guard at two merge points. When the previous segment contains CJK, skip the backward merge. The bracket then reaches the forward-sticky pass which correctly attaches it to the next segment:

After fix:
first-pass merge:   SKIPPED (prev is CJK)
forward-sticky:     서울 | (Seoul) | 과          ✓

Non-CJK behavior unchanged. PR: #148

Test results after fix

CJK Bracket Check — Chrome
────────────────────────────────────────────────────────────
  ✓ PASS  A1: Korean parenthesized English         [4 lines]
  ✓ PASS  A2: Japanese parenthesized English       [4 lines]
  ✓ PASS  A3: Chinese parenthesized English        [3 lines]
  ✓ PASS  A4: Korean abbreviation bracket          [3 lines]
  ✓ PASS  A5: Japanese abbreviation bracket        [3 lines]
  ✓ PASS  A6: Chinese abbreviation bracket         [3 lines]
  ✓ PASS  A7: Korean square brackets               [3 lines]
  ✓ PASS  A8: Japanese square brackets             [3 lines]
  ✓ PASS  A9: Korean curly braces                  [3 lines]
  ✓ PASS  A10: Mixed CJK + nested brackets         [3 lines]

Summary: chrome 10/10 pass

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions