fix: CJK + opening bracket line break segmentation#148
Closed
mayrang wants to merge 6 commits intochenglou:mainfrom
Closed
fix: CJK + opening bracket line break segmentation#148mayrang wants to merge 6 commits intochenglou:mainfrom
mayrang wants to merge 6 commits intochenglou:mainfrom
Conversation
Normalize chunked batch line starts through the same segment-kind policy used by streaming so layoutWithLines(), walkLineRanges(), layoutNextLine(), and layout() stay aligned after zero-width break opportunities and collapsible spaces. Classify Hangul Compatibility Jamo (U+3130..U+318F) as CJK so common Korean compatibility jamo runs break like browser text. Closes chenglou#121 Closes chenglou#142 Co-authored-by: mayrang <pkss0626@naver.com> Co-authored-by: voidborne-d <voidborne-d@users.noreply.github.com> Co-authored-by: lttlin <lttlin@gmail.com>
Add the dedicated chenglou#135 regression shape where a pending soft-hyphen break must not preempt the closer preserved-space break that batch layout chooses. Co-authored-by: voidborne-d <voidborne-d@users.noreply.github.com>
The escaped-quote backward merge was catching bare opening brackets
like ( after CJK text, producing '서울(' instead of keeping '('
separate. Skip the merge when the preceding segment contains CJK
so brackets flow through to the forward-sticky pass and correctly
attach to the following text: '서울' | '(Seoul)'.
10 test cases covering opening brackets after Korean, Japanese, and Chinese text — parentheses, square brackets, curly braces, abbreviations, and nested brackets. All 10/10 pass on Chrome.
45c4140 to
8036cca
Compare
c12ce6a to
3f0a4e6
Compare
Owner
|
Thanks! I implemented the minimal fix in d9f2dff and credited you as co-author. |
nice-hang
pushed a commit
to nice-hang/pretext
that referenced
this pull request
Apr 18, 2026
Skip backward punctuation merges when the previous segment contains CJK so opening brackets can flow through to the forward-sticky pass and attach to the following annotation text. Closes chenglou#145 Refs chenglou#148. Co-authored-by: mayrang <pkss0626@naver.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #145
What broke
The engine produces different line breaks than the browser when CJK text is followed by opening brackets:
Same bug for all CJK languages:
東京(Tokyo),北京(Beijing),인공지능(AI).Why this only breaks CJK, not English
English and CJK go through different segmentation paths after the analysis merge:
AB(CD)→[AB(][CD)]— even though(is merged backward, the line breaker can still break withinAB(at character boundaries. No visible problem.서울(entersbuildBaseCjkUnits()which splits by character →[서][울][(]— the(becomes an isolated unit, disconnected fromSeoul). The meaningful group(Seoul)is broken apart, and the line breaker can't reconnect them.Why it broke
The first-pass merge in
src/analysis.ts(~line 1006) has a rule: "non-word-like punctuation sticks to the previous text segment." This correctly handles closing punctuation (),.,,) but also catches opening brackets ((,[,{) viaisEscapedQuoteClusterSegment()matching allkinsokuEndcharacters.This merge runs before the forward-sticky pass, so
(gets consumed backward before it can be attached forward toSeoul):How it was fixed
Added
!mergedContainsCJK[prevIndex]guard at two merge points insrc/analysis.ts:When the previous segment contains CJK, skip the backward merge. The bracket reaches the forward-sticky pass which correctly attaches it forward:
Why this approach
AB(CD)still works as beforeTests
bun test— 88/88 passbun run scripts/cjk-bracket-check.ts --browser=chrome— 10/10 pass