Accept one word at a time in space-less scripts by FuJacob · Pull Request #515 · FuJacob/cotabby

FuJacob · 2026-06-02T00:45:47Z

Summary

Tab acceptance scanned to the next whitespace boundary to find "a word", so in scripts written
without spaces between words (Chinese, Japanese, Thai, Lao, Khmer, ...) a single Tab swallowed the
entire suggestion run instead of advancing one word. When an acceptance token begins with such a
script, it is now segmented with ICU word breaking and only the first segment is accepted, matching
the one-word-per-Tab feel of space-delimited text. Space-delimited scripts never enter the new
branch, so Latin / Cyrillic / Arabic acceptance is byte-for-byte unchanged.

Validation

xcodebuild ... test ... CODE_SIGNING_ALLOWED=NO CODE_SIGNING_REQUIRED=NO \
  -only-testing:CotabbyTests/SuggestionSessionReconcilerTests \
  -only-testing:CotabbyTests/SuggestionCoordinatorAcceptanceTests
# ** TEST SUCCEEDED ** — 72 reconciler + 6 acceptance tests, 0 failures
#   incl. a Latin-unchanged regression guard and CJK / Japanese / Thai segmentation cases

swiftlint lint --quiet   # exit 0 (clean)

Linked issues

None. Closes a confirmed gap where one Tab accepted an entire space-less-script suggestion.

Risk / rollout notes

Latin/space-delimited acceptance is unchanged — the whitespace-run path is untouched; the new
ICU branch only runs when a token begins with a space-less script (detected by leading scalar).
Segmentation granularity follows the OS ICU word breaker (per-character or dictionary-based
depending on the script/OS); tests assert the robust property (one segment, not the whole run)
rather than a pinned word, so they don't flake on ICU differences.
Full-suggestion acceptance (the full-accept key) still takes everything at once.

Greptile Summary

This PR fixes single-Tab over-acceptance in space-less scripts (Chinese, Japanese, Thai, Khmer, etc.) by routing the acceptance chunk through ICU's enumerateSubstrings(.byWords) whenever the leading character of a non-whitespace token is detected as belonging to a space-less script. Latin/Cyrillic/Arabic paths are untouched.

beginsSpacelessScriptWord on Character performs a cheap scalar-value range check to gate the new branch, keeping the common space-delimited path allocation-free.
firstSegmentedWordEnd wraps enumerateSubstrings with an early-exit and min(..., limit) clamp, correctly bounding the result to the whitespace-scanned token.
Six new unit tests cover a Latin regression guard, CJK/Japanese/Thai run segmentation, cross-space boundary containment, and leading-whitespace preservation, with intentionally OS-agnostic "less than whole run" assertions.

Confidence Score: 4/5

Safe to merge — the new ICU branch is correctly gated and cannot affect Latin/Cyrillic/Arabic acceptance paths.

The core logic is correct and well-tested. The only gap is that CJK Unified Ideograph Extensions C, D, E, and F are missing from the range table, so rare archaic Han characters starting a token would still trigger the old whole-run acceptance. The impact is narrow and the fallback is the previous behavior rather than a crash or data loss.

The beginsSpacelessScriptWord switch in SuggestionSessionReconciler.swift needs the four missing CJK extension ranges added.

Important Files Changed

Filename	Overview
Cotabby/Support/SuggestionSessionReconciler.swift	Adds ICU word-break segmentation for space-less scripts; logic is sound but the `beginsSpacelessScriptWord` switch misses CJK Extensions C–F (U+2A700–U+2EBEF)
CotabbyTests/SuggestionSessionReconcilerTests.swift	Good test coverage for the new branch with Latin regression guard, CJK/Japanese/Thai run tests, and leading-whitespace preservation; the "less than whole run" assertions are intentionally OS-agnostic

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[nextAcceptanceChunk called] --> B[Skip leading whitespace to tokenStart]
    B --> C[Scan forward to next whitespace to index]
    C --> D{tokenStart < index AND leading char is space-less script?}
    D -- No --> F{autoAcceptTrailingPunctuation = false?}
    D -- Yes --> E[firstSegmentedWordEnd: enumerateSubstrings byWords stop after first segment]
    E --> E2{wordEnd found and wordEnd > tokenStart?}
    E2 -- Yes --> E3[index = min wordEnd limit]
    E2 -- No --> F
    E3 --> F
    F -- Yes --> G[wordEndTrimmingTrailingPunctuation peel trailing non-word chars]
    G --> H[Return remainingText up to index]
    F -- No --> H

_{Reviews (1): Last reviewed commit: "Accept one word at a time in space-less ..." | Re-trigger Greptile}

Greptile also left 2 inline comments on this PR.

Tab acceptance scanned to the next whitespace boundary to find a word, so in scripts written without spaces between words (Chinese, Japanese, Thai, ...) a single Tab swallowed the entire suggestion run. When a token begins with such a script, segment it with ICU word breaking and accept only the first segment, matching the one-word-per-Tab behavior of space-delimited text. Space-delimited scripts never enter the new branch, so Latin / Cyrillic / Arabic acceptance is byte-for-byte unchanged (covered by a regression test).

greptile-apps · 2026-06-02T00:50:12Z

+             0x20000...0x2A6DF, // CJK Unified Ideographs Extension B
+             0x30000...0x3134F: // CJK Unified Ideographs Extension G


CJK Unified Ideograph Extensions C, D, E, and F (U+2A700–U+2EBEF) are absent from the switch. Any suggestion that starts with a character from those blocks (rare/archaic Han characters appearing in scholarly texts) would pass beginsSpacelessScriptWord as false and fall through to the old whitespace-scan path, accepting the whole run on a single Tab. Extension B ends at U+2A6DF and Extension G starts at U+30000, so the gap is currently silent.

Suggested change

0x20000...0x2A6DF, // CJK Unified Ideographs Extension B

0x30000...0x3134F: // CJK Unified Ideographs Extension G

0x20000...0x2A6DF, // CJK Unified Ideographs Extension B

0x2A700...0x2B73F, // CJK Unified Ideographs Extension C

0x2B740...0x2B81F, // CJK Unified Ideographs Extension D

0x2B820...0x2CEAF, // CJK Unified Ideographs Extension E

0x2CEB0...0x2EBEF, // CJK Unified Ideographs Extension F

0x30000...0x3134F: // CJK Unified Ideographs Extension G

greptile-apps · 2026-06-02T00:50:13Z

+        let afterSpace = String(chunk.dropFirst())
+        XCTAssertFalse(afterSpace.isEmpty)
+        XCTAssertTrue("你好世界".hasPrefix(afterSpace))
+        XCTAssertLessThan(afterSpace.count, 4, "only the first segment is accepted, not the whole run")


Fragile magic-number bound in leading-whitespace test

The assertion afterSpace.count < 4 is tied to the length of the literal run "你好世界" (4 grapheme clusters). If a future maintainer changes the run to something shorter — say 3 characters — the guard would vacuously pass even if the entire run were accepted in one chunk. Expressing the bound relative to the actual run length (e.g. afterSpace.count < "你好世界".count) makes the intent self-documenting and resilient to edits.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

FuJacob merged commit 4cca73d into main Jun 2, 2026
4 checks passed

FuJacob deleted the feat/unicode-word-acceptance branch June 2, 2026 00:49

greptile-apps Bot reviewed Jun 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Accept one word at a time in space-less scripts#515

Accept one word at a time in space-less scripts#515
FuJacob merged 1 commit into
mainfrom
feat/unicode-word-acceptance

FuJacob commented Jun 2, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

Uh oh!

greptile-apps Bot Jun 2, 2026

Uh oh!

greptile-apps Bot Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		0x20000...0x2A6DF, // CJK Unified Ideographs Extension B
		0x30000...0x3134F: // CJK Unified Ideographs Extension G

Uh oh!

Conversation

FuJacob commented Jun 2, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Linked issues

Risk / rollout notes

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

greptile-apps Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

FuJacob commented Jun 2, 2026 •

edited by greptile-apps Bot

Loading