Skip to content

Accept one word at a time in space-less scripts#515

Merged
FuJacob merged 1 commit into
mainfrom
feat/unicode-word-acceptance
Jun 2, 2026
Merged

Accept one word at a time in space-less scripts#515
FuJacob merged 1 commit into
mainfrom
feat/unicode-word-acceptance

Conversation

@FuJacob

@FuJacob FuJacob commented Jun 2, 2026

Copy link
Copy Markdown
Owner

Summary

Tab acceptance scanned to the next whitespace boundary to find "a word", so in scripts written
without spaces between words (Chinese, Japanese, Thai, Lao, Khmer, ...) a single Tab swallowed the
entire suggestion run instead of advancing one word. When an acceptance token begins with such a
script, it is now segmented with ICU word breaking and only the first segment is accepted, matching
the one-word-per-Tab feel of space-delimited text. Space-delimited scripts never enter the new
branch, so Latin / Cyrillic / Arabic acceptance is byte-for-byte unchanged.

Validation

xcodebuild ... test ... CODE_SIGNING_ALLOWED=NO CODE_SIGNING_REQUIRED=NO \
  -only-testing:CotabbyTests/SuggestionSessionReconcilerTests \
  -only-testing:CotabbyTests/SuggestionCoordinatorAcceptanceTests
# ** TEST SUCCEEDED ** — 72 reconciler + 6 acceptance tests, 0 failures
#   incl. a Latin-unchanged regression guard and CJK / Japanese / Thai segmentation cases

swiftlint lint --quiet   # exit 0 (clean)

Linked issues

None. Closes a confirmed gap where one Tab accepted an entire space-less-script suggestion.

Risk / rollout notes

  • Latin/space-delimited acceptance is unchanged — the whitespace-run path is untouched; the new
    ICU branch only runs when a token begins with a space-less script (detected by leading scalar).
  • Segmentation granularity follows the OS ICU word breaker (per-character or dictionary-based
    depending on the script/OS); tests assert the robust property (one segment, not the whole run)
    rather than a pinned word, so they don't flake on ICU differences.
  • Full-suggestion acceptance (the full-accept key) still takes everything at once.

Greptile Summary

This PR fixes single-Tab over-acceptance in space-less scripts (Chinese, Japanese, Thai, Khmer, etc.) by routing the acceptance chunk through ICU's enumerateSubstrings(.byWords) whenever the leading character of a non-whitespace token is detected as belonging to a space-less script. Latin/Cyrillic/Arabic paths are untouched.

  • beginsSpacelessScriptWord on Character performs a cheap scalar-value range check to gate the new branch, keeping the common space-delimited path allocation-free.
  • firstSegmentedWordEnd wraps enumerateSubstrings with an early-exit and min(..., limit) clamp, correctly bounding the result to the whitespace-scanned token.
  • Six new unit tests cover a Latin regression guard, CJK/Japanese/Thai run segmentation, cross-space boundary containment, and leading-whitespace preservation, with intentionally OS-agnostic "less than whole run" assertions.

Confidence Score: 4/5

Safe to merge — the new ICU branch is correctly gated and cannot affect Latin/Cyrillic/Arabic acceptance paths.

The core logic is correct and well-tested. The only gap is that CJK Unified Ideograph Extensions C, D, E, and F are missing from the range table, so rare archaic Han characters starting a token would still trigger the old whole-run acceptance. The impact is narrow and the fallback is the previous behavior rather than a crash or data loss.

The beginsSpacelessScriptWord switch in SuggestionSessionReconciler.swift needs the four missing CJK extension ranges added.

Important Files Changed

Filename Overview
Cotabby/Support/SuggestionSessionReconciler.swift Adds ICU word-break segmentation for space-less scripts; logic is sound but the beginsSpacelessScriptWord switch misses CJK Extensions C–F (U+2A700–U+2EBEF)
CotabbyTests/SuggestionSessionReconcilerTests.swift Good test coverage for the new branch with Latin regression guard, CJK/Japanese/Thai run tests, and leading-whitespace preservation; the "less than whole run" assertions are intentionally OS-agnostic

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[nextAcceptanceChunk called] --> B[Skip leading whitespace to tokenStart]
    B --> C[Scan forward to next whitespace to index]
    C --> D{tokenStart < index AND leading char is space-less script?}
    D -- No --> F{autoAcceptTrailingPunctuation = false?}
    D -- Yes --> E[firstSegmentedWordEnd: enumerateSubstrings byWords stop after first segment]
    E --> E2{wordEnd found and wordEnd > tokenStart?}
    E2 -- Yes --> E3[index = min wordEnd limit]
    E2 -- No --> F
    E3 --> F
    F -- Yes --> G[wordEndTrimmingTrailingPunctuation peel trailing non-word chars]
    G --> H[Return remainingText up to index]
    F -- No --> H
Loading

Fix All in Codex Fix All in Claude Code

Reviews (1): Last reviewed commit: "Accept one word at a time in space-less ..." | Re-trigger Greptile

Greptile also left 2 inline comments on this PR.

Tab acceptance scanned to the next whitespace boundary to find a word, so in scripts written without spaces between words (Chinese, Japanese, Thai, ...) a single Tab swallowed the entire suggestion run. When a token begins with such a script, segment it with ICU word breaking and accept only the first segment, matching the one-word-per-Tab behavior of space-delimited text.

Space-delimited scripts never enter the new branch, so Latin / Cyrillic / Arabic acceptance is byte-for-byte unchanged (covered by a regression test).
@FuJacob FuJacob merged commit 4cca73d into main Jun 2, 2026
4 checks passed
@FuJacob FuJacob deleted the feat/unicode-word-acceptance branch June 2, 2026 00:49
Comment on lines +544 to +545
0x20000...0x2A6DF, // CJK Unified Ideographs Extension B
0x30000...0x3134F: // CJK Unified Ideographs Extension G

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 CJK Unified Ideograph Extensions C, D, E, and F (U+2A700–U+2EBEF) are absent from the switch. Any suggestion that starts with a character from those blocks (rare/archaic Han characters appearing in scholarly texts) would pass beginsSpacelessScriptWord as false and fall through to the old whitespace-scan path, accepting the whole run on a single Tab. Extension B ends at U+2A6DF and Extension G starts at U+30000, so the gap is currently silent.

Suggested change
0x20000...0x2A6DF, // CJK Unified Ideographs Extension B
0x30000...0x3134F: // CJK Unified Ideographs Extension G
0x20000...0x2A6DF, // CJK Unified Ideographs Extension B
0x2A700...0x2B73F, // CJK Unified Ideographs Extension C
0x2B740...0x2B81F, // CJK Unified Ideographs Extension D
0x2B820...0x2CEAF, // CJK Unified Ideographs Extension E
0x2CEB0...0x2EBEF, // CJK Unified Ideographs Extension F
0x30000...0x3134F: // CJK Unified Ideographs Extension G

Fix in Codex Fix in Claude Code

let afterSpace = String(chunk.dropFirst())
XCTAssertFalse(afterSpace.isEmpty)
XCTAssertTrue("你好世界".hasPrefix(afterSpace))
XCTAssertLessThan(afterSpace.count, 4, "only the first segment is accepted, not the whole run")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Fragile magic-number bound in leading-whitespace test

The assertion afterSpace.count < 4 is tied to the length of the literal run "你好世界" (4 grapheme clusters). If a future maintainer changes the run to something shorter — say 3 characters — the guard would vacuously pass even if the entire run were accepted in one chunk. Expressing the bound relative to the actual run length (e.g. afterSpace.count < "你好世界".count) makes the intent self-documenting and resilient to edits.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Fix in Codex Fix in Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant