Accept one word at a time in space-less scripts#515
Conversation
Tab acceptance scanned to the next whitespace boundary to find a word, so in scripts written without spaces between words (Chinese, Japanese, Thai, ...) a single Tab swallowed the entire suggestion run. When a token begins with such a script, segment it with ICU word breaking and accept only the first segment, matching the one-word-per-Tab behavior of space-delimited text. Space-delimited scripts never enter the new branch, so Latin / Cyrillic / Arabic acceptance is byte-for-byte unchanged (covered by a regression test).
| 0x20000...0x2A6DF, // CJK Unified Ideographs Extension B | ||
| 0x30000...0x3134F: // CJK Unified Ideographs Extension G |
There was a problem hiding this comment.
CJK Unified Ideograph Extensions C, D, E, and F (U+2A700–U+2EBEF) are absent from the switch. Any suggestion that starts with a character from those blocks (rare/archaic Han characters appearing in scholarly texts) would pass
beginsSpacelessScriptWord as false and fall through to the old whitespace-scan path, accepting the whole run on a single Tab. Extension B ends at U+2A6DF and Extension G starts at U+30000, so the gap is currently silent.
| 0x20000...0x2A6DF, // CJK Unified Ideographs Extension B | |
| 0x30000...0x3134F: // CJK Unified Ideographs Extension G | |
| 0x20000...0x2A6DF, // CJK Unified Ideographs Extension B | |
| 0x2A700...0x2B73F, // CJK Unified Ideographs Extension C | |
| 0x2B740...0x2B81F, // CJK Unified Ideographs Extension D | |
| 0x2B820...0x2CEAF, // CJK Unified Ideographs Extension E | |
| 0x2CEB0...0x2EBEF, // CJK Unified Ideographs Extension F | |
| 0x30000...0x3134F: // CJK Unified Ideographs Extension G |
| let afterSpace = String(chunk.dropFirst()) | ||
| XCTAssertFalse(afterSpace.isEmpty) | ||
| XCTAssertTrue("你好世界".hasPrefix(afterSpace)) | ||
| XCTAssertLessThan(afterSpace.count, 4, "only the first segment is accepted, not the whole run") |
There was a problem hiding this comment.
Fragile magic-number bound in leading-whitespace test
The assertion afterSpace.count < 4 is tied to the length of the literal run "你好世界" (4 grapheme clusters). If a future maintainer changes the run to something shorter — say 3 characters — the guard would vacuously pass even if the entire run were accepted in one chunk. Expressing the bound relative to the actual run length (e.g. afterSpace.count < "你好世界".count) makes the intent self-documenting and resilient to edits.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Summary
Tab acceptance scanned to the next whitespace boundary to find "a word", so in scripts written
without spaces between words (Chinese, Japanese, Thai, Lao, Khmer, ...) a single Tab swallowed the
entire suggestion run instead of advancing one word. When an acceptance token begins with such a
script, it is now segmented with ICU word breaking and only the first segment is accepted, matching
the one-word-per-Tab feel of space-delimited text. Space-delimited scripts never enter the new
branch, so Latin / Cyrillic / Arabic acceptance is byte-for-byte unchanged.
Validation
Linked issues
None. Closes a confirmed gap where one Tab accepted an entire space-less-script suggestion.
Risk / rollout notes
ICU branch only runs when a token begins with a space-less script (detected by leading scalar).
depending on the script/OS); tests assert the robust property (one segment, not the whole run)
rather than a pinned word, so they don't flake on ICU differences.
Greptile Summary
This PR fixes single-Tab over-acceptance in space-less scripts (Chinese, Japanese, Thai, Khmer, etc.) by routing the acceptance chunk through ICU's
enumerateSubstrings(.byWords)whenever the leading character of a non-whitespace token is detected as belonging to a space-less script. Latin/Cyrillic/Arabic paths are untouched.beginsSpacelessScriptWordonCharacterperforms a cheap scalar-value range check to gate the new branch, keeping the common space-delimited path allocation-free.firstSegmentedWordEndwrapsenumerateSubstringswith an early-exit andmin(..., limit)clamp, correctly bounding the result to the whitespace-scanned token.Confidence Score: 4/5
Safe to merge — the new ICU branch is correctly gated and cannot affect Latin/Cyrillic/Arabic acceptance paths.
The core logic is correct and well-tested. The only gap is that CJK Unified Ideograph Extensions C, D, E, and F are missing from the range table, so rare archaic Han characters starting a token would still trigger the old whole-run acceptance. The impact is narrow and the fallback is the previous behavior rather than a crash or data loss.
The
beginsSpacelessScriptWordswitch inSuggestionSessionReconciler.swiftneeds the four missing CJK extension ranges added.Important Files Changed
beginsSpacelessScriptWordswitch misses CJK Extensions C–F (U+2A700–U+2EBEF)Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[nextAcceptanceChunk called] --> B[Skip leading whitespace to tokenStart] B --> C[Scan forward to next whitespace to index] C --> D{tokenStart < index AND leading char is space-less script?} D -- No --> F{autoAcceptTrailingPunctuation = false?} D -- Yes --> E[firstSegmentedWordEnd: enumerateSubstrings byWords stop after first segment] E --> E2{wordEnd found and wordEnd > tokenStart?} E2 -- Yes --> E3[index = min wordEnd limit] E2 -- No --> F E3 --> F F -- Yes --> G[wordEndTrimmingTrailingPunctuation peel trailing non-word chars] G --> H[Return remainingText up to index] F -- No --> HReviews (1): Last reviewed commit: "Accept one word at a time in space-less ..." | Re-trigger Greptile