-
Notifications
You must be signed in to change notification settings - Fork 263
Should the Segmenter types accept a locale? #3284
Copy link
Copy link
Closed
Labels
C-segmentationComponent: SegmentationComponent: SegmentationS-mediumSize: Less than a week (larger bug fix or enhancement)Size: Less than a week (larger bug fix or enhancement)T-coreType: Required functionalityType: Required functionality
Milestone
Description
In the API review, @markusicu pointed out that ICU takes a locale in the segmenter, and the locale affects the behavior in certain cases, such as those in the data files below:
- Data bundles that contain some language-specific data for sentence segmentation: https://github.com/unicode-org/icu/tree/main/icu4c/source/data/brkitr
- fi_sv override for word break: https://github.com/unicode-org/icu/blob/main/icu4c/source/data/brkitr/rules/word_fi_sv.txt
- el override for sentence break: https://github.com/unicode-org/icu/blob/main/icu4c/source/data/brkitr/rules/sent_el.txt
Why don't we support these in ICU4X Segmenter, and should we add them?
For 1.2 purposes, we have a few choices:
- Add the locale parameter now and don't use it for anything yet
- Don't add the locale parameter but add something like
_invariantto the constructor names, so that in the futuretry_new_auto_invariant()creates the locale-invariant segmenter andtry_new_auto(locale!("el"))creates the locale-specific segmenter - Keep things the way they are and add locale constructors later, possibly adopting the style above in 2.0
- Add the parameter to Word and Sentence, but not Line or Grapheme
Thoughts?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
C-segmentationComponent: SegmentationComponent: SegmentationS-mediumSize: Less than a week (larger bug fix or enhancement)Size: Less than a week (larger bug fix or enhancement)T-coreType: Required functionalityType: Required functionality