Skip to content

Should the Segmenter types accept a locale? #3284

@sffc

Description

@sffc

In the API review, @markusicu pointed out that ICU takes a locale in the segmenter, and the locale affects the behavior in certain cases, such as those in the data files below:

Why don't we support these in ICU4X Segmenter, and should we add them?

For 1.2 purposes, we have a few choices:

  1. Add the locale parameter now and don't use it for anything yet
  2. Don't add the locale parameter but add something like _invariant to the constructor names, so that in the future try_new_auto_invariant() creates the locale-invariant segmenter and try_new_auto(locale!("el")) creates the locale-specific segmenter
  3. Keep things the way they are and add locale constructors later, possibly adopting the style above in 2.0
  4. Add the parameter to Word and Sentence, but not Line or Grapheme

Thoughts?

@aethanyc @makotokato @Manishearth

Metadata

Metadata

Assignees

Labels

C-segmentationComponent: SegmentationS-mediumSize: Less than a week (larger bug fix or enhancement)T-coreType: Required functionality

Type

No type

Projects

Relationships

None yet

Development

No branches or pull requests

Issue actions