Skip to content
This repository was archived by the owner on Aug 2, 2023. It is now read-only.
This repository was archived by the owner on Aug 2, 2023. It is now read-only.

Unicode Case Folding #2610

@iSazonov

Description

@iSazonov

Come from https://github.com/dotnet/corefx/issues/17233
Related discussion PowerShell/PowerShell#8120
First alfa code https://github.com/iSazonov/PowerShell/tree/add-unicode3/src/System.Management.Automation/utils/unicode
Beta code and help utils https://github.com/iSazonov/SCF

Initial PR #2637.

Suggestion

S1. Implement internal API for case-insensitive language-neitral comparisions and convertions based on Unicode Simple case folding.
S2. Expose public API like System.StringComparison.SimpleCaseFolding for String.Compare() and String.IndexOf(). And similar public API for Span.
Update: Improve OrdinalIgnoreCase comparisons everywhere.
S3. Expose public API for System.String to convert to simple case folding format. And similar public API for Span.

Additional considerations

Initially all suggestions is only for Utf16. Utf8 (and others) could be considered too.

Initially all suggestions is only for Simple Case Folding. Full Case Folding could be useful for culture-sensetive comparisions (specially for Regex).

Expected benefits

Increase performance of case-insensitive language-neitral comparisions for strings and Span.

Get benefits in follow core APIs:

  • Regex case-insensitive search (invariant culture option)
  • string case-insensitive language-neitral comparisions (speed up OrdinalIgnoreCase comparisons everywhere)
  • Span buffer case-insensitive language-neitral comparisions
  • Speed up Marvin.ComputeHash32OrdinalIgnoreCase()

Get benefits for case-insensetive languages like PowerShell, HTTP, HTML, CSS, URL, and others, based on Unicode standard definition of Equivalent Case-Insensitive Identifiers.

Performance measures

First results from a PowerShell repo prototype showed that the simple case folding comparision performance is comparable to String.Ordinal comparision performance.

Justification

Unicode standard introduces the concept of Simple case folding that is intended to compare strings in case-insensetive and language-neitral manner.

Short quotes from Unicode standard

From: http://unicode.org/faq/casemap_charprop.html

Q: What is the difference between case mapping and case folding?
A: Case mapping or case conversion is a process whereby strings are converted to a particular form—uppercase, lowercase, or titlecase—possibly for display to the user. Case folding is mostly used for caseless comparison of text, such as identifiers in a computer program, rather than actual text transformation. Case folding in Unicode is primarily based on the lowercase mapping, but includes additional changes to the source text to help make it language-insensitive and consistent. As a result, case-folded text should be used solely for internal processing and generally should not be stored or displayed to the end user.

From http://www.unicode.org/reports/tr31/tr31-29.html#R5

Equivalent Case-Insensitive Identifiers: To meet this requirement, an implementation shall specify either simple or full case folding, and adhere to the Unicode specification for that folding. Any two identifiers that have the same case-folded form shall be treated as equivalent by the implementation.

From http://www.unicode.org/reports/tr21/tr21-5.html

1.3 Caseless Matching
Caseless matching is implemented using case-folding. The latter is the process of mapping strings to a canonical form where case differences are erased. Case-folding allows for fast caseless matches in lookups, since only binary comparison is required. Case-folding is more than just conversion to lowercase. For example, it handles cases such as the Greek sigma, so that "Μάϊος" and "ΜΆΪΟΣ" will match correctly.

From http://unicode.org/reports/tr18/#Folded_Matching

3.10 Folded Matching (Retracted)
RL3.10 Folded Matching
Previous versions of RL3.10 described tailored folding. However, for most full-featured regular expression engines, it is quite difficult to match under folding equivalences that are not 1:1. For more discussion of this, see 1.5 Simple Loose Matches and 2.1 Canonical Equivalents. Thus RL3.10 has been retracted.

From http://unicode.org/reports/tr18/#Simple_Loose_Matches

1.5 Simple Loose Matches
Most regular expression engines offer caseless matching as the only loose matching. If the engine does offers this, then it needs to account for the large range of cased Unicode characters outside of ASCII.
RL1.5 Simple Loose Matches
To meet this requirement, if an implementation provides for case-insensitive matching, then it shall provide at least the simple, default Unicode case-insensitive matching, and specify which properties are closed and which are not.
To meet this requirement, if an implementation provides for case conversions, then it shall provide at least the simple, default Unicode case folding.

From http://www.unicode.org/versions/Unicode11.0.0/ch03.pdf

3.13 Default Case Algorithms

Default Case Folding
Default Case Folding is based on the full case conversion operations without the contextdependent
mappings sensitive to the casing context.
R4 toCasefold(X): Map each character C in X to Case_Folding(C).
• Case_Folding(C) uses the mappings with the status field value “C” or “F” in the
data file CaseFolding.txt in the Unicode Character Database.

A modified form of Default Case Folding is designed for best behavior when doing caseless
matching of strings interpreted as identifiers.
(See above http://www.unicode.org/reports/tr31/tr31-29.html#R5)

Default Caseless Matching
Default caseless matching is the process of comparing two strings for case-insensitive
equality. The definitions of Unicode Default Caseless Matching build on the definitions of
Unicode Default Case Folding.
Default Caseless Matching uses full case folding:
D144 A string X is a caseless match for a string Y if and only if:
toCasefold(X) = toCasefold(Y)

/cc @tarekgh @ahsonkhan @KrzysztofCwalina (Sorry if my ping is wrong)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Design ReviewOpenBeforeArchivingThese issues were open before the repo was archived. For re-open them, file them in the new repo

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions