Come from https://github.com/dotnet/corefx/issues/17233
Related discussion PowerShell/PowerShell#8120
First alfa code https://github.com/iSazonov/PowerShell/tree/add-unicode3/src/System.Management.Automation/utils/unicode
Beta code and help utils https://github.com/iSazonov/SCF
Initial PR #2637.
Suggestion
S1. Implement internal API for case-insensitive language-neitral comparisions and convertions based on Unicode Simple case folding.
S2. Expose public API like System.StringComparison.SimpleCaseFolding for String.Compare() and String.IndexOf(). And similar public API for Span.
Update: Improve OrdinalIgnoreCase comparisons everywhere.
S3. Expose public API for System.String to convert to simple case folding format. And similar public API for Span.
Additional considerations
Initially all suggestions is only for Utf16. Utf8 (and others) could be considered too.
Initially all suggestions is only for Simple Case Folding. Full Case Folding could be useful for culture-sensetive comparisions (specially for Regex).
Expected benefits
Increase performance of case-insensitive language-neitral comparisions for strings and Span.
Get benefits in follow core APIs:
- Regex case-insensitive search (invariant culture option)
- string case-insensitive language-neitral comparisions (speed up OrdinalIgnoreCase comparisons everywhere)
- Span buffer case-insensitive language-neitral comparisions
- Speed up Marvin.ComputeHash32OrdinalIgnoreCase()
Get benefits for case-insensetive languages like PowerShell, HTTP, HTML, CSS, URL, and others, based on Unicode standard definition of Equivalent Case-Insensitive Identifiers.
Performance measures
First results from a PowerShell repo prototype showed that the simple case folding comparision performance is comparable to String.Ordinal comparision performance.
Justification
Unicode standard introduces the concept of Simple case folding that is intended to compare strings in case-insensetive and language-neitral manner.
Short quotes from Unicode standard
From: http://unicode.org/faq/casemap_charprop.html
Q: What is the difference between case mapping and case folding?
A: Case mapping or case conversion is a process whereby strings are converted to a particular form—uppercase, lowercase, or titlecase—possibly for display to the user. Case folding is mostly used for caseless comparison of text, such as identifiers in a computer program, rather than actual text transformation. Case folding in Unicode is primarily based on the lowercase mapping, but includes additional changes to the source text to help make it language-insensitive and consistent. As a result, case-folded text should be used solely for internal processing and generally should not be stored or displayed to the end user.
From http://www.unicode.org/reports/tr31/tr31-29.html#R5
Equivalent Case-Insensitive Identifiers: To meet this requirement, an implementation shall specify either simple or full case folding, and adhere to the Unicode specification for that folding. Any two identifiers that have the same case-folded form shall be treated as equivalent by the implementation.
From http://www.unicode.org/reports/tr21/tr21-5.html
1.3 Caseless Matching
Caseless matching is implemented using case-folding. The latter is the process of mapping strings to a canonical form where case differences are erased. Case-folding allows for fast caseless matches in lookups, since only binary comparison is required. Case-folding is more than just conversion to lowercase. For example, it handles cases such as the Greek sigma, so that "Μάϊος" and "ΜΆΪΟΣ" will match correctly.
From http://unicode.org/reports/tr18/#Folded_Matching
3.10 Folded Matching (Retracted)
RL3.10 Folded Matching
Previous versions of RL3.10 described tailored folding. However, for most full-featured regular expression engines, it is quite difficult to match under folding equivalences that are not 1:1. For more discussion of this, see 1.5 Simple Loose Matches and 2.1 Canonical Equivalents. Thus RL3.10 has been retracted.
From http://unicode.org/reports/tr18/#Simple_Loose_Matches
1.5 Simple Loose Matches
Most regular expression engines offer caseless matching as the only loose matching. If the engine does offers this, then it needs to account for the large range of cased Unicode characters outside of ASCII.
RL1.5 Simple Loose Matches
To meet this requirement, if an implementation provides for case-insensitive matching, then it shall provide at least the simple, default Unicode case-insensitive matching, and specify which properties are closed and which are not.
To meet this requirement, if an implementation provides for case conversions, then it shall provide at least the simple, default Unicode case folding.
From http://www.unicode.org/versions/Unicode11.0.0/ch03.pdf
3.13 Default Case Algorithms
Default Case Folding
Default Case Folding is based on the full case conversion operations without the contextdependent
mappings sensitive to the casing context.
R4 toCasefold(X): Map each character C in X to Case_Folding(C).
• Case_Folding(C) uses the mappings with the status field value “C” or “F” in the
data file CaseFolding.txt in the Unicode Character Database.
A modified form of Default Case Folding is designed for best behavior when doing caseless
matching of strings interpreted as identifiers.
(See above http://www.unicode.org/reports/tr31/tr31-29.html#R5)
Default Caseless Matching
Default caseless matching is the process of comparing two strings for case-insensitive
equality. The definitions of Unicode Default Caseless Matching build on the definitions of
Unicode Default Case Folding.
Default Caseless Matching uses full case folding:
D144 A string X is a caseless match for a string Y if and only if:
toCasefold(X) = toCasefold(Y)
/cc @tarekgh @ahsonkhan @KrzysztofCwalina (Sorry if my ping is wrong)
Come from https://github.com/dotnet/corefx/issues/17233
Related discussion PowerShell/PowerShell#8120
First alfa code https://github.com/iSazonov/PowerShell/tree/add-unicode3/src/System.Management.Automation/utils/unicodeBeta code and help utils https://github.com/iSazonov/SCF
Initial PR #2637.
Suggestion
S1. Implement internal API for case-insensitive language-neitral comparisions and convertions based on Unicode Simple case folding.
S2. Expose public API like
System.StringComparison.SimpleCaseFoldingforString.Compare()andString.IndexOf(). And similar public API for Span.Update: Improve OrdinalIgnoreCase comparisons everywhere.
S3. Expose public API for
System.Stringto convert to simple case folding format. And similar public API for Span.Additional considerations
Initially all suggestions is only for Utf16. Utf8 (and others) could be considered too.
Initially all suggestions is only for Simple Case Folding. Full Case Folding could be useful for culture-sensetive comparisions (specially for Regex).
Expected benefits
Increase performance of case-insensitive language-neitral comparisions for strings and Span.
Get benefits in follow core APIs:
Get benefits for case-insensetive languages like PowerShell, HTTP, HTML, CSS, URL, and others, based on Unicode standard definition of Equivalent Case-Insensitive Identifiers.
Performance measures
First results from a PowerShell repo prototype showed that the simple case folding comparision performance is comparable to String.Ordinal comparision performance.
Justification
Unicode standard introduces the concept of Simple case folding that is intended to compare strings in case-insensetive and language-neitral manner.
Short quotes from Unicode standard
From: http://unicode.org/faq/casemap_charprop.html
From http://www.unicode.org/reports/tr31/tr31-29.html#R5
From http://www.unicode.org/reports/tr21/tr21-5.html
From http://unicode.org/reports/tr18/#Folded_Matching
From http://unicode.org/reports/tr18/#Simple_Loose_Matches
From http://www.unicode.org/versions/Unicode11.0.0/ch03.pdf
A modified form of Default Case Folding is designed for best behavior when doing caseless
matching of strings interpreted as identifiers.
(See above http://www.unicode.org/reports/tr31/tr31-29.html#R5)
/cc @tarekgh @ahsonkhan @KrzysztofCwalina (Sorry if my ping is wrong)