Unicode Case Folding

Come from https://github.com/dotnet/corefx/issues/17233
Related discussion https://github.com/PowerShell/PowerShell/issues/8120
~~First alfa code https://github.com/iSazonov/PowerShell/tree/add-unicode3/src/System.Management.Automation/utils/unicode~~
Beta code and help utils https://github.com/iSazonov/SCF

Initial PR #2637.


# Suggestion

S1. Implement internal API for case-insensitive language-neitral comparisions and convertions based on Unicode Simple case folding.
S2. Expose public API like `System.StringComparison.SimpleCaseFolding` for `String.Compare()` and `String.IndexOf()`. And similar public API for Span<char>.
Update: Improve OrdinalIgnoreCase comparisons everywhere. 
S3. Expose public API for `System.String` to convert to simple case folding format. And similar public API for Span<char>.


## Additional considerations

Initially all suggestions is only for Utf16. Utf8 (and others) could be considered too.

Initially all suggestions is only for Simple Case Folding. Full Case Folding could be useful for culture-sensetive comparisions (specially for Regex).


# Expected benefits

Increase performance of case-insensitive language-neitral comparisions for strings and Span<char>. 

Get benefits in follow core APIs:
- Regex case-insensitive search (invariant culture option)
- string case-insensitive language-neitral comparisions (speed up OrdinalIgnoreCase comparisons everywhere)
- Span<char> buffer case-insensitive language-neitral comparisions
- Speed up Marvin.ComputeHash32OrdinalIgnoreCase()

Get benefits for case-insensetive languages like PowerShell, HTTP, HTML, CSS, URL, and others, based on Unicode standard definition of _Equivalent Case-Insensitive Identifiers_.

# Performance measures

First results from a [PowerShell repo prototype](https://github.com/PowerShell/PowerShell/issues/8120#issuecomment-435045448) showed that the simple case folding comparision performance is comparable to String.Ordinal comparision performance.

# Justification

Unicode standard introduces the concept of _Simple case folding_ that is intended to compare strings in case-insensetive and language-neitral manner.

## Short quotes from Unicode standard

From: http://unicode.org/faq/casemap_charprop.html

> Q: What is the difference between case mapping and case folding?
A: Case mapping or case conversion is a process whereby strings are converted to a particular form—uppercase, lowercase, or titlecase—possibly for display to the user. Case folding is mostly used for caseless comparison of text, such as identifiers in a computer program, rather than actual text transformation. Case folding in Unicode is primarily based on the lowercase mapping, but includes additional changes to the source text to help make it language-insensitive and consistent. As a result, case-folded text should be used solely for internal processing and generally should not be stored or displayed to the end user.

From http://www.unicode.org/reports/tr31/tr31-29.html#R5

> Equivalent Case-Insensitive Identifiers: To meet this requirement, an implementation shall specify either simple or full case folding, and adhere to the Unicode specification for that folding. Any two identifiers that have the same case-folded form shall be treated as equivalent by the implementation. 


From http://www.unicode.org/reports/tr21/tr21-5.html

> 1.3 Caseless Matching
Caseless matching is implemented using case-folding. The latter is the process of mapping strings to a canonical form where case differences are erased. Case-folding allows for fast caseless matches in lookups, since only binary comparison is required. Case-folding is more than just conversion to lowercase. For example, it handles cases such as the Greek sigma, so that  "Μάϊος" and "ΜΆΪΟΣ" will match correctly.

From http://unicode.org/reports/tr18/#Folded_Matching

> 3.10 Folded Matching (Retracted) 
RL3.10 Folded Matching 
Previous versions of RL3.10 described tailored folding. However, for most full-featured regular expression engines, it is quite difficult to match under folding equivalences that are not 1:1. For more discussion of this, see 1.5 Simple Loose Matches and 2.1 Canonical Equivalents. Thus RL3.10 has been retracted. 

From http://unicode.org/reports/tr18/#Simple_Loose_Matches

> 1.5  Simple Loose Matches 
Most regular expression engines offer caseless matching as the only loose matching. If the engine does offers this, then it needs to account for the large range of cased Unicode characters outside of ASCII.
RL1.5 Simple Loose Matches 
 To meet this requirement, if an implementation provides for case-insensitive matching, then it shall provide at least the simple, default Unicode case-insensitive matching, and specify which properties are closed and which are not. 
To meet this requirement, if an implementation provides for case conversions, then it shall provide at least the simple, default Unicode case folding. 

From http://www.unicode.org/versions/Unicode11.0.0/ch03.pdf

> 3.13 Default Case Algorithms

>Default Case Folding
Default Case Folding is based on the full case conversion operations without the contextdependent
mappings sensitive to the casing context.
R4 toCasefold(X): Map each character C in X to Case_Folding(C).
• Case_Folding(C) uses the mappings with the status field value “C” or “F” in the
data file CaseFolding.txt in the Unicode Character Database.

A modified form of Default Case Folding is designed for best behavior when doing caseless
matching of strings interpreted as identifiers.
(See above http://www.unicode.org/reports/tr31/tr31-29.html#R5)

>Default Caseless Matching
Default caseless matching is the process of comparing two strings for case-insensitive
equality. The definitions of Unicode Default Caseless Matching build on the definitions of
Unicode Default Case Folding.
Default Caseless Matching uses full case folding:
D144 A string X is a caseless match for a string Y if and only if:
toCasefold(X) = toCasefold(Y)

/cc @tarekgh @ahsonkhan @KrzysztofCwalina  (Sorry if my ping is wrong)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode Case Folding #2610

Suggestion

Additional considerations

Expected benefits

Performance measures

Justification

Short quotes from Unicode standard

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Unicode Case Folding #2610

Description

Suggestion

Additional considerations

Expected benefits

Performance measures

Justification

Short quotes from Unicode standard

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions