Correct and improve Unicode escape sequence info (F#) #13168

srutzky · 2019-06-28T16:34:15Z

"Literals" page

Remove erroneous note regarding \U being used for specifying surrogate pairs. That note was patently false given that a) specifying a surrogate pair results in a compiler error, and b) specifying any valid code point / UTF-32 code unit returns the correct Unicode character for that code point.
- Even if the original author meant "supplementary characters" instead of "surrogate pairs", that would still be incorrect as the \U escape can also be used for BMP characters.
- Runnable example code showing that a valid code point (U+1F47E) works via \U0001F47E, and its surrogate pair via \UD83DDC7E does not, on IDE One
Show the exact hex value range for \u and \U to be more readable / helpful. This not only reduces confusion (especially for \U), it also removes any possibility of interpreting the 8 hex digits as being for a surrogate pair (which can never start with two zeros).

"Strings" page

Correctly indicated that \u is for a 2-byte UTF-16 value, and \U is for a 4-byte UTF-32 value.
Show a more accurate pattern for \U to be more readable / helpful. Please note that \U00XXXXXX has two permanent zeros and only 6 user-supplied hex digits. This is not only being completely honest (since those first two zeros can only ever be zeros), it removes any possibility of interpreting the 8 hex digits as being for a surrogate pair (which can never start with two zeros), hence reducing confusion. Runnable example code showing that a valid code point (U+1F47E) works via \U0001F47E, and its surrogate pair via \UD83DDC7E does not, on IDE One.

FYI: I found an undocumented escape sequence, \xXX, that accepts two hex digits and produces an ISO-8859-1 character (same as first 256 Unicode code points). Leaving as undocumented for now as there might be a specific reason that it's undocumented.

For more info on all of this, please see:
Unicode Escape Sequences Across Various Languages and Platforms (including Supplementary Characters)

1. Remove erroneous note regarding `\U` being used for specifying surrogate pairs. That note was patently false given that a) specifying a surrogate pair results in a compiler error, and b) specifying any valid code point / UTF-32 code unit returns the correct Unicode character for that code point. * Even if the original author meant "supplementary characters" instead of "surrogate pairs", that would still be incorrect as the `\U` escape can also be used for BMP characters. * Runnable example code showing that a valid code point (U+1F47E) works via `\U0001F47E`, and its surrogate pair via `\UD83DDC7E` does not, on [IDE One](https://ideone.com/0viKI5) 2. Show the exact hex value range for `\u` and `\U` to be more readable / helpful. This not only reduces confusion (especially for `\U`), it also removes any possibility of interpreting the 8 hex digits as being for a surrogate pair (which can never start with two zeros). For more info on this, please see: [Unicode Escape Sequences Across Various Languages and Platforms (including Supplementary Characters)](https://sqlquantumleap.com/2019/06/26/unicode-escape-sequences-across-various-languages-and-platforms-including-supplementary-characters/#fsharp)

1. Correctly indicated that `\u` is for a 2-byte UTF-16 value, and `\U` is for a 4-byte UTF-32 value. 2. Show a more accurate pattern for `\U` to be more readable / helpful. Please note that `\U00XXXXXX` has two permanent zeros and only 6 user-supplied hex digits. This is not only being completely honest (since those first two zeros can only ever be zeros), it removes any possibility of interpreting the 8 hex digits as being for a surrogate pair (which can never start with two zeros), hence reducing confusion. Runnable example code showing that a valid code point (U+1F47E) works via `\U0001F47E`, and its surrogate pair via `\UD83DDC7E` does not, on [IDE One](https://ideone.com/0viKI5). **FYI:** I found an undocumented escape sequence, `\xXX`, that accepts two hex digits and produces an ISO-8859-1 character (same as first 256 Unicode code points). Leaving as undocumented for now as there might be a specific reason that it's undocumented. For more info on this, please see: [Unicode Escape Sequences Across Various Languages and Platforms (including Supplementary Characters)](https://sqlquantumleap.com/2019/06/26/unicode-escape-sequences-across-various-languages-and-platforms-including-supplementary-characters/#fsharp)

cartermp

Thank you @srutzky!

cartermp · 2019-06-28T17:36:00Z

Regarding this:

I found an undocumented escape sequence, \xXX, that accepts two hex digits and produces an ISO-8859-1 character (same as first 256 Unicode code points). Leaving as undocumented for now as there might be a specific reason that it's undocumented.

We'd certainly be happy to accept documentation for this. But I definitely didn't want to block the correction of outright errors on this.

srutzky · 2019-06-28T17:56:31Z

@cartermp

Re:

We'd certainly be happy to accept documentation for this. But I definitely didn't want to block the correction of outright errors on this.

Yes, I figured if it was something to deal with, then it would be dealt with separately. I just didn't want to add it in now since it could have been something "experimental" and never completed, or something intentionally hidden. I dunno, maybe I am just conditioned by doing most of my work with SQL Server where there is quite a bit of "undocumented" stuff ;-). If nobody knows of a reason why \x shouldn't be documented, then I can submit another update for that early next week...

srutzky · 2019-07-01T14:15:58Z

I forgot to mention that this update has a companion C# update: #13162

srutzky added 2 commits June 28, 2019 11:57

srutzky requested a review from cartermp as a code owner June 28, 2019 16:34

srutzky changed the title ~~Correct and improve Unicode escape sequence info~~ Correct and improve Unicode escape sequence info (F#) Jun 28, 2019

cartermp approved these changes Jun 28, 2019

View reviewed changes

cartermp merged commit 3c8367a into dotnet:master Jun 28, 2019

srutzky mentioned this pull request Jul 1, 2019

Fix and improve Unicode escape sequence info (C#) #13162

Merged

srutzky deleted the patch-2 branch July 9, 2019 16:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Correct and improve Unicode escape sequence info (F#) #13168

Correct and improve Unicode escape sequence info (F#) #13168

Uh oh!

srutzky commented Jun 28, 2019

Uh oh!

cartermp left a comment

Uh oh!

cartermp commented Jun 28, 2019

Uh oh!

srutzky commented Jun 28, 2019

Uh oh!

srutzky commented Jul 1, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Correct and improve Unicode escape sequence info (F#) #13168

Correct and improve Unicode escape sequence info (F#) #13168

Uh oh!

Conversation

srutzky commented Jun 28, 2019

"Literals" page

"Strings" page

Uh oh!

cartermp left a comment

Choose a reason for hiding this comment

Uh oh!

cartermp commented Jun 28, 2019

Uh oh!

srutzky commented Jun 28, 2019

Uh oh!

srutzky commented Jul 1, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants