Skip to content

Conversation

@srutzky
Copy link
Contributor

@srutzky srutzky commented Jun 28, 2019

"Literals" page

  1. Remove erroneous note regarding \U being used for specifying surrogate pairs. That note was patently false given that a) specifying a surrogate pair results in a compiler error, and b) specifying any valid code point / UTF-32 code unit returns the correct Unicode character for that code point.

    • Even if the original author meant "supplementary characters" instead of "surrogate pairs", that would still be incorrect as the \U escape can also be used for BMP characters.
    • Runnable example code showing that a valid code point (U+1F47E) works via \U0001F47E, and its surrogate pair via \UD83DDC7E does not, on IDE One
  2. Show the exact hex value range for \u and \U to be more readable / helpful. This not only reduces confusion (especially for \U), it also removes any possibility of interpreting the 8 hex digits as being for a surrogate pair (which can never start with two zeros).

"Strings" page

  1. Correctly indicated that \u is for a 2-byte UTF-16 value, and \U is for a 4-byte UTF-32 value.

  2. Show a more accurate pattern for \U to be more readable / helpful. Please note that \U00XXXXXX has two permanent zeros and only 6 user-supplied hex digits. This is not only being completely honest (since those first two zeros can only ever be zeros), it removes any possibility of interpreting the 8 hex digits as being for a surrogate pair (which can never start with two zeros), hence reducing confusion. Runnable example code showing that a valid code point (U+1F47E) works via \U0001F47E, and its surrogate pair via \UD83DDC7E does not, on IDE One.

FYI: I found an undocumented escape sequence, \xXX, that accepts two hex digits and produces an ISO-8859-1 character (same as first 256 Unicode code points). Leaving as undocumented for now as there might be a specific reason that it's undocumented.


For more info on all of this, please see:
Unicode Escape Sequences Across Various Languages and Platforms (including Supplementary Characters)

srutzky added 2 commits June 28, 2019 11:57
1. Remove erroneous note regarding `\U` being used for specifying surrogate pairs. That note was patently false given that a) specifying a surrogate pair results in a compiler error, and b) specifying any valid code point / UTF-32 code unit returns the correct Unicode character for that code point.
    * Even if the original author meant "supplementary characters" instead of "surrogate pairs", that would still be incorrect as the `\U` escape can also be used for BMP characters.
    * Runnable example code showing that a valid code point (U+1F47E) works via `\U0001F47E`, and its surrogate pair via `\UD83DDC7E` does not, on [IDE One](https://ideone.com/0viKI5)

2. Show the exact hex value range for `\u` and `\U` to be more readable / helpful. This not only reduces confusion (especially for `\U`), it also removes any possibility of interpreting the 8 hex digits as being for a surrogate pair (which can never start with two zeros).

For more info on this, please see:
[Unicode Escape Sequences Across Various Languages and Platforms (including Supplementary Characters)](https://sqlquantumleap.com/2019/06/26/unicode-escape-sequences-across-various-languages-and-platforms-including-supplementary-characters/#fsharp)
1. Correctly indicated that `\u` is for a 2-byte UTF-16 value, and `\U` is for a 4-byte UTF-32 value.

2. Show a more accurate pattern for `\U` to be more readable / helpful. Please note that `\U00XXXXXX` has two permanent zeros and only 6 user-supplied hex digits. This is not only being completely honest (since those first two zeros can only ever be zeros), it removes any possibility of interpreting the 8 hex digits as being for a surrogate pair (which can never start with two zeros), hence reducing confusion. Runnable example code showing that a valid code point (U+1F47E) works via `\U0001F47E`, and its surrogate pair via `\UD83DDC7E` does not, on [IDE One](https://ideone.com/0viKI5).

**FYI:** I found an undocumented escape sequence, `\xXX`, that accepts two hex digits and produces an ISO-8859-1 character (same as first 256 Unicode code points). Leaving as undocumented for now as there might be a specific reason that it's undocumented.

For more info on this, please see:
[Unicode Escape Sequences Across Various Languages and Platforms (including Supplementary Characters)](https://sqlquantumleap.com/2019/06/26/unicode-escape-sequences-across-various-languages-and-platforms-including-supplementary-characters/#fsharp)
@srutzky srutzky requested a review from cartermp as a code owner June 28, 2019 16:34
@srutzky srutzky changed the title Correct and improve Unicode escape sequence info Correct and improve Unicode escape sequence info (F#) Jun 28, 2019
Copy link
Contributor

@cartermp cartermp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @srutzky!

@cartermp cartermp merged commit 3c8367a into dotnet:master Jun 28, 2019
@cartermp
Copy link
Contributor

Regarding this:

I found an undocumented escape sequence, \xXX, that accepts two hex digits and produces an ISO-8859-1 character (same as first 256 Unicode code points). Leaving as undocumented for now as there might be a specific reason that it's undocumented.

We'd certainly be happy to accept documentation for this. But I definitely didn't want to block the correction of outright errors on this.

@srutzky
Copy link
Contributor Author

srutzky commented Jun 28, 2019

@cartermp

Re:

We'd certainly be happy to accept documentation for this. But I definitely didn't want to block the correction of outright errors on this.

Yes, I figured if it was something to deal with, then it would be dealt with separately. I just didn't want to add it in now since it could have been something "experimental" and never completed, or something intentionally hidden. I dunno, maybe I am just conditioned by doing most of my work with SQL Server where there is quite a bit of "undocumented" stuff ;-). If nobody knows of a reason why \x shouldn't be documented, then I can submit another update for that early next week...

@srutzky
Copy link
Contributor Author

srutzky commented Jul 1, 2019

I forgot to mention that this update has a companion C# update: #13162

@srutzky srutzky deleted the patch-2 branch July 9, 2019 16:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants