-
Notifications
You must be signed in to change notification settings - Fork 6.1k
Correct and improve Unicode escape sequence info (F#) #13168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1. Remove erroneous note regarding `\U` being used for specifying surrogate pairs. That note was patently false given that a) specifying a surrogate pair results in a compiler error, and b) specifying any valid code point / UTF-32 code unit returns the correct Unicode character for that code point.
* Even if the original author meant "supplementary characters" instead of "surrogate pairs", that would still be incorrect as the `\U` escape can also be used for BMP characters.
* Runnable example code showing that a valid code point (U+1F47E) works via `\U0001F47E`, and its surrogate pair via `\UD83DDC7E` does not, on [IDE One](https://ideone.com/0viKI5)
2. Show the exact hex value range for `\u` and `\U` to be more readable / helpful. This not only reduces confusion (especially for `\U`), it also removes any possibility of interpreting the 8 hex digits as being for a surrogate pair (which can never start with two zeros).
For more info on this, please see:
[Unicode Escape Sequences Across Various Languages and Platforms (including Supplementary Characters)](https://sqlquantumleap.com/2019/06/26/unicode-escape-sequences-across-various-languages-and-platforms-including-supplementary-characters/#fsharp)
1. Correctly indicated that `\u` is for a 2-byte UTF-16 value, and `\U` is for a 4-byte UTF-32 value. 2. Show a more accurate pattern for `\U` to be more readable / helpful. Please note that `\U00XXXXXX` has two permanent zeros and only 6 user-supplied hex digits. This is not only being completely honest (since those first two zeros can only ever be zeros), it removes any possibility of interpreting the 8 hex digits as being for a surrogate pair (which can never start with two zeros), hence reducing confusion. Runnable example code showing that a valid code point (U+1F47E) works via `\U0001F47E`, and its surrogate pair via `\UD83DDC7E` does not, on [IDE One](https://ideone.com/0viKI5). **FYI:** I found an undocumented escape sequence, `\xXX`, that accepts two hex digits and produces an ISO-8859-1 character (same as first 256 Unicode code points). Leaving as undocumented for now as there might be a specific reason that it's undocumented. For more info on this, please see: [Unicode Escape Sequences Across Various Languages and Platforms (including Supplementary Characters)](https://sqlquantumleap.com/2019/06/26/unicode-escape-sequences-across-various-languages-and-platforms-including-supplementary-characters/#fsharp)
cartermp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @srutzky!
|
Regarding this:
We'd certainly be happy to accept documentation for this. But I definitely didn't want to block the correction of outright errors on this. |
|
Re:
Yes, I figured if it was something to deal with, then it would be dealt with separately. I just didn't want to add it in now since it could have been something "experimental" and never completed, or something intentionally hidden. I dunno, maybe I am just conditioned by doing most of my work with SQL Server where there is quite a bit of "undocumented" stuff ;-). If nobody knows of a reason why |
|
I forgot to mention that this update has a companion C# update: #13162 |
"Literals" page
Remove erroneous note regarding
\Ubeing used for specifying surrogate pairs. That note was patently false given that a) specifying a surrogate pair results in a compiler error, and b) specifying any valid code point / UTF-32 code unit returns the correct Unicode character for that code point.\Uescape can also be used for BMP characters.\U0001F47E, and its surrogate pair via\UD83DDC7Edoes not, on IDE OneShow the exact hex value range for
\uand\Uto be more readable / helpful. This not only reduces confusion (especially for\U), it also removes any possibility of interpreting the 8 hex digits as being for a surrogate pair (which can never start with two zeros)."Strings" page
Correctly indicated that
\uis for a 2-byte UTF-16 value, and\Uis for a 4-byte UTF-32 value.Show a more accurate pattern for
\Uto be more readable / helpful. Please note that\U00XXXXXXhas two permanent zeros and only 6 user-supplied hex digits. This is not only being completely honest (since those first two zeros can only ever be zeros), it removes any possibility of interpreting the 8 hex digits as being for a surrogate pair (which can never start with two zeros), hence reducing confusion. Runnable example code showing that a valid code point (U+1F47E) works via\U0001F47E, and its surrogate pair via\UD83DDC7Edoes not, on IDE One.FYI: I found an undocumented escape sequence,
\xXX, that accepts two hex digits and produces an ISO-8859-1 character (same as first 256 Unicode code points). Leaving as undocumented for now as there might be a specific reason that it's undocumented.For more info on all of this, please see:
Unicode Escape Sequences Across Various Languages and Platforms (including Supplementary Characters)