Fix and improve Unicode escape sequence info (C#) #13162

srutzky · 2019-06-28T13:42:50Z

Remove erroneous note regarding \U being used for specifying surrogate pairs. That note was patently false given that a) specifying a surrogate pair results in a compiler error, and b) specifying any valid code point / UTF-32 code unit returns the correct Unicode character for that code point.
- Even if the original author meant "supplementary characters" instead of "surrogate pairs", that would still be incorrect as the \U escape can also be used for BMP characters.
- Runnable example code showing that a valid code point (U+1F47E) works via \U0001F47E, and its surrogate pair via \UD83DDC7E does not, on IDE One
- In creating the test noted above, I found a bug in the Mono C# compiler, so I submitted that here:
  "\U" Unicode escape sequence for strings accepts invalid value instead of raising error #15456
- Runnable example code showing that invalid code point (U+110000) raises an exception, on IDE One
Correctly indicated that \U is for a 4-byte UTF-32 value, and \u is for a 2-byte UTF-16 value.
Show the pattern and an example to be more readable / helpful. Please note that \U00nnnnnn has two permanent zeros and only 6 user-supplied hex digits. This is not only being completely honest (since those first two zeros can only ever be zeros), it removes any possibility of interpreting the 8 hex digits as being for a surrogate pair (which can never start with two zeros), hence reducing confusion.
Properly formatted escape sequences as being inline-code
Added warning about using \x escape with less than 4 hex digits. For more info on this, please see:
Unicode Escape Sequences Across Various Languages and Platforms (including Supplementary Characters)

1. Remove erroneous note regarding `\U` being used for specifying surrogate pairs. That note was patently false given that a) specifying a surrogate pair raises an exception, and b) specifying any valid code point / UTF-32 code unit returns the correct Unicode character for that code point. * Even if the original author meant "supplementary characters" instead of "surrogate pairs", that would still be incorrect as the `\U` escape can also be used for BMP characters. * Runnable example code showing that a valid code point (U+1F47E) works via `\U0001F47E`, and its surrogate pair via `\UD83DDC7E` does not, on [IDE One](https://ideone.com/deoylQ) * In creating the test noted above, I found a bug in the Mono C\# compiler, so I submitted that here: ["\U" Unicode escape sequence for strings accepts invalid value instead of raising error dotnet#15456](mono/mono#15456) * Runnable example code showing that invalid code point (U+110000) raises an exception, on [IDE One](https://ideone.com/jpVxL4) 2. Correctly indicated that `\U` is for a 4-byte UTF-32 value, and `\u` is for a 2-byte UTF-16 value. 3. Show the pattern _and_ an example to be more readable / helpful. Please note that `\U00nnnnnn` has two permanent zeros and only 6 user-supplied hex digits. This is not only being completely honest (since those first two zeros can only ever be zeros), it removes any possibility of interpreting the 8 hex digits as being for a surrogate pair (which can never start with two zeros), hence reducing confusion. 4. Properly formatted escape sequences as being inline-code 5. Added warning about using `\x` escape with less than 4 hex digits. For more info on this, please see: [Unicode Escape Sequences Across Various Languages and Platforms (including Supplementary Characters)](https://sqlquantumleap.wordpress.com/2018/09/28/native-utf-8-support-in-sql-server-2019-savior-false-prophet-or-both/#csharp)

BillWagner

Thank you for adding these clarifying comments @srutzky
We appreciate it.

I’ve reviewed the changes, and I’ll now.

Thanks again!

srutzky · 2019-07-01T14:11:11Z

@BillWagner You are welcome.

I forgot to mention that this update has a companion F# update: #13168

srutzky requested a review from BillWagner as a code owner June 28, 2019 13:42

srutzky mentioned this pull request Jun 28, 2019

"\U" Unicode escape sequence for strings accepts invalid value instead of raising error mono/mono#15456

Open

srutzky changed the title ~~Fix and improve Unicode escape sequence info~~ Fix and improve Unicode escape sequence info (C#) Jun 28, 2019

BillWagner approved these changes Jul 1, 2019

View reviewed changes

BillWagner merged commit 9b6f355 into dotnet:master Jul 1, 2019

srutzky mentioned this pull request Jul 1, 2019

Correct and improve Unicode escape sequence info (F#) #13168

Merged

srutzky deleted the patch-1 branch July 9, 2019 16:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix and improve Unicode escape sequence info (C#) #13162

Fix and improve Unicode escape sequence info (C#) #13162

Uh oh!

srutzky commented Jun 28, 2019

Uh oh!

BillWagner left a comment

Uh oh!

srutzky commented Jul 1, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix and improve Unicode escape sequence info (C#) #13162

Fix and improve Unicode escape sequence info (C#) #13162

Uh oh!

Conversation

srutzky commented Jun 28, 2019

Uh oh!

BillWagner left a comment

Choose a reason for hiding this comment

Uh oh!

srutzky commented Jul 1, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants