Add support for colorizing escape sequences in string literals. by CyrusNajmabadi · Pull Request #28927 · dotnet/roslyn

CyrusNajmabadi · 2018-07-28T02:20:11Z

Looks like this:

Note, this color comes from VSCode:

Also note how VSCode doesn't really understand C# strings and messes this up in many ways. :-/

CyrusNajmabadi · 2018-07-28T02:56:42Z

Tagging @dotnet/roslyn-ide @kuhlenh @sharwell

Note: teh majority of this PR was already reviewed in the following feature-branch PRs by @jmarolf and @jcouv:

The only additions this change makes are to actually just turn on classification for string literals for normal C# or VB escape sequences. Effectively, it takes the "virtual chars" we produce (which are intended to be used by Regex/Json) and visualizes them in the IDE.

CyrusNajmabadi · 2018-07-28T03:45:38Z

Let the bikeshedding begin!

CyrusNajmabadi · 2018-07-30T20:57:29Z

@jinujoseph Can i get a buddy on this?

CyrusNajmabadi · 2018-07-30T20:58:11Z

This could "String - Escape Character" or "String - Escape Sequence". This is a user visible string, so i don't mind some bike-shedding.

@CyrusNajmabadi Is this representing the \ part, or the full sequence \u0001. If it the former then use character, otherwise sequence.

CyrusNajmabadi · 2018-07-30T21:00:31Z

@jcouv This code is the same as when you originally reviewed except that i've added support for interpolated strings. this is so that we can effectively classify escape sequences in something like $"{foo}\r\n{bar}". Note: interpolations are actually simpler to handle because the interpolation text is only the \r\n art, and doesn't include things like $" that you then have to ignore.

CyrusNajmabadi · 2018-07-30T21:05:56Z

@jcouv can you take a look? Thanks!

kuhlenh · 2018-07-30T21:47:45Z

@CyrusNajmabadi is this color way different than the regex colors?

CyrusNajmabadi · 2018-07-30T22:27:52Z

@kuhlenh The intention would likely be to keep this the same as regex-escapes. The color here was just picked since it's what vscode already uses.

kuhlenh · 2018-07-30T22:45:32Z

@CyrusNajmabadi makes sense. I think this is great! Definitely will provide a better user experience.

sharwell

I like the feature, but find the implementation immensely and overwhelmingly complex. I would expect this to appear as a simple syntactic classifier that operates directly on string literal tokens coming from the compiler before reporting the classification tags back to the consumer.

sharwell · 2018-08-01T14:21:48Z

❓ Why did this file change?

hrmm.. not sure. reverting.

sharwell · 2018-08-01T14:22:13Z

📝 Incorrect change to indentation

sharwell · 2018-08-01T14:22:41Z

❓ Why are changes to this file appearing in this pull request

Reverting. sorry!

sharwell · 2018-08-01T14:23:50Z

📝 String escape sequences are not an embedded language. The entire use of the embedded language subsystem makes classification of a core syntactical element of C# overly complex.

sharwell · 2018-08-01T14:32:53Z

I'll follow up with @jcouv and @jmarolf today before going further in the review here.

CyrusNajmabadi · 2018-08-01T15:27:07Z

Good feedback, let's start with the major points:

I like the feature, but find the implementation immensely and overwhelmingly complex. I would expect this to appear as a simple syntactic classifier that operates directly on string literal tokens coming from the compiler before reporting the classification tags back to the consumer.

So, I need to know which part of hte impl you find immensely complex. There are currently two main parts to this review that are worth understanding.

The Virtual-Char system. This is the module that can take a string-literal token (or interpolated-text-token) and produce a series of interpretted chars, along with their original source location. i.e. it is the part of the system that can see a sequence like \n and say this is a line-feed char, that occupies two chars in the original source. Such a system would be necessary for this feature no matter what, because the entire purpose is to know what chars in the .ValueText of the literal actually were multiple chars in source.
The Embedded-Lang system. In my mind 'embedded lang' is simply a synonym for "something that wants to operate and process the 'virtual chars' in a string literal token". That will hopefully include regex in the future (and possibly json as well). However, it also falls out that there can just be a "Default" (or "Fallback") embedded-lang system that just processes the virtual-chars without much interpretation doing things like classifying them like in this PR.

String escape sequences are not an embedded language. The entire use of the embedded language subsystem makes classification of a core syntactical element of C# overly complex.

I can see that being hte feeling looking at this PR in isolation. It's likely true that the embedded-lang portion could have been optional here. However, i would still like to maintain this because it will then keep the 'regex' PR much simpler. An important aspect of the embedded-lang system is that it provides an ordering for all the modules that want to process the string literal. i.e. it will try the regex-provider first, and if that produces nothing, it will fall back to this default/fallback language provider. If this was a core part of the classifier, and we also had the embedded-language providers for regex/json, we would need to coordinate between the two in some fashion to ensure there was no issue. Furthermore, these embedded languages all behave the same way. So we'd have different parts of our stack conceptually doing the same thing, but in a different fashion.

Because of that, i felt it was worthwhile and effective to just use the stack that had been built for handling these more complex string cases, and to use it for this simpler string case. I personally don't think it adds that much complexity. And it means later works slots in nicely here instead of involving a lot more refactoring.

--

You can think of this PR as being three PRs together:

The addition of the virtual-char subssytem.
The addition of the embedded-lang system for allowing interesting services to light up on string literals.
The addition of the 'default' embedded lang which classifies just escape chars in a string literal.

Had '1' and '2' gone in (and i've tried in the past to have those as separate PRs) this PR here would be tiny. It would feel incredibly simple and easy to understand. However, it's because '1' and '2' can't go in without a use case that is causing this to feel like a lot.

CyrusNajmabadi · 2018-08-01T15:28:27Z

I'm going to try to break this PR up into smaller commits to make things easier to reason about.

…rs in a string-literal token From the dotnet#23984 PR: The first subsystem is called the VirtualCharService deals with the following issue. To the final .net regex, the following snippets of code appear completely identical to it: "\\1" // In a normal string, we have to escape the \ @"\1" // But not in a verbatim string "\\\u0031" // The '1' could be escaped "\\u005c1" // Even the backslash *itself* may be escaped These are all ways of writing the \1 regex. In other words, C# allows a wide variety of input strings to all compile down to the same final 'value' (or 'ValueText') that the regex engine will finally see. This is a major issue as it means that any data reported by the regex engine must be accurate with respect to the text as the user wrote it. For example, in all of the equivalent cases above, there is the same error "Reference to undefined group number 1". However, for each form the user wrote, it's necessary to understand what the right value is to highlight as the problem. i.e. https://user-images.githubusercontent.com/4564579/34459671-5bb785b2-edab-11e7-8413-79c331ef373f.png and https://user-images.githubusercontent.com/4564579/34459672-6deb88dc-edab-11e7-8236-7ba7cd331247.png So, the purpose of the VirtualCharService is to translate all of the above pieces of user code to the same final set of characters the regex engine will see (specifically \ and 1) while also maintaining the knowledge of where those characters came from (for example, that 1 came from \u0031 in the last example). In essence, the VirtualCharService is able to produce the ValueText for any string literal, while having a mapping back from each character in the ValueText back to the original source span of the document that formed this. With the VirtualCharService user code can be translated into a common format that then can be processed uniformly. This means that the part of the system that actually tries to understand the regex does not need to know about the differences between @"" and "" strings, or the differences between C# and VB. It also means that it can be used by any roslyn language (for example, F#) if that is so desired.

…ls (classification only for now)

…ling string-literals.

…y escape sequences.

CyrusNajmabadi · 2018-08-01T15:54:59Z

@sharwell i've broken this up into 4 conceptual commits to hopefully help make it clearer how this all fits together.

CyrusNajmabadi · 2018-08-02T15:20:57Z

@sharwell have your concerns been addressed here?

CyrusNajmabadi · 2018-08-02T15:22:37Z

I've looked through this and, at a conceptual level, this honestly doesn't seem complex to me. There is complexity in things like the virtual-char service, where we have to figure out all the escape chars and whatnot. But overall, the way things plug into the classification stack is pretty darn simple.

CyrusNajmabadi · 2018-08-05T17:47:16Z

@kuhlenh Using the actual colors we decided on for regex, you get this:

IMO, this may be a bit too subtle for the default. Note: i think these colors are good for Regex. That's because for regex you really want the actual regex operators to stand out, and you just want to see that one of these textual escapes is just a simple string escape, but you don't need to really distinguish them greatly from normal text.

However, when you just have normal text, then being able to distinguish the escapes from the normal text is far more helpful.

I'm curious what you and @sharwell think about this. Should we try to have a consistent 'escaped text' color across Regex and normal strings? Or should we have two separate colors for the different cases? I personally like having the escapes really stand out. But i would be ok with it only getting a subtle treatment.

jcouv · 2018-08-08T22:36:46Z

+            _language = language;
+        }
+
+        public void AddClassifications(


AddClassifications [](start = 20, length = 18)

It seems this PR doesn't add tests to verify classification. Do we have some infrastructure for that?

You are correct. I will add a couple to validate. The majority of the tests validate the virtual char service. But we do need at least 1-2 tests at the classification level to make sure it hooked up properly.

Tests added

jcouv · 2018-08-08T22:46:04Z

+
+            var result = TryConvertToVirtualCharsWorker(token);
+
+#if DEBUG


DEBUG [](start = 4, length = 5)

nit: Consider moving this block to a helper method. #Closed

jcouv · 2018-08-08T22:58:09Z

+            {
+                return default;
+            }
+        }


} [](start = 8, length = 1)

I filed #29172 to follow-up when the alternate interpolated verbatim strings feature (@$" instead of $@") gets merged. This code may have to be updated. #Closed

Could you add a test to CSharpVirtualCharServiceTests.cs that @$" produces specific diagnostics? Those diagnostics will change when C# 8 is merged in and that can be used as a reminder. Thanks

In reply to: 208761728 [](ancestors = 208761728)

test added.

jcouv

Done with review pass (iteration 14).

CyrusNajmabadi · 2018-08-08T23:52:44Z

Tests added.

CyrusNajmabadi · 2018-08-09T00:12:15Z

@jcouv Have tried to address all your feedback. Thanks!

jcouv · 2018-08-09T00:18:24Z

            Test("$@\"{{\"", "['{',[3,5]]");
        }

+        [Fact]


Thanks!
Link to #29172 would be great.

jcouv

LGTM Thanks! (iteration 20).
Can you also file an issue for the themeing follow-up work?

CyrusNajmabadi · 2018-08-09T00:23:56Z

Theming issue is: #29173

CyrusNajmabadi · 2018-08-09T00:26:53Z

#29174 Tracks the 16bit vs 32bit virtual-char limitation.

CyrusNajmabadi · 2018-08-09T00:29:41Z

Ok. All feedback addressed (i think).

tmat · 2018-08-09T01:06:41Z

+    /// </summary>
+    internal readonly struct VirtualChar : IEquatable<VirtualChar>
+    {
+        public readonly char Char;


Seems like rather inefficient representation for long strings that have just a couple of escape characters.

I also wonder how is this gonna work with UTF8 string feature.

I also wonder how is this gonna work with UTF8 string feature.

Can you link me to more info on this, including how Roslyn intends to expose this feature? Based on that, i can give you an assessment.

@jaredpar Is UTF8 string literal proposal being worked on?

jcouv · 2018-08-09T01:56:25Z

Merged. Thanks!

CyrusNajmabadi · 2018-08-09T02:21:41Z

Thanks!

CyrusNajmabadi requested a review from a team as a code owner July 28, 2018 02:20

CyrusNajmabadi force-pushed the classifyEscapes branch from 12b63c4 to 99b26d6 Compare July 28, 2018 02:54

CyrusNajmabadi commented Jul 28, 2018

View reviewed changes

jcouv added Area-IDE Community The pull request was submitted by a contributor who is not a Microsoft employee. labels Jul 28, 2018

jcouv added this to the 16.0 milestone Jul 28, 2018

CyrusNajmabadi commented Jul 30, 2018

View reviewed changes

jcouv self-assigned this Jul 30, 2018

jinujoseph assigned ivanbasov Jul 31, 2018

sharwell suggested changes Aug 1, 2018

View reviewed changes

CyrusNajmabadi added 4 commits August 1, 2018 11:39

Add an extension point to allow services to light up on string-litera…

58103f6

…ls (classification only for now)

Hook up the C#/VB classifiers to defer to embedded languages for hand…

208cf00

…ling string-literals.

Provide a 'fallback' classifier for string-literals that will classif…

e1c408e

…y escape sequences.

CyrusNajmabadi force-pushed the classifyEscapes branch from a54894d to e1c408e Compare August 1, 2018 15:53

Merge remote-tracking branch 'upstream/master' into classifyEscapes

f2e2ddf

Make name plural

522e044

jcouv reviewed Aug 8, 2018

View reviewed changes

jcouv mentioned this pull request Aug 8, 2018

Verify alt interpolated verbatim strings with VirtualChars #29172

Closed

jcouv reviewed Aug 8, 2018

View reviewed changes

CyrusNajmabadi added 2 commits August 8, 2018 16:45

Add tests.

501f704

tests added.

4a68619

CyrusNajmabadi added 4 commits August 8, 2018 17:04

Add test for reversed string.

fad17b9

update test.

7b6d70a

Reduce accessibility

5adcb72

Move checking into helper method.

5cdc6c9

jcouv reviewed Aug 9, 2018

View reviewed changes

jcouv approved these changes Aug 9, 2018

View reviewed changes

Add tracking info.

6d2a187

CyrusNajmabadi mentioned this pull request Aug 9, 2018

Escape character classification needs to be bikeshedded. #29173

Closed

CyrusNajmabadi mentioned this pull request Aug 9, 2018

VirtualChar system cannot handle 32bit wide characters #29174

Closed

Add comment.

b62b60c

tmat reviewed Aug 9, 2018

View reviewed changes

jcouv merged commit acd3186 into dotnet:master Aug 9, 2018

CyrusNajmabadi deleted the classifyEscapes branch August 9, 2018 02:21

This was referenced Aug 9, 2018

Merge master into features/embeddedJson #29178

Merged

Merge master into features/embeddedRegex #29179

Merged


		var result = TryConvertToVirtualCharsWorker(token);

		#if DEBUG

Conversation

CyrusNajmabadi commented Jul 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CyrusNajmabadi commented Jul 28, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CyrusNajmabadi commented Jul 30, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CyrusNajmabadi commented Jul 30, 2018

Uh oh!

kuhlenh commented Jul 30, 2018

Uh oh!

CyrusNajmabadi commented Jul 30, 2018

Uh oh!

kuhlenh commented Jul 30, 2018

Uh oh!

sharwell left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sharwell commented Aug 1, 2018

Uh oh!

CyrusNajmabadi commented Aug 1, 2018

Uh oh!

CyrusNajmabadi commented Aug 1, 2018

Uh oh!

CyrusNajmabadi commented Aug 1, 2018

Uh oh!

CyrusNajmabadi commented Aug 2, 2018

Uh oh!

CyrusNajmabadi commented Aug 2, 2018

Uh oh!

CyrusNajmabadi commented Aug 5, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CyrusNajmabadi Aug 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jcouv Aug 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jcouv Aug 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jcouv left a comment

Choose a reason for hiding this comment

Uh oh!

CyrusNajmabadi commented Jul 28, 2018 •

edited

Loading

CyrusNajmabadi Aug 8, 2018 •

edited

Loading

jcouv Aug 8, 2018 •

edited

Loading

jcouv Aug 8, 2018 •

edited

Loading