Skip to content

Add support for colorizing escape sequences in string literals.#28927

Merged
jcouv merged 22 commits intodotnet:masterfrom
CyrusNajmabadi:classifyEscapes
Aug 9, 2018
Merged

Add support for colorizing escape sequences in string literals.#28927
jcouv merged 22 commits intodotnet:masterfrom
CyrusNajmabadi:classifyEscapes

Conversation

@CyrusNajmabadi
Copy link
Copy Markdown
Contributor

@CyrusNajmabadi CyrusNajmabadi commented Jul 28, 2018

Looks like this:

image

image

Note, this color comes from VSCode:

image

Also note how VSCode doesn't really understand C# strings and messes this up in many ways. :-/

@CyrusNajmabadi CyrusNajmabadi requested a review from a team as a code owner July 28, 2018 02:20
@CyrusNajmabadi
Copy link
Copy Markdown
Contributor Author

Tagging @dotnet/roslyn-ide @kuhlenh @sharwell

Note: teh majority of this PR was already reviewed in the following feature-branch PRs by @jmarolf and @jcouv:

#25541
#24111
#25985

The only additions this change makes are to actually just turn on classification for string literals for normal C# or VB escape sequences. Effectively, it takes the "virtual chars" we produce (which are intended to be used by Regex/Json) and visualizes them in the IDE.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let the bikeshedding begin!

@jcouv jcouv added Area-IDE Community The pull request was submitted by a contributor who is not a Microsoft employee. labels Jul 28, 2018
@jcouv jcouv added this to the 16.0 milestone Jul 28, 2018
@CyrusNajmabadi
Copy link
Copy Markdown
Contributor Author

@jinujoseph Can i get a buddy on this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could "String - Escape Character" or "String - Escape Sequence". This is a user visible string, so i don't mind some bike-shedding.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CyrusNajmabadi Is this representing the \ part, or the full sequence \u0001. If it the former then use character, otherwise sequence.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jcouv This code is the same as when you originally reviewed except that i've added support for interpolated strings. this is so that we can effectively classify escape sequences in something like $"{foo}\r\n{bar}". Note: interpolations are actually simpler to handle because the interpolation text is only the \r\n art, and doesn't include things like $" that you then have to ignore.

@CyrusNajmabadi
Copy link
Copy Markdown
Contributor Author

@jcouv can you take a look? Thanks!

@kuhlenh
Copy link
Copy Markdown

kuhlenh commented Jul 30, 2018

@CyrusNajmabadi is this color way different than the regex colors?

@jcouv jcouv self-assigned this Jul 30, 2018
@CyrusNajmabadi
Copy link
Copy Markdown
Contributor Author

@kuhlenh The intention would likely be to keep this the same as regex-escapes. The color here was just picked since it's what vscode already uses.

@kuhlenh
Copy link
Copy Markdown

kuhlenh commented Jul 30, 2018

@CyrusNajmabadi makes sense. I think this is great! Definitely will provide a better user experience.

Copy link
Copy Markdown
Contributor

@sharwell sharwell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the feature, but find the implementation immensely and overwhelmingly complex. I would expect this to appear as a simple syntactic classifier that operates directly on string literal tokens coming from the compiler before reporting the classification tags back to the consumer.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ Why did this file change?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hrmm.. not sure. reverting.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 Incorrect change to indentation

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ Why are changes to this file appearing in this pull request

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverting. sorry!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 String escape sequences are not an embedded language. The entire use of the embedded language subsystem makes classification of a core syntactical element of C# overly complex.

@sharwell
Copy link
Copy Markdown
Contributor

sharwell commented Aug 1, 2018

I'll follow up with @jcouv and @jmarolf today before going further in the review here.

@CyrusNajmabadi
Copy link
Copy Markdown
Contributor Author

Good feedback, let's start with the major points:

I like the feature, but find the implementation immensely and overwhelmingly complex. I would expect this to appear as a simple syntactic classifier that operates directly on string literal tokens coming from the compiler before reporting the classification tags back to the consumer.

So, I need to know which part of hte impl you find immensely complex. There are currently two main parts to this review that are worth understanding.

  1. The Virtual-Char system. This is the module that can take a string-literal token (or interpolated-text-token) and produce a series of interpretted chars, along with their original source location. i.e. it is the part of the system that can see a sequence like \n and say this is a line-feed char, that occupies two chars in the original source. Such a system would be necessary for this feature no matter what, because the entire purpose is to know what chars in the .ValueText of the literal actually were multiple chars in source.
  2. The Embedded-Lang system. In my mind 'embedded lang' is simply a synonym for "something that wants to operate and process the 'virtual chars' in a string literal token". That will hopefully include regex in the future (and possibly json as well). However, it also falls out that there can just be a "Default" (or "Fallback") embedded-lang system that just processes the virtual-chars without much interpretation doing things like classifying them like in this PR.

String escape sequences are not an embedded language. The entire use of the embedded language subsystem makes classification of a core syntactical element of C# overly complex.

I can see that being hte feeling looking at this PR in isolation. It's likely true that the embedded-lang portion could have been optional here. However, i would still like to maintain this because it will then keep the 'regex' PR much simpler. An important aspect of the embedded-lang system is that it provides an ordering for all the modules that want to process the string literal. i.e. it will try the regex-provider first, and if that produces nothing, it will fall back to this default/fallback language provider. If this was a core part of the classifier, and we also had the embedded-language providers for regex/json, we would need to coordinate between the two in some fashion to ensure there was no issue. Furthermore, these embedded languages all behave the same way. So we'd have different parts of our stack conceptually doing the same thing, but in a different fashion.

Because of that, i felt it was worthwhile and effective to just use the stack that had been built for handling these more complex string cases, and to use it for this simpler string case. I personally don't think it adds that much complexity. And it means later works slots in nicely here instead of involving a lot more refactoring.

--

You can think of this PR as being three PRs together:

  1. The addition of the virtual-char subssytem.
  2. The addition of the embedded-lang system for allowing interesting services to light up on string literals.
  3. The addition of the 'default' embedded lang which classifies just escape chars in a string literal.

Had '1' and '2' gone in (and i've tried in the past to have those as separate PRs) this PR here would be tiny. It would feel incredibly simple and easy to understand. However, it's because '1' and '2' can't go in without a use case that is causing this to feel like a lot.

@CyrusNajmabadi
Copy link
Copy Markdown
Contributor Author

I'm going to try to break this PR up into smaller commits to make things easier to reason about.

…rs in a string-literal token

From the dotnet#23984 PR:

The first subsystem is called the VirtualCharService deals with the following issue. To the final .net regex, the following snippets of code appear completely identical to it:

"\\1"       // In a normal string, we have to escape the \
@"\1"       // But not in a verbatim string
"\\\u0031"  // The '1' could be escaped
"\\u005c1"  // Even the backslash *itself* may be escaped

These are all ways of writing the \1 regex.

In other words, C# allows a wide variety of input strings to all compile down to the same final 'value' (or 'ValueText') that the regex engine will finally see. This is a major issue as it means that any data reported by the regex engine must be accurate with respect to the text as the user wrote it. For example, in all of the equivalent cases above, there is the same error "Reference to undefined group number 1". However, for each form the user wrote, it's necessary to understand what the right value is to highlight as the problem. i.e.

https://user-images.githubusercontent.com/4564579/34459671-5bb785b2-edab-11e7-8413-79c331ef373f.png

and

https://user-images.githubusercontent.com/4564579/34459672-6deb88dc-edab-11e7-8236-7ba7cd331247.png

So, the purpose of the VirtualCharService is to translate all of the above pieces of user code to the same final set of characters the regex engine will see (specifically \ and 1) while also maintaining the knowledge of where those characters came from (for example, that 1 came from \u0031 in the last example). In essence, the VirtualCharService is able to produce the ValueText for any string literal, while having a mapping back from each character in the ValueText back to the original source span of the document that formed this.

With the VirtualCharService user code can be translated into a common format that then can be processed uniformly. This means that the part of the system that actually tries to understand the regex does not need to know about the differences between @"" and "" strings, or the differences between C# and VB. It also means that it can be used by any roslyn language (for example, F#) if that is so desired.
@CyrusNajmabadi
Copy link
Copy Markdown
Contributor Author

@sharwell i've broken this up into 4 conceptual commits to hopefully help make it clearer how this all fits together.

@CyrusNajmabadi
Copy link
Copy Markdown
Contributor Author

@sharwell have your concerns been addressed here?

@CyrusNajmabadi
Copy link
Copy Markdown
Contributor Author

I've looked through this and, at a conceptual level, this honestly doesn't seem complex to me. There is complexity in things like the virtual-char service, where we have to figure out all the escape chars and whatnot. But overall, the way things plug into the classification stack is pretty darn simple.

@CyrusNajmabadi
Copy link
Copy Markdown
Contributor Author

@kuhlenh Using the actual colors we decided on for regex, you get this:

image

image

IMO, this may be a bit too subtle for the default. Note: i think these colors are good for Regex. That's because for regex you really want the actual regex operators to stand out, and you just want to see that one of these textual escapes is just a simple string escape, but you don't need to really distinguish them greatly from normal text.

However, when you just have normal text, then being able to distinguish the escapes from the normal text is far more helpful.

I'm curious what you and @sharwell think about this. Should we try to have a consistent 'escaped text' color across Regex and normal strings? Or should we have two separate colors for the different cases? I personally like having the escapes really stand out. But i would be ok with it only getting a subtle treatment.

_language = language;
}

public void AddClassifications(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AddClassifications [](start = 20, length = 18)

It seems this PR doesn't add tests to verify classification. Do we have some infrastructure for that?

Copy link
Copy Markdown
Contributor Author

@CyrusNajmabadi CyrusNajmabadi Aug 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct. I will add a couple to validate. The majority of the tests validate the virtual char service. But we do need at least 1-2 tests at the classification level to make sure it hooked up properly.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests added


var result = TryConvertToVirtualCharsWorker(token);

#if DEBUG
Copy link
Copy Markdown
Member

@jcouv jcouv Aug 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DEBUG [](start = 4, length = 5)

nit: Consider moving this block to a helper method. #Closed

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

{
return default;
}
}
Copy link
Copy Markdown
Member

@jcouv jcouv Aug 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

} [](start = 8, length = 1)

I filed #29172 to follow-up when the alternate interpolated verbatim strings feature (@$" instead of $@") gets merged. This code may have to be updated. #Closed

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test to CSharpVirtualCharServiceTests.cs that @$" produces specific diagnostics? Those diagnostics will change when C# 8 is merged in and that can be used as a reminder. Thanks


In reply to: 208761728 [](ancestors = 208761728)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test added.

Copy link
Copy Markdown
Member

@jcouv jcouv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done with review pass (iteration 14).

@CyrusNajmabadi
Copy link
Copy Markdown
Contributor Author

Tests added.

@CyrusNajmabadi
Copy link
Copy Markdown
Contributor Author

@jcouv Have tried to address all your feedback. Thanks!

Test("$@\"{{\"", "['{',[3,5]]");
}

[Fact]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
Link to #29172 would be great.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Copy link
Copy Markdown
Member

@jcouv jcouv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM Thanks! (iteration 20).
Can you also file an issue for the themeing follow-up work?

@CyrusNajmabadi
Copy link
Copy Markdown
Contributor Author

Theming issue is: #29173

@CyrusNajmabadi
Copy link
Copy Markdown
Contributor Author

#29174 Tracks the 16bit vs 32bit virtual-char limitation.

@CyrusNajmabadi
Copy link
Copy Markdown
Contributor Author

Ok. All feedback addressed (i think).

/// </summary>
internal readonly struct VirtualChar : IEquatable<VirtualChar>
{
public readonly char Char;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like rather inefficient representation for long strings that have just a couple of escape characters.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also wonder how is this gonna work with UTF8 string feature.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also wonder how is this gonna work with UTF8 string feature.

Can you link me to more info on this, including how Roslyn intends to expose this feature? Based on that, i can give you an assessment.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaredpar Is UTF8 string literal proposal being worked on?

@jcouv jcouv merged commit acd3186 into dotnet:master Aug 9, 2018
@jcouv
Copy link
Copy Markdown
Member

jcouv commented Aug 9, 2018

Merged. Thanks!

@CyrusNajmabadi
Copy link
Copy Markdown
Contributor Author

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area-IDE Community The pull request was submitted by a contributor who is not a Microsoft employee.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants