Add support for colorizing escape sequences in string literals.#28927
Add support for colorizing escape sequences in string literals.#28927jcouv merged 22 commits intodotnet:masterfrom
Conversation
12b63c4 to
99b26d6
Compare
|
Tagging @dotnet/roslyn-ide @kuhlenh @sharwell Note: teh majority of this PR was already reviewed in the following feature-branch PRs by @jmarolf and @jcouv: The only additions this change makes are to actually just turn on classification for string literals for normal C# or VB escape sequences. Effectively, it takes the "virtual chars" we produce (which are intended to be used by Regex/Json) and visualizes them in the IDE. |
There was a problem hiding this comment.
Let the bikeshedding begin!
|
@jinujoseph Can i get a buddy on this? |
There was a problem hiding this comment.
This could "String - Escape Character" or "String - Escape Sequence". This is a user visible string, so i don't mind some bike-shedding.
There was a problem hiding this comment.
@CyrusNajmabadi Is this representing the \ part, or the full sequence \u0001. If it the former then use character, otherwise sequence.
There was a problem hiding this comment.
@jcouv This code is the same as when you originally reviewed except that i've added support for interpolated strings. this is so that we can effectively classify escape sequences in something like $"{foo}\r\n{bar}". Note: interpolations are actually simpler to handle because the interpolation text is only the \r\n art, and doesn't include things like $" that you then have to ignore.
|
@jcouv can you take a look? Thanks! |
|
@CyrusNajmabadi is this color way different than the regex colors? |
|
@kuhlenh The intention would likely be to keep this the same as regex-escapes. The color here was just picked since it's what vscode already uses. |
|
@CyrusNajmabadi makes sense. I think this is great! Definitely will provide a better user experience. |
sharwell
left a comment
There was a problem hiding this comment.
I like the feature, but find the implementation immensely and overwhelmingly complex. I would expect this to appear as a simple syntactic classifier that operates directly on string literal tokens coming from the compiler before reporting the classification tags back to the consumer.
There was a problem hiding this comment.
❓ Why did this file change?
There was a problem hiding this comment.
hrmm.. not sure. reverting.
There was a problem hiding this comment.
📝 Incorrect change to indentation
There was a problem hiding this comment.
❓ Why are changes to this file appearing in this pull request
There was a problem hiding this comment.
Reverting. sorry!
There was a problem hiding this comment.
📝 String escape sequences are not an embedded language. The entire use of the embedded language subsystem makes classification of a core syntactical element of C# overly complex.
|
Good feedback, let's start with the major points:
So, I need to know which part of hte impl you find immensely complex. There are currently two main parts to this review that are worth understanding.
I can see that being hte feeling looking at this PR in isolation. It's likely true that the embedded-lang portion could have been optional here. However, i would still like to maintain this because it will then keep the 'regex' PR much simpler. An important aspect of the embedded-lang system is that it provides an ordering for all the modules that want to process the string literal. i.e. it will try the regex-provider first, and if that produces nothing, it will fall back to this default/fallback language provider. If this was a core part of the classifier, and we also had the embedded-language providers for regex/json, we would need to coordinate between the two in some fashion to ensure there was no issue. Furthermore, these embedded languages all behave the same way. So we'd have different parts of our stack conceptually doing the same thing, but in a different fashion. Because of that, i felt it was worthwhile and effective to just use the stack that had been built for handling these more complex string cases, and to use it for this simpler string case. I personally don't think it adds that much complexity. And it means later works slots in nicely here instead of involving a lot more refactoring. -- You can think of this PR as being three PRs together:
Had '1' and '2' gone in (and i've tried in the past to have those as separate PRs) this PR here would be tiny. It would feel incredibly simple and easy to understand. However, it's because '1' and '2' can't go in without a use case that is causing this to feel like a lot. |
|
I'm going to try to break this PR up into smaller commits to make things easier to reason about. |
…rs in a string-literal token From the dotnet#23984 PR: The first subsystem is called the VirtualCharService deals with the following issue. To the final .net regex, the following snippets of code appear completely identical to it: "\\1" // In a normal string, we have to escape the \ @"\1" // But not in a verbatim string "\\\u0031" // The '1' could be escaped "\\u005c1" // Even the backslash *itself* may be escaped These are all ways of writing the \1 regex. In other words, C# allows a wide variety of input strings to all compile down to the same final 'value' (or 'ValueText') that the regex engine will finally see. This is a major issue as it means that any data reported by the regex engine must be accurate with respect to the text as the user wrote it. For example, in all of the equivalent cases above, there is the same error "Reference to undefined group number 1". However, for each form the user wrote, it's necessary to understand what the right value is to highlight as the problem. i.e. https://user-images.githubusercontent.com/4564579/34459671-5bb785b2-edab-11e7-8413-79c331ef373f.png and https://user-images.githubusercontent.com/4564579/34459672-6deb88dc-edab-11e7-8236-7ba7cd331247.png So, the purpose of the VirtualCharService is to translate all of the above pieces of user code to the same final set of characters the regex engine will see (specifically \ and 1) while also maintaining the knowledge of where those characters came from (for example, that 1 came from \u0031 in the last example). In essence, the VirtualCharService is able to produce the ValueText for any string literal, while having a mapping back from each character in the ValueText back to the original source span of the document that formed this. With the VirtualCharService user code can be translated into a common format that then can be processed uniformly. This means that the part of the system that actually tries to understand the regex does not need to know about the differences between @"" and "" strings, or the differences between C# and VB. It also means that it can be used by any roslyn language (for example, F#) if that is so desired.
…ls (classification only for now)
…ling string-literals.
…y escape sequences.
a54894d to
e1c408e
Compare
|
@sharwell i've broken this up into 4 conceptual commits to hopefully help make it clearer how this all fits together. |
|
@sharwell have your concerns been addressed here? |
|
I've looked through this and, at a conceptual level, this honestly doesn't seem complex to me. There is complexity in things like the virtual-char service, where we have to figure out all the escape chars and whatnot. But overall, the way things plug into the classification stack is pretty darn simple. |
|
@kuhlenh Using the actual colors we decided on for regex, you get this: IMO, this may be a bit too subtle for the default. Note: i think these colors are good for Regex. That's because for regex you really want the actual regex operators to stand out, and you just want to see that one of these textual escapes is just a simple string escape, but you don't need to really distinguish them greatly from normal text. However, when you just have normal text, then being able to distinguish the escapes from the normal text is far more helpful. I'm curious what you and @sharwell think about this. Should we try to have a consistent 'escaped text' color across Regex and normal strings? Or should we have two separate colors for the different cases? I personally like having the escapes really stand out. But i would be ok with it only getting a subtle treatment. |
| _language = language; | ||
| } | ||
|
|
||
| public void AddClassifications( |
There was a problem hiding this comment.
AddClassifications [](start = 20, length = 18)
It seems this PR doesn't add tests to verify classification. Do we have some infrastructure for that?
There was a problem hiding this comment.
You are correct. I will add a couple to validate. The majority of the tests validate the virtual char service. But we do need at least 1-2 tests at the classification level to make sure it hooked up properly.
|
|
||
| var result = TryConvertToVirtualCharsWorker(token); | ||
|
|
||
| #if DEBUG |
There was a problem hiding this comment.
DEBUG [](start = 4, length = 5)
nit: Consider moving this block to a helper method. #Closed
| { | ||
| return default; | ||
| } | ||
| } |
There was a problem hiding this comment.
} [](start = 8, length = 1)
I filed #29172 to follow-up when the alternate interpolated verbatim strings feature (@$" instead of $@") gets merged. This code may have to be updated. #Closed
There was a problem hiding this comment.
Could you add a test to CSharpVirtualCharServiceTests.cs that @$" produces specific diagnostics? Those diagnostics will change when C# 8 is merged in and that can be used as a reminder. Thanks
In reply to: 208761728 [](ancestors = 208761728)
jcouv
left a comment
There was a problem hiding this comment.
Done with review pass (iteration 14).
|
Tests added. |
|
@jcouv Have tried to address all your feedback. Thanks! |
| Test("$@\"{{\"", "['{',[3,5]]"); | ||
| } | ||
|
|
||
| [Fact] |
jcouv
left a comment
There was a problem hiding this comment.
LGTM Thanks! (iteration 20).
Can you also file an issue for the themeing follow-up work?
|
Theming issue is: #29173 |
|
#29174 Tracks the 16bit vs 32bit virtual-char limitation. |
|
Ok. All feedback addressed (i think). |
| /// </summary> | ||
| internal readonly struct VirtualChar : IEquatable<VirtualChar> | ||
| { | ||
| public readonly char Char; |
There was a problem hiding this comment.
Seems like rather inefficient representation for long strings that have just a couple of escape characters.
There was a problem hiding this comment.
I also wonder how is this gonna work with UTF8 string feature.
There was a problem hiding this comment.
I also wonder how is this gonna work with UTF8 string feature.
Can you link me to more info on this, including how Roslyn intends to expose this feature? Based on that, i can give you an assessment.
There was a problem hiding this comment.
@jaredpar Is UTF8 string literal proposal being worked on?
|
Merged. Thanks! |
|
Thanks! |


Looks like this:
Note, this color comes from VSCode:
Also note how VSCode doesn't really understand C# strings and messes this up in many ways. :-/