Extract out the VirtualChar portion of the Regex and JSON editor work.#24111
Extract out the VirtualChar portion of the Regex and JSON editor work.#24111jcouv merged 27 commits intodotnet:features/embeddedRegexfrom
Conversation
|
@jasonmalinowski Pulled out 700 lines. That makes the other PR only 37k large. :D |
|
Tagging @dotnet/roslyn-ide . Review plz. |
|
Also, i'll likely refactor the regex and json PRs to pull out the tests into their own PR. That will help make the review more manageable (and less painful in the browser). |
|
Personally I'd like to see Document comments on all of the methods, both public and internal ones. |
|
@AdamSpeight2008 Anything in particular? My personal approach is to put on doc comments when i think it will provide substantive value for the consumer or maintainer of the code. If there's areas you feel would benefit, i'd be happy to add. But i'm not really going to take an approach of just adding doc comments to everything. |
| return (c >= '0' && c <= '9') ? c - '0' : (c & 0xdf) - 'A' + 10; | ||
| } | ||
|
|
||
| private static bool IsHexDigit(char c) |
There was a problem hiding this comment.
Doesn't cover fullwidth characters that are valid hex character in VB.net
There was a problem hiding this comment.
This is the C# impl.
There was a problem hiding this comment.
My apologise, thought it was in the base class used for both, ie the VirtualCharService
| { | ||
| protected abstract ImmutableArray<VirtualChar> TryConvertToVirtualCharsWorker(SyntaxToken token); | ||
|
|
||
| public ImmutableArray<VirtualChar> TryConvertToVirtualChars(SyntaxToken token) |
There was a problem hiding this comment.
@CyrusNajmabadi
Add a doc comment, so the consumer understands what aspect of the syntax node is being worked on. eg. Is it the actual text from source, or value text, or displayed text?
There was a problem hiding this comment.
Comment added to IVirtualCharService.
| { | ||
| var builder = PooledStringBuilder.GetInstance(); | ||
|
|
||
| foreach (var vc in chars) |
There was a problem hiding this comment.
Wouldn't the following remove an instance of the enumerator.
for(var idx = 0; idx < chars.length; idx++)
{
builder.Builder.Append(chars[idx].char);
}.There was a problem hiding this comment.
the enumerator is a struct. so removing it is unnecessary.
| { | ||
| if (span.IsEmpty) | ||
| { | ||
| throw new ArgumentException(); |
There was a problem hiding this comment.
Use ArguementOutOfRangeException( $"{ nameof( span ) } was expected to be non empty.") or at least add which parameter it is.
| return true; | ||
| } | ||
|
|
||
| private bool TryAddComplexEscape( |
There was a problem hiding this comment.
Add a <summary> giving a quick overview of what a "complex escape" is.
There was a problem hiding this comment.
Sure. I can doc that.
There was a problem hiding this comment.
Changed the name to be something liek TryAddSingleCharEscape and TryAddMultiCharEscape.
|
@CyrusNajmabadi I've added a few comments, mainly around clarifying usage. |
| namespace Microsoft.CodeAnalysis.VirtualChars | ||
| { | ||
| /// <summary> | ||
| /// The Regex parser wants to work over an array of characters, however this array of characters |
There was a problem hiding this comment.
Maybe the references to the Regex parser should be altered, as VirtualChar is know independent for the other pull request?
|
@CyrusNajmabadi Thank you that much better. |
|
@jmarolf @dotnet/roslyn-ide Anyone want to take a looksie? Thanks! |
| /// multiple chars, <see langword="default"/> will also be returned in those cases. All these | ||
| /// cases could be relaxed in the future. But they greatly simplify the implementation. | ||
| /// </summary> | ||
| ImmutableArray<VirtualChar> TryConvertToVirtualChars(SyntaxToken token); |
There was a problem hiding this comment.
I am normally fine with the pattern of returning null for failure cases but returning a default for a struct seems odd. Perhaps use optional to indicate whether the conversion succeeded or failed?
There was a problem hiding this comment.
I'm very torn about this issue with ImmutableArray's. I tried going that route one time and it felt pretty weird given that ImmutableArray then still has a state needed to check for. Given that ImmutableArray<T> is really just a nicer form of T[] and checking T[] against default is pretty sane, i'm not sure i want to change this. But if you feel strongly, i can.
There was a problem hiding this comment.
It stood out as odd to me on first pass. I don't feel too strongly about it, though. ImmutableArray is a weird class so if we just call out that we return default for error cases in the doc comments I think I'm satisfied.
There was a problem hiding this comment.
I'll leave as is. I think the name (TryXXX) and docs explain things acceptably well :)
|
@jcouv Could you take a look if you have time? Tnx! |
| /// ( '\' and 'z' ) back to the ranges of characters the user wrote. | ||
| /// | ||
| /// VirtualChar serves this purpose. It contains the interpretted value of any language character/ | ||
| /// character-escape-sequence, as well as the original SourceText span where that interpretted |
| } | ||
|
|
||
| var lastVC = result.Last(); | ||
| Debug.Assert(lastVC.Span.End == token.Span.End - 1, "Last span has to end right before the end of hte string token (including its trailing delimeter)."); |
| /// Helper to convert simple string literals that escape quotes by doubling them. This is | ||
| /// how normal VB literals and c# verbatim string literals work. | ||
| /// </summary> | ||
| protected ImmutableArray<VirtualChar> TryConvertSimpleDoubleQuoteString( |
| { | ||
| Debug.Assert(!token.ContainsDiagnostics); | ||
|
|
||
| var tokenText = token.Text; |
There was a problem hiding this comment.
Why .Text here, but .ValueText above?
There was a problem hiding this comment.
Very intentional. We use .ValueText because we want to know that the interpretted VirtualChar chars sum up to the .ValueText. However, we want to use .Text here becaue we want to ensure that the uninterpretted VirtualChar spans completely span it and hold for certain invariants.
In a very real sense VirtualChar is the bridge API between .Text and .ValueText. The Spans of the VirtualChars come from us breaking up .Text into pieces. And each VirtualChar has a char that is then each char of .ValueText in order.
Does that make sense?
There was a problem hiding this comment.
Makes sense. Thanks for the explanation
There was a problem hiding this comment.
@CyrusNajmabadi Could you document this in the code, so that future you can remember,
There was a problem hiding this comment.
Added to docs on hte interface.
| /// how normal VB literals and c# verbatim string literals work. | ||
| /// </summary> | ||
| protected ImmutableArray<VirtualChar> TryConvertSimpleDoubleQuoteString( | ||
| SyntaxToken token, string startDelimeter, string endDelimeter) |
There was a problem hiding this comment.
Typo: startDelimiter and endDelimiter
There was a problem hiding this comment.
Consider documenting the delimiter parameters.
| { | ||
| if (!TryAddEscape(result, tokenText, offset, index)) | ||
| { | ||
| return default; |
There was a problem hiding this comment.
Should we release the result array builder here?
Never mind :-)
| } | ||
| } | ||
|
|
||
| return result.ToImmutable(); |
There was a problem hiding this comment.
.ToImmutableAndFree()
Never mind ;-)
| Return "'" + c + "'" | ||
| End Function | ||
| End Class | ||
| End Namespace |
There was a problem hiding this comment.
Since the whole feature is split into a few PRs, it feels like this should go into a feature branch.
There was a problem hiding this comment.
sure. i can do that.
|
|
||
| private ImmutableArray<VirtualChar> TryConvertStringToVirtualChars(SyntaxToken token) | ||
| { | ||
| const string StartDelimeter = "\""; |
There was a problem hiding this comment.
More places with typo for "delimiter"
jcouv
left a comment
There was a problem hiding this comment.
LGTM modulo remaining "delimiter" spelling.
Check with IDE team on creating a feature branch (since multiple PRs for this feature).
|
@jcouv I can't make a branch in roslyn. Could you make a branch for me off of master called "features/embeddedRegex"? I can then make PRs against that. Thanks! |
|
@jcouv This now targets the new feature branch. This PR has two signoffs, so can you merge in? |
From the #23984 PR:
The first subsystem is called the VirtualCharService deals with the following issue. To the final .net regex, the following snippets of code appear completely identical to it:
These are all ways of writing the
\1regex.In other words, C# allows a wide variety of input strings to all compile down to the same final 'value' (or 'ValueText') that the regex engine will finally see. This is a major issue as it means that any data reported by the regex engine must be accurate with respect to the text as the user wrote it. For example, in all of the equivalent cases above, there is the same error "Reference to undefined group number 1". However, for each form the user wrote, it's necessary to understand what the right value is to highlight as the problem. i.e.
and
So, the purpose of the VirtualCharService is to translate all of the above pieces of user code to the same final set of characters the regex engine will see (specifically
\and1) while also maintaining the knowledge of where those characters came from (for example, that1came from\u0031in the last example). In essence, the VirtualCharService is able to produce the ValueText for any string literal, while having a mapping back from each character in the ValueText back to the original source span of the document that formed this.With the VirtualCharService user code can be translated into a common format that then can be processed uniformly. This means that the part of the system that actually tries to understand the regex does not need to know about the differences between
@""and""strings, or the differences between C# and VB. It also means that it can be used by any roslyn language (for example, F#) if that is so desired.