Skip to content

Extract out the VirtualChar portion of the Regex and JSON editor work.#24111

Merged
jcouv merged 27 commits intodotnet:features/embeddedRegexfrom
CyrusNajmabadi:virtualChars
Apr 1, 2018
Merged

Extract out the VirtualChar portion of the Regex and JSON editor work.#24111
jcouv merged 27 commits intodotnet:features/embeddedRegexfrom
CyrusNajmabadi:virtualChars

Conversation

@CyrusNajmabadi
Copy link
Copy Markdown
Contributor

From the #23984 PR:

The first subsystem is called the VirtualCharService deals with the following issue. To the final .net regex, the following snippets of code appear completely identical to it:

"\\1"       // In a normal string, we have to escape the \
@"\1"       // But not in a verbatim string
"\\\u0031"  // The '1' could be escaped
"\\u005c1"  // Even the backslash *itself* may be escaped

These are all ways of writing the \1 regex.

In other words, C# allows a wide variety of input strings to all compile down to the same final 'value' (or 'ValueText') that the regex engine will finally see. This is a major issue as it means that any data reported by the regex engine must be accurate with respect to the text as the user wrote it. For example, in all of the equivalent cases above, there is the same error "Reference to undefined group number 1". However, for each form the user wrote, it's necessary to understand what the right value is to highlight as the problem. i.e.

image

and

image

So, the purpose of the VirtualCharService is to translate all of the above pieces of user code to the same final set of characters the regex engine will see (specifically \ and 1) while also maintaining the knowledge of where those characters came from (for example, that 1 came from \u0031 in the last example). In essence, the VirtualCharService is able to produce the ValueText for any string literal, while having a mapping back from each character in the ValueText back to the original source span of the document that formed this.

With the VirtualCharService user code can be translated into a common format that then can be processed uniformly. This means that the part of the system that actually tries to understand the regex does not need to know about the differences between @"" and "" strings, or the differences between C# and VB. It also means that it can be used by any roslyn language (for example, F#) if that is so desired.

@CyrusNajmabadi CyrusNajmabadi requested a review from a team as a code owner January 9, 2018 00:50
@CyrusNajmabadi
Copy link
Copy Markdown
Contributor Author

@jasonmalinowski Pulled out 700 lines. That makes the other PR only 37k large. :D

@CyrusNajmabadi
Copy link
Copy Markdown
Contributor Author

Tagging @dotnet/roslyn-ide . Review plz.

@CyrusNajmabadi
Copy link
Copy Markdown
Contributor Author

Also, i'll likely refactor the regex and json PRs to pull out the tests into their own PR. That will help make the review more manageable (and less painful in the browser).

@sharwell sharwell self-requested a review January 9, 2018 15:53
@AdamSpeight2008
Copy link
Copy Markdown
Contributor

Personally I'd like to see Document comments on all of the methods, both public and internal ones.

@CyrusNajmabadi
Copy link
Copy Markdown
Contributor Author

@AdamSpeight2008 Anything in particular? My personal approach is to put on doc comments when i think it will provide substantive value for the consumer or maintainer of the code. If there's areas you feel would benefit, i'd be happy to add. But i'm not really going to take an approach of just adding doc comments to everything.

return (c >= '0' && c <= '9') ? c - '0' : (c & 0xdf) - 'A' + 10;
}

private static bool IsHexDigit(char c)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't cover fullwidth characters that are valid hex character in VB.net

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the C# impl.

Copy link
Copy Markdown
Contributor

@AdamSpeight2008 AdamSpeight2008 Jan 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My apologise, thought it was in the base class used for both, ie the VirtualCharService

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#wontfix

{
protected abstract ImmutableArray<VirtualChar> TryConvertToVirtualCharsWorker(SyntaxToken token);

public ImmutableArray<VirtualChar> TryConvertToVirtualChars(SyntaxToken token)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CyrusNajmabadi
Add a doc comment, so the consumer understands what aspect of the syntax node is being worked on. eg. Is it the actual text from source, or value text, or displayed text?

Copy link
Copy Markdown
Contributor Author

@CyrusNajmabadi CyrusNajmabadi Jan 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment added to IVirtualCharService.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#fixed

{
var builder = PooledStringBuilder.GetInstance();

foreach (var vc in chars)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't the following remove an instance of the enumerator.

for(var idx = 0; idx < chars.length; idx++)
{
  builder.Builder.Append(chars[idx].char);
}.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the enumerator is a struct. so removing it is unnecessary.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#wontfix

{
if (span.IsEmpty)
{
throw new ArgumentException();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use ArguementOutOfRangeException( $"{ nameof( span ) } was expected to be non empty.") or at least add which parameter it is.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

return true;
}

private bool TryAddComplexEscape(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a <summary> giving a quick overview of what a "complex escape" is.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I can doc that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the name to be something liek TryAddSingleCharEscape and TryAddMultiCharEscape.

@AdamSpeight2008
Copy link
Copy Markdown
Contributor

@CyrusNajmabadi I've added a few comments, mainly around clarifying usage.
A simple <summary> on methods, giving a quick summary of what its purpose is, would help the consumer determine if that method is the one need. Intellisense discovery and exploration

namespace Microsoft.CodeAnalysis.VirtualChars
{
/// <summary>
/// The Regex parser wants to work over an array of characters, however this array of characters
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the references to the Regex parser should be altered, as VirtualChar is know independent for the other pull request?

@AdamSpeight2008
Copy link
Copy Markdown
Contributor

@CyrusNajmabadi Thank you that much better.

@CyrusNajmabadi
Copy link
Copy Markdown
Contributor Author

@jmarolf @dotnet/roslyn-ide Anyone want to take a looksie? Thanks!

/// multiple chars, <see langword="default"/> will also be returned in those cases. All these
/// cases could be relaxed in the future. But they greatly simplify the implementation.
/// </summary>
ImmutableArray<VirtualChar> TryConvertToVirtualChars(SyntaxToken token);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am normally fine with the pattern of returning null for failure cases but returning a default for a struct seems odd. Perhaps use optional to indicate whether the conversion succeeded or failed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very torn about this issue with ImmutableArray's. I tried going that route one time and it felt pretty weird given that ImmutableArray then still has a state needed to check for. Given that ImmutableArray<T> is really just a nicer form of T[] and checking T[] against default is pretty sane, i'm not sure i want to change this. But if you feel strongly, i can.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It stood out as odd to me on first pass. I don't feel too strongly about it, though. ImmutableArray is a weird class so if we just call out that we return default for error cases in the doc comments I think I'm satisfied.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave as is. I think the name (TryXXX) and docs explain things acceptably well :)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#wontfix

@CyrusNajmabadi
Copy link
Copy Markdown
Contributor Author

@jcouv Could you take a look if you have time? Tnx!

/// ( '\' and 'z' ) back to the ranges of characters the user wrote.
///
/// VirtualChar serves this purpose. It contains the interpretted value of any language character/
/// character-escape-sequence, as well as the original SourceText span where that interpretted
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: interpreted

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

}

var lastVC = result.Last();
Debug.Assert(lastVC.Span.End == token.Span.End - 1, "Last span has to end right before the end of hte string token (including its trailing delimeter).");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: the

/// Helper to convert simple string literals that escape quotes by doubling them. This is
/// how normal VB literals and c# verbatim string literals work.
/// </summary>
protected ImmutableArray<VirtualChar> TryConvertSimpleDoubleQuoteString(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

{
Debug.Assert(!token.ContainsDiagnostics);

var tokenText = token.Text;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why .Text here, but .ValueText above?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very intentional. We use .ValueText because we want to know that the interpretted VirtualChar chars sum up to the .ValueText. However, we want to use .Text here becaue we want to ensure that the uninterpretted VirtualChar spans completely span it and hold for certain invariants.

In a very real sense VirtualChar is the bridge API between .Text and .ValueText. The Spans of the VirtualChars come from us breaking up .Text into pieces. And each VirtualChar has a char that is then each char of .ValueText in order.

Does that make sense?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Thanks for the explanation

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CyrusNajmabadi Could you document this in the code, so that future you can remember,

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to docs on hte interface.

/// how normal VB literals and c# verbatim string literals work.
/// </summary>
protected ImmutableArray<VirtualChar> TryConvertSimpleDoubleQuoteString(
SyntaxToken token, string startDelimeter, string endDelimeter)
Copy link
Copy Markdown
Member

@jcouv jcouv Mar 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: startDelimiter and endDelimiter

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider documenting the delimiter parameters.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

{
if (!TryAddEscape(result, tokenText, offset, index))
{
return default;
Copy link
Copy Markdown
Member

@jcouv jcouv Mar 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we release the result array builder here?
Never mind :-)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:)

}
}

return result.ToImmutable();
Copy link
Copy Markdown
Member

@jcouv jcouv Mar 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.ToImmutableAndFree()
Never mind ;-)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:)

Return "'" + c + "'"
End Function
End Class
End Namespace
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the whole feature is split into a few PRs, it feels like this should go into a feature branch.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. i can do that.


private ImmutableArray<VirtualChar> TryConvertStringToVirtualChars(SyntaxToken token)
{
const string StartDelimeter = "\"";
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More places with typo for "delimiter"

Copy link
Copy Markdown
Member

@jcouv jcouv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo remaining "delimiter" spelling.
Check with IDE team on creating a feature branch (since multiple PRs for this feature).

@CyrusNajmabadi
Copy link
Copy Markdown
Contributor Author

CyrusNajmabadi commented Mar 31, 2018

@jcouv I can't make a branch in roslyn. Could you make a branch for me off of master called "features/embeddedRegex"? I can then make PRs against that. Thanks!

@jcouv jcouv added the Area-IDE label Apr 1, 2018
@CyrusNajmabadi CyrusNajmabadi changed the base branch from master to features/embeddedRegex April 1, 2018 20:37
@CyrusNajmabadi
Copy link
Copy Markdown
Contributor Author

@jcouv This now targets the new feature branch. This PR has two signoffs, so can you merge in?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants