Skip to content

Sg embedded sources#44768

Merged
chsienki merged 19 commits intodotnet:masterfrom
chsienki:sg_embedded_sources
Jun 5, 2020
Merged

Sg embedded sources#44768
chsienki merged 19 commits intodotnet:masterfrom
chsienki:sg_embedded_sources

Conversation

@chsienki
Copy link
Member

@chsienki chsienki commented Jun 1, 2020

Embed generated source texts in the portable PDB. Allows stepping in etc. in the IDE as well as a dump debugging after compilation.

Also some minor code cleanup that I noticed as I was implementing.

@chsienki chsienki added this to the 16.7.P3 milestone Jun 2, 2020
@chsienki chsienki marked this pull request as ready for review June 2, 2020 17:51
@chsienki chsienki requested review from a team as code owners June 2, 2020 17:51
@chsienki
Copy link
Member Author

chsienki commented Jun 2, 2020

@dotnet/roslyn-compiler for review please

private readonly Action<SourceGeneratorContext> _onExecute;
private readonly string _sourceOpt;

public CallbackGenerator(Action<InitializationContext> onInit, Action<SourceGeneratorContext> onExecute, string sourceOpt = "")
Copy link
Member

@jaredpar jaredpar Jun 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than sourceOpt why not use string? source = "" and let nullability enforce the optional ness? #Resolved

{
private readonly Action<InitializationContext> _onInit;
private readonly Action<SourceGeneratorContext> _onExecute;
private readonly string _sourceOpt;
Copy link
Member

@jaredpar jaredpar Jun 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than take the source as a parameter to the type why not change _onExecute to be Func<SourceGeneratorContext, string?>? #WontFix

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the uses of this generator don't actually add source. We just increment a counter or whatever to check it was called. Making the callback return source just makes it noiser


In reply to: 434194957 [](ancestors = 434194957)

public InMemoryAdditionalText(string path, string content)
{
Path = path;
_content = SourceText.From(content);
Copy link
Member

@jaredpar jaredpar Jun 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In all other SourceText.From you explicitly passed Encoding.Utf8. Why is this usage different? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oversight. I'll fix it, but it only actually matters when we're emitting, so we don't hit this in the tests. We have an open tracking note on the board about how to simplify this.


In reply to: 434195219 [](ancestors = 434195219)

using System.Collections.Generic;
using System.Text;
using Microsoft.CodeAnalysis.Text;
using System.IO;
Copy link
Member

@jaredpar jaredpar Jun 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seem uneeded #Resolved

// We pass it to the generators, which will realize any symbols they require.
compilation = RunGenerators(compilation, Arguments.ParseOptions, generators, additionalTexts, diagnostics);

// https://github.com/dotnet/roslyn/issues/44087
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we remove this comment? Feel like this code still needs to be cleaned up once we have a way to determine if SyntaxTree instances are generated or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I started down that path, but it turns out we don't actually need it. @agocke work removing the diagnostics from the syntax tree means we won't have to parse with the config present. Because the syntax trees are deterministically ordered, we should always just be able to count beyond the length of the Arguments.SyntaxTrees to know which were added by generators?


In reply to: 434196389 [](ancestors = 434196389)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Admittedly, It feels like there is a lot of 'spaghetti' tangle in the way we handle this, various different levels of the compiler adjust the embedded texts and the config options based on their own rules and knowledge of what's coming later. I don't particularly like it, and it feels ripe for a refactor/cleanup pass but this PR at least doesn't make it worse.


In reply to: 434210846 [](ancestors = 434210846,434196389)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

K. Have we checked with @jasonmalinowski on whether or not he feels the IDE team will need to be able to recognize generated SyntaxTree instances over non-generated ones?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll almost definitely need to do it in the future, its just not needed for this particular feature. (The IDE already works with the embedded texts and doesn't need to know they're generated)


In reply to: 434221150 [](ancestors = 434221150)

if (!sourceFileAnalyzerConfigOptions.IsDefault)
{
sourceFileAnalyzerConfigOptions = sourceFileAnalyzerConfigOptions.AddRange(generatedSyntaxTrees.Select(f => analyzerConfigSet.GetOptionsForSourcePath(f.FilePath)));
}
Copy link
Member

@jaredpar jaredpar Jun 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}
}
``` #Resolved


// Clean up temp files
CleanupAllGeneratedFiles(src.Path);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to test some more cases:

  • Multiple generated files
  • Multiple generated files with identical hint values
  • Case where an empty file is returned

Generally do we have tests where the hint path contains known invalid file name characters?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added more tests, but ran into #42628, which I've now addresses as part of this PR too.

We aren't currently putting these down on disk, just using them as identifiers in the PDB, so we should be good from a characters perspective. We'll need to address it when we put them to disk though (presumably some form of encoding?)


In reply to: 434220765 [](ancestors = 434220765)

// We pass it to the generators, which will realize any symbols they require.
compilation = RunGenerators(compilation, Arguments.ParseOptions, generators, additionalTexts, diagnostics);

// https://github.com/dotnet/roslyn/issues/44087
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

K. Have we checked with @jasonmalinowski on whether or not he feels the IDE team will need to be able to recognize generated SyntaxTree instances over non-generated ones?

[InlineData("partial class D {}", "file1.cs", "partial class D {}", "file2.cs")] // same files, different names
[InlineData("partial class D {}", "file1.cs", "partial class D {}", "file1.cs")] // same files, same names
[InlineData("partial class D {}", "file1.cs", "", "file2.cs")] // empty second file
[InlineData("partial class D {}", "file1.cs", "", " \\ / : * ? \" < > | ")] // illegal path name chars
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we just silently drop the illegal path name characters?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, they're allows as a valid hint name. We're just embedding it into the PDB not putting it to disk at the moment. We'll need to encode when we put to disk (filenames, especially on windows, look like a minefield :/).

I'll add text to the second file to make it clear we're using it.


In reply to: 434736894 [](ancestors = 434736894)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmat will portable PDB's handle path names with illegal characters okay?

Copy link
Member Author

@chsienki chsienki Jun 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmat can confirm, but the PDB itself seems fine. Interestingly VS chokes a bit when hitting a file with them: it steps in, and you can actually step over each line and see the output, but no file opens in the IDE itself (no crash or other visible errors though)


In reply to: 434767181 [](ancestors = 434767181)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should file a follow up item here to resolve for 16.8. Need to take into account the following:

  • How does hint path impact the physical file on disk when we're in dump to disk mode
  • How do we deal with generators where added files hint paths differ by only illegal file system characters
  • How do illegal characters in the path impact other parts of the experience

Don't think we necessarily need to solve this for 16.7 though. One simple approach though is just ban illegal file system characters in hint paths because at the end of the day this is meant to represent a file name on disk 😄

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Portable PDB has no requirements on path strings. Note that all characters except for \0 are valid path characters on Linux.

Copy link
Member

@tmat tmat Jun 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, interesting test could be unmatched Unicode surrogates. PDB should handle them as well (and preserve their value). Not sure what happens elsewhere in the system though.

}
}

internal class SimpleGenerator2 : SimpleGenerator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why we need this class here? It doesn't seem to add anything over SimpleGenerator


internal GeneratorDriver(GeneratorDriverState state)
{
Debug.Assert(state.Generators.GroupBy(s => s.GetType()).Count() == state.Generators.Length); // ensure we don't have duplicate generator types
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you want to restrict generator types to be unique? Consider the case where someone authors a really handy generator library and two different generators consume it. That seems like a valid case where we would have duplicate generator types.

Copy link
Member Author

@chsienki chsienki Jun 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the rationale here was how they're actually loaded from the command line: we pass in a set of references and create the instance ourselves. That means there can only ever be one type per driver instance. That means that we can safely use the type name of the generator as a prefix for disambiguation.

Now we can of course allow multiple of the same type when constructing in tests, but it doesn't actually reflect the way user consume it from the outside. I put it in as a Debug.Assert to ensure that what we're testing is actually what happens in the real case. SourceGeneratorDriver is public and users are free to put the same type in twice if they're using it as an API.

I wasn't sure about this tbh: I'd obviously prefer that we allow multiple of the same type in the tests (it make it much easier) but doesn't reflect the real situation, and gets into weirdness with the prefix. We could make generators provide a unique prefix? But still doesn't stop collisions.


In reply to: 434739560 [](ancestors = 434739560)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although writing that I realize that the full name of the type isn't actually enough to be unique right? I can have two types with the same full name in two different assemblies?


In reply to: 434750141 [](ancestors = 434750141,434739560)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That means there can only ever be one type per driver instance.

Consider the case where I have authored a library Generator.Utility.dll which has an accessible type SimpleGenerator that implements ISourceGenerator. Two different generators could reference this library and expose that type.

Hmmm ... Or do we scan the generator assembly directly and look for derivations of ISourceGenerator? That wouldn't have this problem but if we had a factory model we would have this issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We scan for a combo of derived form ISourceGenerator and marked with the SourceGeneratorAttribute. I'll have to double check, but I believe in your example we'd still only end up with one generator (we use the same mechanism as analyzer discovery, so its what ever that does)


In reply to: 434766641 [](ancestors = 434766641)


if (AdditionalSources.Contains(hintName))
{
throw new ArgumentException(CodeAnalysisResources.HintNameUniquePerGenerator, nameof(hintName));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we file a bug to track this? I don't think this is sustainable long term.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That hint names must be unique within a generator?


In reply to: 434739910 [](ancestors = 434739910)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Generators essentially represent a form of distributed development. Any aspect of the implementation which requires them to coordinate, such as not having conflicting hint names, makes authoring reliable generators difficult. The only safe approach would be to always use a GUID at which point it's not much of a hint.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, they only have to be unique within a generator. Different generators can have the same hint name.


In reply to: 434765041 [](ancestors = 434765041)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. Assume we have tests for that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah Generator_HintName_MustBe_Unique and Generator_HintName_Is_Appended_With_GeneratorName


In reply to: 434770864 [](ancestors = 434770864)

Ensure ordering of syntax trees
@chsienki
Copy link
Member Author

chsienki commented Jun 4, 2020

Well, that ended up being a bigger rabbit hole than I intended to go down, but it did end up fixing a couple of extra bugs (fixes #42627 and #42593).

I've ended up going the route that we limit hint names to a specific set of characters (IdentifierName + [.,-_ []{}()]). I moved the logic into the AdditionalSourcesCollection and added a bunch of tests. That ended up surfacing issue (#42627), which was fixed by ensuring we maintain the ordering source texts are added and also the order in which generators are run. Fixing the ordering of generators in turn removed the dictionary in #42593.

for (int i = 0; i < _sourcesAdded.Count; i++)
{
// check the hashcodes, as thats the comparison we're using to put the names into _hintNames
if (_sourcesAdded[i].HintName.GetHashCode() == hintName.GetHashCode())
Copy link
Member Author

@chsienki chsienki Jun 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered making the GeneratedSourceTexts implement equals and hashcode in terms of the hintName, but they're only equal in the context of a single generator (which is equivalent to a single AdditionalSourcesCollection)

I suppose we could use a custom comparer just in this class and check the hintName? That would allows us to get rid of the two collections and locks? EDIT: we can, but it means we don't get O(1) lookup for things in the collection, because we have to iterate over each time (and thus still need to lock anyway) so i'm not sure it's worth it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit complicated. Why not just use a Dictionary<string, GeneratedSourceText> instead of having parallel collections here? Could event use a ConcurrentDictionary and avoid all of the lock usage.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason was that Dictionary isn't ordered. This ensures that we get syntax trees back in the order in which they were added. Presumably we could use an ordered dictionary and sort by the hintName, which would at least be consistent, but not the order they were added in. Do you think it matters, as long as its deterministic?


In reply to: 435565199 [](ancestors = 435565199)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up simplifying and just keeping an array builder. That way we'll keep the ordering as they're added, but don't need the locks of the second collection. It means Remove() and Contains() (and by extension Add()) are all O(N) operations now, as we have to check the whole collection for collisions. If it turns out to be a perf bottleneck we can fix it later on.

This also makes us not-thread safe as a generator could theoretically create a race condition that gets two additions into the collection with the same hintName. This could cause the compiler to crash further on (when adding to the PDBs for instance), but for now I think we're willing to live with that. I mean they could just Environment.FailFast if that was their aim.


In reply to: 435637087 [](ancestors = 435637087,435565199)

internal AdditionalSourcesCollection()
{
_sourcesAdded = PooledDictionary<string, SourceText>.GetInstance();
_hintNames = PooledHashSet<string>.GetInstance();
Copy link
Member

@jaredpar jaredpar Jun 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we provide an explicit string comparer here? #Resolved

@@ -13,44 +14,108 @@ namespace Microsoft.CodeAnalysis
{
internal sealed class AdditionalSourcesCollection
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a per generator collection?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we create a new one for every invocation of every generator.


In reply to: 435563892 [](ancestors = 435563892)

for (int i = 0; i < _sourcesAdded.Count; i++)
{
// check the hashcodes, as thats the comparison we're using to put the names into _hintNames
if (_sourcesAdded[i].HintName.GetHashCode() == hintName.GetHashCode())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit complicated. Why not just use a Dictionary<string, GeneratedSourceText> instead of having parallel collections here? Could event use a ConcurrentDictionary and avoid all of the lock usage.

for (int i = 0; i < _sourcesAdded.Count; i++)
{
// check the hashcodes, as thats the comparison we're using to put the names into _hintNames
if (_sourcesAdded[i].HintName.GetHashCode() == hintName.GetHashCode())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't compare HashCode here because you can get false matches.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we're comparing the hash of the string, it'll always be the same if they're considered equal, no? Otherwise the HashSet wouldn't work either?


In reply to: 435565433 [](ancestors = 435565433)


private static string AppendExtensionIfRequired(string hintName)
{
if (!Path.GetExtension(hintName).Equals(".cs", StringComparison.OrdinalIgnoreCase))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code for comparing hint names in general is ordinal equality but the path is ignore case. Should be consistent here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I've made all the hint path comparison be case insensitive as we'll need them to be to support windows anyway. Added more test case theory data to cover it.


In reply to: 435565786 [](ancestors = 435565786)

- Simplify AdditionalSourceCollection
- Ensure we always compare case-insensitively for hintPaths
private readonly PooledDictionary<string, SourceText> _sourcesAdded;
private readonly ArrayBuilder<GeneratedSourceText> _sourcesAdded;

private readonly StringComparer _hintNameComparer = StringComparer.OrdinalIgnoreCase;
Copy link
Member

@jaredpar jaredpar Jun 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider making static readonly #Resolved

{
if (_hintNameComparer.Equals(_sourcesAdded[i].HintName, hintName))
{
_sourcesAdded.Remove(_sourcesAdded[i]);
Copy link
Member

@jaredpar jaredpar Jun 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_sourcesAdded.Remove(_sourcesAdded[i]);
_sourcesAdded.RemoveAt(i);
``` #Resolved

// https://github.com/dotnet/roslyn/issues/42627: This needs to be consistently ordered
ArrayBuilder<GeneratedSourceText> builder = ArrayBuilder<GeneratedSourceText>.GetInstance();
foreach (var (hintName, sourceText) in _sourcesAdded)
if (!Path.GetExtension(hintName).Equals(".cs", StringComparison.OrdinalIgnoreCase))
Copy link
Member

@jaredpar jaredpar Jun 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider using hintName.EndsWith(...) here? Using GetExtension means that operations like Contains end up being allocating. #Resolved

@chsienki chsienki mentioned this pull request Jun 5, 2020
@agocke
Copy link
Member

agocke commented Jun 5, 2020

I don't understand why the tests keep passing Encoding.UTF8 to SourceText.From.

@chsienki
Copy link
Member Author

chsienki commented Jun 5, 2020

Becuase the PDB code checks that a SourceText with a file path must have an encoding, or they'll fail at emit. We've talked about making it a default, but not done yet.


In reply to: 639759958 [](ancestors = 639759958)

@agocke
Copy link
Member

agocke commented Jun 5, 2020

Got it, so the purpose is to write out the source text as UTF-8 when embedding into the PDB? Is there a place where we validate that every incoming source text has an encoding? I couldn't easily find it in this PR.

@chsienki
Copy link
Member Author

chsienki commented Jun 5, 2020

The actual check happens here in Compilation http://sourceroslyn.io/#Microsoft.CodeAnalysis/Compilation/Compilation.cs,2076

We have another issue about making this better, but it's not actually directly related to the embedded requirement. We're thinking we'll just make SourceText.From use UTF8 by default if it's not specified.


In reply to: 639827040 [](ancestors = 639827040)

@agocke
Copy link
Member

agocke commented Jun 5, 2020

We're thinking we'll just make SourceText.From use UTF8 by default if it's not specified

That's fine. But what we should avoid is only giving a diagnostic when the sources are embedded as we don't want generators to function when embedding is turned off, then start to break when it's suddenly turned on.

Copy link
Member

@agocke agocke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chsienki
Copy link
Member Author

chsienki commented Jun 5, 2020

We're thinking we'll just make SourceText.From use UTF8 by default if it's not specified

That's fine. But what we should avoid is only giving a diagnostic when the sources are embedded as we don't want generators to function when embedding is turned off, then start to break when it's suddenly turned on.

Nah, it was doing it before anyway, so no change there. Thanks for the review!

@chsienki chsienki merged commit e6afc95 into dotnet:master Jun 5, 2020
@ghost ghost modified the milestones: 16.7.P3, Next Jun 5, 2020
@RikkiGibson RikkiGibson modified the milestones: Next, 16.7.P3 Jun 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants