Skip to content

Significantly improve performance of dependent-project-finding#44952

Merged
CyrusNajmabadi merged 45 commits intodotnet:masterfrom
CyrusNajmabadi:depProjFinder
Jun 12, 2020
Merged

Significantly improve performance of dependent-project-finding#44952
CyrusNajmabadi merged 45 commits intodotnet:masterfrom
CyrusNajmabadi:depProjFinder

Conversation

@CyrusNajmabadi
Copy link
Contributor

@CyrusNajmabadi CyrusNajmabadi commented Jun 8, 2020

This saves a full 45+ seconds on find-references of a non-private source-symbol in Roslyn.sln after first load.

Specifically, when someone does a FAR call on a non-private source symbol we first start by saying "which projects could see non-private symbols from this project?". This is the set of projects in the solution that:

  1. are a C#/VB project.
  2. have a direct reference to the source-project.
  3. have IVT access, in case this is an an internal symbol.

In practice computing if another project has IVT access is extremely expensive.

First, It requires producing a compilation for the current project to determine all the IVT attributes in it. This involves parsing all files in it, loading metadata for it, producing compilations for all projects it itself depends on, finding all the IVT attributes, and binding them. This part is fine though as we already have teh starting compilation (since we needed it to get the FAR symbol in the first place). Second, for all possibly downstream dependent projects, it involves getting the compilations for thsoe as well to get teh current/other IAssemblySymbol and see if they have IVT. Getting this assembly symobl is again not cheap as it needs to at least parse everything in that project (including the dependencies of those projects).

--

The PR here keeps the majority of hte original algorithm, but tweaks '3' to be much faster. Specifically, instead of going back to the compiler for a strict/exact answer to "Does project X have IVT to project Y?", we change it to "Could project X have IVT to project Y" and we determine this just by getting the names of the assemblies project X has IVT to and doing a simple name check of that assembly name against that of the project we're currently looking at.

This dramatically reduces teh cost to determine potentially dependent projects. In Roslyn.sln this saves a full 45 seconds (indeed, that part of hte process goes from 46.67s on one run to just 0.56s.

The downside here is the potential for inaccuracy. For example, if someone were to write things akin to:

then we wouldn't understand that. The question is if we would require that level of correctness, or if the savings here are beneficial enough to warrant taking something like this.

See below for some options we could to address this if we feel it is significant.

@CyrusNajmabadi CyrusNajmabadi marked this pull request as ready for review June 8, 2020 21:07
@CyrusNajmabadi CyrusNajmabadi requested review from a team as code owners June 8, 2020 21:07
@CyrusNajmabadi
Copy link
Contributor Author

Potential things we could do about the chance for missing a true IVT because it doesn't match.

  1. We can look for thigns we know would break this. For example, adding a using Foo = System.Runtime.CompilerServices.InternalsVisibleToAttribute. Or [InternalsVisibleTo(SomeClass.SomeConstant)].
  2. If that happens, we mark the project as not being suitable for the fast-check, and we fall back to teh slow check.

@CyrusNajmabadi CyrusNajmabadi changed the title Improve performance of dependent-project-finding Significantly improve performance of dependent-project-finding Jun 8, 2020
@jasonmalinowski
Copy link
Member

Or [InternalsVisibleTo(SomeClass.SomeConstant)].

The one I think that might be problematic is if they're doing string concatenation. Since the IVT requires a fully qualified strong name, I think I've seen people do stuff like:

const string PublicKeySuffix = ", PublicKey=...."
[assembly: InternalsVisibleTo("Assembly1" + PublicKeySuffix)]
[assembly: InternalsVisibleTo("Assembly2" + PublicKeySuffix)]

@CyrusNajmabadi
Copy link
Contributor Author

The one I think that might be problematic is if they're doing string concatenation.

Yeah, i think you're right. Will see waht it woudl take to support that. Writing tests now.

@sharwell
Copy link
Contributor

sharwell commented Jun 8, 2020

have IVT access, in case this is an an internal symbol.

So this is wrong anyway. Example:

using System;

namespace ConsoleApp1
{
    class Program
    {
        /// <seealso cref="EnvironmentHelpers"/>
        static void Main(string[] args)
        {
        }
    }
}

@CyrusNajmabadi
Copy link
Contributor Author

The one I think that might be problematic

No longer a problem. We let the compiler make the determination here.


namespace Roslyn.Utilities
{
internal static class AsyncLazy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❔ Why?

Copy link
Contributor Author

@CyrusNajmabadi CyrusNajmabadi Jun 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes construction much nicer. you can just do AsyncLazy.Create(GetColorSchemeRegistryItemsAsync) isntead of new AsyncLazy<ImmutableDictionary<SchemeName, ImmutableArray<RegistryItem>>>(GetColorSchemeRegistryItemsAsync) (an actual example in our code).

symbol = symbol.OriginalDefinition;
var containingAssembly = symbol.ContainingAssembly;
if (containingAssembly == null)
var assemblyAndSourceProject = GetAssemblyAndSourceProject(solution, symbol, cancellationToken);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we always pass this pair around. And tehy represent the same concept "the origination informatino for the symbol". so i just made it a tuple to make that clear and to keep them always paired together consistently.

.WithChangedOption(StorageOptions.Database, StorageDatabase.SQLite)));

Console.WriteLine("Opening roslyn");
var start = DateTime.Now;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
var start = DateTime.Now;
var start = DateTime.UtcNow;

Lest your tests all fail on a daylight savings time transition, or on a train moving between time zones, or... ;-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this applies to the other files too)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super supre super don't care :) It's just info for me to run locally to collect wallclock time.

var dependentProjectsMap = s_dependentProjectsCache.GetValue(solution, _ => new DependentProjectMap());

var asyncLazy = dependentProjectsMap.GetOrAdd(
new DefinitionProject(symbolOrigination.sourceProject?.Id, symbolOrigination.assembly.Name.ToLower()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the "name" here just the name or the full identity of the assembly w/ version and other stuff? Should this just be using IAssemblySymbol.Identity rather than the extra string allocations?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if two unrelated projects point to different versions of an assembly on disk that have different IVTs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. the name is the simple name of the assembly, without all the identity goop for it.
  2. if two unrelated projects point to different versions of the assembly, we'll likely say that both projects have a dependency on the assembly, nd we will consider them candidates for searching.
  3. This would only really be relevant if somehow you had those multiple projects referencing multiple different assemblies and those assemblies have IVT to those projects specifically somehow.
  4. even in that rare case, all that would happen is that we would just search that project for hits.

That's really all that's going on here. This code is not about correctness, it's about performance. i.e. it's trying to limit the number of projects we look at. We could technically look at all projects, however that would be much slower since we'd have to make all the compilations ofr all of them.

So:

previously: we would strictly figure out the set of projects P_1 that we definitely needed to search.
now: we figure out the set of projects P_2 which is equal to or a superset of P_1. We save all the time computing P_1, but we will pay extra time searching the extra P_2 - P_1. In practice, i expect that extra set of projects to be empty, and even if there are a couple of extra projects in it, it should still be much faster than checking all the extra projects like we do today.

Copy link
Member

@jasonmalinowski jasonmalinowski Jun 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is this scenario:

  1. Project A depends on a Metadata.dll that doesn't have any IVT attributes in it.
  2. Project B also depends on a different Metadata.dll, but it does use internals via IVT from it's Metadata.dll.
  3. Some search happens on a public type in Metadata.dll, and the analysis is done with the IAssemblySymbol that is seen through A; the cache is populated that Metadata.dll is consumed from A and B, but because we're using the IAssemblySymbol of the Metadata.dll as seen through A, neither are marked with "has internal access".
  4. A search happens in B's code on an internal type obtained from Metadata.dll through the IVT B sees. Because it's internal, we fetch the entry from step 3, and filter it to only things with internal access, which is nothing. B's code is never searched, so the identifier you invoked find reference on might not be found!

I see a few approaches for this:

  1. Delete the cache. Problem solved; I think you said privately you are already doing that, so maybe that's the easiest answer.
  2. [If we have to restore the cache...] Don't care. I'm hesitant to say this only because I don't know how widespread this is, I do know how difficult it will be to figure out a customer report if it does happen. I'm also wondering if Roslyn could hit this itself: our CodeStyle layer builds against NuGet packages that are older: would we end up building the IVT list based upon the IVTs in the NuGet package we build against, a newer IVT added in the source might potentially get skipped...? Consider my scenario above, where Project A is our Code Style layer and B is some code that has IVT into our Workspaces layer. Minimally, if we choose to not care, a comment at least recognizing the cache issue might be wise so that way if somebody else is debugging and thinks the cache is wrong, they know they're not crazy.
  3. [If we can't delete the cache and find we must care...] Fix it. We already have to walk through all projects and all IAssemblySymbols to actually see which projects are referencing the assembly name in question; at that point we've collected all the IAssemblySymbols that might have an IVT we need to consider -- if we union all of the IVTs together of all the those assembly symbols, then the cache is safe. Assuming you don't process the same IAssemblySymbol more than once, this shouldn't really be any slower than what we have now. I'm guessing multitargeting might mean each IAssemblySymbol might be different, but it's easy to ask if two come from the same PE file path on disk, which would be sufficient for our needs here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete the cache. Problem solved; I think you said privately you are already doing that, so maybe that's the easiest answer.

I'm going to take this approach. In my branch i see no impact in removing it. Do you want that in this PR, or in the followup?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just follow up since you already have that one readyish.

/// </summary>
private static readonly ConditionalWeakTable<Solution, ConcurrentDictionary<DefinitionProject, ImmutableArray<DependentProject>>>.CreateValueCallback s_createDependentProjectsMapCallback =
_ => new ConcurrentDictionary<DefinitionProject, ImmutableArray<DependentProject>>(concurrencyLevel: 2, capacity: 20);
private static readonly ConditionalWeakTable<Solution, DependentProjectMap> s_dependentProjectsCache =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be really curious what the hit rate is of this cache in the end...

Copy link
Contributor Author

@CyrusNajmabadi CyrusNajmabadi Jun 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% when you perform multiple operations in a row. and it's a massive speedup. i will have numbers tomorrow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, i'm thinking of other caches. I think this cache is not necessary. But i'm going to remove in a future change (i'm making several changes here).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my thinking: given it's keyed on Solution it's only going to matter for multiple searches when no changes happened in the middle, and if the key is a project/assembly pair it has to be multiple searches originating from the same place. Was just curious, but if we think it's worth it I'm not against keeping it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i no longer think ti's important based on perf numbers i'm seeing. But i'm going to hold off on removing to an independent PR where i can validate making just that change :)

var resolvedSymbol = sourceAssemblySymbolKey.Resolve(compilation, cancellationToken: cancellationToken).Symbol;
if (resolvedSymbol is IAssemblySymbol sourceAssemblyInTargetCompilation)
{
return targetAssembly.IsSameAssemblyOrHasFriendAccessTo(sourceAssemblyInTargetCompilation);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're no longer checking this, do we still do this somewhere in FAR? I'm imagining a case where:

  1. Project B depends on A, which grants IVT to an assembly with the same name but the compiler says no.
  2. I find refs on some symbol in A.

We can't call the compiler method to check until we have both compilations of course, but by dropping this are we going to start doing further checks (binding/tree walking) that would have been avoided in this other case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. we will def do extra checks in that case. i'm weighing off the chance of that happening (and the consequences) against all this work that we're doing now. i.e. we do 45+ seconds worth of computation to try to just figure out the precise set, insetad of jsut having a slightly larger set and potentially have a few more hits we might have to filter out.

How likely do we think it is that someone has IVT to a same-named project that hte compiler says 'no' to? wouldn't that be the rarer and less expected case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I'm all for making this the cheap and approximate check. I just wasn't sure if FAR still does call IsSameAssemblyOrHasFriendAccessTo after we have the Compilation but before we actually walk trees and do other expensive stuff. I'd hate to send FAR down a wild goose chase if it was missing the check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also have no sense in the "real world" what sorts of strangeness is out there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I'm all for making this the cheap and approximate check. I just wasn't sure if FAR still does call IsSameAssemblyOrHasFriendAccessTo after we have the Compilation but before we actually walk trees and do other expensive stuff. I'd hate to send FAR down a wild goose chase if it was missing the check.

No, it doesn't do that check later. We do always check that the reference is a true reference.

I also have no sense in the "real world" what sorts of strangeness is out there.

I think having the uncommon cases pay this price instead of having everyone pay this price is desirable.

Again, for this to trigger, you would need ot have such a project arrangement where you had a project that said it had IVT to another project with the same name, but didn't intend for it to see internals. And in that case, all that would happen is that we will end up checking that project, where the majority of the time will be getting the compilation for it. And that's the cost we're already paying today, just that we pay it no matter what.

i.e. today we go to all these potential matches, and we then make a compilation (the most expensive part of find refs) just to ask "should we really check you?". This PR says: yes, we should check you, and pushes the getting of the compilation until teh point that we really actually would need to bind and determine if a symbol matches.

So, in the case where the downstream project does reference the name that we're looking for, we pay the same price. But in the common case that the downstream project doesn't, we are much much cheaper.

i.e. if i'm searching for a public symbol from MS.CodeAnalysis, then wasting all this computation to determine if we have IVT is irrelevant and wasteful. And, if i'm searching for an internal symbol, then precomputing the compilation for all downstream projects prior to even determining if there's a potential match in them is also wasteful.

Currently, getting the compilation is the most expensive part of FAR. Anything that pushes that off until absolutely necessary is seriously beneficial.

End Function

<WpfTheory, CombinatorialData, Trait(Traits.Feature, Traits.Features.FindReferences)>
Public Async Function TestInternalMethodSeenInDownstreamProjectWithIVT_NonLiteralString(kind As TestKind, host As TestHost) As Task
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given we removed the original approach that lead to these tests, do we still need them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically no. But i think ti doesn't hurt. i.e. if someone came in here later and thought they coudl optimize with syntactic checks, this would catch potential issues.

@CyrusNajmabadi CyrusNajmabadi merged commit 1967d75 into dotnet:master Jun 12, 2020
@ghost ghost added this to the Next milestone Jun 12, 2020
@CyrusNajmabadi CyrusNajmabadi deleted the depProjFinder branch June 13, 2020 02:12
@dibarbet dibarbet modified the milestones: Next, 16.7.P4 Jun 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants