Skip to content

Remove the use of Parallel.ForEach in ClsComplianceChecker#23519

Merged
sharwell merged 4 commits intodotnet:dev15.6.xfrom
sharwell:optimize-clscompliant
Feb 2, 2018
Merged

Remove the use of Parallel.ForEach in ClsComplianceChecker#23519
sharwell merged 4 commits intodotnet:dev15.6.xfrom
sharwell:optimize-clscompliant

Conversation

@sharwell
Copy link
Contributor

@sharwell sharwell commented Dec 1, 2017

Customer scenario

Running analyzer during a build is slower than it should be, with the analyzer driver contributing substantial overhead even when the analyzers themselves are lightweight.

Bugs this fixes

Fixes #23459

Workarounds, if any

None needed

Risk

Low.

Performance impact

15-20% reduction in allocations for running IDE analyzers. The benefits extend to other analyzers that call GetDiagnostics, with a 30-70% reduction in execution time for analyzers that depend on compiler diagnostics.

Is this a regression from a previous update?

No.

Root cause analysis

AnalyzerRunner is a new tool for helping us test analyzer performance in isolation.

How was the bug found?

AnalyzerRunner.

Test documentation updated?

No.

@sharwell sharwell added this to the 15.6 milestone Dec 1, 2017
@sharwell sharwell self-assigned this Dec 1, 2017
@sharwell sharwell requested review from a team December 1, 2017 17:05
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider using a hybrid approach by changing the original check to:
if (_compilation.Options.ConcurrentBuild && symbol.IsGlobalNamespace)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also noted that we are kicking off parallel foreach for every single named type: http://source.roslyn.io/#Microsoft.CodeAnalysis.CSharp/Compiler/ClsComplianceChecker.cs,207. I think that is compounding the allocations.

We ran into similar issues when trying to make the analyzer driver concurrently process compilation events. Attempting to process all events concurrently caused huge allocation and perf hit, and we saw optimal performance with when using a worker queue limited by a certain max worker count: http://source.roslyn.io/#Microsoft.CodeAnalysis/DiagnosticAnalyzer/AnalyzerDriver.cs,799. Probably something to consider here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider using a hybrid approach

I did investigate an approach where the VisitNamespace method was not changed but VisitNamedType was. The result showed no improvement over baseline. I stopped looking for hybrid approaches because I couldn't find a scenario that was worse than the one I have here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

➡️ Updated the Visual Basic version of the same item.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So now we are completely removing the parallelism in this CLS compliance visitor? Seems fine to me as long as there is no build time regression, but probably someone from @dotnet/roslyn-compiler should take a look.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should prioritize this: #3679

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmat I left a comment on that issue, suggesting that other scenarios would benefit from such an approach.

@heejaechang
Copy link
Contributor

heejaechang commented Dec 4, 2017

I am not sure whether the issue is allocation or task being starved. but I believe caller of this method is probably already running in parallel (when concurrent build option is on) so, probably no reason for its nested works to be parallel again. especially using Parallel.ForEach since it is blocking API and it can take quite a bit of time if there is another parallel running since it can't find task to run some of its work. block waiting task thread to available.

my experience of making some of our feature concurrent is that, if one do multiple level of concurrent using Parallel extension. outcome is slower rather than faster. most of time, doing concurrent from outer most is better.

@sharwell sharwell force-pushed the optimize-clscompliant branch from 43e626f to 6537a7b Compare December 5, 2017 12:56
@sharwell sharwell requested a review from a team as a code owner December 5, 2017 12:56
@sharwell
Copy link
Contributor Author

sharwell commented Dec 5, 2017

@Pilchie for ask mode

@sharwell
Copy link
Contributor Author

sharwell commented Dec 5, 2017

@dotnet/roslyn-compiler I need reviews from you as well

/// </remarks>
[Conditional("EMIT_CODE_ANALYSIS_ATTRIBUTES")]
[AttributeUsage(AttributeTargets.Constructor | AttributeTargets.Method | AttributeTargets.Property | AttributeTargets.Field, AllowMultiple = true, Inherited = false)]
internal sealed class PerformanceSensitiveAttribute : Attribute
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

@sharwell
Copy link
Contributor Author

sharwell commented Dec 6, 2017

@dotnet/roslyn-compiler for reviews

/// </remarks>
[Conditional("EMIT_CODE_ANALYSIS_ATTRIBUTES")]
[AttributeUsage(AttributeTargets.Constructor | AttributeTargets.Method | AttributeTargets.Property | AttributeTargets.Field, AllowMultiple = true, Inherited = false)]
internal sealed class PerformanceSensitiveAttribute : Attribute
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PerformanceSensitiveAttribute [](start = 26, length = 29)

It looks like this file already exists in the repo at that same location. Is it just a case of concurrent PRs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

➡️ It's the same commit. If we want to wait for a new set of builds I can tell the PR to update itself and this addition will disappear from the diff.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that's fine as long as you merge and don't squash ;-)

@jcouv
Copy link
Member

jcouv commented Dec 8, 2017

There is a possibility for a performance regression for compilation-only scenarios with no analyzers involved, for very large projects on concurrent systems. However, during testing I was unable to find a solution for which the parallelization benefits outweighed the allocation overhead of the concurrent approach.

Thanks for the analysis.
Could you share more details about how the Parallel.ForEach creates those overheads. I understand that it would cause more allocations (Tasks), but it is surprising that it would be that much allocations and that on-net the runtime would be reduced by running serially.

Copy link
Member

@jcouv jcouv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (with question to understand why Parallel.ForEach has such negative impact here)

CheckMemberDistinctness(symbol);
}

if (_compilation.Options.ConcurrentBuild)
Copy link
Contributor

@AlekseyTs AlekseyTs Dec 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (_compilation.Options.ConcurrentBuild) [](start = 12, length = 41)

I am not comfortable with removing this parallelization completely. Even if we didn't add it based on specific scenario, similar approach was definitely proven to improve throughput for other similar cases (when we need to visit all namespaces and types recursively and do some analysis for them). Could we use some heuristic to disable parallelization only when it is likely cause a problem? For example, only when we have cancellable _cancellationToken and probably non-null _filterTree. #Closed

Copy link
Contributor

@AlekseyTs AlekseyTs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should find a way to keep parallelism for compilation only scenarios. See #23519 (comment) for some approaches to consider.

@AlekseyTs
Copy link
Contributor

AlekseyTs commented Dec 8, 2017

Done with review pass (iteration 2). #Closed

@sharwell
Copy link
Contributor Author

sharwell commented Dec 8, 2017

Could you share more details about how the Parallel.ForEach creates those overheads.

For a reference run which is 1/4 of the size of the run described in #23582, here is some of the overhead caused by Parallel.ForEach:

  • ParallelForReplicaTask: 911MB
  • IndexRange[]: 258MB
  • System.Threading.Tasks.Shared<long>: 257MB
  • ContingentProperties: 151MB
  • Action<object>: 128MB
  • Action<Symbol>: 122MB
  • <>c__DisplayClass17_0<object>: 104MB
  • ParallelForReplicatingTask: 88MB
  • Action<int>: 76MB
  • <>c__DisplayClass31_0<Symbol, Object>: 64MB
  • Action: 63MB
  • CancellationCallbackInfo: 60MB
  • <>c__DisplayClass176_0: 54MB
  • ParallelOptions: 43MB
  • RangeManager: 40MB
  • <>c__DisplayClass6_0<Symbol>: 32MB
  • ParallelLoopStateFlags32: 23MB
  • SetOnInvokeMres: 2MB

@sharwell
Copy link
Contributor Author

sharwell commented Dec 11, 2017

Over the weekend I ran elapsed time evaluations to compare four approaches:

  1. Baseline
  2. _filterTree: Skip parallelization when _filterTree is not null
  3. Sequential: No parallelization
  4. _filterTree+NSOnly: Only parallelize across namespaces, and skip parallelization when _filterTree is not null

Since these are time measurements (more subject to noise than allocation measurements), I used /iter:10 instead of the normal /iter:4 to help stabilize the results. Here are the results, with all times is milliseconds:

Analyzer Baseline _filterTree Sequential _filterTree+NSOnly
CSharpAddAccessibilityModifiersDiagnosticAnalyzer: 1635 1930 1735 1888
CSharpAddBracesDiagnosticAnalyzer: 9272 8368 8176 8165
CSharpAsAndNullCheckDiagnosticAnalyzer: 25340 22453 19569 20745
CSharpInlineDeclarationDiagnosticAnalyzer: 225978 171636 159502 168476
CSharpIsAndCastCheckDiagnosticAnalyzer: 685 705 306 492
CSharpNamingStyleDiagnosticAnalyzer: 120882 110507 90558 114473
CSharpOrderModifiersDiagnosticAnalyzer: 2650 2879 3057 2994
CSharpPreferFrameworkTypeDiagnosticAnalyzer: 36416 35503 29867 35437
CSharpQualifyMemberAccessDiagnosticAnalyzer: 38703 33185 28413 33537
CSharpRemoveUnnecessaryCastDiagnosticAnalyzer: 27258 25254 23712 24008
CSharpRemoveUnnecessaryImportsDiagnosticAnalyzer: 194789 45457 35945 44266
CSharpRemoveUnreachableCodeDiagnosticAnalyzer: 178197 38627 29539 39477
CSharpSimplifyTypeNamesDiagnosticAnalyzer: 2036460 1871650 1721344 1812623
CSharpUnboundIdentifiersDiagnosticAnalyzer: 2800 2489 2554 2927
CSharpUseCoalesceExpressionDiagnosticAnalyzer: 3020 2369 1919 2146
CSharpUseCoalesceExpressionForNullableDiagnosticAnalyzer: 3244 2953 2525 3232
CSharpUseCollectionInitializerDiagnosticAnalyzer: 45378 40457 38683 42377
CSharpUseDeconstructionDiagnosticAnalyzer: 35481 33847 30305 34423
CSharpUseDefaultLiteralDiagnosticAnalyzer: 4423 3960 4042 3993
CSharpUseExplicitTypeDiagnosticAnalyzer: 97618 91163 87976 88354
CSharpUseImplicitTypeDiagnosticAnalyzer: 20463 18624 16858 17985
CSharpUseInferredMemberNameDiagnosticAnalyzer: 4727 3695 3982 3885
CSharpUseIsNullCheckDiagnosticAnalyzer: 66877 64729 58624 67451
CSharpUseLocalFunctionDiagnosticAnalyzer: 10653 9208 7224 9177
CSharpUseNullPropagationDiagnosticAnalyzer: 5658 6074 5488 5726
CSharpUseObjectInitializerDiagnosticAnalyzer: 10848 10756 9884 9812
CSharpUseThrowExpressionDiagnosticAnalyzer: 23309 21823 20907 20100
CSharpValidateFormatStringDiagnosticAnalyzer: 20877 20127 19458 19776
InvokeDelegateWithConditionalAccessAnalyzer: 11093 9241 8540 9133
UseExpressionBodyDiagnosticAnalyzer: 50178 51688 40819 48971
Total 3314912 2761357 2511511 2696049

The wall-clock elapsed time numbers lead to a similar, though less pronounced, conclusion:

Baseline _filterTree Sequential _filterTree+NSOnly
233517 223185 216334 218797

@heejaechang
Copy link
Contributor

@sharwell can you be little bit more specific on "Sequential: No parallelization".

is it no parallelization between analyzers and for a analyzer? for example, running 1 analyzer at a time and run all actions sequentially for 1 analyzer?

or are you talking about analyzers are running concurrently, but each analyzer run sequentially?

or are you talking about analyzer runs sequentially but each analyzer actions run concurrently?

or are you talking about specifically this ClsCompilanceChecker being sequential or parallel?

@sharwell
Copy link
Contributor Author

@heejaechang Sequential is the current state of the pull request.

@mavasani
Copy link
Contributor

mavasani commented Dec 11, 2017

or are you talking about specifically this ClsCompilanceChecker being sequential or parallel?

I also feel that we need to get the before and after results in batch compilation with CLS compliance checker but without any analyzers. I presume this is the case that compiler team cares the most about.

@sharwell sharwell force-pushed the optimize-clscompliant branch from 6537a7b to b99d0b3 Compare January 31, 2018 16:45
@sharwell sharwell requested a review from a team as a code owner January 31, 2018 16:45
@sharwell sharwell changed the base branch from master to dev15.6.x January 31, 2018 16:50
@sharwell
Copy link
Contributor Author

@AlekseyTs @VSadov Please review the updated change. I made the following changes:

  • Re-enable parallelization across types
  • Disable parallelization when information is requested only for a single syntax tree
  • Use the lightweight fork-join approach from MethodCompiler instead of using Parallel.ForEach

The resulting numbers are similar to the _filterTree+NSOnly numbers, but with lower overall allocations.

@VSadov
Copy link
Member

VSadov commented Jan 31, 2018

It is likely that the last change improves the batch compiler perf as well. ParallelFor was a bit too heavy tool for the job there.

I am not sure by how much – CLS checks do not take a lot of E2E time, as long as they run concurrently, and from the look of the change that behavior is preserved while implementation is now lighter. It might be interesting to run compiler benchmark – just to see if we got any wins on E2E time. Probably not much if at all, but there should not be any regressions either.

The new change does not make me worried.

Copy link
Member

@VSadov VSadov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Next
End If

For Each m In symbol.Modules
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❗️ This needs to be updated to match the update in the C# compiler.

{
var queue = new ConcurrentQueue<Diagnostic>();
var checker = new ClsComplianceChecker(compilation, filterTree, filterSpanWithinTree, queue, cancellationToken);
if (compilation.Options.ConcurrentBuild)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not check the option in side of constructor of ClsCompilanceChecker? seems it is okay to create the stack in the ctor with readonly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was following the pattern of MethodCompiler. I see that the VB implementation of the same feature initializes the collection in the constructor. I'll move it there for the C# one.


_declaredOrInheritedCompliance = new ConcurrentDictionary<Symbol, Compliance>();

if (compilation.Options.ConcurrentBuild)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compilation.Options.ConcurrentBuild [](start = 16, length = 35)

Consider also checking if _filterTree is null.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

➡️ I thought about it, but figured it risked regression if the conditions ever changed. The one thing we can count on is it's only needed during concurrent builds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 If this is important to you, let me know and I'll extract the condition to a property.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me know and I'll extract the condition to a property

This feels like the right thing to do. Why allocate when we don't need to?


In reply to: 165447222 [](ancestors = 165447222)

@AlekseyTs AlekseyTs dismissed their stale review February 1, 2018 18:41

Obsolete

Copy link
Contributor

@AlekseyTs AlekseyTs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (iteration 3)

@jinujoseph
Copy link
Contributor

Approved to merge via link
Merge pending Val +RPS verification

@sharwell
Copy link
Contributor Author

sharwell commented Feb 2, 2018

Validation builds passed too 👍

@sharwell sharwell merged commit 644cb4f into dotnet:dev15.6.x Feb 2, 2018
@sharwell sharwell deleted the optimize-clscompliant branch February 2, 2018 18:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants