Skip to content

Stack frame size optimisations#44854

Merged
RikkiGibson merged 10 commits intodotnet:masterfrom
MykolaBalakin:balakin/stack-usage-optimization
Jun 9, 2020
Merged

Stack frame size optimisations#44854
RikkiGibson merged 10 commits intodotnet:masterfrom
MykolaBalakin:balakin/stack-usage-optimization

Conversation

@MykolaBalakin
Copy link
Contributor

The changes contains a few optimisations based on EndToEndTests.DeeplyNestedGeneric test.
Below are the results of the test run:

 OS Framework  Configuration  Before After
macOS netcoreapp3.1 Debug 180 200
macOS netcoreapp3.1 Release 440 530
macOS net472 Debug 870 960
macOS net472 Release 1230 1390
Windows x64 netcoreapp3.1 Debug 500 560
Windows x64 netcoreapp3.1 Release 1200 1440
Windows x32 net472 Debug 480 520
Windows x32 net472 Release 1480 1560

Moving the code to a local function reduces stack frame size of BindQualifiedName by 80 bytes
This change reduces the BindNameSpaceOrTypeSymbol method's stack frame size by 32 bytes
Reduces the stack frame size by 64 bytes
…om the caller

Reduces BindQualifiedName method's stack frame size by 32 bytes
@MykolaBalakin MykolaBalakin requested a review from a team as a code owner June 4, 2020 16:05
Copy link
Contributor Author

@MykolaBalakin MykolaBalakin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I increase the limits defined or add macOS Release configuration limit in the EndToEndTests.DeeplyNestedGeneric test?

@jaredpar
Copy link
Member

jaredpar commented Jun 4, 2020

Should I increase the limits defined or add macOS Release configuration limit in the EndToEndTests.DeeplyNestedGeneric test?

Absolutely. Want to lock in this win.

return TypeWithAnnotations.Create(new PointerTypeSymbol(elementType));
}

NamespaceOrTypeOrAliasSymbolWithAnnotations createErrorType()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. This is a simple and effective technique I'm guessing we could replicate elsewhere.

Curious what tools, if any, did you use to measure the stack frame sizes here? Or did you just experiment and run the tests?

Copy link
Contributor Author

@MykolaBalakin MykolaBalakin Jun 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just lldb, sos and clru. Have not found nothing better yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the other case statements of the switch-case above may lead to recursive code execution, so similar changes there may get other cases worse (even while improving the current one).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just lldb, sos and clru. Has not found nothing better yet.

That's basically the same approach we'd been taking (windbg + sos).

@jaredpar jaredpar added Area-Compilers Community The pull request was submitted by a contributor who is not a Microsoft employee. labels Jun 4, 2020
Copy link
Member

@gafter gafter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. Looks good, but please do update the expected depth for the test.

constraintsList,
openBrace,
members,
membersList,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. @jaredpar it's a super pity this sort of change needs to be manual. i suppose it's unsafe for the compielr to figure this out due to potential side-effects?

  2. shoudl we get rid of those implicit conversions between the builders and the built items? so that we have to think about this and can potentially get improvements in other areas of the parser.

var semicolon = TryEatToken(SyntaxKind.SemicolonToken);
var modifiersList = (SyntaxList<SyntaxToken>)modifiers.ToList();
var membersList = (SyntaxList<MemberDeclarationSyntax>)members;
var constraintsList = (SyntaxList<TypeParameterConstraintClauseSyntax>)constraints;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we explicitly doc this? this would be something that could easily be done later by someone well meaning. i.e. we might get a stack-improvement elsewhere in the future. then someone goes "i'll just inline these" and now we undo it, but nothing catches it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say that is a point for improvement for JIT-compiler. JIT-compiler may reuse stack frame space for different variables by analysing variable life scope.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this enabling us to save stack space here? This is essentially converting a sequence of implicit conversions that would cause some push call sequences into ldloc calls. The final stack size should be identical particularly because these aren't the final arguments to the method they are passed as parameters too.

Is this a case where the JIT is over allocating stack space for the method and this just helps their analysis? The method is fairly large and does have a lot of locals. Possibly it's over the range where they do their most in depth optimizations.

That or I'm missing something really basic here :)

Copy link
Contributor Author

@MykolaBalakin MykolaBalakin Jun 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each case-branch contains a set of cast operator invocations and JIT places the result of the cast into different addresses in the stack frame. Aggregating all the casts before the switch eliminates the "duplicates".

; case SyntaxKind.ClassKeyword:
; ...
mov     rdi, qword ptr [rbp - 0xa8]
call    0x1127621a0 (SyntaxListBuilder.ToListNode())
; ...
mov     qword ptr [rbp - 0xc0], rax

; case SyntaxKind.StructKeyword:
; ...
mov     rdi, qword ptr [rbp - 0xa8]
call    0x1127621a0 (SyntaxListBuilder.ToListNode())
; ...
mov     qword ptr [rbp - 0xb8], rax

; case SyntaxKind.InterfaceKeyword:
; ...
mov     rdi, qword ptr [rbp - 0xa8]
call    0x1127621a0 (SyntaxListBuilder.ToListNode())
; ...
mov     qword ptr [rbp - 0xb0], rax

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very interesting!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very interesting indeed.

This doesn't seem to happen with all calls in a switch / case but I was able to get it to repro in some smaller samples. That makes it hard to make a simple judgement on at code review time but we also now know to look for this when we're investigating stack frame bugs / making changes in this area. So definitely can improve our process going forward.

Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Auto-approval

(ExecutionArchitecture.x86, ExecutionConfiguration.Release) => 1290,
(ExecutionArchitecture.x64, ExecutionConfiguration.Debug) => 170,
(ExecutionArchitecture.x64, ExecutionConfiguration.Release) => 730,
(ExecutionArchitecture.x86, ExecutionConfiguration.Debug) => 520,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are the maximum working values tested on VS 2019 VM image in Azure. I'm wondering maybe it's worth to reserve some margin so that different runtime or some random stuff does not make the test fail.

Also I have some troubles with measuring Linux limits. It works with the limit over 3000 and takes about 40 minutes to pass.

(ExecutionArchitecture.x86, ExecutionConfiguration.Release) => 1640,
(ExecutionArchitecture.x64, ExecutionConfiguration.Debug) => 290,
(ExecutionArchitecture.x64, ExecutionConfiguration.Release) => 810,
(ExecutionArchitecture.x86, ExecutionConfiguration.Debug) => 460,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reference environment to run this test?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running xunit.console.exe or xunit.console.x86.exe on my home PC.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, that is how I determined new baselines. Perhaps we could do it in a more "canonical" environment, e.g. by lowering all the baselines, uncommenting the bits that find the new baseline, then running it on a CI machine.

Copy link
Contributor Author

@MykolaBalakin MykolaBalakin Jun 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The numbers set were from the test run on Azure VM (VS 2019 Latest image) using ./eng/cibuild.ps1. I thought that would be a "canonical" enough method to measure the baselines. 😄
Also, what's interesting, running the commented loop leads to different number than running the test itself while trying to adjust the number with the loop commented. Maybe will check why it is so later.

Yeah, and the question about Linux environment is still actual. Do we want to adjust the number for it? The real issue there is the test duration which reaches 40 minutes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, there is no need to test the actual limit on Linux. Keeping at around the desktop Release/x86 level or a bit higher is fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, the numbers mentioned in the PR description were measured by running dotnet test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also noticed that the thresholds in the test differ based on simply whether you use the loop or not. It is a bit of a pain.

The most likely reason your thresholds differed from mine is that some features have gone into master since this PR was opened, most notably records.

Copy link
Contributor Author

@MykolaBalakin MykolaBalakin Jun 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting enough, they differ in different directions. For macOS the working threshold was lower than the one measured using the loop and for Windows it's vice versa.

@RikkiGibson RikkiGibson merged commit 71b5c58 into dotnet:master Jun 9, 2020
@ghost ghost added this to the Next milestone Jun 9, 2020
@jaredpar
Copy link
Member

jaredpar commented Jun 9, 2020

Thanks for this contribution! This is really exciting for us to see.

RikkiGibson added a commit to RikkiGibson/roslyn that referenced this pull request Jun 9, 2020
RikkiGibson added a commit that referenced this pull request Jun 10, 2020
@dibarbet dibarbet modified the milestones: Next, 16.7.P4 Jun 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area-Compilers auto-merge Community The pull request was submitted by a contributor who is not a Microsoft employee.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants