Stack frame size optimisations#44854
Conversation
Moving the code to a local function reduces stack frame size of BindQualifiedName by 80 bytes
This change reduces the BindNameSpaceOrTypeSymbol method's stack frame size by 32 bytes
Reduces the stack frame size by 64 bytes
Reduces the stack frame size by 64 bytes
…om the caller Reduces BindQualifiedName method's stack frame size by 32 bytes
MykolaBalakin
left a comment
There was a problem hiding this comment.
Should I increase the limits defined or add macOS Release configuration limit in the EndToEndTests.DeeplyNestedGeneric test?
Absolutely. Want to lock in this win. |
| return TypeWithAnnotations.Create(new PointerTypeSymbol(elementType)); | ||
| } | ||
|
|
||
| NamespaceOrTypeOrAliasSymbolWithAnnotations createErrorType() |
There was a problem hiding this comment.
Nice. This is a simple and effective technique I'm guessing we could replicate elsewhere.
Curious what tools, if any, did you use to measure the stack frame sizes here? Or did you just experiment and run the tests?
There was a problem hiding this comment.
Just lldb, sos and clru. Have not found nothing better yet.
There was a problem hiding this comment.
All the other case statements of the switch-case above may lead to recursive code execution, so similar changes there may get other cases worse (even while improving the current one).
There was a problem hiding this comment.
Just lldb, sos and clru. Has not found nothing better yet.
That's basically the same approach we'd been taking (windbg + sos).
gafter
left a comment
There was a problem hiding this comment.
Nice work. Looks good, but please do update the expected depth for the test.
| constraintsList, | ||
| openBrace, | ||
| members, | ||
| membersList, |
There was a problem hiding this comment.
-
@jaredpar it's a super pity this sort of change needs to be manual. i suppose it's unsafe for the compielr to figure this out due to potential side-effects?
-
shoudl we get rid of those implicit conversions between the builders and the built items? so that we have to think about this and can potentially get improvements in other areas of the parser.
| var semicolon = TryEatToken(SyntaxKind.SemicolonToken); | ||
| var modifiersList = (SyntaxList<SyntaxToken>)modifiers.ToList(); | ||
| var membersList = (SyntaxList<MemberDeclarationSyntax>)members; | ||
| var constraintsList = (SyntaxList<TypeParameterConstraintClauseSyntax>)constraints; |
There was a problem hiding this comment.
can we explicitly doc this? this would be something that could easily be done later by someone well meaning. i.e. we might get a stack-improvement elsewhere in the future. then someone goes "i'll just inline these" and now we undo it, but nothing catches it.
There was a problem hiding this comment.
I'd say that is a point for improvement for JIT-compiler. JIT-compiler may reuse stack frame space for different variables by analysing variable life scope.
There was a problem hiding this comment.
How is this enabling us to save stack space here? This is essentially converting a sequence of implicit conversions that would cause some push call sequences into ldloc calls. The final stack size should be identical particularly because these aren't the final arguments to the method they are passed as parameters too.
Is this a case where the JIT is over allocating stack space for the method and this just helps their analysis? The method is fairly large and does have a lot of locals. Possibly it's over the range where they do their most in depth optimizations.
That or I'm missing something really basic here :)
There was a problem hiding this comment.
Each case-branch contains a set of cast operator invocations and JIT places the result of the cast into different addresses in the stack frame. Aggregating all the casts before the switch eliminates the "duplicates".
; case SyntaxKind.ClassKeyword:
; ...
mov rdi, qword ptr [rbp - 0xa8]
call 0x1127621a0 (SyntaxListBuilder.ToListNode())
; ...
mov qword ptr [rbp - 0xc0], rax
; case SyntaxKind.StructKeyword:
; ...
mov rdi, qword ptr [rbp - 0xa8]
call 0x1127621a0 (SyntaxListBuilder.ToListNode())
; ...
mov qword ptr [rbp - 0xb8], rax
; case SyntaxKind.InterfaceKeyword:
; ...
mov rdi, qword ptr [rbp - 0xa8]
call 0x1127621a0 (SyntaxListBuilder.ToListNode())
; ...
mov qword ptr [rbp - 0xb0], raxThere was a problem hiding this comment.
Very interesting indeed.
This doesn't seem to happen with all calls in a switch / case but I was able to get it to repro in some smaller samples. That makes it hard to make a simple judgement on at code review time but we also now know to look for this when we're investigating stack frame bugs / making changes in this area. So definitely can improve our process going forward.
| (ExecutionArchitecture.x86, ExecutionConfiguration.Release) => 1290, | ||
| (ExecutionArchitecture.x64, ExecutionConfiguration.Debug) => 170, | ||
| (ExecutionArchitecture.x64, ExecutionConfiguration.Release) => 730, | ||
| (ExecutionArchitecture.x86, ExecutionConfiguration.Debug) => 520, |
There was a problem hiding this comment.
Those are the maximum working values tested on VS 2019 VM image in Azure. I'm wondering maybe it's worth to reserve some margin so that different runtime or some random stuff does not make the test fail.
Also I have some troubles with measuring Linux limits. It works with the limit over 3000 and takes about 40 minutes to pass.
| (ExecutionArchitecture.x86, ExecutionConfiguration.Release) => 1640, | ||
| (ExecutionArchitecture.x64, ExecutionConfiguration.Debug) => 290, | ||
| (ExecutionArchitecture.x64, ExecutionConfiguration.Release) => 810, | ||
| (ExecutionArchitecture.x86, ExecutionConfiguration.Debug) => 460, |
There was a problem hiding this comment.
What is the reference environment to run this test?
There was a problem hiding this comment.
Running xunit.console.exe or xunit.console.x86.exe on my home PC.
There was a problem hiding this comment.
Well, that is how I determined new baselines. Perhaps we could do it in a more "canonical" environment, e.g. by lowering all the baselines, uncommenting the bits that find the new baseline, then running it on a CI machine.
There was a problem hiding this comment.
The numbers set were from the test run on Azure VM (VS 2019 Latest image) using ./eng/cibuild.ps1. I thought that would be a "canonical" enough method to measure the baselines. 😄
Also, what's interesting, running the commented loop leads to different number than running the test itself while trying to adjust the number with the loop commented. Maybe will check why it is so later.
Yeah, and the question about Linux environment is still actual. Do we want to adjust the number for it? The real issue there is the test duration which reaches 40 minutes.
There was a problem hiding this comment.
IMO, there is no need to test the actual limit on Linux. Keeping at around the desktop Release/x86 level or a bit higher is fine.
There was a problem hiding this comment.
BTW, the numbers mentioned in the PR description were measured by running dotnet test.
There was a problem hiding this comment.
I also noticed that the thresholds in the test differ based on simply whether you use the loop or not. It is a bit of a pain.
The most likely reason your thresholds differed from mine is that some features have gone into master since this PR was opened, most notably records.
There was a problem hiding this comment.
Interesting enough, they differ in different directions. For macOS the working threshold was lower than the one measured using the loop and for Windows it's vice versa.
|
Thanks for this contribution! This is really exciting for us to see. |
Apply #44854 to preview3 branch
The changes contains a few optimisations based on
EndToEndTests.DeeplyNestedGenerictest.Below are the results of the test run: