Remove relocations for vtable chunks#17147
Conversation
|
@dotnet/arm32-contrib PTAL |
a3caca9 to
58af16f
Compare
|
cc @vitek-karas This private memory optimization is specific to the FragileNonVersionable NGen. If it is not used (the default), this change is a regression with no benefits. We have been working towards including Linux ARM among officially supported .NET Core platforms for .NET Core 2.1, and just recently started looking at performance. I think we need to step back and agree on a plan how to evolve the native binary file format(s). Here are some points that come to mind:
Thoughts? |
|
What do you think about adding this optimization under This way, FragileNonVersionable or R2R specific optimizations, which affect each other, could be defined under different
Does this mean that precodes won't be saved to FNV images and will be created at run time? |
Yes, it is what can be done for now (together with other FragileNonVersionable optimizations that are under ARM && UNIX ifdef today). I am worried that these ifdefs become unmanageable over time.
The prototype that we have done saves the precodes as read-only (https://github.com/vitek-karas/coreclr/tree/NGenWX). This reminds me that we need an issue for this. I have just created one: https://github.com/dotnet/coreclr/issues/17230 |
58af16f to
a39f5c5
Compare
We'll consider applying this approach on ARM after May, as we have to measure performance and memory consumption of such changes. Could you, please, take a look at this pull request? |
BruceForstall
left a comment
There was a problem hiding this comment.
Do you need to make arm LEGACY_BACKEND work with this? It's likely we will delete this code path soon.
|
|
||
| *GetAddrOfSlot() = GetTemporaryEntryPoint(); | ||
| TADDR slot = GetAddrOfSlot(); | ||
| if (IsVtableSlot()) |
There was a problem hiding this comment.
Any reason why you have not converted all slots to this scheme?
There was a problem hiding this comment.
Non-virtual slots and non-vtable slots are plain pointers in non-NGEN mode and relative pointers in NGEN mode, so, for IsVtableSlot() slot is always a relative pointer (for ARM with FEATURE_FNV_MEM_OPTIMIZATIONS), otherwise it's not.
| MethodTable::VtableIndirectionSlotIterator it = pMT->IterateVtableIndirectionSlots(); | ||
| while (it.Next()) | ||
| { | ||
| image->PlaceInternedStructureForAddress(it.GetIndirectionSlot(), CORCOMPILE_SECTION_READONLY_SHARED_HOT, CORCOMPILE_SECTION_READONLY_HOT); |
There was a problem hiding this comment.
Why can't you use PlaceInternedStructureForAddress anymore? PlaceInternedStructureForAddress was used here as binary size optimization.
There was a problem hiding this comment.
I've missed the check for MethodTable::VTableIndir2_t::isRelative there: it should be PlaceStructureForAddress if this condition is met, otherwise, as before. I'll fix this.
When vtable chunks consist of relative pointers, they can't be reused, as these chunks can be the same for different method tables, but base addresses must be different.
There was a problem hiding this comment.
The base address is always the address of the slot itself. I do not see what prevents reusing them.
There was a problem hiding this comment.
What I meant was that without this optimization if there are two method tables MT1 with chunk1 and MT2 with chunk2, these chunks differ, as they consist of different absolute addresses. If addresses are relative, memory blocks of these chunks might become equal, when all methods in both chunks have the same relative offsets. Then, base addresses of these chunks have to be different in order to store different addresses in them.
There was a problem hiding this comment.
If addresses are relative, memory blocks of these chunks might become equal
PlaceInternedStructureForAddress does not look at bytes inside the blocks. It just looks at the addresses. I do not think that the addresses can ever become accidentally equal.
There was a problem hiding this comment.
I agree, this place could be left unchanged, as m_reusedStructures in PlaceInternedStructureForAddress will not contain vtable chunks, and this change is unneeded. Thank you.
src/jit/morph.cpp
Outdated
| &isRelative); | ||
| &isRelative, &isRelativeAfterIndirection); | ||
|
|
||
| if (isRelativeAfterIndirection) |
There was a problem hiding this comment.
Should we have just one flag for it that turns it on/off for the whole thing?
There was a problem hiding this comment.
Two separate flags could be useful for enabling more or less aggressive memory optimizations for FNV. Now there are next possible combinations:
isRelative==false and isRelativeAfterIndirection==false
isRelative==true and isRelativeAfterIndirection==false
isRelative==true and isRelativeAfterIndirection==true
There was a problem hiding this comment.
I understand that it is more flexibility, but I doubt that we will ever take advantage of it and it increases the number of options one has to worry about when working on the code.
There was a problem hiding this comment.
We target different devices and this option might be useful, as memory reduction of this optimization comes with performance reduction on benchmarks.
Maybe this and similar places could be refactored, so that it will be easier to maintain them. What do you think about this?
There was a problem hiding this comment.
Historically, fine grained controls like these did not survive for long. I think we just need two options: fully position independent data structures on/off. Similar to how position-independent-code works in C/C++. You either turn it on or off. There are no fine-grained controls.
a39f5c5 to
bb8ae14
Compare
src/jit/codegencommon.cpp
Outdated
|
|
||
| case IAT_RELPVALUE: | ||
| { | ||
| #if defined(_TARGET_ARM_) && defined(FEATURE_FNV_MEM_OPTIMIZATIONS) |
There was a problem hiding this comment.
The FEATURE_FNV_MEM_OPTIMIZATION ifdefs should be just on the VM side. They are not necessary on the JIT side since the JIT gets instructions on what to do from the VM side, and the cost of the extra checks is negligible.
Yes, without legacy backend part of this change we won't be able to run legacy jit with |
clrdefinitions.cmake
Outdated
| if (CLR_CMAKE_PLATFORM_UNIX OR CLR_CMAKE_TARGET_ARCH_ARM64) | ||
| add_definitions(-DFEATURE_STUBS_AS_IL) | ||
| endif () | ||
| if (FEATURE_FNV_MEM_OPTIMIZATIONS) |
There was a problem hiding this comment.
I think the name of the ifdef should be something like FEATURE_NGEN_RELOCS_OPTIMIZATIONS to better describe what it does.
There was a problem hiding this comment.
And switch the existing places that use #if defined(PLATFORM_UNIX) && defined(_TARGET_ARM_) to use it as well.
There was a problem hiding this comment.
Could we add build with FEATURE_NGEN_RELOCS_OPTIMIZATIONS for Linux ARM to default set of tests on pull requests?
1cb5a66 to
93459f6
Compare
|
@dotnet-bot help |
|
Welcome to the dotnet/coreclr Repository The following is a list of valid commands on this PR. To invoke a command, comment the indicated phrase on the PR The following commands are valid for all PRs and repositories. Click to expand
The following jobs are launched by default for each PR against dotnet/coreclr:master. Click to expand
The following optional jobs are available in PRs against dotnet/coreclr:master. Click to expand
Have a nice day! |
|
Welcome to the dotnet/coreclr Repository The following is a list of valid commands on this PR. To invoke a command, comment the indicated phrase on the PR The following commands are valid for all PRs and repositories. Click to expand
The following jobs are launched by default for each PR against dotnet/coreclr:master. Click to expand
The following optional jobs are available in PRs against dotnet/coreclr:master. Click to expand
Have a nice day! |
|
Welcome to the dotnet/coreclr Perf help The following is a list of valid commands on this PR. To invoke a command, comment the indicated phrase on the PR The following commands are valid for all PRs and repositories. Click to expand
The following jobs are launched by default for each PR against dotnet/coreclr:master. Click to expand
The following optional jobs are available in PRs against dotnet/coreclr:master. Click to expand
Have a nice day! |
|
Could you, please, take a look? |
src/inc/corinfo.h
Outdated
| IAT_PVALUE, // The value needs to be accessed via an indirection | ||
| IAT_PPVALUE // The value needs to be accessed via a double indirection | ||
| IAT_PVALUE, // The value needs to be accessed via an indirection | ||
| IAT_RELPVALUE, // The value needs to be accessed via a relative indirection |
There was a problem hiding this comment.
This is changing JIT/EE interface definition. The JIT/EE interface definition will need new GUID.
There was a problem hiding this comment.
It would be better to add IAT_RELPVALUE at the end of the enum rather than in the middle.
It less of a breaking change that way.
|
The non-JIT part of the changes looks good to me otherwise. @BruceForstall Could you please take a look at the JIT part of the changes? |
8241858 to
6d886f7
Compare
|
@jkotas @BruceForstall @briansull |
src/jit/codegencommon.cpp
Outdated
|
|
||
| regNumber vptrReg1 = REG_R11; | ||
| regMaskTP vptrReg1Mask = genRegMask(vptrReg1); | ||
| inst_IV(INS_push, (int)vptrReg1Mask); |
There was a problem hiding this comment.
There's a fundamental problem here, or maybe two: (1) you are temporarily trashing R11, which is the frame pointer. This will break stack walking, and local variable access (e.g., debugging), and if an exception occurs after FP has been trashed, FP wouldn't get restored, and the stack would also be off, and (2) we never change the stack pointer outside of the prolog or epilog. I think that's required, also for stack walking.
Presumably if you ran GCStress in a fully-interruptible method with this type of call, you would hit a problem (since GC would be attempted after the push). Actually, I'm guessing GC would fail dramatically because we don't track stack pointer moves outside the prolog/epilog, so the GC info would be incorrect.
The correct way to do this would be to make information known to the register allocator that we need a temporary register here, and then grab that temporary. This might be a case where we've never done that, so I don't know if there's an established mechanism for communicating this.
I notice also that StubLinkerCPU::ThumbEmitCallManagedMethod has apparently the same call sequence, which uses R11 in the same way. I'm not sure what the requirements are on stubs, and whether this is ok in that context or not.
There was a problem hiding this comment.
@BruceForstall, what if we use some other register for this instead of R11? For example, R1, to fix problem (1), which you have mentioned.
There was a problem hiding this comment.
@BruceForstall For problem (2), we can reserve additional space on stack in prolog, and store helper register there instead of push. This will allow to not change stack pointer. What do you think about this?
There was a problem hiding this comment.
Those two things would probably make this work. However, it wouldn't be the "right" way to do it. The right way is to get the register allocator to allocate a temporary register that would be available at this point. It done, then it would automatically save/restore any conflicting register (if needed), and allocate stack space (if needed).
There was a problem hiding this comment.
@BruceForstall I've pushed new commits related to these. Could you, please, take a look?
0510018 to
67a651a
Compare
src/jit/codegencommon.cpp
Outdated
| break; | ||
|
|
||
| case IAT_RELPVALUE: | ||
| #ifdef FEATURE_NGEN_RELOCS_OPTIMIZATIONS |
There was a problem hiding this comment.
We do not use FEATURE ifdefs for cases like these when the VM asked the JIT to do something. We just leave the handling in unconditionally, even though it creates a bit of unreachable code in the JIT. It simplifies the matrix of different build combinations we need to worry about. It is nice to have one JIT library that can handle all of them (per host/target combination).
…TURE_NGEN_RELOCS_OPTIMIZATIONS is enabled. Introduce FEATURE_NGEN_RELOCS_OPTIMIZATIONS, under which NGEN specific relocations optimizations are enabled
- str/ldr of R4 in space reserved in epilog for non-tail calls - usage of R4 with hybrid-tail calls (same as for EmitShuffleThunk)
…r register right before its restore from stack
032544b to
0f5c702
Compare
|
@dotnet-bot test this please |
|
@dotnet-bot test Ubuntu arm Cross Checked normal Build and Test |
|
@jkotas do you have any more concerns with this change, or is it ready to merge from your perspective? |
src/jit/codegencommon.cpp
Outdated
| unwindStarted = true; | ||
| } | ||
|
|
||
| #ifdef FEATURE_NGEN_RELOCS_OPTIMIZATIONS |
There was a problem hiding this comment.
We try to avoid FEATURE ifdefs like these in the JIT. Can this be driven by what the VM tells JIT to do?
There was a problem hiding this comment.
@jkotas, yes, this would work without #ifdef FEATURE_NGEN_RELOCS_OPTIMIZATIONS. I'll remove it
src/vm/methodtable.cpp
Outdated
| // On ARM, make sure that the address to the virtual thunk that we write into the | ||
| // vtable "chunk" has the Thumb bit set. | ||
| image->FixupFieldToNode(slotBase, slotOffset, importThunk ARM_ARG(THUMB_CODE)); | ||
| image->FixupFieldToNode(slotBase, slotOffset, importThunk ARM_ARG_OR_ZERO(THUMB_CODE), relocType); |
There was a problem hiding this comment.
It would be better to write this as ARM_ARG + NOT_ARM_ARG instead of introducing a new macro for this.
src/jit/codegencommon.cpp
Outdated
|
|
||
| CORINFO_METHOD_HANDLE methHnd = (CORINFO_METHOD_HANDLE)jmpNode->gtVal.gtVal1; | ||
| CORINFO_CONST_LOOKUP addrInfo; | ||
| compiler->info.compCompHnd->getFunctionEntryPoint(methHnd, &addrInfo); |
There was a problem hiding this comment.
We will call getFunctionEntryPoint again later when jmpEpilog is set. Should we remember the result from here?
src/jit/codegencommon.cpp
Outdated
| unwindStarted = true; | ||
| } | ||
|
|
||
| #if !FEATURE_FASTTAILCALL |
There was a problem hiding this comment.
If/once FEATURE_FASTTAILCALL is enabled for ARM, this will need to change. Would it make sense to change this to something like this so that it is ready?
#if FEATURE_FASTTAILCALL
if (jmpEpilog && jmpNode->gtOper == GT_JMP)
#else
if (jmpEpilog)
#endif
Or maybe not have this under ifdef at all and just do if (jmpEpilog && jmpNode->gtOper == GT_JMP) always to keep the code simple.
There was a problem hiding this comment.
I've updated latest commit with your last suggestion, thank you!
b54399e to
ca2576d
Compare
ca2576d to
a3992d9
Compare
|
Is anybody fixing desktop errors introduced by this change and porting |
|
I will take care of it |
|
@jkotas Thank you! |
* Separate sections READONLY_VCHUNKS and READONLY_DICTIONARY * Remove relocations for second-level indirection of Vtable in case FEATURE_NGEN_RELOCS_OPTIMIZATIONS is enabled. Introduce FEATURE_NGEN_RELOCS_OPTIMIZATIONS, under which NGEN specific relocations optimizations are enabled * Replace push/pop of R11 in stubs with - str/ldr of R4 in space reserved in epilog for non-tail calls - usage of R4 with hybrid-tail calls (same as for EmitShuffleThunk) * Replace push/pop of R11 for function epilog with usage of LR as helper register right before its restore from stack Commit migrated from dotnet/coreclr@61146b5
This pull request replaces absolute pointers with relative in vtable chunks for Linux ARM.
On application, which is referenced in #10380 (comment), the following memory consumption changes occur for mappings of images (based on 08c87c5):
Rss: 8872 -> 8756 (1.3% improvement)
Private_Dirty: 2048 -> 1796 (12.3% improvement)
Private_Clean: 2028 -> 2036
Shared_Clean: 4796 -> 4924
No difference in startup time for GUI applications in case system dlls are crossgened. In case no dlls are crossgened: 0.7% increase of startup time. On micro benchmarks with lots of virtual calls: up to 29% increase of execution time.
cc @Dmitri-Botcharnikov @alpencolt @kvochko