Eliminate duplicate zero initializations more aggressively.#31960
Eliminate duplicate zero initializations more aggressively.#31960erozenfeld merged 1 commit intodotnet:masterfrom
Conversation
|
Framework diffs: |
|
Sample diff from Microsoft.CodeAnalysis.dll: ; Assembly listing for method ReversedEnumeratorImpl:.ctor(byref):this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
; V00 this [V00,T00] ( 3, 3 ) ref -> rdi this class-hnd
; V01 arg1 [V01,T01] ( 3, 3 ) byref -> rdx
; V02 OutArgs [V02 ] ( 1, 1 ) lclBlk (32) [rsp+0x00] "OutgoingArgSpace"
-; V03 tmp1 [V03 ] ( 3, 6 ) struct (56) [rsp+0x20] do-not-enreg[XSB] must-init addr-exposed "NewObj constructor temp"
+; V03 tmp1 [V03 ] ( 2, 4 ) struct (56) [rsp+0x20] do-not-enreg[XSB] must-init addr-exposed "NewObj constructor temp"
;
; Lcl frame size = 88
G_M63972_IG01:
push rdi
push rsi
sub rsp, 88
- vzeroupper
mov rsi, rcx
lea rdi, [rsp+20H]
mov ecx, 14
xor rax, rax
rep stosd
mov rcx, rsi
mov rdi, rcx
- ;; bbWeight=1 PerfScore 30.00
+ ;; bbWeight=1 PerfScore 29.00
G_M63972_IG02:
- xor ecx, ecx
- vxorps xmm0, xmm0
- vmovdqu xmmword ptr [rsp+20H], xmm0
- vmovdqu xmmword ptr [rsp+30H], xmm0
- vmovdqu xmmword ptr [rsp+40H], xmm0
- mov qword ptr [rsp+50H], rcx
lea rcx, bword ptr [rsp+20H]
call Enumerator:.ctor(byref):this
add rdi, 8
lea rsi, bword ptr [rsp+20H]
call CORINFO_HELP_ASSIGN_BYREF
call CORINFO_HELP_ASSIGN_BYREF
movsq
movsq
call CORINFO_HELP_ASSIGN_BYREF
call CORINFO_HELP_ASSIGN_BYREF
movsq
- ;; bbWeight=1 PerfScore 13.83
+ ;; bbWeight=1 PerfScore 9.25
G_M63972_IG03:
add rsp, 88
pop rsi
pop rdi
ret
;; bbWeight=1 PerfScore 2.25
-; Total bytes of code 113, prolog size 29, PerfScore 57.78, (MethodHash=e622061c) for method ReversedEnumeratorImpl:.ctor(byref):this
+; Total bytes of code 81, prolog size 26, PerfScore 48.60, (MethodHash=e622061c) for method ReversedEnumeratorImpl:.ctor(byref):this
|
|
@dotnet/jit-contrib @AndyAyersMS PTAL |
|
Benchmark diffs: |
src/coreclr/src/jit/flowgraph.cpp
Outdated
There was a problem hiding this comment.
Can you add jitdump here for the case where the local is used by we decide we don't need to zero init?
src/coreclr/src/jit/flowgraph.cpp
Outdated
There was a problem hiding this comment.
We set BBF_BACKWARD_JUMP very early, just based on lexical IL. So it may encompass non-loop blocks (eg loop with a branch that leads to a return).
We may also optimize away backwards branches (probably rare, but not impossible).
I wonder how often we end up with extra zeroing from BBF_BACKWARD_JUMP being approximate? Any Idea?
There was a problem hiding this comment.
No, I don't know. I guess I can try to allow this optimization for BBJ_RETURN basic blocks in root methods since we can execute those blocks only once even in a loop. That won't account for all cases you are asking about but I don't see an easy way to capture those.
There was a problem hiding this comment.
To estimate how often this happens, you could perhaps set a new BBF in blocks where we add zero inits. Then later when we run some more robust form of loop detection, see how many non-loop blocks have this BBF.
0a80c31 to
b9a2f2b
Compare
src/coreclr/src/jit/flowgraph.cpp
Outdated
There was a problem hiding this comment.
Nit: please use lvaGetDesc(varNum) instead of &lvaTable[tmpNum] here and below.
We have 3 places in the code where we may need to insert explicit zero
initialization. In some cases the initialization may become redundant with prolog
initialization so we have logic to avoid explicit initialization in some cases.
This change makes the logic less conservative.
1. If the variable is not being initialized in a loop, we can avoid explicit zero initialization if
- the variable is a gc pointer, or
- the variable is a struct with gc pointer fields and either all fields are gc pointer fields
or the struct is big enough to guarantee block initialization, or
- compInitMem is set and the variable has a long lifetime or has gc fields.
Before this change we always inserted explicit zero initalization when compInitMem
wasn't set and the variable was a gc pointer or a struct with gc fields. This relaxes that
as described above.
2. When we were allocating structs via newobj, we were always inserting explicit zero initializations
when importing an inlinee. This change applies the same optimization if the basic block can't be
in a loop after inlining.
3. When we were initializing inlinee locals, we were only optimizing explicit zero initializations
for structs; however, the same logic applies to non-structs.
4. If the initialization basic block is in a loop but is BBJ_RETURN, we may avoid zero initialization
since the block will be executed at most once per method invocation.
b9a2f2b to
66c810e
Compare
|
It turns out that the initial PR was too aggressive. My assumption that we always fully zero initialize structs with GC fields in the prolog was incorrect. The prolog zero initialization logic lives in runtime/src/coreclr/src/jit/codegencommon.cpp Lines 6403 to 6418 in 8079a27 So I modified my changes to skip zero initialization in structs with GC fields when While studying I also made the change to not block the optimization for This leaves us with about half of the original wins, which is still worthwhile. |
|
New framework diff results: |
|
Benchmark diffs: |
|
@AndyAyersMS @sandreenko PTAL |
|
|
||
| // TODO-Review: The code below is currently unreachable. We are guaranteed to execute one of the | ||
| // 'continue' statements above. | ||
| #if 0 |
There was a problem hiding this comment.
I don't want to hold up getting this merged - looks great BTW - but why didn't you just delete this?
There was a problem hiding this comment.
I want to go back and see if any of this needs to be resurrected. It looks like part of the unreachable code was supposed to be responsible for correctness. I'll clean this up in a subsequent PR.
There was a problem hiding this comment.
Perhaps open an issue for follow up on this, so we don't forget?
There was a problem hiding this comment.
I added a comment to the issue that tracks improving the heuristic: #8890 (comment)
AndyAyersMS
left a comment
There was a problem hiding this comment.
Thanks, looks good.
Seems like we should probably block init more aggressively in the prolog (as well as init more efficiently).
|
Yes, we have #8890 open |
We have 3 places in the code where we may need to insert explicit zero
initialization. In some cases the initialization may become redundant with prolog
initialization so we have logic to avoid explicit initialization in some cases.
This change makes the logic less conservative.
If the variable is not being initialized in a loop, we can avoid explicit zero initialization if
Before this change we inserted explicit zero initalization in one more case: when compInitMem
wasn't set and the variable was a gc pointer or a struct with gc fields. That is not necessary,
so this change fixes that in fgVarNeedsExplicitZeroInit.
When we were allocating structs via newobj, we were always inserting explicit zero initializations
when importing an inlinee. This change applies the same optimization if the basic block can't be
in a loop after inlining.
When we were initializing inlinee locals, we were only optimizing explicit zero initializations
for structs; however, the same logic applies to non-structs.