Newly implementd partial loop-unrolling support for RyuJIT #19594
Newly implementd partial loop-unrolling support for RyuJIT #19594ArtBlnd wants to merge 50 commits intodotnet:masterfrom
Conversation
|
In debug, all my tests in local is passing. and I |
That sounds odd, checked builds are are sort of like debug builds but with compiler optimizations enabled so they're reasonably fast. Usually I use checked builds rather than debug builds, exactly because they're faster.
Hmm, I don't see how you could do that. Obviously, there's something different between builds if tests pass with one build but not with the other. |
|
Ready for review :> |
|
This implements will replace old optimization for loop unrolling. which only support full unrolling with Vector object. the goal is make support for more common cases. which based on GenTree's cost. following methods will be fixed.
|
|
PTAL @mikedn @AndyAyersMS |
|
@ArtBlnd I'll start taking a look -- but I have a couple of suggestions. Since you have refactored the existing full unroll logic to also support partial unrolling, can you add in a new jit config option to allow partial unrolling to be enabled / disabled, and then show if partial unrolling is disabled the jit produces the same code as it does now (via jit-diff)? This gives us confidence that nothing important was lost or changed by the refactoring. I would recommend for now setting the new config option to disable partial unrolling by default, and once we are happy with the refactoring, we can start to explore what happens when we try partial unrolling. It would also be helpful if you could summarize the strategy you're proposing for partial unrolling (even if parts of it are not yet implemented) and any results you care to highlight, either good or bad. I realize there is some discussion of the strategy in comments but I'd like to see something that abstracts away from the code. |
|
First, thank you for reviewing :> I don't see anything good case for old full unrolling, even on test cases. |
At this time. I don't see much problem with it. |
|
Usually we run diffs over all the test cases, since they're -- for the most part -- not categorized in useful ways. But you can start with just framework assemblies too. I'd recommend learning how to use |
|
I've ran all jit-diff on all over testcases. most of unrolling process needs are regression. and most of improvements are triggered by unrolling. (not really sure) for SIMD optimizations. I think its generating another instructions for better results(for optimizations) caused from unrolling. [ADDED] That is actually something wrong with unrolling. It does sometimes do not process full unrolling if target IR has SIMD operator. |
|
failed due to lack of disk space in OSX10.12 x64 Checked test is because of my PR or just really lack of disk space?... |
|
Probably just some issue with the CI system. Let's retry.... @dotnet-bot retest OSX10.12 x64 Checked Innerloop Build and Test |
https://1drv.ms/u/s!Avjm8UvmA5M9gdMAs4KlE0rH5c3M2A Here is full test jit-diff with PR. Can you check it? @AndyAyersMS [ADDED] I've check that improvements are really triggered by optimizer(useless codes are removed by unrolling) not full unrolling isn't triggered something like this. but still doesn't know why that |
|
Current PR is NOT doing performance improvement. DO NOT MERGE. |
|
TODO : Apply branch-prediciton theory for bound checks. |
|
Biggest binary size change was on Bytemark test. After PR Before PR Something is wrong with it :< I've excepted more than 5~10% improves. |
|
What CPU are you running tests at ? |
That results are based on CI `Windows_NT x64 full_opt ryujit CoreCLR Perf Tests Correctness. |
|
I think I have to remove dependency on the common ALU loops for super-scalar. Its really hard to remove (or very high-cost to remove the dependency) and its only available for unsafe memory access. |
|
Some CPUs having loop stream detector may loose perf gains due to loop body being too large for L.S.D. to work.
Hope this helps. |
Oh. thats really bad news, I thought removing over 16 iterations are effective. (because of the intel optimization manual.) that makes sense. I'll have a try. Thanks :> |
|
From my experience running VTune on Nehalem (quite old CPU) in many cases I saw lots of time spent after condition check at the loop end, and then at the beginning of the loop. I don't have good explanation for such behavior though - maybe related to branch predictor or some uops cache stuff or decoding machinery doing its work but delaying instructions execution, so treat this just as additional info. |
|
I hope one day loop head alignment will be implemented :-) - see #11607 |
The based on L.S.D threshold was on fetch size. now its also can be based on iterations which can maximize L.S.D. performance.
Sorry that I forget it every time...
also fixup for not cloning if iteration varaint is zero.
also removed unused variables. and clean-up comments.
fixed as unroll particle first. and relax more cache line based unroll condition.
and fix type on comments.
For future super-scalar support implements.
99aa94e to
27f9dbf
Compare
|
Re-based current branch for compare original jit and unrolled jit. |
|
I've tested verbose issues and I don't think there is something wrong with it. jit-dump emits fine. |
|
@AndyAyersMS Can I have feedback for these comments? thanks |
|
Thank you for your contribution. As announced in dotnet/coreclr#27549 this repository will be moving to dotnet/runtime on November 13. If you would like to continue working on this PR after this date, the easiest way to move the change to dotnet/runtime is:
|
|
@ArtBlnd sorry again for the lack of activity here. I am going to close this PR. It is just too big a step to take. A while back in this PR's history I outlined a bunch of steps I'd like to see followed if you or anyone else wants to take up this work item again after the repo consolidation. So if we/when we revisit this, please keep those in mind. I'm happy to collaborate with you on a plan that gets us moving towards this functionality in a number of smaller, well-thought out steps. |


This will replace PR : #18016