Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Newly implementd partial loop-unrolling support for RyuJIT #19594

Closed
ArtBlnd wants to merge 50 commits intodotnet:masterfrom
ArtBlnd:partial-unrolling-support
Closed

Newly implementd partial loop-unrolling support for RyuJIT #19594
ArtBlnd wants to merge 50 commits intodotnet:masterfrom
ArtBlnd:partial-unrolling-support

Conversation

@ArtBlnd
Copy link

@ArtBlnd ArtBlnd commented Aug 22, 2018

This will replace PR : #18016

@ArtBlnd
Copy link
Author

ArtBlnd commented Aug 22, 2018

@AndyAyersMS

@ArtBlnd ArtBlnd changed the title Newly implementd partial loop-unrolling support for RyuJIT [WIP] Newly implementd partial loop-unrolling support for RyuJIT Aug 22, 2018
@ArtBlnd
Copy link
Author

ArtBlnd commented Aug 23, 2018

In debug, all my tests in local is passing. and I Checked build is just not passing.
I want to debug crossgenning Checked build's native images from Debug build, Is it possible?

@ArtBlnd
Copy link
Author

ArtBlnd commented Aug 23, 2018

@mikedn

@mikedn
Copy link

mikedn commented Aug 23, 2018

In debug, all my tests in local is passing. and I Checked build is just not passing.

That sounds odd, checked builds are are sort of like debug builds but with compiler optimizations enabled so they're reasonably fast. Usually I use checked builds rather than debug builds, exactly because they're faster.

I want to debug crossgenning Checked build's native images from Debug build, Is it possible?

Hmm, I don't see how you could do that. Obviously, there's something different between builds if tests pass with one build but not with the other.

@ArtBlnd ArtBlnd changed the title [WIP] Newly implementd partial loop-unrolling support for RyuJIT Newly implementd partial loop-unrolling support for RyuJIT Aug 28, 2018
@ArtBlnd
Copy link
Author

ArtBlnd commented Aug 29, 2018

Ready for review :>

@ArtBlnd
Copy link
Author

ArtBlnd commented Aug 29, 2018

This implements will replace old optimization for loop unrolling. which only support full unrolling with Vector object. the goal is make support for more common cases. which based on GenTree's cost.

following methods will be fixed.

  1. non Vector object limit unrolling
  2. partial unrolling which with inner threshold and iteration threshold
  3. outer loop unrolling is also supported with partial unrolling
  4. merge partial unrolling and full unrolling implements
  5. unrolling support for complicate operator type which contains iteration local variable.

@ArtBlnd
Copy link
Author

ArtBlnd commented Aug 29, 2018

PTAL @mikedn @AndyAyersMS

@AndyAyersMS
Copy link
Member

@ArtBlnd I'll start taking a look -- but I have a couple of suggestions.

Since you have refactored the existing full unroll logic to also support partial unrolling, can you add in a new jit config option to allow partial unrolling to be enabled / disabled, and then show if partial unrolling is disabled the jit produces the same code as it does now (via jit-diff)? This gives us confidence that nothing important was lost or changed by the refactoring.

I would recommend for now setting the new config option to disable partial unrolling by default, and once we are happy with the refactoring, we can start to explore what happens when we try partial unrolling.

It would also be helpful if you could summarize the strategy you're proposing for partial unrolling (even if parts of it are not yet implemented) and any results you care to highlight, either good or bad. I realize there is some discussion of the strategy in comments but I'd like to see something that abstracts away from the code.

@ArtBlnd
Copy link
Author

ArtBlnd commented Aug 30, 2018

First, thank you for reviewing :>

I don't see anything good case for old full unrolling, even on test cases.
Can you recommend which case that would be great to check that something lost or changed?

@ArtBlnd
Copy link
Author

ArtBlnd commented Aug 30, 2018

It would also be helpful if you could summarize the strategy you're proposing for partial unrolling (even if parts of it are not yet implemented) and any results you care to highlight, either good or bad. I realize there is some discussion of the strategy in comments but I'd like to see something that abstracts away from the code.

At this time. I don't see much problem with it.
But first of all, the overall partial unrolling threshold codes are based on gtSetStmtInfo which computes GenTree's cost. and its based on IR, but every platform do not have the same efficient inner loop unrolling threshold. That can cause performance degradation for some platforms. (it will still better than when its not unrolled. but for code size)

@AndyAyersMS
Copy link
Member

Usually we run diffs over all the test cases, since they're -- for the most part -- not categorized in useful ways. But you can start with just framework assemblies too.

I'd recommend learning how to use jit-diff from the jitutils repo. In particular running jit-diff diff --pmi to see what impact your changes are having on a wide variety of methods.

@ArtBlnd
Copy link
Author

ArtBlnd commented Aug 31, 2018

I've ran all jit-diff on all over testcases.
as I excepted, diff is regression. and here is results

Total bytes of diff: 20356 (0.01% of base)
    diff is a regression.

Top file regressions by size (bytes):
        2356 : JIT\Directed\UnrollLoop\loop2_cs_do\loop2_cs_do.dasm (25.99% of base)
        2020 : JIT\Directed\UnrollLoop\loop2_cs_ro\loop2_cs_ro.dasm (22.88% of base)
        1716 : JIT\HardwareIntrinsics\X86\Avx\Permute2x128.Avx_ro\Permute2x128.Avx_ro.dasm (3.64% of base)
         918 : JIT\jit64\opt\lur\lur_02\lur_02.dasm (33.16% of base)
         901 : JIT\Performance\CodeQuality\Bytemark\Bytemark\Bytemark.dasm (1.05% of base)

Top file improvements by size (bytes):
       -2181 : JIT\SIMD\VectorMul_ro\VectorMul_ro.dasm (-19.26% of base)
        -593 : JIT\SIMD\VectorReturn_ro\VectorReturn_ro.dasm (-4.87% of base)
        -453 : JIT\SIMD\VectorAbs_ro\VectorAbs_ro.dasm (-7.55% of base)
        -453 : JIT\SIMD\VectorAdd_ro\VectorAdd_ro.dasm (-7.46% of base)
        -453 : JIT\SIMD\VectorHWAccel2_ro\VectorHWAccel2_ro.dasm (-9.92% of base)

196 total files with size differences (65 improved, 131 regressed), 8377 unchanged.

Top method regressions by size (bytes):
         716 : JIT\HardwareIntrinsics\X86\Avx\Permute2x128.Avx_ro\Permute2x128.Avx_ro.dasm - JIT.HardwareIntrinsics.X86.SimpleBinaryOpTest__Permute2x128Double2:.ctor():this
         709 : JIT\Directed\intrinsic\pow\pow2_cs_do\pow2_cs_do.dasm - pow2:Main():int
         709 : JIT\Directed\intrinsic\pow\pow2_cs_ro\pow2_cs_ro.dasm - pow2:Main():int
         500 : JIT\HardwareIntrinsics\X86\Avx\Permute2x128.Avx_ro\Permute2x128.Avx_ro.dasm - JIT.HardwareIntrinsics.X86.SimpleBinaryOpTest__Permute2x128Int642:.ctor():this
         500 : JIT\HardwareIntrinsics\X86\Avx\Permute2x128.Avx_ro\Permute2x128.Avx_ro.dasm - JIT.HardwareIntrinsics.X86.SimpleBinaryOpTest__Permute2x128UInt642:.ctor():this

Top method improvements by size (bytes):
       -1274 : JIT\SIMD\VectorMul_ro\VectorMul_ro.dasm - VectorMulTest`1[Int32][System.Int32]:VectorMul(int,int,int,int,int):int
        -554 : JIT\SIMD\VectorMul_ro\VectorMul_ro.dasm - VectorMulTest`1[Double][System.Double]:VectorMul(double,double,double,double,double):int
        -511 : JIT\SIMD\VectorReturn_ro\VectorReturn_ro.dasm - VectorTest:VectorTReturnTest():int
        -283 : JIT\SIMD\CreateGeneric_ro\CreateGeneric_ro.dasm - VectorMathTests.Program:Main(ref):int
        -271 : JIT\SIMD\VectorMul_ro\VectorMul_ro.dasm - VectorMulTest`1[Int64][System.Int64]:VectorMul(long,long,long,long,long):int

801 total methods with size differences (291 improved, 510 regressed), 349685 unchanged.

most of unrolling process needs are regression. and most of improvements are triggered by unrolling. (not really sure) for SIMD optimizations.

I think its generating another instructions for better results(for optimizations) caused from unrolling.

[ADDED] That is actually something wrong with unrolling. It does sometimes do not process full unrolling if target IR has SIMD operator.

@ArtBlnd
Copy link
Author

ArtBlnd commented Aug 31, 2018

failed due to lack of disk space in OSX10.12 x64 Checked test is because of my PR or just really lack of disk space?...

@AndyAyersMS
Copy link
Member

Probably just some issue with the CI system. Let's retry....

@dotnet-bot retest OSX10.12 x64 Checked Innerloop Build and Test

@ArtBlnd
Copy link
Author

ArtBlnd commented Sep 4, 2018

Analyzing diffs...
Found 237 files with textual diffs.
PMI Diffs for assemblies in C:\coreclr\bin\tests\Windows_NT.x64.Release for  default jit
Summary:
(Lower is better)
Total bytes of diff: 34020 (0.02% of base)
    diff is a regression.
Top file regressions by size (bytes):
        2356 : JIT\Directed\UnrollLoop\loop2_cs_do\loop2_cs_do.dasm (25.99% of base)
        2020 : JIT\Directed\UnrollLoop\loop2_cs_ro\loop2_cs_ro.dasm (22.88% of base)
        1716 : JIT\HardwareIntrinsics\X86\Avx\Permute2x128.Avx_ro\Permute2x128.Avx_ro.dasm (3.64% of base)
         918 : JIT\jit64\opt\lur\lur_02\lur_02.dasm (33.16% of base)
         901 : JIT\Performance\CodeQuality\Bytemark\Bytemark\Bytemark.dasm (1.08% of base)
Top file improvements by size (bytes):
        -294 : JIT\jit64\opt\lim\lim_002\lim_002.dasm (-13.25% of base)
        -244 : JIT\SIMD\VectorMul_ro\VectorMul_ro.dasm (-2.16% of base)
         -93 : JIT\SIMD\VectorReturn_ro\VectorReturn_ro.dasm (-0.76% of base)
         -70 : JIT\Directed\UnrollLoop\loop4_cs_do\loop4_cs_do.dasm (-1.71% of base)
         -70 : JIT\Directed\UnrollLoop\loop4_cs_ro\loop4_cs_ro.dasm (-1.71% of base)
193 total files with size differences (38 improved, 155 regressed), 8381 unchanged.
Top method regressions by size (bytes):
         716 : JIT\HardwareIntrinsics\X86\Avx\Permute2x128.Avx_ro\Permute2x128.Avx_ro.dasm - SimpleBinaryOpTest__Permute2x128Double2:.ctor():this
         709 : JIT\Directed\intrinsic\pow\pow2_cs_do\pow2_cs_do.dasm - pow2:Main():int
         709 : JIT\Directed\intrinsic\pow\pow2_cs_ro\pow2_cs_ro.dasm - pow2:Main():int
         500 : JIT\HardwareIntrinsics\X86\Avx\Permute2x128.Avx_ro\Permute2x128.Avx_ro.dasm - SimpleBinaryOpTest__Permute2x128Int642:.ctor():this
         500 : JIT\HardwareIntrinsics\X86\Avx\Permute2x128.Avx_ro\Permute2x128.Avx_ro.dasm - SimpleBinaryOpTest__Permute2x128UInt642:.ctor():this
Top method improvements by size (bytes):
        -108 : JIT\SIMD\VectorMul_ro\VectorMul_ro.dasm - VectorMulTest`1:VectorMul(double,double,double,double,double):int
         -93 : JIT\SIMD\VectorReturn_ro\VectorReturn_ro.dasm - VectorTest:VectorTReturnTest():int
         -75 : JIT\SIMD\VectorMul_ro\VectorMul_ro.dasm - VectorMulTest`1:VectorMul(int,int,int,int,int):int
         -70 : JIT\Directed\UnrollLoop\loop4_cs_do\loop4_cs_do.dasm - SmallLoop1:Main():int
         -70 : JIT\Directed\UnrollLoop\loop4_cs_ro\loop4_cs_ro.dasm - SmallLoop1:Main():int
757 total methods with size differences (202 improved, 555 regressed), 302016 unchanged.
44 files had text diffs but not size diffs.
JIT\Performance\CodeQuality\SIMD\SeekUnroll\SeekUnroll\SeekUnroll.dasm had 72 diffs
JIT\SIMD\VectorGet_ro\VectorGet_ro.dasm had 60 diffs
JIT\SIMD\VectorDiv_ro\VectorDiv_ro.dasm had 58 diffs
JIT\SIMD\VectorMax_ro\VectorMax_ro.dasm had 56 diffs
JIT\SIMD\VectorMin_ro\VectorMin_ro.dasm had 56 diffs
Completed analysis in 416.03s

https://1drv.ms/u/s!Avjm8UvmA5M9gdMAs4KlE0rH5c3M2A

Here is full test jit-diff with PR. Can you check it? @AndyAyersMS

[ADDED] I've check that improvements are really triggered by optimizer(useless codes are removed by unrolling) not full unrolling isn't triggered

image

something like this. but still doesn't know why that cmp instruction isn't removed.

@ArtBlnd
Copy link
Author

ArtBlnd commented Sep 4, 2018

Current PR is NOT doing performance improvement. DO NOT MERGE.

@ArtBlnd ArtBlnd changed the title Newly implementd partial loop-unrolling support for RyuJIT [WIP] Newly implementd partial loop-unrolling support for RyuJIT Sep 4, 2018
@ArtBlnd
Copy link
Author

ArtBlnd commented Sep 4, 2018

TODO : Apply branch-prediciton theory for bound checks.

@ArtBlnd
Copy link
Author

ArtBlnd commented Sep 5, 2018

Biggest binary size change was on Bytemark test.

After PR

 Bytemark.exe                         | Metric   | Unit | Iterations |  Average | STDEV.S |      Min |      Max
:------------------------------------ |:-------- |:----:|:----------:| --------:| -------:| --------:| --------:
 ByteMark.BenchAssignJagged           | Duration | msec |     2      | 1339.450 |  12.401 | 1330.681 | 1348.219
 ByteMark.BenchAssignRectangular      | Duration | msec |     2      | 1705.360 |   5.544 | 1701.440 | 1709.281
 ByteMark.BenchBitOps                 | Duration | msec |     2      | 1352.882 |   3.743 | 1350.235 | 1355.529
 ByteMark.BenchEmFloat                | Duration | msec |     2      | 4228.507 | 102.771 | 4155.837 | 4301.177
 ByteMark.BenchEmFloatClass           | Duration | msec |     2      |  857.242 |  42.785 |  826.989 |  887.496
 ByteMark.BenchFourier                | Duration | msec |     2      |  817.118 |  38.716 |  789.742 |  844.495
 ByteMark.BenchIDEAEncryption         | Duration | msec |     2      | 1256.892 |  43.960 | 1225.807 | 1287.976
 ByteMark.BenchLUDecomp               | Duration | msec |     2      | 1602.163 |   8.455 | 1596.185 | 1608.142
 ByteMark.BenchNeural                 | Duration | msec |     2      | 1142.809 |  13.389 | 1133.341 | 1152.276
 ByteMark.BenchNeuralJagged           | Duration | msec |     2      | 1175.733 |  12.474 | 1166.913 | 1184.553
 ByteMark.BenchNumericSortJagged      | Duration | msec |     2      | 1672.816 |  26.731 | 1653.914 | 1691.717
 ByteMark.BenchNumericSortRectangular | Duration | msec |     2      | 1737.014 |  36.482 | 1711.217 | 1762.810
 ByteMark.BenchStringSort             | Duration | msec |     2      | 1981.515 |  57.016 | 1941.198 | 2021.831

Before PR

 Bytemark.exe                         | Metric   | Unit | Iterations |  Average | STDEV.S |      Min |      Max
:------------------------------------ |:-------- |:----:|:----------:| --------:| -------:| --------:| --------:
 ByteMark.BenchAssignJagged           | Duration | msec |     2      | 1297.203 |  52.673 | 1259.958 | 1334.449
 ByteMark.BenchAssignRectangular      | Duration | msec |     2      | 1690.054 |  72.995 | 1638.439 | 1741.669
 ByteMark.BenchBitOps                 | Duration | msec |     2      | 1351.465 |   2.517 | 1349.685 | 1353.244
 ByteMark.BenchEmFloat                | Duration | msec |     2      | 4659.788 | 133.808 | 4565.171 | 4754.404
 ByteMark.BenchEmFloatClass           | Duration | msec |     2      |  973.132 |   0.541 |  972.749 |  973.514
 ByteMark.BenchFourier                | Duration | msec |     2      |  782.487 |  44.610 |  750.943 |  814.031
 ByteMark.BenchIDEAEncryption         | Duration | msec |     2      | 1378.041 |  46.912 | 1344.869 | 1411.212
 ByteMark.BenchLUDecomp               | Duration | msec |     2      | 1685.830 |  32.736 | 1662.682 | 1708.977
 ByteMark.BenchNeural                 | Duration | msec |     2      | 1188.487 |  19.616 | 1174.616 | 1202.358
 ByteMark.BenchNeuralJagged           | Duration | msec |     2      | 1176.183 |   8.082 | 1170.468 | 1181.898
 ByteMark.BenchNumericSortJagged      | Duration | msec |     2      | 1662.520 |  28.560 | 1642.325 | 1682.715
 ByteMark.BenchNumericSortRectangular | Duration | msec |     2      | 1723.915 |  73.229 | 1672.134 | 1775.696
 ByteMark.BenchStringSort             | Duration | msec |     2      | 1987.932 | 101.412 | 1916.223 | 2059.641

Something is wrong with it :< I've excepted more than 5~10% improves.

@voinokin
Copy link

voinokin commented Sep 6, 2018

What CPU are you running tests at ?

@ArtBlnd
Copy link
Author

ArtBlnd commented Sep 6, 2018

What CPU are you running tests at ?

That results are based on CI `Windows_NT x64 full_opt ryujit CoreCLR Perf Tests Correctness.

@ArtBlnd
Copy link
Author

ArtBlnd commented Sep 6, 2018

I think I have to remove dependency on the common ALU loops for super-scalar.
also not to increase iteration variant over the body.

Its really hard to remove (or very high-cost to remove the dependency) and its only available for unsafe memory access.

@voinokin
Copy link

voinokin commented Sep 6, 2018

Some CPUs having loop stream detector may loose perf gains due to loop body being too large for L.S.D. to work.

the Loop Stream Detector was introduced in Intel Core microarchitectures. With Sandy Bridge, it was 28 μops. It grows to 56 μops (with HT off) with Haswell and Broadwell. It grows again (considerably) with Skylake to 64 μops per thread (HT on or off). The LSD resides in the BPU (branch prediction unit) and it is basically telling the BPU to stop predicting branches and just stream from the LSD.

Hope this helps.

@ArtBlnd
Copy link
Author

ArtBlnd commented Sep 6, 2018

Some CPUs having loop stream detector may loose perf gains due to loop body being too large for L.S.D. to work.

Oh. thats really bad news, I thought removing over 16 iterations are effective. (because of the intel optimization manual.) that makes sense. I'll have a try. Thanks :>

@voinokin
Copy link

voinokin commented Sep 6, 2018

From my experience running VTune on Nehalem (quite old CPU) in many cases I saw lots of time spent after condition check at the loop end, and then at the beginning of the loop. I don't have good explanation for such behavior though - maybe related to branch predictor or some uops cache stuff or decoding machinery doing its work but delaying instructions execution, so treat this just as additional info.

@ArtBlnd
Copy link
Author

ArtBlnd commented Sep 6, 2018

image

I've found some L.S.D. optimization on intel optimization manual.
for current code. L.S.D will not triggered because of the loop will not have more than 16 iterations.

I think its better optimize for L.S.D. if loop has more than 64 iterations.
Anyways. thanks for advising :>

@voinokin
Copy link

voinokin commented Sep 6, 2018

I hope one day loop head alignment will be implemented :-) - see #11607

The based on L.S.D threshold was on fetch size. now its also can be based on iterations which can maximize L.S.D. performance.
Sorry that I forget it every time...
also fixup for not cloning if iteration varaint is zero.
also removed unused variables. and clean-up comments.
fixed as unroll particle first. and relax more cache line based unroll condition.
and fix type on comments.
For future super-scalar support implements.
@ArtBlnd ArtBlnd force-pushed the partial-unrolling-support branch from 99aa94e to 27f9dbf Compare January 10, 2019 06:27
@ArtBlnd
Copy link
Author

ArtBlnd commented Jan 10, 2019

Re-based current branch for compare original jit and unrolled jit.

@ArtBlnd
Copy link
Author

ArtBlnd commented Jan 11, 2019

I've tested verbose issues and I don't think there is something wrong with it. jit-dump emits fine.

@ArtBlnd
Copy link
Author

ArtBlnd commented Jan 25, 2019

@AndyAyersMS Can I have feedback for these comments? thanks
Actually. I am waiting this PR to be merged for long time. also I am freaking ready to do some on IV-Widening issues. and I just can't. because of this PR. I have to re-base IV-Widening patch if this does not be merged. This is being wrong. I know this is kind of huge patch. some huge changes for loop unrolling idioms in RyuJIT. but no one is reviewing or helping CLEARLY even it has large performance increases.

@maryamariyan
Copy link

Thank you for your contribution. As announced in dotnet/coreclr#27549 this repository will be moving to dotnet/runtime on November 13. If you would like to continue working on this PR after this date, the easiest way to move the change to dotnet/runtime is:

  1. In your coreclr repository clone, create patch by running git format-patch origin
  2. In your runtime repository clone, apply the patch by running git apply --directory src/coreclr <path to the patch created in step 1>

@AndyAyersMS
Copy link
Member

@ArtBlnd sorry again for the lack of activity here. I am going to close this PR. It is just too big a step to take.

A while back in this PR's history I outlined a bunch of steps I'd like to see followed if you or anyone else wants to take up this work item again after the repo consolidation.

So if we/when we revisit this, please keep those in mind. I'm happy to collaborate with you on a plan that gets us moving towards this functionality in a number of smaller, well-thought out steps.

@AndyAyersMS AndyAyersMS closed this Nov 7, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants