Split chunkcopy_safe to allow the first part to be inlined more often.#1776
Split chunkcopy_safe to allow the first part to be inlined more often.#1776
Conversation
|
x86_64, GCC13 Develop e4fb380: Split chunkcopy_safe: This shows a decrease in instructions and branches hit during the benchmark, as well as a less cpu time spent. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #1776 +/- ##
===========================================
- Coverage 83.33% 83.02% -0.31%
===========================================
Files 132 135 +3
Lines 10018 10326 +308
Branches 2687 2796 +109
===========================================
+ Hits 8348 8573 +225
- Misses 1009 1054 +45
- Partials 661 699 +38 ☔ View full report in Codecov by Sentry. |
x86-64 developx86-64 split_chunkcopyrpi3 developrpi3 split_chunkcopyrpi5a developrpi5a split_chunkcopyrpi5b developrpi5b split_chunkcopy |
|
Awesome. Somewhat strange that it didn't inline the whole function body, though. I wrote that copy ladder with the express intent of it doing 32 byte wide copies with a single move instruction with AVX2. We are probably losing that, given that we don't compile inflate.c with -mavx2 whereas with a header we were by inflate_fast compiling for each implementation. Then again, 128 bit wide operations tend to run at a higher clock frequency on Intel. |
I went back and verified, and it actually inlined successfully in Output with We could increase that value, but I don't really think that is a good idea. Now, as a result of this, I also went back and experimented with adding a two versions of the function, one full for inflate and the 2-part one for inffast. This shows a small reduction in instructions and branches (so in theory it is better), but a slight increase in cpu time taken, I think that might be caused by a combination of less cache-reuse and the increased library size. New benchmarks: This PR for comparison: |
|
For completeness, I did increase max-inline-insns-single from 300 to 500 in order to force the inlining into inffast, and this was the result. Worse than Devel is currently in this test. |
|
The smaller code size is probably going to be the overall benefit. You could try still declaring the function in a header so that the copy loop is called as a separate function but still compiled with -mavx2, maybe? Though that may try inlining that call and you end up with the same result |
|
I have a different change around some function inlining in the works that may change the calculus on some of this. I'll try get something staged later tonight. It ends up inlining more, but it's about a 3.5% improvement in my png decoding benchmarks overall. I have one more change I want to add on top that might improve things further |
Already tried that, it inlines it and it ends up exactly the same as the original unfortunately.
I am excited to test this 👍 |
Compilers try to accommodate the specified inlining, but fail to do so for
inflateandinflate_fastbecause there is just too many function calls we want to inline, so it ends up inlining some until it hits a threshold, then stops inlining. So not all calls to for examplechunkcopy_safegets inlined.This PR divides
chunkcopy_safeinto two pieces, the first and most simple part, we try to inline. The second part is longer (in instruction count) so we split it out into a separate function that is not inlined.In my tests I see 0.8% to 2.7% faster inflate performance from this change.