Skip to content

Split chunkcopy_safe to allow the first part to be inlined more often.#1776

Merged
Dead2 merged 1 commit intodevelopfrom
split_chunkcopy_safe
Sep 13, 2024
Merged

Split chunkcopy_safe to allow the first part to be inlined more often.#1776
Dead2 merged 1 commit intodevelopfrom
split_chunkcopy_safe

Conversation

@Dead2
Copy link
Copy Markdown
Member

@Dead2 Dead2 commented Sep 11, 2024

Compilers try to accommodate the specified inlining, but fail to do so for inflate and inflate_fast because there is just too many function calls we want to inline, so it ends up inlining some until it hits a threshold, then stops inlining. So not all calls to for example chunkcopy_safe gets inlined.

This PR divides chunkcopy_safe into two pieces, the first and most simple part, we try to inline. The second part is longer (in instruction count) so we split it out into a separate function that is not inlined.

In my tests I see 0.8% to 2.7% faster inflate performance from this change.

@Dead2
Copy link
Copy Markdown
Member Author

Dead2 commented Sep 11, 2024

x86_64, GCC13
Using -falign-function=64 to avoid some cache-line alignment effect

Develop e4fb380:

   text    data     bss     dec     hex filename
 135622    1312       8  136942   216ee libz-ng.so.2
          5,897.17 msec task-clock:u                     #    1.000 CPUs utilized               ( +-  0.04% )
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
               144      page-faults:u                    #   24.418 /sec                        ( +-  0.31% )
    20,485,703,570      cycles:u                         #    3.474 GHz                         ( +-  0.04% )
    35,731,150,672      instructions:u                   #    1.74  insn per cycle              ( +-  0.00% )
     3,758,779,847      branches:u                       #  637.387 M/sec                       ( +-  0.00% )
       182,334,345      branch-misses:u                  #    4.85% of all branches             ( +-  0.16% )

           5.89758 +- 0.00223 seconds time elapsed  ( +-  0.04% )

Split chunkcopy_safe:

   text    data     bss     dec     hex filename
 135966    1312       8  137286   21846 libz-ng.so.2
          5,733.90 msec task-clock:u                     #    1.000 CPUs utilized               ( +-  0.01% )
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
               143      page-faults:u                    #   24.939 /sec                        ( +-  0.30% )
    19,908,943,906      cycles:u                         #    3.472 GHz                         ( +-  0.01% )
    35,633,452,503      instructions:u                   #    1.79  insn per cycle              ( +-  0.00% )
     3,749,361,374      branches:u                       #  653.894 M/sec                       ( +-  0.00% )
       181,427,372      branch-misses:u                  #    4.84% of all branches             ( +-  0.04% )

          5.734291 +- 0.000584 seconds time elapsed  ( +-  0.01% )

This shows a decrease in instructions and branches hit during the benchmark, as well as a less cpu time spent.

@codecov
Copy link
Copy Markdown

codecov bot commented Sep 11, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.02%. Comparing base (e4fb380) to head (256b0c7).
Report is 4 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #1776      +/-   ##
===========================================
- Coverage    83.33%   83.02%   -0.31%     
===========================================
  Files          132      135       +3     
  Lines        10018    10326     +308     
  Branches      2687     2796     +109     
===========================================
+ Hits          8348     8573     +225     
- Misses        1009     1054      +45     
- Partials       661      699      +38     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Dead2
Copy link
Copy Markdown
Member Author

Dead2 commented Sep 11, 2024

x86-64 develop

 Tool: minideflate Levels: 1-9
 Runs: 40         Trim worst: 25

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     44.409%      1.184/1.215/1.226/0.012        0.448/0.458/0.461/0.004       94,127,485
 2     35.519%      2.248/2.264/2.276/0.009        0.445/0.456/0.462/0.005       75,286,310
 3     33.844%      2.829/2.850/2.864/0.012        0.425/0.433/0.438/0.003       71,735,206
 4     33.146%      3.142/3.176/3.203/0.019        0.421/0.425/0.428/0.002       70,255,211
 5     32.642%      3.475/3.532/3.570/0.028        0.410/0.417/0.422/0.003       69,187,407
 6     32.483%      4.041/4.087/4.125/0.027        0.406/0.416/0.422/0.005       68,850,764
 7     32.255%      5.977/6.004/6.038/0.018        0.409/0.415/0.421/0.003       68,366,747
 8     32.167%      8.667/8.738/8.774/0.033        0.408/0.414/0.420/0.004       68,180,750
 9     31.887%   12.043/12.084/12.119/0.025        0.406/0.416/0.422/0.005       67,586,430

 avg1  34.261%                        4.883                          0.428
 tot                                659.244                         57.760      653,576,310

   text    data     bss     dec     hex filename
 135622    1312       8  136942   216ee libz-ng.so.2

x86-64 split_chunkcopy

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     44.409%      1.203/1.221/1.229/0.008        0.415/0.432/0.438/0.006       94,127,485
 2     35.519%      2.253/2.272/2.283/0.009        0.427/0.435/0.439/0.004       75,286,310
 3     33.844%      2.830/2.847/2.857/0.010        0.404/0.410/0.415/0.003       71,735,206
 4     33.146%      3.147/3.185/3.203/0.016        0.391/0.399/0.405/0.004       70,255,211
 5     32.642%      3.508/3.551/3.583/0.024        0.383/0.394/0.402/0.006       69,187,407
 6     32.483%      4.025/4.105/4.132/0.025        0.384/0.394/0.398/0.004       68,850,764
 7     32.255%      5.981/6.024/6.058/0.023        0.387/0.394/0.397/0.003       68,366,747
 8     32.167%      8.673/8.730/8.764/0.028        0.387/0.392/0.396/0.003       68,180,750
 9     31.887%   12.039/12.099/12.165/0.036        0.377/0.389/0.395/0.005       67,586,430

 avg1  34.261%                        4.892                          0.404
 tot                                660.487                         54.599      653,576,310

   text    data     bss     dec     hex filename
 135966    1312       8  137286   21846 libz-ng.so.2

rpi3 develop

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 0    100.008%      0.000/0.011/0.016/0.004        0.016/0.026/0.032/0.004       15,737,543
 1     54.185%      0.464/0.504/0.521/0.015        0.120/0.135/0.141/0.005        8,526,745
 2     43.871%      0.845/0.864/0.874/0.008        0.119/0.141/0.149/0.008        6,903,702
 3     42.390%      1.175/1.190/1.199/0.007        0.119/0.137/0.147/0.007        6,670,664
 4     41.644%      1.285/1.320/1.332/0.010        0.116/0.133/0.140/0.006        6,553,205
 5     41.215%      1.405/1.424/1.433/0.007        0.116/0.133/0.140/0.006        6,485,659
 6     41.032%      1.647/1.659/1.667/0.006        0.115/0.135/0.142/0.006        6,456,912
 7     40.778%      2.082/2.099/2.107/0.007        0.112/0.130/0.137/0.005        6,416,941
 8     40.704%      2.555/2.577/2.587/0.009        0.112/0.129/0.136/0.006        6,405,249
 9     40.409%      3.118/3.135/3.143/0.007        0.113/0.129/0.138/0.007        6,358,951

 avg1  48.624%                        1.478                          0.123
 avg2  54.026%                        1.643                          0.137
 tot                                591.379                         49.145       76,515,571

   text    data     bss     dec     hex filename
 115488    1504       8  117000   1c908 libz-ng.so.2

rpi3 split_chunkcopy

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 0    100.008%      0.004/0.013/0.016/0.003        0.012/0.025/0.032/0.004       15,737,543
 1     54.185%      0.470/0.505/0.522/0.013        0.099/0.136/0.143/0.007        8,526,745
 2     43.871%      0.849/0.865/0.873/0.006        0.115/0.138/0.149/0.008        6,903,702
 3     42.390%      1.167/1.186/1.196/0.008        0.097/0.130/0.145/0.012        6,670,664
 4     41.644%      1.300/1.321/1.331/0.007        0.112/0.130/0.138/0.006        6,553,205
 5     41.215%      1.405/1.426/1.436/0.008        0.118/0.133/0.141/0.006        6,485,659
 6     41.032%      1.636/1.655/1.663/0.006        0.111/0.130/0.137/0.007        6,456,912
 7     40.778%      2.069/2.095/2.107/0.009        0.116/0.130/0.138/0.006        6,416,941
 8     40.704%      2.558/2.572/2.581/0.006        0.116/0.130/0.136/0.005        6,405,249
 9     40.409%      3.107/3.130/3.139/0.009        0.115/0.132/0.138/0.005        6,358,951

 avg1  48.624%                        1.477                          0.121
 avg2  54.026%                        1.641                          0.135
 tot                                590.748                         48.542       76,515,571

   text    data     bss     dec     hex filename
 115936    1504       8  117448   1cac8 libz-ng.so.2

rpi5a develop

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 0    100.008%      0.000/0.001/0.004/0.002        0.000/0.002/0.004/0.002       15,737,543
 1     54.185%      0.166/0.177/0.181/0.004        0.055/0.067/0.072/0.005        8,526,745
 2     43.871%      0.298/0.304/0.308/0.003        0.045/0.063/0.070/0.006        6,903,702
 3     42.390%      0.361/0.372/0.375/0.003        0.048/0.062/0.068/0.005        6,670,664
 4     41.644%      0.410/0.419/0.423/0.003        0.050/0.060/0.063/0.004        6,553,205
 5     41.215%      0.448/0.462/0.466/0.004        0.053/0.059/0.062/0.003        6,485,659
 6     41.032%      0.524/0.539/0.544/0.005        0.048/0.059/0.065/0.004        6,456,912
 7     40.778%      0.705/0.717/0.721/0.004        0.053/0.061/0.065/0.003        6,416,941
 8     40.704%      0.905/0.912/0.916/0.003        0.050/0.060/0.065/0.004        6,405,249
 9     40.409%      1.089/1.099/1.103/0.003        0.048/0.058/0.064/0.004        6,358,951

 avg1  48.624%                        0.500                          0.055
 avg2  54.026%                        0.556                          0.061
 tot                                200.061                         22.092       76,515,571
 
   text    data     bss     dec     hex filename
 115488    1504       8  117000   1c908 libz-ng.so.2

rpi5a split_chunkcopy

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 0    100.008%      0.000/0.001/0.004/0.002        0.000/0.001/0.004/0.002       15,737,543
 1     54.185%      0.168/0.177/0.181/0.004        0.054/0.065/0.070/0.004        8,526,745
 2     43.871%      0.299/0.306/0.310/0.003        0.051/0.063/0.069/0.004        6,903,702
 3     42.390%      0.358/0.370/0.374/0.004        0.052/0.062/0.067/0.004        6,670,664
 4     41.644%      0.406/0.419/0.424/0.004        0.049/0.060/0.065/0.004        6,553,205
 5     41.215%      0.453/0.462/0.468/0.004        0.050/0.061/0.065/0.004        6,485,659
 6     41.032%      0.529/0.540/0.545/0.004        0.044/0.059/0.064/0.005        6,456,912
 7     40.778%      0.704/0.715/0.719/0.004        0.048/0.058/0.064/0.004        6,416,941
 8     40.704%      0.908/0.914/0.918/0.003        0.048/0.059/0.064/0.004        6,405,249
 9     40.409%      1.086/1.097/1.101/0.004        0.044/0.056/0.060/0.004        6,358,951

 avg1  48.624%                        0.500                          0.054
 avg2  54.026%                        0.556                          0.060
 tot                                200.041                         21.710       76,515,571

   text    data     bss     dec     hex filename
 115936    1504       8  117448   1cac8 libz-ng.so.2

rpi5b develop

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 0    100.008%      0.000/0.001/0.004/0.001        0.000/0.000/0.004/0.001       15,737,543
 1     54.185%      0.165/0.176/0.181/0.004        0.054/0.067/0.069/0.003        8,526,745
 2     43.871%      0.295/0.305/0.310/0.004        0.056/0.065/0.072/0.004        6,903,702
 3     42.390%      0.358/0.372/0.377/0.004        0.051/0.063/0.066/0.004        6,670,664
 4     41.644%      0.409/0.419/0.423/0.004        0.052/0.062/0.064/0.003        6,553,205
 5     41.215%      0.452/0.461/0.465/0.003        0.048/0.060/0.064/0.004        6,485,659
 6     41.032%      0.531/0.541/0.544/0.003        0.054/0.061/0.063/0.003        6,456,912
 7     40.778%      0.707/0.716/0.720/0.003        0.054/0.060/0.063/0.003        6,416,941
 8     40.704%      0.901/0.913/0.917/0.003        0.047/0.061/0.067/0.005        6,405,249
 9     40.409%      1.090/1.099/1.104/0.004        0.049/0.059/0.065/0.004        6,358,951

 avg1  48.624%                        0.500                          0.056
 avg2  54.026%                        0.556                          0.062
 tot                                200.128                         22.272       76,515,571

   text    data     bss     dec     hex filename
 115488    1504       8  117000   1c908 libz-ng.so.2

rpi5b split_chunkcopy

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 0    100.008%      0.000/0.000/0.004/0.001        0.000/0.001/0.004/0.001       15,737,543
 1     54.185%      0.168/0.176/0.180/0.003        0.046/0.064/0.069/0.005        8,526,745
 2     43.871%      0.299/0.306/0.310/0.003        0.056/0.066/0.072/0.004        6,903,702
 3     42.390%      0.362/0.371/0.374/0.003        0.053/0.063/0.066/0.003        6,670,664
 4     41.644%      0.405/0.420/0.423/0.004        0.056/0.062/0.067/0.003        6,553,205
 5     41.215%      0.448/0.462/0.465/0.004        0.052/0.060/0.063/0.003        6,485,659
 6     41.032%      0.532/0.542/0.546/0.003        0.050/0.061/0.066/0.004        6,456,912
 7     40.778%      0.704/0.715/0.719/0.004        0.046/0.059/0.063/0.005        6,416,941
 8     40.704%      0.903/0.912/0.916/0.004        0.041/0.059/0.063/0.005        6,405,249
 9     40.409%      1.086/1.097/1.101/0.003        0.049/0.058/0.062/0.003        6,358,951

 avg1  48.624%                        0.500                          0.055
 avg2  54.026%                        0.556                          0.061
 tot                                200.003                         22.089       76,515,571

   text    data     bss     dec     hex filename
 115936    1504       8  117448   1cac8 libz-ng.so.2

@KungFuJesus
Copy link
Copy Markdown
Collaborator

KungFuJesus commented Sep 11, 2024

Awesome. Somewhat strange that it didn't inline the whole function body, though. I wrote that copy ladder with the express intent of it doing 32 byte wide copies with a single move instruction with AVX2. We are probably losing that, given that we don't compile inflate.c with -mavx2 whereas with a header we were by inflate_fast compiling for each implementation. Then again, 128 bit wide operations tend to run at a higher clock frequency on Intel.

@Dead2
Copy link
Copy Markdown
Member Author

Dead2 commented Sep 12, 2024

Awesome. Somewhat strange that it didn't inline the whole function body, though.

I went back and verified, and it actually inlined successfully in inflate, but not in inffast.

Output with -Winline

In file included from /inffast_tpl.h:11,
                 from /arch/generic/chunkset_c.c:42:
/inflate_p.h: In function ‘inflate_fast_c’:
/inflate_p.h:172:24: warning: inlining failed in call to ‘chunkcopy_safe’: --param max-inline-insns-single limit reached [-Winline]
  172 | static inline uint8_t* chunkcopy_safe(uint8_t *out, uint8_t *from, uint64_t len, uint8_t *safe) {
      |                        ^~~~~~~~~~~~~~
/inffast_tpl.h:241:35: note: called from here
  241 |                             out = chunkcopy_safe(out, from, op, safe);
      |                                   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/inflate_p.h:172:24: warning: inlining failed in call to ‘chunkcopy_safe’: --param max-inline-insns-single limit reached [-Winline]
  172 | static inline uint8_t* chunkcopy_safe(uint8_t *out, uint8_t *from, uint64_t len, uint8_t *safe) {
      |                        ^~~~~~~~~~~~~~
/inffast_tpl.h:253:31: note: called from here
  253 |                         out = chunkcopy_safe(out, out - dist, len, safe);
      |                               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/inflate_p.h:172:24: warning: inlining failed in call to ‘chunkcopy_safe’: --param max-inline-insns-single limit reached [-Winline]
  172 | static inline uint8_t* chunkcopy_safe(uint8_t *out, uint8_t *from, uint64_t len, uint8_t *safe) {
      |                        ^~~~~~~~~~~~~~
/inffast_tpl.h:251:31: note: called from here
  251 |                         out = chunkcopy_safe(out, from, op, safe);
      |                               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/inflate_p.h:172:24: warning: inlining failed in call to ‘chunkcopy_safe’: --param max-inline-insns-single limit reached [-Winline]
  172 | static inline uint8_t* chunkcopy_safe(uint8_t *out, uint8_t *from, uint64_t len, uint8_t *safe) {
      |                        ^~~~~~~~~~~~~~
/inffast_tpl.h:253:31: note: called from here
  253 |                         out = chunkcopy_safe(out, out - dist, len, safe);
      |                               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/inflate_p.h:172:24: warning: inlining failed in call to ‘chunkcopy_safe’: --param max-inline-insns-single limit reached [-Winline]
  172 | static inline uint8_t* chunkcopy_safe(uint8_t *out, uint8_t *from, uint64_t len, uint8_t *safe) {
      |                        ^~~~~~~~~~~~~~
/inffast_tpl.h:255:31: note: called from here
  255 |                         out = chunkcopy_safe(out, from, len, safe);
      |                               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/inflate_p.h:172:24: warning: inlining failed in call to ‘chunkcopy_safe’: --param max-inline-insns-single limit reached [-Winline]
  172 | static inline uint8_t* chunkcopy_safe(uint8_t *out, uint8_t *from, uint64_t len, uint8_t *safe) {
      |                        ^~~~~~~~~~~~~~
/inffast_tpl.h:260:31: note: called from here
  260 |                         out = chunkcopy_safe(out, out - dist, len, safe);
      |                               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
max-inline-insns-single
Several parameters control the tree inliner used in gcc.
This number sets the maximum number of instructions (counted in gcc's internal representation) in a
single function that the tree inliner will consider for inlining. This only affects functions declared inline
and methods implemented in a class declaration (C++)

We could increase that value, but I don't really think that is a good idea.
First of all it would be an ugly bandaid, second it would be hard to do this for all compilers and keep track of their changes to that value per compiler version, third it could negatively affect other parts of the code as well.


Now, as a result of this, I also went back and experimented with adding a two versions of the function, one full for inflate and the 2-part one for inffast. This shows a small reduction in instructions and branches (so in theory it is better), but a slight increase in cpu time taken, I think that might be caused by a combination of less cache-reuse and the increased library size.

New benchmarks:

   text    data     bss     dec     hex filename
 136286    1312       8  137606   21986 libz-ng.so.2

[hansr@hk zlib-ng]$ perf stat -r 10 -- build/minigzip -c -d -k ../cov-analysis-linux64-2023.6.2.tar.gz > /dev/null

 Performance counter stats for 'build/minigzip -c -d -k ../cov-analysis-linux64-2023.6.2.tar.gz' (10 runs):

          5,738.33 msec task-clock:u                     #    1.000 CPUs utilized               ( +-  0.01% )
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
               143      page-faults:u                    #   24.920 /sec                        ( +-  0.36% )
    19,918,944,011      cycles:u                         #    3.471 GHz                         ( +-  0.01% )
    35,635,353,156      instructions:u                   #    1.79  insn per cycle              ( +-  0.00% )
     3,749,340,406      branches:u                       #  653.385 M/sec                       ( +-  0.00% )
       181,085,523      branch-misses:u                  #    4.83% of all branches             ( +-  0.02% )

          5.738717 +- 0.000656 seconds time elapsed  ( +-  0.01% )

# Second run for verification:
          5,738.58 msec task-clock:u                     #    1.000 CPUs utilized               ( +-  0.01% )
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
               144      page-faults:u                    #   25.093 /sec                        ( +-  0.27% )
    19,924,395,312      cycles:u                         #    3.472 GHz                         ( +-  0.02% )
    35,635,353,151      instructions:u                   #    1.79  insn per cycle              ( +-  0.00% )
     3,749,340,402      branches:u                       #  653.357 M/sec                       ( +-  0.00% )
       181,036,154      branch-misses:u                  #    4.83% of all branches             ( +-  0.02% )

          5.738972 +- 0.000647 seconds time elapsed  ( +-  0.01% )

This PR for comparison:

   text    data     bss     dec     hex filename
 135966    1312       8  137286   21846 libz-ng.so.2
          5,733.90 msec task-clock:u                     #    1.000 CPUs utilized               ( +-  0.01% )
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
               143      page-faults:u                    #   24.939 /sec                        ( +-  0.30% )
    19,908,943,906      cycles:u                         #    3.472 GHz                         ( +-  0.01% )
    35,633,452,503      instructions:u                   #    1.79  insn per cycle              ( +-  0.00% )
     3,749,361,374      branches:u                       #  653.894 M/sec                       ( +-  0.00% )
       181,427,372      branch-misses:u                  #    4.84% of all branches             ( +-  0.04% )

          5.734291 +- 0.000584 seconds time elapsed  ( +-  0.01% )

@Dead2
Copy link
Copy Markdown
Member Author

Dead2 commented Sep 12, 2024

For completeness, I did increase max-inline-insns-single from 300 to 500 in order to force the inlining into inffast, and this was the result.

   text    data     bss     dec     hex filename
 142878    1312       8  144198   23346 libz-ng.so.2

 Performance counter stats for 'build/minigzip -c -d -k ../cov-analysis-linux64-2023.6.2.tar.gz' (10 runs):

          5,941.61 msec task-clock:u                     #    1.000 CPUs utilized               ( +-  0.01% )
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
               143      page-faults:u                    #   24.068 /sec                        ( +-  0.44% )
    20,654,370,492      cycles:u                         #    3.476 GHz                         ( +-  0.02% )
    35,812,270,619      instructions:u                   #    1.73  insn per cycle              ( +-  0.00% )
     3,748,346,787      branches:u                       #  630.863 M/sec                       ( +-  0.00% )
       181,576,942      branch-misses:u                  #    4.84% of all branches             ( +-  0.01% )

          5.942097 +- 0.000782 seconds time elapsed  ( +-  0.01% )

Worse than Devel is currently in this test.

@KungFuJesus
Copy link
Copy Markdown
Collaborator

The smaller code size is probably going to be the overall benefit. You could try still declaring the function in a header so that the copy loop is called as a separate function but still compiled with -mavx2, maybe? Though that may try inlining that call and you end up with the same result

@KungFuJesus
Copy link
Copy Markdown
Collaborator

I have a different change around some function inlining in the works that may change the calculus on some of this. I'll try get something staged later tonight. It ends up inlining more, but it's about a 3.5% improvement in my png decoding benchmarks overall. I have one more change I want to add on top that might improve things further

@Dead2
Copy link
Copy Markdown
Member Author

Dead2 commented Sep 12, 2024

The smaller code size is probably going to be the overall benefit. You could try still declaring the function in a header so that the copy loop is called as a separate function but still compiled with -mavx2, maybe? Though that may try inlining that call and you end up with the same result

Already tried that, it inlines it and it ends up exactly the same as the original unfortunately.

I have a different change around some function inlining in the works that may change the calculus on some of this. I'll try get something staged later tonight. It ends up inlining more, but it's about a 3.5% improvement in my png decoding benchmarks overall. I have one more change I want to add on top that might improve things further

I am excited to test this 👍

@Dead2 Dead2 merged commit 6b8efe7 into develop Sep 13, 2024
@Dead2 Dead2 deleted the split_chunkcopy_safe branch September 13, 2024 10:48
This was referenced Sep 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants