Skip to content

use memcpy/memcmp for unaligned reads#934

Closed
matijaskala wants to merge 1 commit intozlib-ng:developfrom
matijaskala:develop
Closed

use memcpy/memcmp for unaligned reads#934
matijaskala wants to merge 1 commit intozlib-ng:developfrom
matijaskala:develop

Conversation

@matijaskala
Copy link
Copy Markdown
Contributor

that way any future optimizers won't make false
assumptions that those reads are actually aligned

compiler optimizes out memcpy/memcmp calls so
generated assembly code remains identical

Comment thread deflate.h
*(uint32_t *)(&s->pending_buf[s->pending]) = (lld >> 32) & 0xffffffff;
s->pending += 4;
memcpy(&s->pending_buf[s->pending], lld, 8);
s->pending += 8;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change doesn't make sense... First two cases would need to be combined for little-endian systems.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has this been resolved? Seems like if the compiler doesn't support 64 bit unaligned it can split it up into 2 x 32-bit reads or perhaps just do memcpy.

@matijaskala matijaskala requested a review from mtl1979 April 24, 2021 01:18
Comment thread compare258.c Outdated
uint32_t sv = *(uint32_t *)src0;
uint32_t mv = *(uint32_t *)src1;
uint32_t sv;
uint32_t mv;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As both variables have same type and they are not assigned a value that depend on each other, they can be declared on same line. Only when the variables are members of a struct, they should be declared on separate lines.

Comment thread compare258.c Outdated
uint64_t sv = *(uint64_t *)src0;
uint64_t mv = *(uint64_t *)src1;
uint64_t sv;
uint64_t mv;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As both variables have same type and they are not assigned a value that depend on each other, they can be declared on same line.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this was resolved?

Comment thread deflate.h
put_byte(s, ((dw >> 16) & 0xff));
put_byte(s, ((dw >> 8) & 0xff));
put_byte(s, (dw & 0xff));
# error No endian defined
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be No endianess defined.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied the text from functable.c

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't reviewed all files... There is three long-time contributors and only two need to review a pull request before it gets merged.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 26, 2021

Codecov Report

Merging #934 (2f1f56d) into develop (92107a9) will decrease coverage by 1.48%.
The diff coverage is 68.96%.

❗ Current head 2f1f56d differs from pull request most recent head af7c8ad. Consider uploading reports for the commit af7c8ad to get more accurate results
Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #934      +/-   ##
===========================================
- Coverage    77.95%   76.47%   -1.49%     
===========================================
  Files           85       78       -7     
  Lines         8700     8250     -450     
  Branches      1396     1354      -42     
===========================================
- Hits          6782     6309     -473     
- Misses        1369     1409      +40     
+ Partials       549      532      -17     
Flag Coverage Δ
macos_clang 68.26% <8.00%> (-0.80%) ⬇️
macos_gcc 67.18% <0.00%> (-0.62%) ⬇️
ubuntu_clang 69.30% <40.00%> (+1.16%) ⬆️
ubuntu_clang_debug 68.71% <40.00%> (+1.26%) ⬆️
ubuntu_clang_inflate_allow_invalid_dist 69.05% <40.00%> (+1.13%) ⬆️
ubuntu_clang_inflate_strict 69.30% <40.00%> (+1.16%) ⬆️
ubuntu_clang_mmap 69.30% <40.00%> (+1.15%) ⬆️
ubuntu_clang_msan 69.30% <40.00%> (+1.16%) ⬆️
ubuntu_clang_pigz 35.33% <52.00%> (+3.42%) ⬆️
ubuntu_clang_pigz_no_optim 37.79% <56.00%> (-0.25%) ⬇️
ubuntu_clang_pigz_no_threads 34.98% <52.00%> (+3.42%) ⬆️
ubuntu_clang_reduced_mem ?
ubuntu_gcc 68.58% <44.44%> (+1.41%) ⬆️
ubuntu_gcc_aarch64 68.78% <10.00%> (-1.63%) ⬇️
ubuntu_gcc_aarch64_compat_no_opt 67.01% <0.00%> (-1.78%) ⬇️
ubuntu_gcc_aarch64_no_acle 67.61% <0.00%> (-1.86%) ⬇️
ubuntu_gcc_aarch64_no_neon 67.96% <0.00%> (-1.73%) ⬇️
ubuntu_gcc_armhf 68.76% <18.18%> (-1.63%) ⬇️
ubuntu_gcc_armhf_compat_no_opt 66.98% <0.00%> (-1.79%) ⬇️
ubuntu_gcc_armhf_no_acle 68.93% <18.18%> (-1.70%) ⬇️
ubuntu_gcc_armhf_no_neon 69.16% <18.18%> (-1.66%) ⬇️
ubuntu_gcc_armsf 68.76% <16.66%> (-1.64%) ⬇️
ubuntu_gcc_armsf_compat_no_opt 66.98% <0.00%> (-1.79%) ⬇️
ubuntu_gcc_compat_no_opt 68.42% <0.00%> (-1.73%) ⬇️
ubuntu_gcc_mingw_i686 0.00% <0.00%> (ø)
ubuntu_gcc_mingw_x86_64 0.00% <0.00%> (ø)
ubuntu_gcc_no_avx2 68.64% <0.00%> (-1.75%) ⬇️
ubuntu_gcc_no_ctz ?
ubuntu_gcc_no_ctzll ?
ubuntu_gcc_no_pclmulqdq 66.59% <0.00%> (+1.25%) ⬆️
ubuntu_gcc_no_sse2 67.82% <0.00%> (-0.11%) ⬇️
ubuntu_gcc_no_sse4 66.79% <0.00%> (-0.51%) ⬇️
ubuntu_gcc_o3 68.17% <0.00%> (+0.36%) ⬆️
ubuntu_gcc_osb 70.88% <0.00%> (+1.47%) ⬆️
ubuntu_gcc_pigz 35.17% <58.82%> (+3.20%) ⬆️
ubuntu_gcc_pigz_aarch64 ?
ubuntu_gcc_ppc 68.86% <100.00%> (+0.52%) ⬆️
ubuntu_gcc_ppc64 69.73% <100.00%> (-1.83%) ⬇️
ubuntu_gcc_ppc64le 68.43% <10.00%> (-1.52%) ⬇️
ubuntu_gcc_ppc_no_power8 ?
ubuntu_gcc_s390x 67.75% <100.00%> (-2.44%) ⬇️
ubuntu_gcc_sparc64 70.02% <100.00%> (-2.00%) ⬇️
win64_gcc 70.09% <13.04%> (∅)
win64_gcc_compat_no_opt 73.50% <25.00%> (∅)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
chunkset.c 100.00% <ø> (+40.90%) ⬆️
compare258.c 54.63% <30.76%> (-34.03%) ⬇️
deflate.h 39.13% <100.00%> (+5.79%) ⬆️
insert_string_tpl.h 100.00% <100.00%> (ø)
match_tpl.h 100.00% <100.00%> (+24.61%) ⬆️
zutil_p.h 40.00% <0.00%> (-31.43%) ⬇️
inftrees.c 82.73% <0.00%> (-15.11%) ⬇️
arch/s390/dfltcc_detail.h 0.00% <0.00%> (-12.50%) ⬇️
deflate_stored.c 86.20% <0.00%> (-9.20%) ⬇️
... and 46 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 92107a9...af7c8ad. Read the comment docs.

@Dead2 Dead2 added the Next Devel Targeting next devel release (does not mean accepted) label Apr 27, 2021
@nmoinvaz
Copy link
Copy Markdown
Member

nmoinvaz commented Jun 12, 2021

I think this PR is good. I'm not sure what is left to do on it. Benchmarks? It allows the compiler to determine if unaligned read/writes are allowed by using memcpy/memcmp. Also gets rid of my convolution I had in chunkset.c.

@Dead2 Dead2 requested a review from mtl1979 June 13, 2021 08:07
@Dead2
Copy link
Copy Markdown
Member

Dead2 commented Jun 13, 2021

It looked to me like @mtl1979 had requested changes, but perhaps he can clarify.

Comment thread insert_string_tpl.h Outdated
@@ -33,7 +33,7 @@ Z_INTERNAL Pos QUICK_INSERT_STRING(deflate_state *const s, const uint32_t str) {
uint32_t val, hm, h = 0;

#ifdef UNALIGNED_OK
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of checking if UNALIGNED_OK is defined, we would need to check if system is little-endian or big-endian.

Comment thread insert_string_tpl.h Outdated
@@ -68,7 +68,7 @@ Z_INTERNAL void INSERT_STRING(deflate_state *const s, const uint32_t str, uint32
uint32_t val, hm, h = 0;

#ifdef UNALIGNED_OK
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of checking if UNALIGNED_OK is defined, we would need to check if system is little-endian or big-endian.

Copy link
Copy Markdown
Collaborator

@mtl1979 mtl1979 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems work-in-progress. Needs some logic changes.

@nmoinvaz
Copy link
Copy Markdown
Member

This PR seems ok to me. Just should probably be benchmarked so to make sure no degradation.

@nmoinvaz
Copy link
Copy Markdown
Member

nmoinvaz commented Jun 21, 2021

Using -DWITH_OPTIM=ON

ZLIB-NG 834e7d8

 Tool: minigzip-ng-develop.exe Levels: 1-9
 Runs: 70         Trim worst: 40

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     44.809%      1.015/1.022/1.030/0.005        0.660/0.670/0.674/0.003      104,067,822
 2     35.391%      1.678/1.700/1.710/0.009        0.673/0.686/0.692/0.005       82,195,416
 3     33.698%      2.320/2.338/2.348/0.009        0.655/0.661/0.666/0.003       78,262,436
 4     32.989%      2.618/2.639/2.649/0.007        0.637/0.648/0.654/0.005       76,616,183
 5     32.483%      2.781/2.803/2.815/0.009        0.638/0.644/0.649/0.004       75,441,984
 6     32.318%      3.392/3.414/3.425/0.009        0.632/0.640/0.645/0.004       75,058,801
 7     32.061%      4.654/4.675/4.688/0.010        0.633/0.644/0.650/0.005       74,461,300
 8     31.978%      7.149/7.192/7.212/0.014        0.635/0.643/0.650/0.005       74,269,032
 9     31.961%      9.729/9.761/9.784/0.016        0.638/0.645/0.650/0.003       74,230,342

 avg1  34.188%                        3.949                          0.654
 tot                               1066.332                        176.447      714,603,316

ZLIB-NG PR 934 2f1f56d

 Tool: minigzip-ng-pr934.exe Levels: 1-9
 Runs: 70         Trim worst: 40

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     44.809%      1.018/1.031/1.039/0.006        0.665/0.688/0.739/0.026      104,067,822
 2     35.391%      1.685/1.706/1.721/0.010        0.676/0.688/0.694/0.005       82,195,416
 3     33.698%      2.329/2.375/2.436/0.032        0.652/0.666/0.676/0.007       78,262,436
 4     32.989%      2.628/2.661/2.679/0.014        0.639/0.652/0.658/0.006       76,616,183
 5     32.483%      2.781/2.840/2.915/0.042        0.635/0.647/0.653/0.005       75,441,984
 6     32.318%      3.410/3.450/3.532/0.039        0.632/0.643/0.650/0.005       75,058,801
 7     32.061%      4.684/4.743/4.833/0.055        0.630/0.647/0.654/0.006       74,461,300
 8     31.978%      7.186/7.265/7.379/0.068        0.637/0.646/0.652/0.004       74,269,032
 9     31.961%      9.765/9.845/9.977/0.079        0.634/0.648/0.655/0.005       74,230,342

 avg1  34.188%                        3.991                          0.659
 tot                               1077.460                        177.822      714,603,316

@nmoinvaz
Copy link
Copy Markdown
Member

Second run using -DWITH_OPTIM=ON:

ZLIB-NG 834e7d8

 Tool: minigzip-ng-develop.exe Levels: 1-9
 Runs: 70         Trim worst: 40

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     44.809%      1.016/1.025/1.032/0.005        0.662/0.669/0.674/0.004      104,067,822
 2     35.391%      1.679/1.696/1.704/0.007        0.675/0.683/0.688/0.003       82,195,416
 3     33.698%      2.313/2.338/2.348/0.009        0.647/0.660/0.666/0.004       78,262,436
 4     32.989%      2.618/2.638/2.647/0.008        0.641/0.648/0.654/0.004       76,616,183
 5     32.483%      2.789/2.801/2.811/0.006        0.637/0.643/0.648/0.004       75,441,984
 6     32.318%      3.388/3.407/3.420/0.009        0.633/0.641/0.647/0.004       75,058,801
 7     32.061%      4.643/4.668/4.682/0.011        0.636/0.645/0.650/0.004       74,461,300
 8     31.978%      7.152/7.180/7.195/0.011        0.630/0.643/0.649/0.005       74,269,032
 9     31.961%      9.719/9.752/9.764/0.011        0.637/0.646/0.652/0.004       74,230,342

 avg1  34.188%                        3.945                          0.653
 tot                               1065.164                        176.269      714,603,316

ZLIB-NG 2f1f56d

 Tool: minigzip-ng-pr934.exe Levels: 1-9
 Runs: 70         Trim worst: 40

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     44.809%      1.013/1.022/1.027/0.004        0.658/0.670/0.677/0.005      104,067,822
 2     35.391%      1.682/1.699/1.708/0.006        0.672/0.683/0.687/0.004       82,195,416
 3     33.698%      2.328/2.344/2.353/0.006        0.650/0.661/0.667/0.004       78,262,436
 4     32.989%      2.627/2.645/2.655/0.009        0.639/0.649/0.653/0.003       76,616,183
 5     32.483%      2.792/2.808/2.815/0.006        0.635/0.643/0.647/0.004       75,441,984
 6     32.318%      3.396/3.415/3.424/0.008        0.630/0.641/0.646/0.004       75,058,801
 7     32.061%      4.667/4.694/4.704/0.008        0.633/0.642/0.648/0.004       74,461,300
 8     31.978%      7.185/7.207/7.217/0.008        0.634/0.643/0.649/0.004       74,269,032
 9     31.961%      9.752/9.777/9.792/0.011        0.637/0.643/0.649/0.003       74,230,342

 avg1  34.188%                        3.957                          0.653
 tot                               1068.295                        176.240      714,603,316

@Dead2 Dead2 added the Rebase needed Please do a 'git rebase develop yourbranch' label Jun 25, 2021
@Dead2
Copy link
Copy Markdown
Member

Dead2 commented Jun 25, 2021

This needs a little rebase.

@Dead2
Copy link
Copy Markdown
Member

Dead2 commented Jun 25, 2021

Baseline 21050e0 x86-64

   text    data     bss     dec     hex filename
 115241    1416      32  116689   1c7d1 libz-ng.so.2

Activated multiple generated file mode. Source: ../silesia-small.tar
Level 1: /tmp/deflatebench-1.tmp  465.2 MiB   487,825,920 B
Level 2: /tmp/deflatebench-2.tmp  270.1 MiB   283,253,760 B
Level 3: /tmp/deflatebench-3.tmp  225.1 MiB   236,044,800 B
Level 4: /tmp/deflatebench-4.tmp  180.1 MiB   188,835,840 B
Level 5: /tmp/deflatebench-5.tmp  165.1 MiB   173,099,520 B
Level 6: /tmp/deflatebench-6.tmp  135.1 MiB   141,626,880 B
Level 7: /tmp/deflatebench-7.tmp  105.1 MiB   110,154,240 B
Level 8: /tmp/deflatebench-8.tmp   90.0 MiB    94,417,920 B
Level 9: /tmp/deflatebench-9.tmp   75.0 MiB    78,681,600 B

 Tool: minigzip   Levels: 1-9
 Runs: 40         Trim worst: 20

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     54.165%      2.892/2.928/2.946/0.017        1.661/1.669/1.678/0.005      264,231,515
 2     43.875%      2.974/3.002/3.015/0.010        0.984/1.002/1.017/0.010      124,277,644
 3     42.382%      3.003/3.018/3.032/0.009        0.786/0.812/0.825/0.011      100,039,412
 4     41.669%      2.718/2.737/2.743/0.007        0.622/0.634/0.645/0.006       78,685,354
 5     41.215%      2.868/2.911/2.926/0.016        0.568/0.578/0.585/0.006       71,342,944
 6     41.036%      2.763/2.776/2.785/0.007        0.441/0.469/0.478/0.009       58,118,026
 7     40.781%      2.717/2.735/2.742/0.007        0.358/0.367/0.371/0.004       44,921,649
 8     40.702%      2.996/3.009/3.017/0.007        0.297/0.313/0.321/0.007       38,429,965
 9     40.698%      2.783/2.795/2.801/0.005        0.247/0.259/0.266/0.006       32,021,745

 avg1  42.947%                        2.879                          0.678
 tot                                518.191                        122.066      812,068,254

PR #934 2f1f56d x86-64

   text    data     bss     dec     hex filename
 116330    1424      32  117786   1cc1a libz-ng.so.2

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     54.165%      2.831/2.882/2.900/0.019        1.655/1.676/1.685/0.008      264,231,515
 2     43.875%      3.382/3.393/3.407/0.007        0.990/1.006/1.017/0.010      124,277,644
 3     42.382%      3.550/3.572/3.587/0.011        0.775/0.810/0.825/0.013      100,039,412
 4     41.669%      3.482/3.505/3.516/0.010        0.621/0.637/0.648/0.009       78,685,354
 5     41.215%      3.640/3.676/3.687/0.012        0.550/0.579/0.590/0.010       71,342,944
 6     41.036%      3.728/3.744/3.755/0.008        0.457/0.470/0.477/0.005       58,118,026
 7     40.781%      4.407/4.415/4.424/0.005        0.351/0.366/0.375/0.006       44,921,649
 8     40.702%      5.151/5.169/5.177/0.007        0.301/0.312/0.317/0.005       38,429,965
 9     40.698%      4.891/4.900/4.906/0.004        0.250/0.261/0.266/0.004       32,021,745

 avg1  42.947%                        3.918                          0.680
 tot                                705.152                        122.341      812,068,254

This PR unfortunately makes compression 36% slower than the baseline, and it adds quite a bit to the compiled code size (text).

This was tested with GCC 4.8.5 and Glibc 2.17

@mscastanho
Copy link
Copy Markdown
Contributor

This was tested with GCC 4.8.5 and Glibc 2.17

@Dead2 Did you have a chance to test this with a newer GCC+glibc? glibc 2.17 caught my attention, as it is ~8y old. The implementation of core string functions (like memcmp and memcpy) are frequently revisited on glibc, so I wouldn't be surprised if faster implementations are available on newer glibc versions.

Of course this wouldn't change the situation for people using zlib-ng with an older toolchain. But I just wanted to add more info to the discussion anyways =)

@mtl1979
Copy link
Copy Markdown
Collaborator

mtl1979 commented Jun 28, 2021

We don't officially support anything that is 8 years old... Even Visual C++ 2013 is too old for us.. GCC versions from 4.8 and 4.9 had optimizer bugs that I have mentioned a long time ago, not to mention they don't support instruction sets introduced after their release.

On my development machine I have gcc 6.5 and clang 6.0 as oldest version and Visual C++ 2015 for Windows builds. For more recent compiler tests I use gcc 11, clang 12 and Visual C++ 2022.

@Dead2
Copy link
Copy Markdown
Member

Dead2 commented Jun 28, 2021

@mtl1979 I have officially stated multiple times that RHEL 7 and derived distros are to be supported as a lower-end. The above test was done on such a machine. The compiler not supporting newer features is no problem if the features are optional, but this PR would be much harder to make optional. MSVC is a (very) special case due to them not wanting to really support C until recently.

@mscastanho I suspect that is the problem (that is why I also posted those details), but I have not had the time to run any more tests yet.
But I suspect this could also cause problems for those using uclibc/musl or other clibs, and that makes it all that much harder.

@mtl1979
Copy link
Copy Markdown
Collaborator

mtl1979 commented Jun 28, 2021

@Dead2 I don't care what is the oldest operating system we (as contributors or developers of zlib-ng) use or support...

Officially we support almost all 32-bit and 64-bit operating systems that can run either gcc, clang or Visual C++ compiler... Other compilers might work, but due to bugs or missing features all compilers that are more recent than 6 years or so might not be usable, for example early builds of Visual C++ 2017 and 2019 which were unusable due to critical bugs that were eventually fixed in later builds.

RHEL 7 has gcc 5-8 as optional packages.

I still run BeOS 5.0.3 and Windows Vista on my older computers, both are quite old operating systems.

@Dead2
Copy link
Copy Markdown
Member

Dead2 commented Jun 29, 2021

Baseline 21050e0 aarch64

   text    data     bss     dec     hex filename
 105234    1576      16  106826   1a14a libz-ng.so.2

Level 1: /tmp/deflatebench-1.tmp   75.0 MiB    78,681,600 B
Level 2: /tmp/deflatebench-2.tmp   45.0 MiB    47,208,960 B
Level 3: /tmp/deflatebench-3.tmp   45.0 MiB    47,208,960 B
Level 4: /tmp/deflatebench-4.tmp   30.0 MiB    31,472,640 B
Level 5: /tmp/deflatebench-5.tmp   30.0 MiB    31,472,640 B
Level 6: /tmp/deflatebench-6.tmp   30.0 MiB    31,472,640 B
Level 7: /tmp/deflatebench-7.tmp   15.0 MiB    15,736,320 B
Level 8: /tmp/deflatebench-8.tmp   15.0 MiB    15,736,320 B
Level 9: /tmp/deflatebench-9.tmp   15.0 MiB    15,736,320 B

 Tool: minigzip   Levels: 1-9
 Runs: 40         Trim worst: 20

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     54.166%      2.729/2.769/2.790/0.019        0.974/0.998/1.014/0.014       42,618,858
 2     43.879%      2.821/2.856/2.872/0.012        0.583/0.601/0.613/0.009       20,714,938
 3     42.383%      3.722/3.745/3.766/0.016        0.568/0.588/0.600/0.010       20,008,755
 4     41.660%      2.782/2.802/2.818/0.011        0.350/0.376/0.390/0.012       13,111,391
 5     41.216%      3.001/3.035/3.052/0.014        0.339/0.371/0.388/0.013       12,971,746
 6     41.039%      3.511/3.541/3.555/0.013        0.345/0.369/0.379/0.010       12,916,025
 7     40.778%      2.260/2.273/2.286/0.009        0.170/0.181/0.187/0.004        6,416,924
 8     40.704%      2.785/2.805/2.817/0.010        0.149/0.177/0.191/0.011        6,405,244
 9     40.696%      2.997/3.021/3.034/0.009        0.165/0.180/0.190/0.008        6,404,084

 avg1  42.947%                        2.983                          0.427
 tot                                536.961                         76.853      141,567,965

PR #934 aarch64

   text    data     bss     dec     hex filename
 105234    1576      16  106826   1a14a libz-ng.so.2

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     54.166%      2.708/2.764/2.791/0.024        0.938/0.997/1.014/0.017       42,618,858
 2     43.879%      2.820/2.850/2.870/0.014        0.582/0.602/0.616/0.010       20,714,938
 3     42.383%      3.717/3.750/3.774/0.019        0.559/0.585/0.593/0.008       20,008,755
 4     41.660%      2.775/2.800/2.819/0.015        0.357/0.378/0.387/0.008       13,111,391
 5     41.216%      3.008/3.038/3.065/0.017        0.350/0.372/0.381/0.008       12,971,746
 6     41.039%      3.510/3.533/3.554/0.013        0.347/0.370/0.383/0.012       12,916,025
 7     40.778%      2.252/2.279/2.292/0.011        0.165/0.180/0.188/0.007        6,416,924
 8     40.704%      2.772/2.798/2.814/0.013        0.157/0.176/0.184/0.007        6,405,244
 9     40.696%      3.000/3.025/3.035/0.009        0.160/0.174/0.186/0.006        6,404,084

 avg1  42.947%                        2.982                          0.426
 tot                                536.761                         76.709      141,567,965

This is aarch64 and GCC 8.3.0. So no slowdown there.

@Dead2
Copy link
Copy Markdown
Member

Dead2 commented Oct 12, 2021

Proposal for discussion:
What do you think about making our own zmemcpy/zmemcmp that on older/unoptimized libc will default to the current way of doing things, but on modern/optimized libc/gcc it is just an alias for the built-in.

Implementing that should be pretty easy, performance would be best of both worlds, cleaner code than today, easily maintainable and also easy to get rid of when/if we decide to in a few years.
The biggest challenge is probably making the detection/selection logic that decides when to use what.

@mtl1979
Copy link
Copy Markdown
Collaborator

mtl1979 commented Oct 12, 2021

@Dead2 Because memcpy() is implemented inside the compiler, making own version that would actually be better would require pretty much rewriting the code generation part of compiler... This is because different architectures and even processors have different internal bus widths and compiler can choose the best implementation depending on both bus width and data length/alignment.

@Dead2
Copy link
Copy Markdown
Member

Dead2 commented Oct 12, 2021

@mtl1979 I am not talking about implementing full memcpy/memcmp, but rather macros that resolve into the same cast+assignments and such that we use today. My proposed names might be misleading though.

Something like this:

// Pseudocode
if OLD:
    define zmemcmp_2(str1, str2) *(uint16_t *)str1 != *(uint16_t)str2
else:
    define zmemcmp_2(str1, str2) memcmp(str1, str2, 2)
endif

@mtl1979
Copy link
Copy Markdown
Collaborator

mtl1979 commented Oct 12, 2021

@Dead2 That would just overlap what we already use UNALIGNED_OK for... Implementing memcmp() is a lot easier than implementing memcpy().

that way any future optimizers won't make false
assumptions that those reads are actually aligned

compiler optimizes out memcpy/memcmp calls so
generated assembly code remains identical

keep old behaviour with gcc <8
@nmoinvaz
Copy link
Copy Markdown
Member

@Dead2 I like your idea about macros.

@Dead2
Copy link
Copy Markdown
Member

Dead2 commented Jan 8, 2022

This is obsoleted by #1086 and #1102

@Dead2 Dead2 closed this Jan 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Next Devel Targeting next devel release (does not mean accepted) Rebase needed Please do a 'git rebase develop yourbranch'

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants