use memcpy/memcmp for unaligned reads#934
use memcpy/memcmp for unaligned reads#934matijaskala wants to merge 1 commit intozlib-ng:developfrom matijaskala:develop
Conversation
| *(uint32_t *)(&s->pending_buf[s->pending]) = (lld >> 32) & 0xffffffff; | ||
| s->pending += 4; | ||
| memcpy(&s->pending_buf[s->pending], lld, 8); | ||
| s->pending += 8; |
There was a problem hiding this comment.
This change doesn't make sense... First two cases would need to be combined for little-endian systems.
There was a problem hiding this comment.
Has this been resolved? Seems like if the compiler doesn't support 64 bit unaligned it can split it up into 2 x 32-bit reads or perhaps just do memcpy.
| uint32_t sv = *(uint32_t *)src0; | ||
| uint32_t mv = *(uint32_t *)src1; | ||
| uint32_t sv; | ||
| uint32_t mv; |
There was a problem hiding this comment.
As both variables have same type and they are not assigned a value that depend on each other, they can be declared on same line. Only when the variables are members of a struct, they should be declared on separate lines.
| uint64_t sv = *(uint64_t *)src0; | ||
| uint64_t mv = *(uint64_t *)src1; | ||
| uint64_t sv; | ||
| uint64_t mv; |
There was a problem hiding this comment.
As both variables have same type and they are not assigned a value that depend on each other, they can be declared on same line.
| put_byte(s, ((dw >> 16) & 0xff)); | ||
| put_byte(s, ((dw >> 8) & 0xff)); | ||
| put_byte(s, (dw & 0xff)); | ||
| # error No endian defined |
There was a problem hiding this comment.
Should be No endianess defined.
There was a problem hiding this comment.
I copied the text from functable.c
There was a problem hiding this comment.
I haven't reviewed all files... There is three long-time contributors and only two need to review a pull request before it gets merged.
Codecov Report
@@ Coverage Diff @@
## develop #934 +/- ##
===========================================
- Coverage 77.95% 76.47% -1.49%
===========================================
Files 85 78 -7
Lines 8700 8250 -450
Branches 1396 1354 -42
===========================================
- Hits 6782 6309 -473
- Misses 1369 1409 +40
+ Partials 549 532 -17
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
|
I think this PR is good. I'm not sure what is left to do on it. Benchmarks? It allows the compiler to determine if unaligned read/writes are allowed by using memcpy/memcmp. Also gets rid of my convolution I had in |
|
It looked to me like @mtl1979 had requested changes, but perhaps he can clarify. |
| @@ -33,7 +33,7 @@ Z_INTERNAL Pos QUICK_INSERT_STRING(deflate_state *const s, const uint32_t str) { | |||
| uint32_t val, hm, h = 0; | |||
|
|
|||
| #ifdef UNALIGNED_OK | |||
There was a problem hiding this comment.
Instead of checking if UNALIGNED_OK is defined, we would need to check if system is little-endian or big-endian.
| @@ -68,7 +68,7 @@ Z_INTERNAL void INSERT_STRING(deflate_state *const s, const uint32_t str, uint32 | |||
| uint32_t val, hm, h = 0; | |||
|
|
|||
| #ifdef UNALIGNED_OK | |||
There was a problem hiding this comment.
Instead of checking if UNALIGNED_OK is defined, we would need to check if system is little-endian or big-endian.
mtl1979
left a comment
There was a problem hiding this comment.
Seems work-in-progress. Needs some logic changes.
|
This PR seems ok to me. Just should probably be benchmarked so to make sure no degradation. |
|
Using ZLIB-NG 834e7d8ZLIB-NG PR 934 2f1f56d |
|
Second run using -DWITH_OPTIM=ON: ZLIB-NG 834e7d8ZLIB-NG 2f1f56d |
|
This needs a little rebase. |
Baseline 21050e0 x86-64PR #934 2f1f56d x86-64This PR unfortunately makes compression 36% slower than the baseline, and it adds quite a bit to the compiled code size (text). This was tested with GCC 4.8.5 and Glibc 2.17 |
@Dead2 Did you have a chance to test this with a newer GCC+glibc? glibc 2.17 caught my attention, as it is ~8y old. The implementation of core string functions (like memcmp and memcpy) are frequently revisited on glibc, so I wouldn't be surprised if faster implementations are available on newer glibc versions. Of course this wouldn't change the situation for people using zlib-ng with an older toolchain. But I just wanted to add more info to the discussion anyways =) |
|
We don't officially support anything that is 8 years old... Even Visual C++ 2013 is too old for us.. GCC versions from 4.8 and 4.9 had optimizer bugs that I have mentioned a long time ago, not to mention they don't support instruction sets introduced after their release. On my development machine I have gcc 6.5 and clang 6.0 as oldest version and Visual C++ 2015 for Windows builds. For more recent compiler tests I use gcc 11, clang 12 and Visual C++ 2022. |
|
@mtl1979 I have officially stated multiple times that RHEL 7 and derived distros are to be supported as a lower-end. The above test was done on such a machine. The compiler not supporting newer features is no problem if the features are optional, but this PR would be much harder to make optional. MSVC is a (very) special case due to them not wanting to really support C until recently. @mscastanho I suspect that is the problem (that is why I also posted those details), but I have not had the time to run any more tests yet. |
|
@Dead2 I don't care what is the oldest operating system we (as contributors or developers of zlib-ng) use or support... Officially we support almost all 32-bit and 64-bit operating systems that can run either gcc, clang or Visual C++ compiler... Other compilers might work, but due to bugs or missing features all compilers that are more recent than 6 years or so might not be usable, for example early builds of Visual C++ 2017 and 2019 which were unusable due to critical bugs that were eventually fixed in later builds. RHEL 7 has gcc 5-8 as optional packages. I still run BeOS 5.0.3 and Windows Vista on my older computers, both are quite old operating systems. |
Baseline 21050e0 aarch64PR #934 aarch64This is aarch64 and GCC 8.3.0. So no slowdown there. |
|
Proposal for discussion: Implementing that should be pretty easy, performance would be best of both worlds, cleaner code than today, easily maintainable and also easy to get rid of when/if we decide to in a few years. |
|
@Dead2 Because memcpy() is implemented inside the compiler, making own version that would actually be better would require pretty much rewriting the code generation part of compiler... This is because different architectures and even processors have different internal bus widths and compiler can choose the best implementation depending on both bus width and data length/alignment. |
|
@mtl1979 I am not talking about implementing full memcpy/memcmp, but rather macros that resolve into the same cast+assignments and such that we use today. My proposed names might be misleading though. Something like this: |
|
@Dead2 That would just overlap what we already use |
that way any future optimizers won't make false assumptions that those reads are actually aligned compiler optimizes out memcpy/memcmp calls so generated assembly code remains identical keep old behaviour with gcc <8
|
@Dead2 I like your idea about macros. |
that way any future optimizers won't make false
assumptions that those reads are actually aligned
compiler optimizes out memcpy/memcmp calls so
generated assembly code remains identical