Port Intel optimizations (adler32, chunkcopy) to cloudflare#23
Port Intel optimizations (adler32, chunkcopy) to cloudflare#23vkrasnov merged 2 commits intocloudflare:gcc.amd64from janaknat:intel-optimizations
Conversation
Based on the adler32-simd patch from Noel Gordon for the chromium fork of zlib.
17bbb3d73c84 ("zlib adler_simd.c")
Signed-off-by: Janakarajan Natarajan <janakan@amazon.com>
Based on 2 patches from zlib chromium fork: * Adenilson Cavalcanti (adenilson.cavalcanti@arm.com) 3060dcb - "zlib: inflate using wider loads and stores" * Noel Gordon (noel@chromium.org) 64ffef0 - "Improve zlib inflate speed by using SSE2 chunk copy The improvement in inflate performance is around 15-35%, based on the workload, when checked with a modified zpipe.c and the Silesia corpus. Signed-off-by: Janakarajan Natarajan <janakan@amazon.com>
|
Performance numbers using the Silesia Corpus and a modified zpipe.c
|
|
@vkrasnov Any feedback on this series? |
|
This is perfect! |
|
@vkrasnov Thanks for merging the PR! |
|
So I built and compared #82035d0687c8d34981c93589f35982149fb34591 with #836eb111a5c5df7db3f2469a867a6f7c1b2e7bdb by running minigzip on a 305Mb big source tarball (QtBase 5.12.6, uncompressed; no options, reading from stdin and sending the output to /dev/null).
To my surprise, compression time increased about 5x in the current release. Please tell me that makes sense because the compression settings were changed?!
|
|
@RJVB Can you provide a link to the source tarball? How did you measure the time taken? Also, the changes made in this PR are for de-compression(inflate). No changes were made to the compression code path. |
|
This is very weird, but he is right, in fact inflate is slower too, because somehow the -O3 flags goes missing after running configure. |
|
Fixed now in #24 |
|
Thanks @RJVB ! |
|
Thanks @RJVB !
Well, thank *you* for finding the cause so quickly!
I'm still at almost 2x slower compression (and that is in fact compared against a 32-bit build of the older code!) but decompression is indeed significantly faster. I don't know if you can compare the number of instructions between 64 bit and 32 bit builds but according to `perf stat` the older version requires more than 2x as many instructions to decompress the same file (and almost 2x as many cycles).
|
|
@RJVB Do you see the same compression slow-down without the changes in this PR? |
|
@janaknat I do not see any benefits for compressing Silesia files to/from gzip format. I compared the latest version of this repository (zlibCFX) to a fork from June (zlibCF) on my MacBook with in-built SSD. Same performance regardless of whether I use make or cmake to compile the application. I suspect this is because Gzip uses crc32 not Adler-32. On the other hand, decompression is faster, but only appreciably if compiled with
|
|
@RJVB Do you see the same compression slow-down without the changes in this PR?
I'd have to check - I didn't yet because I never thought it was due to this change esp. after learning that the optimisations are only for decompression.
|
|
@neurolabusc These are de-compression optimization changes, so not surprising there are any changes in compression performance. I can take a look at the work that needs to be done for cmake. |
…flare#23) It's unnecessary to await a promise returned from an async function (except if, as not needed here, the exception is to be caught in this function instead of by a caller). It's slightly verbose and advised against: https://github.com/eslint/eslint/blob/master/docs/rules/no-return-await.md https://jakearchibald.com/2017/await-vs-return-vs-return-await/ Since this is the only use of await, this function is not actually async. Finally, it should be named with the same casing convention as the previous example.
This series ports the x86 inflate performance improvements from the chromium fork of zlib.
Based on 3 patches:
17bbb3d73c84 - "zlib adler_simd.c"
3060dcb - "zlib: inflate using wider loads and stores"
64ffef0 - "Improve zlib inflate speed by using SSE2 chunk copy
With these changes, the inflate performance improvement is 15-35% when tested with a modified zpipe.c and the Silesia corpus.