Port ARM inflate performance improvement patches (chunk SIMD, read64le)#22
Merged
vkrasnov merged 6 commits intocloudflare:gcc.amd64from Sep 23, 2020
janaknat:chunk-simd-neon
Merged
Port ARM inflate performance improvement patches (chunk SIMD, read64le)#22vkrasnov merged 6 commits intocloudflare:gcc.amd64from janaknat:chunk-simd-neon
vkrasnov merged 6 commits intocloudflare:gcc.amd64from
janaknat:chunk-simd-neon
Conversation
the zlib header. The allowed values of the four-bit field are 0..7, but when windowBits is zero, values greater than 7 are permitted and acted upon, resulting in large, mostly unused memory allocations. This fix rejects such invalid zlib headers.
The undocumented (except in these commit comments) function inflateValidate(strm, check) can be called after an inflateInit(), inflateInit2(), or inflateReset2() with check equal to zero to turn off the check value (CRC-32 or Adler-32) computation and comparison. Calling with check not equal to zero turns checking back on. This should only be called immediately after the init or reset function. inflateReset() does not change the state, so a previous inflateValidate() setting will remain in effect. This also turns off validation of the gzip header CRC when present. This should only be used when a zlib or gzip stream has already been checked, and repeated decompressions of the same stream no longer need to be validated.
expected type of state, deflate or inflate, and that at least the first several bytes of the internal state have not been clobbered.
This combines two patches which help in improving the readability and maintainability of the code by making magic numbers into #defines. Based on Chris Blume's (cblume@chromium) patches for zlib chromium: 8888511 - "Zlib: Use defines for inffast" b9c1566 - "Share inffast names in zlib" These patches are needed when introducing chunk SIMD NEON enchancements. Signed-off-by: Janakarajan Natarajan <janakan@amazon.com>
Based on 2 patches from zlib chromium fork: * Adenilson Cavalcanti (adenilson.cavalcanti@arm.com) 3060dcb - "zlib: inflate using wider loads and stores" * Noel Gordon (noel@chromium.org) 64ffef0 - "Improve zlib inflate speed by using SSE2 chunk copy The two patches combined provide around 5-25% increase in inflate performance, based on the workload, when checked with a modified zpipe.c and the Silesia corpus. Signed-off-by: Janakarajan Natarajan <janakan@amazon.com>
Update the chunk-copy code with a wide input data reader, which consumes input in 64-bit (8 byte) chunks. Update inflate_fast_chunk_() to use the wide reader. Based on Noel Gordon's (noel@chromium.org) patch for the zlib chromium fork 8a8edc1 - "Increase inflate speed: read decoder input into a uint64_t" This patch provides 7-10% inflate performance improvement when tested with a modified zpipe.c and the Silesia corpus. Signed-off-by: Janakarajan Natarajan <janakan@amazon.com>
Author
|
Some inflate performance number breakdown (average of 5 runs):
|
|
Nice, what CPU was used for the benchmarks? It looks like Intel should be faster as well? |
Author
|
@vkrasnov This was tested with a Graviton2 CPU. With this patch series only ARM enhancements are ported. |
Author
|
Any feedback on this patch series? |
|
I am sorry, I was a bit busy, will review tomorrow. |
|
This is great. I actually managed to apply some of the optimizations to Intel too for a 16% inflate speedup. I will apply those too. |
Author
|
@vkrasnov Thanks. I had the Intel optimizations lined up next. Was waiting for this PR to land. If you've got that covered, that's great. |
|
I am happy to accept another PR, I just have a PoC |
Author
|
@vkrasnov I've created a PR with the Intel optimizations. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This series ports ARM inflate performance improvement patches. With this, the performance improvement during inflate, tested using a modified zpipe.c and the Silesia corpus, is around 17-34%.
Patches 1-3 include a bug-fix and some code improvements taken from madler/zlib.
Patch 4 is a code readability port from zlib chromium.
Patch 5 introduces 3 new files: inffast_chunk.c, inffast_chunk.h and chunkcopy.h.
These incorporate the changes from 2 patches in zlib chromium.
Patch 6 is a port of a performance improvement patch from zlib chromium.