Skip to content

Port ARM inflate performance improvement patches (chunk SIMD, read64le)#22

Merged
vkrasnov merged 6 commits intocloudflare:gcc.amd64from
janaknat:chunk-simd-neon
Sep 23, 2020
Merged

Port ARM inflate performance improvement patches (chunk SIMD, read64le)#22
vkrasnov merged 6 commits intocloudflare:gcc.amd64from
janaknat:chunk-simd-neon

Conversation

@janaknat
Copy link
Copy Markdown

This series ports ARM inflate performance improvement patches. With this, the performance improvement during inflate, tested using a modified zpipe.c and the Silesia corpus, is around 17-34%.

Patches 1-3 include a bug-fix and some code improvements taken from madler/zlib.

Patch 4 is a code readability port from zlib chromium.

* Chris Blume (cblume@chromium):
  8888511 - "Zlib: Use defines for inffast"
  b9c1566 - "Share inffast names in zlib"

Patch 5 introduces 3 new files: inffast_chunk.c, inffast_chunk.h and chunkcopy.h.
These incorporate the changes from 2 patches in zlib chromium.

* Adenilson Cavalcanti (adenilson.cavalcanti@arm.com)
  3060dcb - "zlib: inflate using wider loads and stores"

* Noel Gordon (noel@chromium.org)
  64ffef0 - "Improve zlib inflate speed by using SSE2 chunk copy"

Patch 6 is a port of a performance improvement patch from zlib chromium.

* Noel Gordon (noel@chromium.org)
  8a8edc1 - "Increase inflate speed: read decoder input into a uint64_t"

madler and others added 6 commits August 25, 2020 19:26
the zlib header.  The allowed values of the four-bit field are
0..7, but when windowBits is zero, values greater than 7 are
permitted and acted upon, resulting in large, mostly unused memory
allocations.  This fix rejects such invalid zlib headers.
The undocumented (except in these commit comments) function
inflateValidate(strm, check) can be called after an inflateInit(),
inflateInit2(), or inflateReset2() with check equal to zero to
turn off the check value (CRC-32 or Adler-32) computation and
comparison. Calling with check not equal to zero turns checking
back on. This should only be called immediately after the init or
reset function. inflateReset() does not change the state, so a
previous inflateValidate() setting will remain in effect.

This also turns off validation of the gzip header CRC when
present.

This should only be used when a zlib or gzip stream has already
been checked, and repeated decompressions of the same stream no
longer need to be validated.
expected type of state, deflate or inflate, and that at least the
first several bytes of the internal state have not been clobbered.
This combines two patches which help in improving the readability and
maintainability of the code by making magic numbers into #defines.

Based on Chris Blume's (cblume@chromium) patches for zlib chromium:
8888511 - "Zlib: Use defines for inffast"
b9c1566 - "Share inffast names in zlib"

These patches are needed when introducing chunk SIMD NEON enchancements.

Signed-off-by: Janakarajan Natarajan <janakan@amazon.com>
Based on 2 patches from zlib chromium fork:

* Adenilson Cavalcanti (adenilson.cavalcanti@arm.com)
  3060dcb - "zlib: inflate using wider loads and stores"

* Noel Gordon (noel@chromium.org)
  64ffef0 - "Improve zlib inflate speed by using SSE2 chunk copy

The two patches combined provide around 5-25% increase in inflate
performance, based on the workload, when checked with a modified
zpipe.c and the Silesia corpus.

Signed-off-by: Janakarajan Natarajan <janakan@amazon.com>
Update the chunk-copy code with a wide input data reader, which consumes
input in 64-bit (8 byte) chunks. Update inflate_fast_chunk_() to use the
wide reader.

Based on Noel Gordon's (noel@chromium.org) patch for the zlib chromium fork
8a8edc1 - "Increase inflate speed: read decoder input into a uint64_t"

This patch provides 7-10% inflate performance improvement when tested with a
modified zpipe.c and the Silesia corpus.

Signed-off-by: Janakarajan Natarajan <janakan@amazon.com>
@janaknat
Copy link
Copy Markdown
Author

Some inflate performance number breakdown (average of 5 runs):

Dataset File Throughput before (kb/ms) Throughput after (kb/ms) % change
Silesia xml 70.86 94.81 +34%
x-ray 138.68 155.01 +12%
webster 83.47 108.71 +30%
sao 154.70 180.67 +17%
samba 94.49 122.20 +30%
reymont 82.65 108.34 +31%
osdb 108.96 128.85 +18%
ooffice 107.34 126.89 +17%
nci 61.99 82.12 +34%
mr 94.72 122.25 +29%
mozilla 107.81 131.39 +22%
dickens 88.84 118.96 +34%

@vkrasnov
Copy link
Copy Markdown

vkrasnov commented Sep 4, 2020

Nice, what CPU was used for the benchmarks? It looks like Intel should be faster as well?

@janaknat
Copy link
Copy Markdown
Author

janaknat commented Sep 8, 2020

@vkrasnov This was tested with a Graviton2 CPU. With this patch series only ARM enhancements are ported.

@janaknat
Copy link
Copy Markdown
Author

Any feedback on this patch series?

@vkrasnov
Copy link
Copy Markdown

I am sorry, I was a bit busy, will review tomorrow.

@vkrasnov vkrasnov self-requested a review September 23, 2020 17:06
@vkrasnov vkrasnov merged commit e76d32d into cloudflare:gcc.amd64 Sep 23, 2020
@vkrasnov
Copy link
Copy Markdown

This is great. I actually managed to apply some of the optimizations to Intel too for a 16% inflate speedup. I will apply those too.

@janaknat
Copy link
Copy Markdown
Author

@vkrasnov Thanks. I had the Intel optimizations lined up next. Was waiting for this PR to land. If you've got that covered, that's great.

@vkrasnov
Copy link
Copy Markdown

I am happy to accept another PR, I just have a PoC

@janaknat
Copy link
Copy Markdown
Author

@vkrasnov I've created a PR with the Intel optimizations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants