Refactor trees.c, deflate.c and deflate.h for gcc x86-64. by vkrasnov · Pull Request #2 · cloudflare/zlib

vkrasnov · 2015-01-19T15:43:24Z

Remove options we do not require (FASTEST, 64K, NOT_TWEAK_COMPILER). Remove contribs we don't require, integrate the hash func and longest_match funcs into deflate.c.
Use intrinsics, where compiler struggles.
Improve output buffer performance by using 64 bit buffer instead 16 bit.
All in all ~10% performance gain for lvl 4 and ~5% performance gain for lvl 5.

…ons we do not require (FASTEST, 64K, NOT_TWEAK_COMPILER). Remove contribs we don't require, integrate the hash func and longest_match funcs into deflate.c. Improve output buffer performance by using 64 bit buffer instead 16 bit. All in all ~10% performance gain for lvl 4 and ~5% performance gain for lvl 5

yangshuxin · 2015-01-19T17:32:41Z

Remove options we do not require (FASTEST, 64K, NOT_TWEAK_COMPILER). Remove contribs we don't require, integrate the hash func and longest_match funcs into deflate.c.

I don't think it makes big sense to remote those contrib. As it make huge diff and render it lots harder to keep in sync with the mainstream.

Use intrinsics, where compiler struggles.
What intrisincis, and why compilers struggles?

Improve output buffer performance by using 64 bit buffer instead 16 bit.
Correct me if I'm wrong, we care memory usage as well, as far as I know, i leave as much as possible for buffer cache. How much addition memory usage we see in practice? What are the size of inputs you used for the testing?

vkrasnov · 2015-01-19T19:01:40Z

I don't think it makes big sense to remote those contrib. As it make huge diff and render it lots harder to keep in sync with the mainstream.

The mainstream development came to a stall, I think we will gain a lot by branching.

What intrisincis, and why compilers struggles?

Saturated sub. After (I think it was you?) fixed the loop in fill_window gcc 4.9.1 emits saturated sub, but with some absolutely redundant code, with slower performance. I didn't check what gcc 4.8 did, but intrinsics are good for any processor.

Correct me if I'm wrong, we care memory usage as well, as far as I know, i leave as much as possible for buffer cache. How much addition memory usage we see in practice? What are the size of inputs you used for the testing?

This is not a memory buffer, but rather a "register" used to flush the actual bits. So zero memory usage.
However I do think we should sacrifice memory to gain performance. We can easily use 10x times the memory without problems.

yangshuxin · 2015-01-19T19:12:23Z

On 01/19/2015 11:01 AM, vkrasnov wrote:

I don't think it makes big sense to remote those contrib. As it
make huge diff and render it lots harder to keep in sync with the
mainstream.
The mainstream development came to a stall, I think we will gain a lot
by branching.
What intrisincis, and why compilers struggles?
Saturated sub. After (I think it was you?) fixed the loop in
fill_window gcc 4.9.1 emits saturated sub, but with some absolutely
redundant code, with slower performance. I didn't check what gcc 4.8
did, but intrinsics are good for any processor.

How much difference did you obtained from this change? I recall I wrote
some asm to replace them and see no difference.
The code generated by gcc is indeed stupid, but it does not seems to
hurt the performance. I guess those redundant code
are not in critical path. I think it's not hard to improve. In the
following releases, gcc could get rid of the defects.

Replacing the code with intrinsic make it difficult port to other
architectures. What if those architectures do not support
vectorized saturated sub? They could resort to if-convertion to
vectorize this loop which yield about the same performance.

Correct me if I'm wrong, we care memory usage as well, as far as I
know, i leave as much as possible for buffer cache. How much
addition memory usage we see in practice? What are the size of
inputs you used for the testing?
This is not a memory buffer, but rather a "register" used to flush the
actual bits. So zero memory usage.
However I do think we should sacrifice memory to gain performance. We
can easily use 10x times the memory without problems.

—
Reply to this email directly or view it on GitHub
#2 (comment).

vkrasnov · 2015-01-19T19:28:40Z

How much difference did you obtained from this change? I recall I wrote
some asm to replace them and see no difference.
The code generated by gcc is indeed stupid, but it does not seems to
hurt the performance. I guess those redundant code
are not in critical path. I think it's not hard to improve. In the
following releases, gcc could get rid of the defects.

Not much really. It is mostly noticeable on level 4, that spends little time on actual matches. For very short files, this never performed, and for medium large files I saw about 2-4% overall perf improvement on Haswell. For level 6 and up you can't tell the difference.

Replacing the code with intrinsic make it difficult port to other
architectures. What if those architectures do not support
vectorized saturated sub? They could resort to if-convertion to
vectorize this loop which yield about the same performance.

I can only think of an ARMv8 port, that does have NEON instruction for this.

jgrahamc · 2015-02-12T12:21:54Z

LGTM

vkrasnov · 2015-03-02T15:31:49Z

So I can merge this one?

Refactor trees.c, deflate.c and deflate.h for gcc x86-64.

…are#3

* Update README.md * Update title * Update spec link

Previously the status was set to "draft", which Bikeshed doesn't understand. Change it to w3c/CG-DRAFT.

Additional 4%-6% speedup for all levels

a17deee

vkrasnov added a commit that referenced this pull request Mar 3, 2015

Merge pull request #2 from cloudflare/refactor.gcc.amd64

83be2a1

Refactor trees.c, deflate.c and deflate.h for gcc x86-64.

vkrasnov merged commit 83be2a1 into gcc.amd64 Mar 3, 2015

vkrasnov deleted the refactor.gcc.amd64 branch July 8, 2015 15:54

geoff-nixon pushed a commit to geoff-nixon/zlib that referenced this pull request May 6, 2016

Fix for intels zlib fork, fixes their issues cloudflare#2 and cloudfl…

6901ac1

…are#3

fhanau pushed a commit to fhanau/zlib that referenced this pull request Feb 27, 2023

Clearer message (cloudflare#2)

2e297c7

* Update README.md * Update title * Update spec link

fhanau pushed a commit to fhanau/zlib that referenced this pull request Feb 27, 2023

Fix document status (cloudflare#2)

927ffe2

Previously the status was set to "draft", which Bikeshed doesn't understand. Change it to w3c/CG-DRAFT.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor trees.c, deflate.c and deflate.h for gcc x86-64.#2

Refactor trees.c, deflate.c and deflate.h for gcc x86-64.#2
vkrasnov merged 2 commits intogcc.amd64from
refactor.gcc.amd64

vkrasnov commented Jan 19, 2015

Uh oh!

yangshuxin commented Jan 19, 2015

Uh oh!

vkrasnov commented Jan 19, 2015

Uh oh!

yangshuxin commented Jan 19, 2015

Uh oh!

vkrasnov commented Jan 19, 2015

Uh oh!

jgrahamc commented Feb 12, 2015

Uh oh!

vkrasnov commented Mar 2, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

vkrasnov commented Jan 19, 2015

Uh oh!

yangshuxin commented Jan 19, 2015

Uh oh!

vkrasnov commented Jan 19, 2015

Uh oh!

yangshuxin commented Jan 19, 2015

Uh oh!

vkrasnov commented Jan 19, 2015

Uh oh!

jgrahamc commented Feb 12, 2015

Uh oh!

vkrasnov commented Mar 2, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants