Skip to content

Refactor trees.c, deflate.c and deflate.h for gcc x86-64.#2

Merged
vkrasnov merged 2 commits intogcc.amd64from
refactor.gcc.amd64
Mar 3, 2015
Merged

Refactor trees.c, deflate.c and deflate.h for gcc x86-64.#2
vkrasnov merged 2 commits intogcc.amd64from
refactor.gcc.amd64

Conversation

@vkrasnov
Copy link
Copy Markdown

Remove options we do not require (FASTEST, 64K, NOT_TWEAK_COMPILER). Remove contribs we don't require, integrate the hash func and longest_match funcs into deflate.c.
Use intrinsics, where compiler struggles.
Improve output buffer performance by using 64 bit buffer instead 16 bit.
All in all ~10% performance gain for lvl 4 and ~5% performance gain for lvl 5.

…ons we do not require (FASTEST, 64K, NOT_TWEAK_COMPILER). Remove contribs we don't require, integrate the hash func and longest_match funcs into deflate.c.

Improve output buffer performance by using 64 bit buffer instead 16 bit.
All in all ~10% performance gain for lvl 4 and ~5% performance gain for lvl 5
@yangshuxin
Copy link
Copy Markdown

Remove options we do not require (FASTEST, 64K, NOT_TWEAK_COMPILER). Remove contribs we don't require, integrate the hash func and longest_match funcs into deflate.c.

I don't think it makes big sense to remote those contrib. As it make huge diff and render it lots harder to keep in sync with the mainstream.

Use intrinsics, where compiler struggles.
What intrisincis, and why compilers struggles?

Improve output buffer performance by using 64 bit buffer instead 16 bit.
Correct me if I'm wrong, we care memory usage as well, as far as I know, i leave as much as possible for buffer cache. How much addition memory usage we see in practice? What are the size of inputs you used for the testing?

@vkrasnov
Copy link
Copy Markdown
Author

I don't think it makes big sense to remote those contrib. As it make huge diff and render it lots harder to keep in sync with the mainstream.

The mainstream development came to a stall, I think we will gain a lot by branching.

What intrisincis, and why compilers struggles?

Saturated sub. After (I think it was you?) fixed the loop in fill_window gcc 4.9.1 emits saturated sub, but with some absolutely redundant code, with slower performance. I didn't check what gcc 4.8 did, but intrinsics are good for any processor.

Correct me if I'm wrong, we care memory usage as well, as far as I know, i leave as much as possible for buffer cache. How much addition memory usage we see in practice? What are the size of inputs you used for the testing?

This is not a memory buffer, but rather a "register" used to flush the actual bits. So zero memory usage.
However I do think we should sacrifice memory to gain performance. We can easily use 10x times the memory without problems.

@yangshuxin
Copy link
Copy Markdown

On 01/19/2015 11:01 AM, vkrasnov wrote:

I don't think it makes big sense to remote those contrib. As it
make huge diff and render it lots harder to keep in sync with the
mainstream.

The mainstream development came to a stall, I think we will gain a lot
by branching.

What intrisincis, and why compilers struggles?

Saturated sub. After (I think it was you?) fixed the loop in
fill_window gcc 4.9.1 emits saturated sub, but with some absolutely
redundant code, with slower performance. I didn't check what gcc 4.8
did, but intrinsics are good for any processor.

How much difference did you obtained from this change? I recall I wrote
some asm to replace them and see no difference.
The code generated by gcc is indeed stupid, but it does not seems to
hurt the performance. I guess those redundant code
are not in critical path. I think it's not hard to improve. In the
following releases, gcc could get rid of the defects.

Replacing the code with intrinsic make it difficult port to other
architectures. What if those architectures do not support
vectorized saturated sub? They could resort to if-convertion to
vectorize this loop which yield about the same performance.

Correct me if I'm wrong, we care memory usage as well, as far as I
know, i leave as much as possible for buffer cache. How much
addition memory usage we see in practice? What are the size of
inputs you used for the testing?

This is not a memory buffer, but rather a "register" used to flush the
actual bits. So zero memory usage.
However I do think we should sacrifice memory to gain performance. We
can easily use 10x times the memory without problems.


Reply to this email directly or view it on GitHub
#2 (comment).

@vkrasnov
Copy link
Copy Markdown
Author

How much difference did you obtained from this change? I recall I wrote
some asm to replace them and see no difference.
The code generated by gcc is indeed stupid, but it does not seems to
hurt the performance. I guess those redundant code
are not in critical path. I think it's not hard to improve. In the
following releases, gcc could get rid of the defects.

Not much really. It is mostly noticeable on level 4, that spends little time on actual matches. For very short files, this never performed, and for medium large files I saw about 2-4% overall perf improvement on Haswell. For level 6 and up you can't tell the difference.

Replacing the code with intrinsic make it difficult port to other
architectures. What if those architectures do not support
vectorized saturated sub? They could resort to if-convertion to
vectorize this loop which yield about the same performance.

I can only think of an ARMv8 port, that does have NEON instruction for this.

@jgrahamc
Copy link
Copy Markdown

LGTM

@vkrasnov
Copy link
Copy Markdown
Author

vkrasnov commented Mar 2, 2015

So I can merge this one?

vkrasnov added a commit that referenced this pull request Mar 3, 2015
Refactor trees.c, deflate.c and deflate.h for gcc x86-64.
@vkrasnov vkrasnov merged commit 83be2a1 into gcc.amd64 Mar 3, 2015
@vkrasnov vkrasnov deleted the refactor.gcc.amd64 branch July 8, 2015 15:54
geoff-nixon pushed a commit to geoff-nixon/zlib that referenced this pull request May 6, 2016
fhanau pushed a commit to fhanau/zlib that referenced this pull request Feb 27, 2023
* Update README.md

* Update title

* Update spec link
fhanau pushed a commit to fhanau/zlib that referenced this pull request Feb 27, 2023
Previously the status was set to "draft", which Bikeshed doesn't
understand. Change it to w3c/CG-DRAFT.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants