Refactor trees.c, deflate.c and deflate.h for gcc x86-64.#2
Refactor trees.c, deflate.c and deflate.h for gcc x86-64.#2
Conversation
…ons we do not require (FASTEST, 64K, NOT_TWEAK_COMPILER). Remove contribs we don't require, integrate the hash func and longest_match funcs into deflate.c. Improve output buffer performance by using 64 bit buffer instead 16 bit. All in all ~10% performance gain for lvl 4 and ~5% performance gain for lvl 5
I don't think it makes big sense to remote those contrib. As it make huge diff and render it lots harder to keep in sync with the mainstream.
|
The mainstream development came to a stall, I think we will gain a lot by branching.
Saturated sub. After (I think it was you?) fixed the loop in fill_window gcc 4.9.1 emits saturated sub, but with some absolutely redundant code, with slower performance. I didn't check what gcc 4.8 did, but intrinsics are good for any processor.
This is not a memory buffer, but rather a "register" used to flush the actual bits. So zero memory usage. |
|
On 01/19/2015 11:01 AM, vkrasnov wrote:
How much difference did you obtained from this change? I recall I wrote Replacing the code with intrinsic make it difficult port to other
|
Not much really. It is mostly noticeable on level 4, that spends little time on actual matches. For very short files, this never performed, and for medium large files I saw about 2-4% overall perf improvement on Haswell. For level 6 and up you can't tell the difference.
I can only think of an ARMv8 port, that does have NEON instruction for this. |
|
LGTM |
|
So I can merge this one? |
Refactor trees.c, deflate.c and deflate.h for gcc x86-64.
* Update README.md * Update title * Update spec link
Previously the status was set to "draft", which Bikeshed doesn't understand. Change it to w3c/CG-DRAFT.
Remove options we do not require (FASTEST, 64K, NOT_TWEAK_COMPILER). Remove contribs we don't require, integrate the hash func and longest_match funcs into deflate.c.
Use intrinsics, where compiler struggles.
Improve output buffer performance by using 64 bit buffer instead 16 bit.
All in all ~10% performance gain for lvl 4 and ~5% performance gain for lvl 5.