Port Intel optimizations (adler32, chunkcopy) to cloudflare by janaknat · Pull Request #23 · cloudflare/zlib

janaknat · 2020-09-24T18:48:07Z

This series ports the x86 inflate performance improvements from the chromium fork of zlib.

Based on 3 patches:

Noel Gordon (noel@chromium.org)
17bbb3d73c84 - "zlib adler_simd.c"
Adenilson Cavalcanti (adenilson.cavalcanti@arm.com)
3060dcb - "zlib: inflate using wider loads and stores"
Noel Gordon (noel@chromium.org)
64ffef0 - "Improve zlib inflate speed by using SSE2 chunk copy

With these changes, the inflate performance improvement is 15-35% when tested with a modified zpipe.c and the Silesia corpus.

Based on the adler32-simd patch from Noel Gordon for the chromium fork of zlib. 17bbb3d73c84 ("zlib adler_simd.c") Signed-off-by: Janakarajan Natarajan <janakan@amazon.com>

Based on 2 patches from zlib chromium fork: * Adenilson Cavalcanti (adenilson.cavalcanti@arm.com) 3060dcb - "zlib: inflate using wider loads and stores" * Noel Gordon (noel@chromium.org) 64ffef0 - "Improve zlib inflate speed by using SSE2 chunk copy The improvement in inflate performance is around 15-35%, based on the workload, when checked with a modified zpipe.c and the Silesia corpus. Signed-off-by: Janakarajan Natarajan <janakan@amazon.com>

janaknat · 2020-09-28T15:41:05Z

Performance numbers using the Silesia Corpus and a modified zpipe.c

Dataset	File	Throughput before (kb/ms)	Throughput after (kb/ms)	% change
Silesia	xml	65.21	84.29	+29%
	x-ray	138.18	159.91	+15%
	webster	75.03	95.93	+26%
	sao	155.98	189.81	+21%
	samba	88.70	111.74	+26%
	reymont	73.50	98.45	+34%
	osdb	99.36	120.32	+21%
	ooffice	98.12	118.96	+20%
	nci	55.70	69.41	+25%
	mr	92.76	115.98	+25%
	mozilla	101.37	121.30	+19%
	dickens	81.98	109.82	+34%

janaknat · 2020-09-28T17:25:09Z

@vkrasnov Any feedback on this series?

vkrasnov · 2020-09-28T18:20:46Z

This is perfect!

janaknat · 2020-09-28T19:36:25Z

@vkrasnov Thanks for merging the PR!

RJVB · 2020-09-29T15:05:18Z

So I built and compared #82035d0687c8d34981c93589f35982149fb34591 with #836eb111a5c5df7db3f2469a867a6f7c1b2e7bdb by running minigzip on a 305Mb big source tarball (QtBase 5.12.6, uncompressed; no options, reading from stdin and sending the output to /dev/null). To my surprise, compression time increased about 5x in the current release. Please tell me that makes sense because the compression settings were changed?!

janaknat · 2020-09-29T15:27:34Z

@RJVB Can you provide a link to the source tarball? How did you measure the time taken?

Also, the changes made in this PR are for de-compression(inflate). No changes were made to the compression code path.

vkrasnov · 2020-09-29T15:30:21Z

This is very weird, but he is right, in fact inflate is slower too, because somehow the -O3 flags goes missing after running configure.

vkrasnov · 2020-09-29T15:36:23Z

Fixed now in #24

vkrasnov · 2020-09-29T15:37:22Z

Thanks @RJVB !

janaknat · 2020-09-29T15:39:53Z

@vkrasnov @RJVB Apologies for the typo.

RJVB · 2020-09-29T19:21:43Z

Thanks @RJVB !

Well, thank *you* for finding the cause so quickly! I'm still at almost 2x slower compression (and that is in fact compared against a 32-bit build of the older code!) but decompression is indeed significantly faster. I don't know if you can compare the number of instructions between 64 bit and 32 bit builds but according to `perf stat` the older version requires more than 2x as many instructions to decompress the same file (and almost 2x as many cycles).

janaknat · 2020-10-02T18:38:42Z

@RJVB Do you see the same compression slow-down without the changes in this PR?

neurolabusc · 2020-10-02T20:38:46Z

@janaknat I do not see any benefits for compressing Silesia files to/from gzip format. I compared the latest version of this repository (zlibCFX) to a fork from June (zlibCF) on my MacBook with in-built SSD. Same performance regardless of whether I use make or cmake to compile the application. I suspect this is because Gzip uses crc32 not Adler-32.

On the other hand, decompression is faster, but only appreciably if compiled with ./configure --static; make which reports the DINFLATE_CHUNK_SIMD_SSE2 option. @janaknat, since you are familiar with these new features, I would be grateful if you would consider updating cmake to support them.

CompressMethod	Level	Min	Mean	Max	mb/s	%
zlibCF	1	2350	2405	2483	90	35.80
zlibCF	2	2472	2482	2498	86	35.00
zlibCF	3	2682	2736	2802	79	34.37
zlibCF	4	2907	3072	3248	73	33.46
zlibCF	5	3585	4071	4497	59	32.61
zlibCF	6	4460	4860	5542	48	32.35
zlibCF	7	5008	5459	5806	42	32.25
zlibCF	8	7400	7551	7856	29	32.17
zlibCF	9	10099	10678	11507	21	32.15
zlibCFX	1	2524	2584	2678	84	35.80
zlibCFX	2	2542	2624	2761	83	35.00
zlibCFX	3	2772	2899	3000	76	34.37
zlibCFX	4	2891	3141	3443	73	33.46
zlibCFX	5	3688	3758	3928	57	32.61
zlibCFX	6	4475	4678	5103	47	32.35
zlibCFX	7	5193	5403	5980	41	32.25
zlibCFX	8	7498	7671	7996	28	32.17
zlibCFX	9	10093	10274	10558	21	32.15

DecompressMethod	Min	Mean	Max	mb/s
zlibCF	42683	43312	43883	134.07
zlibCFX(cmake)	41457	42038	42833	138.03
zlibCFX(make)	36631	36814	37127	156.22

RJVB · 2020-10-02T20:48:42Z

@RJVB Do you see the same compression slow-down without the changes in this PR?

I'd have to check - I didn't yet because I never thought it was due to this change esp. after learning that the optimisations are only for decompression.

janaknat · 2020-10-05T17:35:26Z

@neurolabusc These are de-compression optimization changes, so not surprising there are any changes in compression performance.

I can take a look at the work that needs to be done for cmake.

…flare#23) It's unnecessary to await a promise returned from an async function (except if, as not needed here, the exception is to be caught in this function instead of by a caller). It's slightly verbose and advised against: https://github.com/eslint/eslint/blob/master/docs/rules/no-return-await.md https://jakearchibald.com/2017/await-vs-return-vs-return-await/ Since this is the only use of await, this function is not actually async. Finally, it should be named with the same casing convention as the previous example.

janaknat added 2 commits September 23, 2020 19:06

Add SIMD SSSE3 implementation of the adler32 checksum

d15eeef

Based on the adler32-simd patch from Noel Gordon for the chromium fork of zlib. 17bbb3d73c84 ("zlib adler_simd.c") Signed-off-by: Janakarajan Natarajan <janakan@amazon.com>

vkrasnov merged commit 82035d0 into cloudflare:gcc.amd64 Sep 28, 2020

neurolabusc mentioned this pull request Nov 5, 2020

Feature request: Update zlib static libraries for faster decompression synopse/mORMot#365

Closed

Conversation

janaknat commented Sep 24, 2020

Uh oh!

janaknat commented Sep 28, 2020

Uh oh!

janaknat commented Sep 28, 2020

Uh oh!

vkrasnov commented Sep 28, 2020

Uh oh!

janaknat commented Sep 28, 2020

Uh oh!

RJVB commented Sep 29, 2020 via email

Uh oh!

janaknat commented Sep 29, 2020

Uh oh!

vkrasnov commented Sep 29, 2020

Uh oh!

vkrasnov commented Sep 29, 2020

Uh oh!

vkrasnov commented Sep 29, 2020

Uh oh!

janaknat commented Sep 29, 2020

Uh oh!

RJVB commented Sep 29, 2020 via email

Uh oh!

janaknat commented Oct 2, 2020

Uh oh!

neurolabusc commented Oct 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RJVB commented Oct 2, 2020 via email

Uh oh!

janaknat commented Oct 5, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

neurolabusc commented Oct 2, 2020 •

edited

Loading