Closed
Conversation
on x86-64 architecture. Speedup the performance by some 7% on my linux box with corei7 archiecture. The original loop is legal to be vectorized; gcc 4.7.* and 4.8.* somehow fail to catch this case. There are still have room to squeeze from the vectorized code. However, since these loops now account for about 1.5% of execution time, it is not worthwhile to sequeeze the performance via hand-writing assembly. The original loops are guarded with "#ifdef NOT_TWEAK_COMPILER". By default, the modified version is picked up unless the code is compiled explictly with -DNOT_TWEAK_COMPILER.
… the
speedup arises from:
o. Remove the conditional branch in the loop
o. Remove some indirection memory accesses:
The memory accesses to "s->prev_length" s->strstart" cannot be promoted
to register because the compiler is not able to disambiguate them with
store-operation in INSERT_STRING()
o. Convert non-countable loop to countable loop.
I'm not sure if this change really contribute, in general, countable
loop is lots easier to optimized than non-countable loop.
See the header of "contrib/amd64/longest-match.inc" for details
conditional-directive for target architecture.
…hange sees about 5% speedup to compression at level of 3, 5, 6. Level-1/9 sees significantly higher/lower speedup. Decompression shows about 20% speedup. The file "contrib/amd64/crc32-pclmul_asm.S" is stolen from linux kerel 3.14 with minor change. In the near future, this file will be replaced with C implementation.
(about 0% to 3% depending on the compression level and input). I guess
the speedup likely arises from following facts:
1) "s->pending" now is loaded once, and stored once. In the original
implementation, it needs to be loaded and stored twice as the
compiler isn't able to disambiguate "s->pending" and
"s->pending_buf[]"
2) better code generations:
2.1) no instruction are needed for extracting two bytes from a short.
2.2) need less registers
2.3) stores to adjacent bytes are merged into a single store, albeit
at the cost of penalty of potentially unaligned access.
…ast 4 bytes, instead the 3 the spec allows. It allows for faster and more effective compression because there are less hash collisions, and false matches are discarded faster. In theory it could lead to worse compression in some cases, but for the most common files there is an improvement for all levels both in speed and in compression ratio.
It changes the hash and longest match functions to always match at least 4 bytes, instead the 3 the spec allows. It allows for faster and more effective compression because there are less hash collisions, and false matches are discarded faster. In theory it could lead to worse compression in some cases, but for the most common files there is an improvement for all levels both in speed and in compression ratio.
…ons we do not require (FASTEST, 64K, NOT_TWEAK_COMPILER). Remove contribs we don't require, integrate the hash func and longest_match funcs into deflate.c. Improve output buffer performance by using 64 bit buffer instead 16 bit. All in all ~10% performance gain for lvl 4 and ~5% performance gain for lvl 5
Refactor trees.c, deflate.c and deflate.h for gcc x86-64.
1) MSVC doesn't have __builtin_expect. Just ignore likely/unlikely and hope the compiler optimizes for the best. 2) MSVC also doesn't have __builtin_ctzl. Implement using _BitScanForward().
-MSVC (2010) only supports C89 for c files. All variable declarations need to be moved to the top of blocks. -MSVC also doesn't support GCC's typeof. No real need to call it in deflate.c anyway, so I removed it.
Changes to allow compilation using MSVC 2010
* Add aarch64 support, with over 2x faster compression than generic zlib * Support for OS X * Support for clang * Better configuration (no longer need to pass CFLAGS)
Support for aarch64 with crc extension
* faster crc32 of last <8 bytes on aarch64
On FreeBSD amd64 is named... 'amd64', not 'x86_64'. This is what 'uname -m' prints.
* .type and .size are ELF/COFF specific so drop them * .globl + .hidden equivalent for macOS is .private_extern * symbol name are not mangled on macOS, so we need to prefix _
* PS-237 Don't disable target optimizations when cross-compiling. AFAICT there's no reason to check the host platform here. The test only runs the compiler; it doesn't execute the output. * Add misc intermediate files to .gitignore.
configure mostly generates this file by copying from zconf.h.in.
* Replace GPL CRC with BSD CRC (zlib-ng/zlib-ng#42), for validation see https://github.com/neurolabusc/simd_crc * Westmere detection * Update configure for Westmere (InsightSoftwareConsortium/ITK#416) * use cpu_has_pclmul() to autodetect CPU hardware (InsightSoftwareConsortium/ITK#416) * remove gpl code * Improve support for compiling using Windows (https://github.com/ningfei/zlib) * Import ucm.cmake from https://github.com/ningfei/zlib * crc32_simd as separate file (#18) * atomic and SKIP_CPUID_CHECK (#18) * Remove unused code * Only allow compiler to use clmul instructions to crc_simd unit (intel/zlib#25) * Atomic does not compile on Ubuntu 14.04 * Allow "cmake -DSKIP_CPUID_CHECK=ON .." to not check SIMD CRC support on execution. * Restore Windows compilation using "nmake -f win32\Makefile.msc" * zconf.h required for Windows nmake * support crc intrinsics for ARM CPUs * Requested changes (#19)
Add SIMD NEON implementation of the adler32 checksum.
Inflate speed increased by ~10% with the Silesia corpus for ARMv8. Tested with
a modified zpipe.c to run inflate, deflate for a stream of size 100M.
Based on the adler32-simd patch from Noel Gordon for the chromium fork of zlib.
17bbb3d73c84 ("zlib adler_simd.c")
…e) (#22) * When windowBits is zero, the size of the sliding window comes from the zlib header. The allowed values of the four-bit field are 0..7, but when windowBits is zero, values greater than 7 are permitted and acted upon, resulting in large, mostly unused memory allocations. This fix rejects such invalid zlib headers. * Add option to not compute or check check values. The undocumented (except in these commit comments) function inflateValidate(strm, check) can be called after an inflateInit(), inflateInit2(), or inflateReset2() with check equal to zero to turn off the check value (CRC-32 or Adler-32) computation and comparison. Calling with check not equal to zero turns checking back on. This should only be called immediately after the init or reset function. inflateReset() does not change the state, so a previous inflateValidate() setting will remain in effect. This also turns off validation of the gzip header CRC when present. This should only be used when a zlib or gzip stream has already been checked, and repeated decompressions of the same stream no longer need to be validated. * This verifies that the state has been initialized, that it is the expected type of state, deflate or inflate, and that at least the first several bytes of the internal state have not been clobbered. * Use macros to represent magic numbers This combines two patches which help in improving the readability and maintainability of the code by making magic numbers into #defines. Based on Chris Blume's (cblume@chromium) patches for zlib chromium: 8888511 - "Zlib: Use defines for inffast" b9c1566 - "Share inffast names in zlib" These patches are needed when introducing chunk SIMD NEON enchancements. Signed-off-by: Janakarajan Natarajan <janakan@amazon.com> * Port inflate chunk SIMD NEON patches for cloudflare Based on 2 patches from zlib chromium fork: * Adenilson Cavalcanti (adenilson.cavalcanti@arm.com) 3060dcb - "zlib: inflate using wider loads and stores" * Noel Gordon (noel@chromium.org) 64ffef0 - "Improve zlib inflate speed by using SSE2 chunk copy The two patches combined provide around 5-25% increase in inflate performance, based on the workload, when checked with a modified zpipe.c and the Silesia corpus. Signed-off-by: Janakarajan Natarajan <janakan@amazon.com> * Increase inflate speed: read decode input into a uint64_t Update the chunk-copy code with a wide input data reader, which consumes input in 64-bit (8 byte) chunks. Update inflate_fast_chunk_() to use the wide reader. Based on Noel Gordon's (noel@chromium.org) patch for the zlib chromium fork 8a8edc1 - "Increase inflate speed: read decoder input into a uint64_t" This patch provides 7-10% inflate performance improvement when tested with a modified zpipe.c and the Silesia corpus. Signed-off-by: Janakarajan Natarajan <janakan@amazon.com> Co-authored-by: Mark Adler <madler@alumni.caltech.edu>
* Add SIMD SSSE3 implementation of the adler32 checksum
Based on the adler32-simd patch from Noel Gordon for the chromium fork of zlib.
17bbb3d73c84 ("zlib adler_simd.c")
Signed-off-by: Janakarajan Natarajan <janakan@amazon.com>
* Port inflate chunk SIMD SSE2 improvements for cloudflare
Based on 2 patches from zlib chromium fork:
* Adenilson Cavalcanti (adenilson.cavalcanti@arm.com)
3060dcb - "zlib: inflate using wider loads and stores"
* Noel Gordon (noel@chromium.org)
64ffef0 - "Improve zlib inflate speed by using SSE2 chunk copy
The improvement in inflate performance is around 15-35%, based
on the workload, when checked with a modified zpipe.c and the
Silesia corpus.
Signed-off-by: Janakarajan Natarajan <janakan@amazon.com>
* Fix architecture macro on MSVC. Signed-off-by: Ningfei Li <ningfei.li@gmail.com> * Update '.gitignore'. Also remove 'zconf.h', it will be generated during building. * Add 'build' subdir to .gitignore * Fix architecture macro on MSVC. * Update CMakeLists.txt Clean up. Add option to set runtime (useful for MSVC). Add 'SSE4.2' and 'AVX' option. Signed-off-by: Ningfei Li <ningfei.li@gmail.com> * Define '_LARGEFILE64_SOURCE' by default in CMakeLists.txt. Also remove checking fseeko. Signed-off-by: Ningfei Li <ningfei.li@gmail.com> * Set 'CMAKE_MACOSX_RPATH' to TRUE. Signed-off-by: Ningfei Li <ningfei.li@gmail.com> * Update SSE4.2 and PCLMUL setting in CMakeLists.txt Signed-off-by: Ningfei Li <ningfei.li@gmail.com> * Add 'HAVE_HIDDEN' to CMakeLists.txt Signed-off-by: Ningfei Li <ningfei.li@gmail.com> * Fix building dll on MSVC. Signed-off-by: Ningfei Li <ningfei.li@gmail.com> * Fix zlib.pc file generation. Signed-off-by: Ningfei Li <ningfei.li@gmail.com> * Refine cmake settings. * Change cmake function to lowercase. * Fix zlib.pc * Fix crc32_pclmul_le_16 type. * Set POSITION_INDEPENDENT_CODE flag to ON by default. * Replace GPL CRC with BSD CRC (zlib-ng/zlib-ng#42), for validation see https://github.com/neurolabusc/simd_crc * Westmere detection * Update configure for Westmere (InsightSoftwareConsortium/ITK#416) * use cpu_has_pclmul() to autodetect CPU hardware (InsightSoftwareConsortium/ITK#416) * remove gpl code * Improve support for compiling using Windows (https://github.com/ningfei/zlib) * Import ucm.cmake from https://github.com/ningfei/zlib * crc32_simd as separate file (#18) * atomic and SKIP_CPUID_CHECK (#18) * Suppress some MSVC warnings. * Remove unused code * Fix ucm_set_runtime when only C is enabled for the project. * Removed configured header file. * Unify zconf.h template. * Only allow compiler to use clmul instructions to crc_simd unit (intel/zlib#25) * Atomic does not compile on Ubuntu 14.04 * Allow "cmake -DSKIP_CPUID_CHECK=ON .." to not check SIMD CRC support on execution. * Restore Windows compilation using "nmake -f win32\Makefile.msc" * Do not set SOVERSION on Cygwin. * Fix file permission. * Refine PCLMUL CMake option. * zconf.h required for Windows nmake * Do not set visibility flag on Cygwin. * Fix compiling dll resource. * Fix compiling using MinGW. SSE4.2 and PCLMUL are also supported. * Fix zconf.h for Windows. * support crc intrinsics for ARM CPUs * Add gzip -k option to minigzip (to aid benchmarks) * Support Intel and ARMv8 optimization in CMake. * Clean up. * Fix compiling on Apple M1. check_c_compiler_flag detects SSE2, SSE3, SSE42 and PCLMUL but compiling fails. Workaround fix, need further investigation. * Restore MSVC warnings. Co-authored-by: neurolabusc <rorden@sc.edu> Co-authored-by: Chris Rorden <rorden@mailbox.sc.edu>
This ports a21a4e8 - "Handling undefined behavior in inffast_chunk" from the Chromium fork of zlib.
Port of the patch for CVE-2018-25032 from madler@5c44459 This bug was reported by Danilo Ramos of Eideticom, Inc. It has lain in wait 13 years before being found! The bug was introduced in zlib 1.2.2.2, with the addition of the Z_FIXED option. That option forces the use of fixed Huffman codes. For rare inputs with a large number of distant matches, the pending buffer into which the compressed data is written can overwrite the distance symbol table which it overlays. That results in corrupted output due to invalid distances, and can result in out-of-bound accesses, crashing the application. The fix here combines the distance buffer and literal/length buffers into a single symbol buffer. Now three bytes of pending buffer space are opened up for each literal or length/distance pair consumed, instead of the previous two bytes. This assures that the pending buffer cannot overwrite the symbol table, since the maximum fixed code compressed length/distance is 31 bits, and since there are four bytes of pending space for every three bytes of symbol space. Co-authored-by: Mark Adler <madler@alumni.caltech.edu>
A windowBits value of 0, 16, or 32 gets the window bits from the zlib header. However there is no zlib header for 16, or for 32 when the input is gzip. This commit sets the window bits for inflate to 15 if a gzip stream is detected and windowBits was 16 or 32. Co-authored-by: Mark Adler <madler@alumni.caltech.edu>
* Fix a bug when getting a gzip header extra field with inflate(). If the extra field was larger than the space the user provided with inflateGetHeader(), and if multiple calls of inflate() delivered the extra header data, then there could be a buffer overflow of the provided space. This commit assures that provided space is not exceeded. * Fix extra field processing bug that dereferences NULL state->head. The recent commit to fix a gzip header extra field processing bug introduced the new bug fixed here. Co-authored-by: Mark Adler <fork@madler.net>
Port the change for Chromium issue 1302606 which avoids writing padding bytes for debug purposes except for debug builds. In particular, this ensures that the JDK `InflaterInputStream::read(byte[] b, int off, int len)` contract can be satisfied by this implementation. See: - https://bugs.chromium.org/p/chromium/issues/detail?id=1302606 - https://chromium.googlesource.com/chromium/src/+/36be3c7a465f00f4edefccc10bba00bbebe1843d%5E%21/ This is complementary to the work Volker Simonis is doing upstream on making `InflaterInputStream` more tolerant of this kind of behavior, providing compatibility for earlier JDK releases. - https://bugs.openjdk.org/browse/JDK-8281962 - https://bugs.openjdk.org/browse/JDK-8283758 Signed-off-by: Danny Thomas <dannyt@netflix.com> Signed-off-by: Danny Thomas <dannyt@netflix.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Cherry-picked from: madler@9aaec95
This is (CVE-2016-9841)[https://github.com/advisories/GHSA-5f8r-846v-423w] fix.