Conditionally shortcut via the chorba polynomial based on compile flags by KungFuJesus · Pull Request #2020 · zlib-ng/zlib-ng

KungFuJesus · 2025-11-21T15:10:13Z

As it turns out, the copying CRC32 variant is slower when compiled
with generic flags. The reason for this is mainly extra stack spills and
the lack of operations we can overlap with the moves. However, when
compiling for an architecture with more registers, such as avx512, we no
longer have to eat all these costly stack spills and we can overlap with
a 3 operand XOR. Conditionally guarding this means that if a Linux
distribution wants to compile with -march=x86_64-v4 they get all the
upsides to this.

This code notably is not actually used if you happen to have something
that support 512 bit wide clmul, so this does help a somewhat narrow
range of targets (most of the earlier avx512 implementations pre ice
lake).

We also must guard with AVX512VL, as just specifying AVX512F makes GCC
generate vpternlogic instructions of 512 bit widths only, so a bunch of
packing and unpacking of 512 bit to 256 bit registers and vice versa has
to occur, absolutely killing runtime. It's only AVX512VL where there's a
128 bit wide vpternlogic.

This should be merged after #2019

coderabbitai · 2025-11-21T15:16:03Z

Walkthrough

Adds a preprocessor guard around the large 512+64+16*8 processing loop in arch/x86/crc32_fold_pclmulqdq_tpl.h, so that the block is compiled only when COPY is not defined or __AVX512VL__ is available. No public API changes.

Changes

Cohort / File(s)	Summary
CRC32 folding guard `arch/x86/crc32_fold_pclmulqdq_tpl.h`	Wraps the large 512+64+16*8 processing loop with `#if !defined(COPY)

Possibly related PRs

Use aligned loads in the chorba portions of the clmul crc routines #2019 — Also modifies the CRC folding loop in arch/x86/crc32_fold_pclmulqdq_tpl.h; directly related changes affecting behavior when COPY is/aren't defined.

Suggested labels

Architecture, optimization

Suggested reviewers

Dead2
nmoinvaz

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title directly relates to the main change: conditionally guarding a CRC32 processing loop based on compile flags (COPY not defined or AVX512VL available).
Description check	✅ Passed	The description is directly related to the changeset, providing detailed rationale for why the conditional guard is needed based on register availability and compiler optimization behavior.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 99d2915 and e6fe20a.

📒 Files selected for processing (1)

arch/x86/crc32_fold_pclmulqdq_tpl.h (2 hunks)

🧰 Additional context used

🧠 Learnings (11)

📓 Common learnings

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/x86_intrins.h:114-117
Timestamp: 2025-02-23T16:51:54.545Z
Learning: In x86/x86_intrins.h, the Clang macros for _mm_cvtsi64x_si128 and _mm_cvtsi128_si64x don't need additional MSVC guards since MSVC's implementation is already protected by `defined(_MSC_VER) && !defined(__clang__)`, making them mutually exclusive.

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1805
File: inffast_tpl.h:257-262
Timestamp: 2024-10-29T02:22:52.846Z
Learning: In `inffast_tpl.h`, when AVX512 is enabled, the branch involving `chunkcopy_safe` is intentionally eliminated to optimize performance.

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:14-24
Timestamp: 2025-02-21T01:41:50.358Z
Learning: In zlib-ng's SSE2 vectorized Chorba CRC implementation, the code that calls READ_NEXT macro ensures 16-byte alignment, making explicit alignment checks unnecessary within the macro.

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/x86_intrins.h:0-0
Timestamp: 2025-02-23T16:50:50.925Z
Learning: MSVC does not define `__GNUC__`, so adding `!defined(_MSC_VER)` to GCC detection macros is redundant when `__GNUC__` is already being checked.

Learnt from: mtl1979
Repo: zlib-ng/zlib-ng PR: 1925
File: arch/loongarch/lasxintrin_ext.h:38-65
Timestamp: 2025-07-04T16:59:44.725Z
Learning: For new architectures like LoongArch64, inline assembly fallbacks are necessary because compilers don't yet have intrinsic functions for all common operations. This requires maintaining complex inline assembly implementations until the compiler ecosystem matures.

📚 Learning: 2024-10-29T02:22:52.846Z

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1805
File: inffast_tpl.h:257-262
Timestamp: 2024-10-29T02:22:52.846Z
Learning: In `inffast_tpl.h`, when AVX512 is enabled, the branch involving `chunkcopy_safe` is intentionally eliminated to optimize performance.

Applied to files:

arch/x86/crc32_fold_pclmulqdq_tpl.h

📚 Learning: 2025-02-23T16:51:54.545Z

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/x86_intrins.h:114-117
Timestamp: 2025-02-23T16:51:54.545Z
Learning: In x86/x86_intrins.h, the Clang macros for _mm_cvtsi64x_si128 and _mm_cvtsi128_si64x don't need additional MSVC guards since MSVC's implementation is already protected by `defined(_MSC_VER) && !defined(__clang__)`, making them mutually exclusive.

Applied to files:

arch/x86/crc32_fold_pclmulqdq_tpl.h

📚 Learning: 2025-02-21T01:41:50.358Z

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:14-24
Timestamp: 2025-02-21T01:41:50.358Z
Learning: In zlib-ng's SSE2 vectorized Chorba CRC implementation, the code that calls READ_NEXT macro ensures 16-byte alignment, making explicit alignment checks unnecessary within the macro.

Applied to files:

arch/x86/crc32_fold_pclmulqdq_tpl.h

📚 Learning: 2024-10-29T02:22:55.489Z

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1805
File: arch/x86/chunkset_avx512.c:32-34
Timestamp: 2024-10-29T02:22:55.489Z
Learning: In `arch/x86/chunkset_avx512.c`, the `gen_mask` function's `len` parameter cannot exceed 32 because it is only called on the remaining bytes from a 32-byte vector.

Applied to files:

arch/x86/crc32_fold_pclmulqdq_tpl.h

📚 Learning: 2024-10-29T02:18:25.966Z

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1805
File: arch/x86/chunkset_avx512.c:28-30
Timestamp: 2024-10-29T02:18:25.966Z
Learning: In `chunkset_avx512.c`, the `gen_half_mask` function does not require validation for `len` since it will never exceed 16 due to computing the remainder for a 16-byte load.

Applied to files:

arch/x86/crc32_fold_pclmulqdq_tpl.h

📚 Learning: 2025-02-21T01:42:40.488Z

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:42:40.488Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.

Applied to files:

arch/x86/crc32_fold_pclmulqdq_tpl.h

📚 Learning: 2025-02-21T01:44:03.996Z

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:26-28
Timestamp: 2025-02-21T01:44:03.996Z
Learning: The alignment requirements for chorba_small_nondestructive_sse2 (16-byte alignment and multiple of 8 length) are enforced by its calling function, making additional checks redundant.

Applied to files:

arch/x86/crc32_fold_pclmulqdq_tpl.h

📚 Learning: 2024-10-07T22:00:02.180Z

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1802
File: chunkset_tpl.h:92-108
Timestamp: 2024-10-07T22:00:02.180Z
Learning: In the `HALFCHUNKCOPY` function in `chunkset_tpl.h`, the `len` parameter is always bound to be small by the callers and will not exceed the limits of an `int32_t`.

Applied to files:

arch/x86/crc32_fold_pclmulqdq_tpl.h

📚 Learning: 2025-06-10T07:38:03.297Z

Learnt from: mtl1979
Repo: zlib-ng/zlib-ng PR: 1921
File: arch/riscv/chunkset_rvv.c:103-104
Timestamp: 2025-06-10T07:38:03.297Z
Learning: In RISC-V chunkset_rvv.c CHUNKCOPY function, when dist < sizeof(chunk_t), the vl variable intentionally becomes 0, causing the while loop to not execute. This is correct behavior because copying full chunks is not safe when the distance is smaller than chunk size, and the function appropriately falls back to memcpy for handling remaining bytes.

Applied to files:

arch/x86/crc32_fold_pclmulqdq_tpl.h

📚 Learning: 2024-10-07T21:18:37.806Z

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1802
File: arch/x86/chunkset_avx2.c:82-85
Timestamp: 2024-10-07T21:18:37.806Z
Learning: In `arch/x86/chunkset_avx2.c`, when working with AVX2-capable x86 CPUs, unaligned memory access using `_mm_loadu_si128` is acceptable since there is no performance penalty on architectures after Nehalem. Ensuring alignment may introduce unnecessary overhead due to arbitrary offsets into the window.

Applied to files:

arch/x86/crc32_fold_pclmulqdq_tpl.h

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (73)

GitHub Check: macOS Clang Native Instructions (ARM64)
GitHub Check: Windows GCC Compat No Opt
GitHub Check: Ubuntu MinGW i686
GitHub Check: EL10 GCC S390X DFLTCC ASAN
GitHub Check: Ubuntu GCC AARCH64 ASAN
GitHub Check: macOS GCC (ARM64)
GitHub Check: macOS Clang Native Instructions (ARM64)
GitHub Check: Windows GCC Compat No Opt
GitHub Check: Ubuntu MinGW i686
GitHub Check: EL10 GCC S390X DFLTCC ASAN
GitHub Check: Ubuntu GCC AARCH64 ASAN
GitHub Check: macOS GCC (ARM64)
GitHub Check: macOS Clang Native Instructions (ARM64)
GitHub Check: Windows GCC Compat No Opt
GitHub Check: Ubuntu MinGW i686
GitHub Check: EL10 GCC S390X DFLTCC ASAN
GitHub Check: Ubuntu GCC AARCH64 ASAN
GitHub Check: macOS GCC (ARM64)
GitHub Check: macOS Clang Native Instructions (ARM64)
GitHub Check: Windows GCC Compat No Opt
GitHub Check: Ubuntu MinGW i686
GitHub Check: EL10 GCC S390X DFLTCC ASAN
GitHub Check: Ubuntu GCC AARCH64 ASAN
GitHub Check: macOS GCC (ARM64)
GitHub Check: macOS Clang Native Instructions (ARM64)
GitHub Check: Windows GCC Compat No Opt
GitHub Check: Ubuntu MinGW i686
GitHub Check: EL10 GCC S390X DFLTCC ASAN
GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
GitHub Check: Ubuntu GCC AARCH64 ASAN
GitHub Check: macOS GCC (ARM64)
GitHub Check: macOS Clang Native Instructions (ARM64)
GitHub Check: Windows GCC Compat No Opt
GitHub Check: Ubuntu MinGW i686
GitHub Check: EL10 GCC S390X DFLTCC ASAN
GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
GitHub Check: Ubuntu GCC AARCH64 ASAN
GitHub Check: macOS GCC (ARM64)
GitHub Check: macOS Clang Native Instructions (ARM64)
GitHub Check: Windows GCC Compat No Opt
GitHub Check: Ubuntu MinGW i686
GitHub Check: EL10 GCC S390X DFLTCC ASAN
GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
GitHub Check: Ubuntu GCC AARCH64 ASAN
GitHub Check: macOS GCC Symbol Prefix & Compat (ARM64)
GitHub Check: macOS GCC (ARM64)
GitHub Check: macOS Clang Native Instructions (ARM64)
GitHub Check: Windows GCC Compat No Opt
GitHub Check: Windows MSVC 2022 v140 Win32
GitHub Check: Ubuntu MinGW i686
GitHub Check: EL10 GCC S390X DFLTCC ASAN
GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
GitHub Check: Ubuntu GCC AARCH64 ASAN
GitHub Check: macOS GCC Symbol Prefix & Compat (ARM64)
GitHub Check: macOS GCC (ARM64)
GitHub Check: macOS Clang Native Instructions (ARM64)
GitHub Check: Windows GCC Compat No Opt
GitHub Check: Windows MSVC 2022 v140 Win32
GitHub Check: Ubuntu MinGW i686
GitHub Check: EL10 GCC S390X DFLTCC ASAN
GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
GitHub Check: Ubuntu GCC AARCH64 ASAN
GitHub Check: macOS GCC Symbol Prefix & Compat (ARM64)
GitHub Check: macOS GCC (ARM64)
GitHub Check: macOS Clang Native Instructions (ARM64)
GitHub Check: Windows GCC Compat No Opt
GitHub Check: Windows MSVC 2022 v140 Win32
GitHub Check: Ubuntu MinGW i686
GitHub Check: EL10 GCC S390X DFLTCC ASAN
GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
GitHub Check: Ubuntu GCC AARCH64 ASAN
GitHub Check: macOS GCC Symbol Prefix & Compat (ARM64)
GitHub Check: macOS GCC (ARM64)

🔇 Additional comments (1)

arch/x86/crc32_fold_pclmulqdq_tpl.h (1)

114-326: Conditional compilation guard correctly implements the optimization strategy.

The guard appropriately restricts the chorba polynomial optimization for the COPY variant to builds with AVX512VL support, addressing the stack spill and register pressure issues on generic builds while enabling the optimization when sufficient registers are available. The fallback to the 64-byte loop at line 328 ensures correct behavior in all configurations, and the benchmark results confirm the expected performance improvements.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

arch/x86/crc32_fold_pclmulqdq_tpl.h (2)
114-115: Guarding Chorba path on COPY + AVX512VL looks correct and targeted

The #if !defined(COPY) || defined(__AVX512VL__) around the Chorba loop cleanly limits the optimized “copying” variant to AVX512VL builds, while keeping the non‑COPY CRC32 path always enabled. That matches the stated goal: avoid the register‑pressure regression for generic COPY builds, but still allow the optimization when compiling with AVX‑512VL, and leave the non‑copying CRC32 unaffected.

If you want to make this intent more obvious to future readers, consider adding a short comment above the #if along the lines of “COPY variant only enabled here when compiled with AVX‑512VL to avoid stack‑spill regressions on generic builds.” This is optional; the logic itself looks sound.

Also applies to: 326-326

116-123: Aligned loads in Chorba loop are consistent with existing alignment invariant

All loads from src in the Chorba loop have been switched to _mm_load_si128 (aligned). Given the earlier algn_diff handling and existing use of _mm_load_si128 in the subsequent 64‑byte and tail loops, src is already guaranteed to be 16‑byte aligned at this point and advanced only by multiples of 16. So this change is consistent with the rest of the function and should remove redundant unaligned load semantics without introducing new UB.

If you’d like to harden this for future refactors, you could optionally add a debug‑only invariant just before the Chorba loop, e.g.:
@@
-    /* Implement Chorba algorithm from https://arxiv.org/abs/2412.16398 */
+    /* Implement Chorba algorithm from https://arxiv.org/abs/2412.16398 */
+#if !defined(NDEBUG)
+    assert((((uintptr_t)src) & 0xF) == 0);
+#endif
That would document the 16‑byte alignment requirement which these _mm_load_si128 calls rely on, aligning with how other SSE2/Chorba paths in this codebase depend on caller‑enforced alignment. Based on learnings.

Also applies to: 142-145, 164-167, 187-190, 210-213, 233-236, 256-259, 279-282, 301-304

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 469cf6d and 99d2915.

📒 Files selected for processing (1)

arch/x86/crc32_fold_pclmulqdq_tpl.h (10 hunks)

🧰 Additional context used

🧠 Learnings (14)

📓 Common learnings

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.

Learnt from: mtl1979
Repo: zlib-ng/zlib-ng PR: 1925
File: arch/loongarch/lasxintrin_ext.h:38-65
Timestamp: 2025-07-04T16:59:44.725Z
Learning: In zlib-ng, the policy for supporting old compilers is to maintain compatibility for up to two years after GitHub removes support for them in CI, or when there is no long-time contributor left testing for the architecture if the architecture was not supported by GitHub Actions at all.

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:14-24
Timestamp: 2025-02-21T01:41:50.358Z
Learning: In zlib-ng's SSE2 vectorized Chorba CRC implementation, the code that calls READ_NEXT macro ensures 16-byte alignment, making explicit alignment checks unnecessary within the macro.

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:42:40.488Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1802
File: arch/x86/chunkset_avx2.c:82-85
Timestamp: 2024-10-08T19:37:14.998Z
Learning: In `arch/x86/chunkset_avx2.c`, when working with AVX2-capable x86 CPUs, unaligned memory access using `_mm_loadu_si128` is acceptable since there is no performance penalty on architectures after Nehalem. Ensuring alignment may introduce unnecessary overhead due to arbitrary offsets into the window.

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1802
File: arch/x86/chunkset_avx2.c:82-85
Timestamp: 2024-10-07T21:18:37.806Z
Learning: In `arch/x86/chunkset_avx2.c`, when working with AVX2-capable x86 CPUs, unaligned memory access using `_mm_loadu_si128` is acceptable since there is no performance penalty on architectures after Nehalem. Ensuring alignment may introduce unnecessary overhead due to arbitrary offsets into the window.

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1778
File: arch/x86/chunkset_avx2.c:160-171
Timestamp: 2024-10-08T21:51:45.330Z
Learning: In `arch/x86/chunkset_avx2.c`, within the `GET_HALFCHUNK_MAG` function, using a conditional branch to select between `_mm_loadl_epi64` and `_mm_loadu_si128` is not recommended because the branching cost outweighs the savings from the load.

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/x86_intrins.h:114-117
Timestamp: 2025-02-23T16:51:54.545Z
Learning: In x86/x86_intrins.h, the Clang macros for _mm_cvtsi64x_si128 and _mm_cvtsi128_si64x don't need additional MSVC guards since MSVC's implementation is already protected by `defined(_MSC_VER) && !defined(__clang__)`, making them mutually exclusive.

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1805
File: inffast_tpl.h:257-262
Timestamp: 2024-10-29T02:22:52.846Z
Learning: In `inffast_tpl.h`, when AVX512 is enabled, the branch involving `chunkcopy_safe` is intentionally eliminated to optimize performance.

Learnt from: mtl1979
Repo: zlib-ng/zlib-ng PR: 1921
File: arch/riscv/chunkset_rvv.c:97-102
Timestamp: 2025-06-09T16:46:28.468Z
Learning: In RISC-V chunkset_rvv.c CHUNKCOPY function, the alignment logic intentionally copies sizeof(chunk_t) bytes but advances pointers by align bytes to avoid unaligned access. This "re-reading" technique is a deliberate optimization to maintain proper alignment for subsequent operations, not a bug.

Learnt from: mtl1979
Repo: zlib-ng/zlib-ng PR: 1925
File: arch/loongarch/lasxintrin_ext.h:38-65
Timestamp: 2025-07-04T16:59:44.725Z
Learning: For new architectures like LoongArch64, inline assembly fallbacks are necessary because compilers don't yet have intrinsic functions for all common operations. This requires maintaining complex inline assembly implementations until the compiler ecosystem matures.

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:26-28
Timestamp: 2025-02-21T01:44:03.996Z
Learning: The alignment requirements for chorba_small_nondestructive_sse2 (16-byte alignment and multiple of 8 length) are enforced by its calling function, making additional checks redundant.

📚 Learning: 2025-02-23T16:51:54.545Z

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/x86_intrins.h:114-117
Timestamp: 2025-02-23T16:51:54.545Z
Learning: In x86/x86_intrins.h, the Clang macros for _mm_cvtsi64x_si128 and _mm_cvtsi128_si64x don't need additional MSVC guards since MSVC's implementation is already protected by `defined(_MSC_VER) && !defined(__clang__)`, making them mutually exclusive.

Applied to files:

arch/x86/crc32_fold_pclmulqdq_tpl.h

📚 Learning: 2024-10-29T02:22:52.846Z

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1805
File: inffast_tpl.h:257-262
Timestamp: 2024-10-29T02:22:52.846Z
Learning: In `inffast_tpl.h`, when AVX512 is enabled, the branch involving `chunkcopy_safe` is intentionally eliminated to optimize performance.

Applied to files:

arch/x86/crc32_fold_pclmulqdq_tpl.h

📚 Learning: 2024-10-08T21:51:45.330Z

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1778
File: arch/x86/chunkset_avx2.c:160-171
Timestamp: 2024-10-08T21:51:45.330Z
Learning: In `arch/x86/chunkset_avx2.c`, within the `GET_HALFCHUNK_MAG` function, using a conditional branch to select between `_mm_loadl_epi64` and `_mm_loadu_si128` is not recommended because the branching cost outweighs the savings from the load.

Applied to files:

arch/x86/crc32_fold_pclmulqdq_tpl.h

📚 Learning: 2024-10-08T19:37:14.998Z

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1802
File: arch/x86/chunkset_avx2.c:82-85
Timestamp: 2024-10-08T19:37:14.998Z
Learning: In `arch/x86/chunkset_avx2.c`, when working with AVX2-capable x86 CPUs, unaligned memory access using `_mm_loadu_si128` is acceptable since there is no performance penalty on architectures after Nehalem. Ensuring alignment may introduce unnecessary overhead due to arbitrary offsets into the window.

Applied to files:

arch/x86/crc32_fold_pclmulqdq_tpl.h

📚 Learning: 2025-02-21T01:41:50.358Z

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:14-24
Timestamp: 2025-02-21T01:41:50.358Z
Learning: In zlib-ng's SSE2 vectorized Chorba CRC implementation, the code that calls READ_NEXT macro ensures 16-byte alignment, making explicit alignment checks unnecessary within the macro.

Applied to files:

arch/x86/crc32_fold_pclmulqdq_tpl.h

📚 Learning: 2025-02-21T01:42:40.488Z

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:42:40.488Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.

Applied to files:

arch/x86/crc32_fold_pclmulqdq_tpl.h

📚 Learning: 2024-10-29T02:22:55.489Z

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1805
File: arch/x86/chunkset_avx512.c:32-34
Timestamp: 2024-10-29T02:22:55.489Z
Learning: In `arch/x86/chunkset_avx512.c`, the `gen_mask` function's `len` parameter cannot exceed 32 because it is only called on the remaining bytes from a 32-byte vector.

Applied to files:

arch/x86/crc32_fold_pclmulqdq_tpl.h

📚 Learning: 2024-10-29T02:18:25.966Z

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1805
File: arch/x86/chunkset_avx512.c:28-30
Timestamp: 2024-10-29T02:18:25.966Z
Learning: In `chunkset_avx512.c`, the `gen_half_mask` function does not require validation for `len` since it will never exceed 16 due to computing the remainder for a 16-byte load.

Applied to files:

arch/x86/crc32_fold_pclmulqdq_tpl.h

📚 Learning: 2025-02-21T01:41:10.063Z

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:41:10.063Z
Learning: For SSE2 optimizations, `_mm_cvtsi128_si64` should be used instead of `_mm_extract_epi64` (SSE4.1) for extracting 64-bit values from 128-bit vectors, as it generates more efficient movq instructions.

Applied to files:

arch/x86/crc32_fold_pclmulqdq_tpl.h

📚 Learning: 2025-06-10T07:38:03.297Z

Learnt from: mtl1979
Repo: zlib-ng/zlib-ng PR: 1921
File: arch/riscv/chunkset_rvv.c:103-104
Timestamp: 2025-06-10T07:38:03.297Z
Learning: In RISC-V chunkset_rvv.c CHUNKCOPY function, when dist < sizeof(chunk_t), the vl variable intentionally becomes 0, causing the while loop to not execute. This is correct behavior because copying full chunks is not safe when the distance is smaller than chunk size, and the function appropriately falls back to memcpy for handling remaining bytes.

Applied to files:

arch/x86/crc32_fold_pclmulqdq_tpl.h

📚 Learning: 2025-02-23T16:49:52.043Z

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.

Applied to files:

arch/x86/crc32_fold_pclmulqdq_tpl.h

📚 Learning: 2025-01-23T22:01:53.422Z

Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-01-23T22:01:53.422Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.

Applied to files:

arch/x86/crc32_fold_pclmulqdq_tpl.h

📚 Learning: 2024-10-08T19:37:14.998Z

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1802
File: chunkset_tpl.h:92-108
Timestamp: 2024-10-08T19:37:14.998Z
Learning: In the `HALFCHUNKCOPY` function in `chunkset_tpl.h`, the `len` parameter is always bound to be small by the callers and will not exceed the limits of an `int32_t`.

Applied to files:

arch/x86/crc32_fold_pclmulqdq_tpl.h

🧬 Code graph analysis (1)

arch/x86/crc32_fold_pclmulqdq_tpl.h (1)

arch/x86/crc32_pclmulqdq_tpl.h (1)

crc32_fold_load (273-278)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (30)

GitHub Check: EL10 GCC S390X DFLTCC ASAN
GitHub Check: Ubuntu GCC No CTZ
GitHub Check: macOS GCC (Intel)
GitHub Check: EL10 GCC S390X DFLTCC ASAN
GitHub Check: Ubuntu GCC No CTZ
GitHub Check: macOS GCC (Intel)
GitHub Check: EL10 GCC S390X DFLTCC ASAN
GitHub Check: Ubuntu GCC No CTZ
GitHub Check: macOS GCC (Intel)
GitHub Check: EL10 GCC S390X DFLTCC ASAN
GitHub Check: Ubuntu GCC No CTZ
GitHub Check: macOS GCC (Intel)
GitHub Check: EL10 GCC S390X DFLTCC ASAN
GitHub Check: Ubuntu GCC No CTZ
GitHub Check: macOS GCC (Intel)
GitHub Check: EL10 GCC S390X DFLTCC ASAN
GitHub Check: Ubuntu GCC No CTZ
GitHub Check: macOS GCC (Intel)
GitHub Check: EL10 GCC S390X DFLTCC ASAN
GitHub Check: Ubuntu GCC No CTZ
GitHub Check: macOS GCC (Intel)
GitHub Check: EL10 GCC S390X DFLTCC ASAN
GitHub Check: Ubuntu GCC No CTZ
GitHub Check: macOS GCC (Intel)
GitHub Check: EL10 GCC S390X DFLTCC ASAN
GitHub Check: Ubuntu GCC No CTZ
GitHub Check: macOS GCC (Intel)
GitHub Check: EL10 GCC S390X DFLTCC ASAN
GitHub Check: Ubuntu GCC No CTZ
GitHub Check: macOS GCC (Intel)

KungFuJesus · 2025-11-21T15:18:16Z

Before (generic build arguments)

2025-11-21T10:13:34-05:00
Running ./benchmark_zlib
Run on (28 X 3243.97 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x14)
  L1 Instruction 32 KiB (x14)
  L2 Unified 1024 KiB (x14)
  L3 Unified 19712 KiB (x1)
Load Average: 0.31, 0.40, 0.36
-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
crc32/pclmulqdq/32768          1254 ns         1254 ns       564282
crc32_fc/pclmulqdq/32768       1592 ns         1592 ns       441056

After (generic build arguments)

2025-11-21T10:14:34-05:00
Running ./benchmark_zlib
Run on (28 X 4099.51 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x14)
  L1 Instruction 32 KiB (x14)
  L2 Unified 1024 KiB (x14)
  L3 Unified 19712 KiB (x1)
Load Average: 0.43, 0.43, 0.37
-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
crc32/pclmulqdq/32768          1267 ns         1267 ns       551274
crc32_fc/pclmulqdq/32768       1407 ns         1407 ns       489659

Before (-march=x86-64-v4)

2025-11-21T10:16:11-05:00
Running ./benchmark_zlib
Run on (28 X 4100.07 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x14)
  L1 Instruction 32 KiB (x14)
  L2 Unified 1024 KiB (x14)
  L3 Unified 19712 KiB (x1)
Load Average: 0.20, 0.36, 0.35
-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
crc32/pclmulqdq/32768          1196 ns         1195 ns       585771
crc32_fc/pclmulqdq/32768       1276 ns         1276 ns       549482

After (-march=x86-64-v4)

2025-11-21T10:16:54-05:00
Running ./benchmark_zlib
Run on (28 X 4100.01 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x14)
  L1 Instruction 32 KiB (x14)
  L2 Unified 1024 KiB (x14)
  L3 Unified 19712 KiB (x1)
Load Average: 0.22, 0.35, 0.35
-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
crc32/pclmulqdq/32768          1195 ns         1195 ns       585751
crc32_fc/pclmulqdq/32768       1255 ns         1254 ns       558030

We'd expect the last comparison to be about the same.

codecov · 2025-11-21T15:30:14Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 0.00%. Comparing base (f6e28fb) to head (e6fe20a).
⚠️ Report is 1 commits behind head on develop.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop   #2020       +/-   ##
===========================================
- Coverage    82.78%       0   -82.79%     
===========================================
  Files          163       0      -163     
  Lines        12863       0    -12863     
  Branches      3171       0     -3171     
===========================================
- Hits         10648       0    -10648     
+ Misses        1156       0     -1156     
+ Partials      1059       0     -1059

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Dead2 · 2025-11-22T00:11:11Z

@KungFuJesus Please rebase. I tried doing it through github, but it just made a mess, as I thought it would..

As it turns out, the copying CRC32 variant _is_ slower when compiled with generic flags. The reason for this is mainly extra stack spills and the lack of operations we can overlap with the moves. However, when compiling for an architecture with more registers, such as avx512, we no longer have to eat all these costly stack spills and we can overlap with a 3 operand XOR. Conditionally guarding this means that if a Linux distribution wants to compile with -march=x86_64-v4 they get all the upsides to this. This code notably is not actually used if you happen to have something that support 512 bit wide clmul, so this does help a somewhat narrow range of targets (most of the earlier avx512 implementations pre ice lake). We also must guard with AVX512VL, as just specifying AVX512F makes GCC generate vpternlogic instructions of 512 bit widths only, so a bunch of packing and unpacking of 512 bit to 256 bit registers and vice versa has to occur, absolutely killing runtime. It's only AVX512VL where there's a 128 bit wide vpternlogic.

KungFuJesus · 2025-11-22T04:56:12Z

Weird, it was a pretty clean rebase. Just pushed.

coderabbitai bot reviewed Nov 21, 2025

View reviewed changes

nmoinvaz approved these changes Nov 21, 2025

View reviewed changes

Dead2 approved these changes Nov 21, 2025

View reviewed changes

KungFuJesus force-pushed the conditional_chorba branch from 5be358a to e6fe20a Compare November 22, 2025 04:54

Dead2 merged commit fe17958 into zlib-ng:develop Nov 22, 2025
154 checks passed

Dead2 mentioned this pull request Nov 25, 2025

2.3.1 Release #2021

Merged

coderabbitai bot mentioned this pull request Dec 3, 2025

Remove COPY ifdef from crc32 (v)pclmulqdq. #2034

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Conditionally shortcut via the chorba polynomial based on compile flags#2020

Conditionally shortcut via the chorba polynomial based on compile flags#2020
Dead2 merged 1 commit intozlib-ng:developfrom
KungFuJesus:conditional_chorba

KungFuJesus commented Nov 21, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Nov 21, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

KungFuJesus commented Nov 21, 2025

Uh oh!

codecov bot commented Nov 21, 2025 •

edited

Loading

Uh oh!

Dead2 commented Nov 22, 2025

Uh oh!

KungFuJesus commented Nov 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

KungFuJesus commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Possibly related PRs

Suggested labels

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

KungFuJesus commented Nov 21, 2025

Before (generic build arguments)

After (generic build arguments)

Before (-march=x86-64-v4)

After (-march=x86-64-v4)

Uh oh!

codecov bot commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Dead2 commented Nov 22, 2025

Uh oh!

KungFuJesus commented Nov 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

KungFuJesus commented Nov 21, 2025 •

edited

Loading

coderabbitai bot commented Nov 21, 2025 •

edited

Loading

codecov bot commented Nov 21, 2025 •

edited

Loading