Skip to content

Conditionally shortcut via the chorba polynomial based on compile flags#2020

Merged
Dead2 merged 1 commit intozlib-ng:developfrom
KungFuJesus:conditional_chorba
Nov 22, 2025
Merged

Conditionally shortcut via the chorba polynomial based on compile flags#2020
Dead2 merged 1 commit intozlib-ng:developfrom
KungFuJesus:conditional_chorba

Conversation

@KungFuJesus
Copy link
Copy Markdown
Collaborator

@KungFuJesus KungFuJesus commented Nov 21, 2025

As it turns out, the copying CRC32 variant is slower when compiled
with generic flags. The reason for this is mainly extra stack spills and
the lack of operations we can overlap with the moves. However, when
compiling for an architecture with more registers, such as avx512, we no
longer have to eat all these costly stack spills and we can overlap with
a 3 operand XOR. Conditionally guarding this means that if a Linux
distribution wants to compile with -march=x86_64-v4 they get all the
upsides to this.

This code notably is not actually used if you happen to have something
that support 512 bit wide clmul, so this does help a somewhat narrow
range of targets (most of the earlier avx512 implementations pre ice
lake).

We also must guard with AVX512VL, as just specifying AVX512F makes GCC
generate vpternlogic instructions of 512 bit widths only, so a bunch of
packing and unpacking of 512 bit to 256 bit registers and vice versa has
to occur, absolutely killing runtime. It's only AVX512VL where there's a
128 bit wide vpternlogic.

This should be merged after #2019

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Nov 21, 2025

Walkthrough

Adds a preprocessor guard around the large 512+64+16*8 processing loop in arch/x86/crc32_fold_pclmulqdq_tpl.h, so that the block is compiled only when COPY is not defined or __AVX512VL__ is available. No public API changes.

Changes

Cohort / File(s) Summary
CRC32 folding guard
arch/x86/crc32_fold_pclmulqdq_tpl.h
Wraps the large 512+64+16*8 processing loop with `#if !defined(COPY)

Possibly related PRs

Suggested labels

Architecture, optimization

Suggested reviewers

  • Dead2
  • nmoinvaz

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title directly relates to the main change: conditionally guarding a CRC32 processing loop based on compile flags (COPY not defined or AVX512VL available).
Description check ✅ Passed The description is directly related to the changeset, providing detailed rationale for why the conditional guard is needed based on register availability and compiler optimization behavior.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 99d2915 and e6fe20a.

📒 Files selected for processing (1)
  • arch/x86/crc32_fold_pclmulqdq_tpl.h (2 hunks)
🧰 Additional context used
🧠 Learnings (11)
📓 Common learnings
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/x86_intrins.h:114-117
Timestamp: 2025-02-23T16:51:54.545Z
Learning: In x86/x86_intrins.h, the Clang macros for _mm_cvtsi64x_si128 and _mm_cvtsi128_si64x don't need additional MSVC guards since MSVC's implementation is already protected by `defined(_MSC_VER) && !defined(__clang__)`, making them mutually exclusive.
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1805
File: inffast_tpl.h:257-262
Timestamp: 2024-10-29T02:22:52.846Z
Learning: In `inffast_tpl.h`, when AVX512 is enabled, the branch involving `chunkcopy_safe` is intentionally eliminated to optimize performance.
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:14-24
Timestamp: 2025-02-21T01:41:50.358Z
Learning: In zlib-ng's SSE2 vectorized Chorba CRC implementation, the code that calls READ_NEXT macro ensures 16-byte alignment, making explicit alignment checks unnecessary within the macro.
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/x86_intrins.h:0-0
Timestamp: 2025-02-23T16:50:50.925Z
Learning: MSVC does not define `__GNUC__`, so adding `!defined(_MSC_VER)` to GCC detection macros is redundant when `__GNUC__` is already being checked.
Learnt from: mtl1979
Repo: zlib-ng/zlib-ng PR: 1925
File: arch/loongarch/lasxintrin_ext.h:38-65
Timestamp: 2025-07-04T16:59:44.725Z
Learning: For new architectures like LoongArch64, inline assembly fallbacks are necessary because compilers don't yet have intrinsic functions for all common operations. This requires maintaining complex inline assembly implementations until the compiler ecosystem matures.
📚 Learning: 2024-10-29T02:22:52.846Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1805
File: inffast_tpl.h:257-262
Timestamp: 2024-10-29T02:22:52.846Z
Learning: In `inffast_tpl.h`, when AVX512 is enabled, the branch involving `chunkcopy_safe` is intentionally eliminated to optimize performance.

Applied to files:

  • arch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2025-02-23T16:51:54.545Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/x86_intrins.h:114-117
Timestamp: 2025-02-23T16:51:54.545Z
Learning: In x86/x86_intrins.h, the Clang macros for _mm_cvtsi64x_si128 and _mm_cvtsi128_si64x don't need additional MSVC guards since MSVC's implementation is already protected by `defined(_MSC_VER) && !defined(__clang__)`, making them mutually exclusive.

Applied to files:

  • arch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2025-02-21T01:41:50.358Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:14-24
Timestamp: 2025-02-21T01:41:50.358Z
Learning: In zlib-ng's SSE2 vectorized Chorba CRC implementation, the code that calls READ_NEXT macro ensures 16-byte alignment, making explicit alignment checks unnecessary within the macro.

Applied to files:

  • arch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2024-10-29T02:22:55.489Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1805
File: arch/x86/chunkset_avx512.c:32-34
Timestamp: 2024-10-29T02:22:55.489Z
Learning: In `arch/x86/chunkset_avx512.c`, the `gen_mask` function's `len` parameter cannot exceed 32 because it is only called on the remaining bytes from a 32-byte vector.

Applied to files:

  • arch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2024-10-29T02:18:25.966Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1805
File: arch/x86/chunkset_avx512.c:28-30
Timestamp: 2024-10-29T02:18:25.966Z
Learning: In `chunkset_avx512.c`, the `gen_half_mask` function does not require validation for `len` since it will never exceed 16 due to computing the remainder for a 16-byte load.

Applied to files:

  • arch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2025-02-21T01:42:40.488Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:42:40.488Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.

Applied to files:

  • arch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2025-02-21T01:44:03.996Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:26-28
Timestamp: 2025-02-21T01:44:03.996Z
Learning: The alignment requirements for chorba_small_nondestructive_sse2 (16-byte alignment and multiple of 8 length) are enforced by its calling function, making additional checks redundant.

Applied to files:

  • arch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2024-10-07T22:00:02.180Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1802
File: chunkset_tpl.h:92-108
Timestamp: 2024-10-07T22:00:02.180Z
Learning: In the `HALFCHUNKCOPY` function in `chunkset_tpl.h`, the `len` parameter is always bound to be small by the callers and will not exceed the limits of an `int32_t`.

Applied to files:

  • arch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2025-06-10T07:38:03.297Z
Learnt from: mtl1979
Repo: zlib-ng/zlib-ng PR: 1921
File: arch/riscv/chunkset_rvv.c:103-104
Timestamp: 2025-06-10T07:38:03.297Z
Learning: In RISC-V chunkset_rvv.c CHUNKCOPY function, when dist < sizeof(chunk_t), the vl variable intentionally becomes 0, causing the while loop to not execute. This is correct behavior because copying full chunks is not safe when the distance is smaller than chunk size, and the function appropriately falls back to memcpy for handling remaining bytes.

Applied to files:

  • arch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2024-10-07T21:18:37.806Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1802
File: arch/x86/chunkset_avx2.c:82-85
Timestamp: 2024-10-07T21:18:37.806Z
Learning: In `arch/x86/chunkset_avx2.c`, when working with AVX2-capable x86 CPUs, unaligned memory access using `_mm_loadu_si128` is acceptable since there is no performance penalty on architectures after Nehalem. Ensuring alignment may introduce unnecessary overhead due to arbitrary offsets into the window.

Applied to files:

  • arch/x86/crc32_fold_pclmulqdq_tpl.h
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (73)
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: Windows GCC Compat No Opt
  • GitHub Check: Ubuntu MinGW i686
  • GitHub Check: EL10 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: macOS GCC (ARM64)
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: Windows GCC Compat No Opt
  • GitHub Check: Ubuntu MinGW i686
  • GitHub Check: EL10 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: macOS GCC (ARM64)
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: Windows GCC Compat No Opt
  • GitHub Check: Ubuntu MinGW i686
  • GitHub Check: EL10 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: macOS GCC (ARM64)
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: Windows GCC Compat No Opt
  • GitHub Check: Ubuntu MinGW i686
  • GitHub Check: EL10 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: macOS GCC (ARM64)
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: Windows GCC Compat No Opt
  • GitHub Check: Ubuntu MinGW i686
  • GitHub Check: EL10 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: macOS GCC (ARM64)
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: Windows GCC Compat No Opt
  • GitHub Check: Ubuntu MinGW i686
  • GitHub Check: EL10 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: macOS GCC (ARM64)
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: Windows GCC Compat No Opt
  • GitHub Check: Ubuntu MinGW i686
  • GitHub Check: EL10 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: macOS GCC Symbol Prefix & Compat (ARM64)
  • GitHub Check: macOS GCC (ARM64)
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: Windows GCC Compat No Opt
  • GitHub Check: Windows MSVC 2022 v140 Win32
  • GitHub Check: Ubuntu MinGW i686
  • GitHub Check: EL10 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: macOS GCC Symbol Prefix & Compat (ARM64)
  • GitHub Check: macOS GCC (ARM64)
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: Windows GCC Compat No Opt
  • GitHub Check: Windows MSVC 2022 v140 Win32
  • GitHub Check: Ubuntu MinGW i686
  • GitHub Check: EL10 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: macOS GCC Symbol Prefix & Compat (ARM64)
  • GitHub Check: macOS GCC (ARM64)
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: Windows GCC Compat No Opt
  • GitHub Check: Windows MSVC 2022 v140 Win32
  • GitHub Check: Ubuntu MinGW i686
  • GitHub Check: EL10 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: macOS GCC Symbol Prefix & Compat (ARM64)
  • GitHub Check: macOS GCC (ARM64)
🔇 Additional comments (1)
arch/x86/crc32_fold_pclmulqdq_tpl.h (1)

114-326: Conditional compilation guard correctly implements the optimization strategy.

The guard appropriately restricts the chorba polynomial optimization for the COPY variant to builds with AVX512VL support, addressing the stack spill and register pressure issues on generic builds while enabling the optimization when sufficient registers are available. The fallback to the 64-byte loop at line 328 ensures correct behavior in all configurations, and the benchmark results confirm the expected performance improvements.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
arch/x86/crc32_fold_pclmulqdq_tpl.h (2)

114-115: Guarding Chorba path on COPY + AVX512VL looks correct and targeted

The #if !defined(COPY) || defined(__AVX512VL__) around the Chorba loop cleanly limits the optimized “copying” variant to AVX512VL builds, while keeping the non‑COPY CRC32 path always enabled. That matches the stated goal: avoid the register‑pressure regression for generic COPY builds, but still allow the optimization when compiling with AVX‑512VL, and leave the non‑copying CRC32 unaffected.

If you want to make this intent more obvious to future readers, consider adding a short comment above the #if along the lines of “COPY variant only enabled here when compiled with AVX‑512VL to avoid stack‑spill regressions on generic builds.” This is optional; the logic itself looks sound.

Also applies to: 326-326


116-123: Aligned loads in Chorba loop are consistent with existing alignment invariant

All loads from src in the Chorba loop have been switched to _mm_load_si128 (aligned). Given the earlier algn_diff handling and existing use of _mm_load_si128 in the subsequent 64‑byte and tail loops, src is already guaranteed to be 16‑byte aligned at this point and advanced only by multiples of 16. So this change is consistent with the rest of the function and should remove redundant unaligned load semantics without introducing new UB.

If you’d like to harden this for future refactors, you could optionally add a debug‑only invariant just before the Chorba loop, e.g.:

@@
-    /* Implement Chorba algorithm from https://arxiv.org/abs/2412.16398 */
+    /* Implement Chorba algorithm from https://arxiv.org/abs/2412.16398 */
+#if !defined(NDEBUG)
+    assert((((uintptr_t)src) & 0xF) == 0);
+#endif

That would document the 16‑byte alignment requirement which these _mm_load_si128 calls rely on, aligning with how other SSE2/Chorba paths in this codebase depend on caller‑enforced alignment. Based on learnings.

Also applies to: 142-145, 164-167, 187-190, 210-213, 233-236, 256-259, 279-282, 301-304

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 469cf6d and 99d2915.

📒 Files selected for processing (1)
  • arch/x86/crc32_fold_pclmulqdq_tpl.h (10 hunks)
🧰 Additional context used
🧠 Learnings (14)
📓 Common learnings
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.
Learnt from: mtl1979
Repo: zlib-ng/zlib-ng PR: 1925
File: arch/loongarch/lasxintrin_ext.h:38-65
Timestamp: 2025-07-04T16:59:44.725Z
Learning: In zlib-ng, the policy for supporting old compilers is to maintain compatibility for up to two years after GitHub removes support for them in CI, or when there is no long-time contributor left testing for the architecture if the architecture was not supported by GitHub Actions at all.
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:14-24
Timestamp: 2025-02-21T01:41:50.358Z
Learning: In zlib-ng's SSE2 vectorized Chorba CRC implementation, the code that calls READ_NEXT macro ensures 16-byte alignment, making explicit alignment checks unnecessary within the macro.
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:42:40.488Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1802
File: arch/x86/chunkset_avx2.c:82-85
Timestamp: 2024-10-08T19:37:14.998Z
Learning: In `arch/x86/chunkset_avx2.c`, when working with AVX2-capable x86 CPUs, unaligned memory access using `_mm_loadu_si128` is acceptable since there is no performance penalty on architectures after Nehalem. Ensuring alignment may introduce unnecessary overhead due to arbitrary offsets into the window.
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1802
File: arch/x86/chunkset_avx2.c:82-85
Timestamp: 2024-10-07T21:18:37.806Z
Learning: In `arch/x86/chunkset_avx2.c`, when working with AVX2-capable x86 CPUs, unaligned memory access using `_mm_loadu_si128` is acceptable since there is no performance penalty on architectures after Nehalem. Ensuring alignment may introduce unnecessary overhead due to arbitrary offsets into the window.
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1778
File: arch/x86/chunkset_avx2.c:160-171
Timestamp: 2024-10-08T21:51:45.330Z
Learning: In `arch/x86/chunkset_avx2.c`, within the `GET_HALFCHUNK_MAG` function, using a conditional branch to select between `_mm_loadl_epi64` and `_mm_loadu_si128` is not recommended because the branching cost outweighs the savings from the load.
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/x86_intrins.h:114-117
Timestamp: 2025-02-23T16:51:54.545Z
Learning: In x86/x86_intrins.h, the Clang macros for _mm_cvtsi64x_si128 and _mm_cvtsi128_si64x don't need additional MSVC guards since MSVC's implementation is already protected by `defined(_MSC_VER) && !defined(__clang__)`, making them mutually exclusive.
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1805
File: inffast_tpl.h:257-262
Timestamp: 2024-10-29T02:22:52.846Z
Learning: In `inffast_tpl.h`, when AVX512 is enabled, the branch involving `chunkcopy_safe` is intentionally eliminated to optimize performance.
Learnt from: mtl1979
Repo: zlib-ng/zlib-ng PR: 1921
File: arch/riscv/chunkset_rvv.c:97-102
Timestamp: 2025-06-09T16:46:28.468Z
Learning: In RISC-V chunkset_rvv.c CHUNKCOPY function, the alignment logic intentionally copies sizeof(chunk_t) bytes but advances pointers by align bytes to avoid unaligned access. This "re-reading" technique is a deliberate optimization to maintain proper alignment for subsequent operations, not a bug.
Learnt from: mtl1979
Repo: zlib-ng/zlib-ng PR: 1925
File: arch/loongarch/lasxintrin_ext.h:38-65
Timestamp: 2025-07-04T16:59:44.725Z
Learning: For new architectures like LoongArch64, inline assembly fallbacks are necessary because compilers don't yet have intrinsic functions for all common operations. This requires maintaining complex inline assembly implementations until the compiler ecosystem matures.
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:26-28
Timestamp: 2025-02-21T01:44:03.996Z
Learning: The alignment requirements for chorba_small_nondestructive_sse2 (16-byte alignment and multiple of 8 length) are enforced by its calling function, making additional checks redundant.
📚 Learning: 2025-02-23T16:51:54.545Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/x86_intrins.h:114-117
Timestamp: 2025-02-23T16:51:54.545Z
Learning: In x86/x86_intrins.h, the Clang macros for _mm_cvtsi64x_si128 and _mm_cvtsi128_si64x don't need additional MSVC guards since MSVC's implementation is already protected by `defined(_MSC_VER) && !defined(__clang__)`, making them mutually exclusive.

Applied to files:

  • arch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2024-10-29T02:22:52.846Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1805
File: inffast_tpl.h:257-262
Timestamp: 2024-10-29T02:22:52.846Z
Learning: In `inffast_tpl.h`, when AVX512 is enabled, the branch involving `chunkcopy_safe` is intentionally eliminated to optimize performance.

Applied to files:

  • arch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2024-10-08T21:51:45.330Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1778
File: arch/x86/chunkset_avx2.c:160-171
Timestamp: 2024-10-08T21:51:45.330Z
Learning: In `arch/x86/chunkset_avx2.c`, within the `GET_HALFCHUNK_MAG` function, using a conditional branch to select between `_mm_loadl_epi64` and `_mm_loadu_si128` is not recommended because the branching cost outweighs the savings from the load.

Applied to files:

  • arch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2024-10-08T19:37:14.998Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1802
File: arch/x86/chunkset_avx2.c:82-85
Timestamp: 2024-10-08T19:37:14.998Z
Learning: In `arch/x86/chunkset_avx2.c`, when working with AVX2-capable x86 CPUs, unaligned memory access using `_mm_loadu_si128` is acceptable since there is no performance penalty on architectures after Nehalem. Ensuring alignment may introduce unnecessary overhead due to arbitrary offsets into the window.

Applied to files:

  • arch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2025-02-21T01:41:50.358Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:14-24
Timestamp: 2025-02-21T01:41:50.358Z
Learning: In zlib-ng's SSE2 vectorized Chorba CRC implementation, the code that calls READ_NEXT macro ensures 16-byte alignment, making explicit alignment checks unnecessary within the macro.

Applied to files:

  • arch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2025-02-21T01:42:40.488Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:42:40.488Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.

Applied to files:

  • arch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2024-10-29T02:22:55.489Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1805
File: arch/x86/chunkset_avx512.c:32-34
Timestamp: 2024-10-29T02:22:55.489Z
Learning: In `arch/x86/chunkset_avx512.c`, the `gen_mask` function's `len` parameter cannot exceed 32 because it is only called on the remaining bytes from a 32-byte vector.

Applied to files:

  • arch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2024-10-29T02:18:25.966Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1805
File: arch/x86/chunkset_avx512.c:28-30
Timestamp: 2024-10-29T02:18:25.966Z
Learning: In `chunkset_avx512.c`, the `gen_half_mask` function does not require validation for `len` since it will never exceed 16 due to computing the remainder for a 16-byte load.

Applied to files:

  • arch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2025-02-21T01:41:10.063Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:41:10.063Z
Learning: For SSE2 optimizations, `_mm_cvtsi128_si64` should be used instead of `_mm_extract_epi64` (SSE4.1) for extracting 64-bit values from 128-bit vectors, as it generates more efficient movq instructions.

Applied to files:

  • arch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2025-06-10T07:38:03.297Z
Learnt from: mtl1979
Repo: zlib-ng/zlib-ng PR: 1921
File: arch/riscv/chunkset_rvv.c:103-104
Timestamp: 2025-06-10T07:38:03.297Z
Learning: In RISC-V chunkset_rvv.c CHUNKCOPY function, when dist < sizeof(chunk_t), the vl variable intentionally becomes 0, causing the while loop to not execute. This is correct behavior because copying full chunks is not safe when the distance is smaller than chunk size, and the function appropriately falls back to memcpy for handling remaining bytes.

Applied to files:

  • arch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2025-02-23T16:49:52.043Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.

Applied to files:

  • arch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2025-01-23T22:01:53.422Z
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-01-23T22:01:53.422Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.

Applied to files:

  • arch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2024-10-08T19:37:14.998Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1802
File: chunkset_tpl.h:92-108
Timestamp: 2024-10-08T19:37:14.998Z
Learning: In the `HALFCHUNKCOPY` function in `chunkset_tpl.h`, the `len` parameter is always bound to be small by the callers and will not exceed the limits of an `int32_t`.

Applied to files:

  • arch/x86/crc32_fold_pclmulqdq_tpl.h
🧬 Code graph analysis (1)
arch/x86/crc32_fold_pclmulqdq_tpl.h (1)
arch/x86/crc32_pclmulqdq_tpl.h (1)
  • crc32_fold_load (273-278)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (30)
  • GitHub Check: EL10 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC No CTZ
  • GitHub Check: macOS GCC (Intel)
  • GitHub Check: EL10 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC No CTZ
  • GitHub Check: macOS GCC (Intel)
  • GitHub Check: EL10 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC No CTZ
  • GitHub Check: macOS GCC (Intel)
  • GitHub Check: EL10 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC No CTZ
  • GitHub Check: macOS GCC (Intel)
  • GitHub Check: EL10 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC No CTZ
  • GitHub Check: macOS GCC (Intel)
  • GitHub Check: EL10 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC No CTZ
  • GitHub Check: macOS GCC (Intel)
  • GitHub Check: EL10 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC No CTZ
  • GitHub Check: macOS GCC (Intel)
  • GitHub Check: EL10 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC No CTZ
  • GitHub Check: macOS GCC (Intel)
  • GitHub Check: EL10 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC No CTZ
  • GitHub Check: macOS GCC (Intel)
  • GitHub Check: EL10 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC No CTZ
  • GitHub Check: macOS GCC (Intel)

@KungFuJesus
Copy link
Copy Markdown
Collaborator Author

Before (generic build arguments)

2025-11-21T10:13:34-05:00
Running ./benchmark_zlib
Run on (28 X 3243.97 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x14)
  L1 Instruction 32 KiB (x14)
  L2 Unified 1024 KiB (x14)
  L3 Unified 19712 KiB (x1)
Load Average: 0.31, 0.40, 0.36
-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
crc32/pclmulqdq/32768          1254 ns         1254 ns       564282
crc32_fc/pclmulqdq/32768       1592 ns         1592 ns       441056

After (generic build arguments)

2025-11-21T10:14:34-05:00
Running ./benchmark_zlib
Run on (28 X 4099.51 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x14)
  L1 Instruction 32 KiB (x14)
  L2 Unified 1024 KiB (x14)
  L3 Unified 19712 KiB (x1)
Load Average: 0.43, 0.43, 0.37
-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
crc32/pclmulqdq/32768          1267 ns         1267 ns       551274
crc32_fc/pclmulqdq/32768       1407 ns         1407 ns       489659

Before (-march=x86-64-v4)

2025-11-21T10:16:11-05:00
Running ./benchmark_zlib
Run on (28 X 4100.07 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x14)
  L1 Instruction 32 KiB (x14)
  L2 Unified 1024 KiB (x14)
  L3 Unified 19712 KiB (x1)
Load Average: 0.20, 0.36, 0.35
-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
crc32/pclmulqdq/32768          1196 ns         1195 ns       585771
crc32_fc/pclmulqdq/32768       1276 ns         1276 ns       549482

After (-march=x86-64-v4)

2025-11-21T10:16:54-05:00
Running ./benchmark_zlib
Run on (28 X 4100.01 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x14)
  L1 Instruction 32 KiB (x14)
  L2 Unified 1024 KiB (x14)
  L3 Unified 19712 KiB (x1)
Load Average: 0.22, 0.35, 0.35
-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
crc32/pclmulqdq/32768          1195 ns         1195 ns       585751
crc32_fc/pclmulqdq/32768       1255 ns         1254 ns       558030

We'd expect the last comparison to be about the same.

@codecov
Copy link
Copy Markdown

codecov bot commented Nov 21, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 0.00%. Comparing base (f6e28fb) to head (e6fe20a).
⚠️ Report is 1 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop   #2020       +/-   ##
===========================================
- Coverage    82.78%       0   -82.79%     
===========================================
  Files          163       0      -163     
  Lines        12863       0    -12863     
  Branches      3171       0     -3171     
===========================================
- Hits         10648       0    -10648     
+ Misses        1156       0     -1156     
+ Partials      1059       0     -1059     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Dead2
Copy link
Copy Markdown
Member

Dead2 commented Nov 22, 2025

@KungFuJesus Please rebase. I tried doing it through github, but it just made a mess, as I thought it would..

As it turns out, the copying CRC32 variant _is_ slower when compiled
with generic flags. The reason for this is mainly extra stack spills and
the lack of operations we can overlap with the moves. However, when
compiling for an architecture with more registers, such as avx512, we no
longer have to eat all these costly stack spills and we can overlap with
a 3 operand XOR. Conditionally guarding this means that if a Linux
distribution wants to compile with -march=x86_64-v4 they get all the
upsides to this.

This code notably is not actually used if you happen to have something
that support 512 bit wide clmul, so this does help a somewhat narrow
range of targets (most of the earlier avx512 implementations pre ice
lake).

We also must guard with AVX512VL, as just specifying AVX512F makes GCC
generate vpternlogic instructions of 512 bit widths only, so a bunch of
packing and unpacking of 512 bit to 256 bit registers and vice versa has
to occur, absolutely killing runtime. It's only AVX512VL where there's a
128 bit wide vpternlogic.
@KungFuJesus
Copy link
Copy Markdown
Collaborator Author

Weird, it was a pretty clean rebase. Just pushed.

@Dead2 Dead2 merged commit fe17958 into zlib-ng:develop Nov 22, 2025
154 checks passed
@Dead2 Dead2 mentioned this pull request Nov 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants