Conditionally shortcut via the chorba polynomial based on compile flags#2020
Conditionally shortcut via the chorba polynomial based on compile flags#2020Dead2 merged 1 commit intozlib-ng:developfrom
Conversation
WalkthroughAdds a preprocessor guard around the large 512+64+16*8 processing loop in Changes
Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🧰 Additional context used🧠 Learnings (11)📓 Common learnings📚 Learning: 2024-10-29T02:22:52.846ZApplied to files:
📚 Learning: 2025-02-23T16:51:54.545ZApplied to files:
📚 Learning: 2025-02-21T01:41:50.358ZApplied to files:
📚 Learning: 2024-10-29T02:22:55.489ZApplied to files:
📚 Learning: 2024-10-29T02:18:25.966ZApplied to files:
📚 Learning: 2025-02-21T01:42:40.488ZApplied to files:
📚 Learning: 2025-02-21T01:44:03.996ZApplied to files:
📚 Learning: 2024-10-07T22:00:02.180ZApplied to files:
📚 Learning: 2025-06-10T07:38:03.297ZApplied to files:
📚 Learning: 2024-10-07T21:18:37.806ZApplied to files:
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (73)
🔇 Additional comments (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
arch/x86/crc32_fold_pclmulqdq_tpl.h (2)
114-115: Guarding Chorba path on COPY + AVX512VL looks correct and targetedThe
#if !defined(COPY) || defined(__AVX512VL__)around the Chorba loop cleanly limits the optimized “copying” variant to AVX512VL builds, while keeping the non‑COPY CRC32 path always enabled. That matches the stated goal: avoid the register‑pressure regression for generic COPY builds, but still allow the optimization when compiling with AVX‑512VL, and leave the non‑copying CRC32 unaffected.If you want to make this intent more obvious to future readers, consider adding a short comment above the
#ifalong the lines of “COPY variant only enabled here when compiled with AVX‑512VL to avoid stack‑spill regressions on generic builds.” This is optional; the logic itself looks sound.Also applies to: 326-326
116-123: Aligned loads in Chorba loop are consistent with existing alignment invariantAll loads from
srcin the Chorba loop have been switched to_mm_load_si128(aligned). Given the earlieralgn_diffhandling and existing use of_mm_load_si128in the subsequent 64‑byte and tail loops,srcis already guaranteed to be 16‑byte aligned at this point and advanced only by multiples of 16. So this change is consistent with the rest of the function and should remove redundant unaligned load semantics without introducing new UB.If you’d like to harden this for future refactors, you could optionally add a debug‑only invariant just before the Chorba loop, e.g.:
@@ - /* Implement Chorba algorithm from https://arxiv.org/abs/2412.16398 */ + /* Implement Chorba algorithm from https://arxiv.org/abs/2412.16398 */ +#if !defined(NDEBUG) + assert((((uintptr_t)src) & 0xF) == 0); +#endifThat would document the 16‑byte alignment requirement which these
_mm_load_si128calls rely on, aligning with how other SSE2/Chorba paths in this codebase depend on caller‑enforced alignment. Based on learnings.Also applies to: 142-145, 164-167, 187-190, 210-213, 233-236, 256-259, 279-282, 301-304
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
arch/x86/crc32_fold_pclmulqdq_tpl.h(10 hunks)
🧰 Additional context used
🧠 Learnings (14)
📓 Common learnings
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.
Learnt from: mtl1979
Repo: zlib-ng/zlib-ng PR: 1925
File: arch/loongarch/lasxintrin_ext.h:38-65
Timestamp: 2025-07-04T16:59:44.725Z
Learning: In zlib-ng, the policy for supporting old compilers is to maintain compatibility for up to two years after GitHub removes support for them in CI, or when there is no long-time contributor left testing for the architecture if the architecture was not supported by GitHub Actions at all.
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:14-24
Timestamp: 2025-02-21T01:41:50.358Z
Learning: In zlib-ng's SSE2 vectorized Chorba CRC implementation, the code that calls READ_NEXT macro ensures 16-byte alignment, making explicit alignment checks unnecessary within the macro.
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:42:40.488Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1802
File: arch/x86/chunkset_avx2.c:82-85
Timestamp: 2024-10-08T19:37:14.998Z
Learning: In `arch/x86/chunkset_avx2.c`, when working with AVX2-capable x86 CPUs, unaligned memory access using `_mm_loadu_si128` is acceptable since there is no performance penalty on architectures after Nehalem. Ensuring alignment may introduce unnecessary overhead due to arbitrary offsets into the window.
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1802
File: arch/x86/chunkset_avx2.c:82-85
Timestamp: 2024-10-07T21:18:37.806Z
Learning: In `arch/x86/chunkset_avx2.c`, when working with AVX2-capable x86 CPUs, unaligned memory access using `_mm_loadu_si128` is acceptable since there is no performance penalty on architectures after Nehalem. Ensuring alignment may introduce unnecessary overhead due to arbitrary offsets into the window.
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1778
File: arch/x86/chunkset_avx2.c:160-171
Timestamp: 2024-10-08T21:51:45.330Z
Learning: In `arch/x86/chunkset_avx2.c`, within the `GET_HALFCHUNK_MAG` function, using a conditional branch to select between `_mm_loadl_epi64` and `_mm_loadu_si128` is not recommended because the branching cost outweighs the savings from the load.
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/x86_intrins.h:114-117
Timestamp: 2025-02-23T16:51:54.545Z
Learning: In x86/x86_intrins.h, the Clang macros for _mm_cvtsi64x_si128 and _mm_cvtsi128_si64x don't need additional MSVC guards since MSVC's implementation is already protected by `defined(_MSC_VER) && !defined(__clang__)`, making them mutually exclusive.
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1805
File: inffast_tpl.h:257-262
Timestamp: 2024-10-29T02:22:52.846Z
Learning: In `inffast_tpl.h`, when AVX512 is enabled, the branch involving `chunkcopy_safe` is intentionally eliminated to optimize performance.
Learnt from: mtl1979
Repo: zlib-ng/zlib-ng PR: 1921
File: arch/riscv/chunkset_rvv.c:97-102
Timestamp: 2025-06-09T16:46:28.468Z
Learning: In RISC-V chunkset_rvv.c CHUNKCOPY function, the alignment logic intentionally copies sizeof(chunk_t) bytes but advances pointers by align bytes to avoid unaligned access. This "re-reading" technique is a deliberate optimization to maintain proper alignment for subsequent operations, not a bug.
Learnt from: mtl1979
Repo: zlib-ng/zlib-ng PR: 1925
File: arch/loongarch/lasxintrin_ext.h:38-65
Timestamp: 2025-07-04T16:59:44.725Z
Learning: For new architectures like LoongArch64, inline assembly fallbacks are necessary because compilers don't yet have intrinsic functions for all common operations. This requires maintaining complex inline assembly implementations until the compiler ecosystem matures.
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:26-28
Timestamp: 2025-02-21T01:44:03.996Z
Learning: The alignment requirements for chorba_small_nondestructive_sse2 (16-byte alignment and multiple of 8 length) are enforced by its calling function, making additional checks redundant.
📚 Learning: 2025-02-23T16:51:54.545Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/x86_intrins.h:114-117
Timestamp: 2025-02-23T16:51:54.545Z
Learning: In x86/x86_intrins.h, the Clang macros for _mm_cvtsi64x_si128 and _mm_cvtsi128_si64x don't need additional MSVC guards since MSVC's implementation is already protected by `defined(_MSC_VER) && !defined(__clang__)`, making them mutually exclusive.
Applied to files:
arch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2024-10-29T02:22:52.846Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1805
File: inffast_tpl.h:257-262
Timestamp: 2024-10-29T02:22:52.846Z
Learning: In `inffast_tpl.h`, when AVX512 is enabled, the branch involving `chunkcopy_safe` is intentionally eliminated to optimize performance.
Applied to files:
arch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2024-10-08T21:51:45.330Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1778
File: arch/x86/chunkset_avx2.c:160-171
Timestamp: 2024-10-08T21:51:45.330Z
Learning: In `arch/x86/chunkset_avx2.c`, within the `GET_HALFCHUNK_MAG` function, using a conditional branch to select between `_mm_loadl_epi64` and `_mm_loadu_si128` is not recommended because the branching cost outweighs the savings from the load.
Applied to files:
arch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2024-10-08T19:37:14.998Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1802
File: arch/x86/chunkset_avx2.c:82-85
Timestamp: 2024-10-08T19:37:14.998Z
Learning: In `arch/x86/chunkset_avx2.c`, when working with AVX2-capable x86 CPUs, unaligned memory access using `_mm_loadu_si128` is acceptable since there is no performance penalty on architectures after Nehalem. Ensuring alignment may introduce unnecessary overhead due to arbitrary offsets into the window.
Applied to files:
arch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2025-02-21T01:41:50.358Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:14-24
Timestamp: 2025-02-21T01:41:50.358Z
Learning: In zlib-ng's SSE2 vectorized Chorba CRC implementation, the code that calls READ_NEXT macro ensures 16-byte alignment, making explicit alignment checks unnecessary within the macro.
Applied to files:
arch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2025-02-21T01:42:40.488Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:42:40.488Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.
Applied to files:
arch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2024-10-29T02:22:55.489Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1805
File: arch/x86/chunkset_avx512.c:32-34
Timestamp: 2024-10-29T02:22:55.489Z
Learning: In `arch/x86/chunkset_avx512.c`, the `gen_mask` function's `len` parameter cannot exceed 32 because it is only called on the remaining bytes from a 32-byte vector.
Applied to files:
arch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2024-10-29T02:18:25.966Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1805
File: arch/x86/chunkset_avx512.c:28-30
Timestamp: 2024-10-29T02:18:25.966Z
Learning: In `chunkset_avx512.c`, the `gen_half_mask` function does not require validation for `len` since it will never exceed 16 due to computing the remainder for a 16-byte load.
Applied to files:
arch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2025-02-21T01:41:10.063Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:41:10.063Z
Learning: For SSE2 optimizations, `_mm_cvtsi128_si64` should be used instead of `_mm_extract_epi64` (SSE4.1) for extracting 64-bit values from 128-bit vectors, as it generates more efficient movq instructions.
Applied to files:
arch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2025-06-10T07:38:03.297Z
Learnt from: mtl1979
Repo: zlib-ng/zlib-ng PR: 1921
File: arch/riscv/chunkset_rvv.c:103-104
Timestamp: 2025-06-10T07:38:03.297Z
Learning: In RISC-V chunkset_rvv.c CHUNKCOPY function, when dist < sizeof(chunk_t), the vl variable intentionally becomes 0, causing the while loop to not execute. This is correct behavior because copying full chunks is not safe when the distance is smaller than chunk size, and the function appropriately falls back to memcpy for handling remaining bytes.
Applied to files:
arch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2025-02-23T16:49:52.043Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.
Applied to files:
arch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2025-01-23T22:01:53.422Z
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-01-23T22:01:53.422Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.
Applied to files:
arch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2024-10-08T19:37:14.998Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1802
File: chunkset_tpl.h:92-108
Timestamp: 2024-10-08T19:37:14.998Z
Learning: In the `HALFCHUNKCOPY` function in `chunkset_tpl.h`, the `len` parameter is always bound to be small by the callers and will not exceed the limits of an `int32_t`.
Applied to files:
arch/x86/crc32_fold_pclmulqdq_tpl.h
🧬 Code graph analysis (1)
arch/x86/crc32_fold_pclmulqdq_tpl.h (1)
arch/x86/crc32_pclmulqdq_tpl.h (1)
crc32_fold_load(273-278)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (30)
- GitHub Check: EL10 GCC S390X DFLTCC ASAN
- GitHub Check: Ubuntu GCC No CTZ
- GitHub Check: macOS GCC (Intel)
- GitHub Check: EL10 GCC S390X DFLTCC ASAN
- GitHub Check: Ubuntu GCC No CTZ
- GitHub Check: macOS GCC (Intel)
- GitHub Check: EL10 GCC S390X DFLTCC ASAN
- GitHub Check: Ubuntu GCC No CTZ
- GitHub Check: macOS GCC (Intel)
- GitHub Check: EL10 GCC S390X DFLTCC ASAN
- GitHub Check: Ubuntu GCC No CTZ
- GitHub Check: macOS GCC (Intel)
- GitHub Check: EL10 GCC S390X DFLTCC ASAN
- GitHub Check: Ubuntu GCC No CTZ
- GitHub Check: macOS GCC (Intel)
- GitHub Check: EL10 GCC S390X DFLTCC ASAN
- GitHub Check: Ubuntu GCC No CTZ
- GitHub Check: macOS GCC (Intel)
- GitHub Check: EL10 GCC S390X DFLTCC ASAN
- GitHub Check: Ubuntu GCC No CTZ
- GitHub Check: macOS GCC (Intel)
- GitHub Check: EL10 GCC S390X DFLTCC ASAN
- GitHub Check: Ubuntu GCC No CTZ
- GitHub Check: macOS GCC (Intel)
- GitHub Check: EL10 GCC S390X DFLTCC ASAN
- GitHub Check: Ubuntu GCC No CTZ
- GitHub Check: macOS GCC (Intel)
- GitHub Check: EL10 GCC S390X DFLTCC ASAN
- GitHub Check: Ubuntu GCC No CTZ
- GitHub Check: macOS GCC (Intel)
Before (generic build arguments)After (generic build arguments)Before (-march=x86-64-v4)After (-march=x86-64-v4)We'd expect the last comparison to be about the same. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #2020 +/- ##
===========================================
- Coverage 82.78% 0 -82.79%
===========================================
Files 163 0 -163
Lines 12863 0 -12863
Branches 3171 0 -3171
===========================================
- Hits 10648 0 -10648
+ Misses 1156 0 -1156
+ Partials 1059 0 -1059 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@KungFuJesus Please rebase. I tried doing it through github, but it just made a mess, as I thought it would.. |
As it turns out, the copying CRC32 variant _is_ slower when compiled with generic flags. The reason for this is mainly extra stack spills and the lack of operations we can overlap with the moves. However, when compiling for an architecture with more registers, such as avx512, we no longer have to eat all these costly stack spills and we can overlap with a 3 operand XOR. Conditionally guarding this means that if a Linux distribution wants to compile with -march=x86_64-v4 they get all the upsides to this. This code notably is not actually used if you happen to have something that support 512 bit wide clmul, so this does help a somewhat narrow range of targets (most of the earlier avx512 implementations pre ice lake). We also must guard with AVX512VL, as just specifying AVX512F makes GCC generate vpternlogic instructions of 512 bit widths only, so a bunch of packing and unpacking of 512 bit to 256 bit registers and vice versa has to occur, absolutely killing runtime. It's only AVX512VL where there's a 128 bit wide vpternlogic.
5be358a to
e6fe20a
Compare
|
Weird, it was a pretty clean rebase. Just pushed. |
As it turns out, the copying CRC32 variant is slower when compiled
with generic flags. The reason for this is mainly extra stack spills and
the lack of operations we can overlap with the moves. However, when
compiling for an architecture with more registers, such as avx512, we no
longer have to eat all these costly stack spills and we can overlap with
a 3 operand XOR. Conditionally guarding this means that if a Linux
distribution wants to compile with -march=x86_64-v4 they get all the
upsides to this.
This code notably is not actually used if you happen to have something
that support 512 bit wide clmul, so this does help a somewhat narrow
range of targets (most of the earlier avx512 implementations pre ice
lake).
We also must guard with AVX512VL, as just specifying AVX512F makes GCC
generate vpternlogic instructions of 512 bit widths only, so a bunch of
packing and unpacking of 512 bit to 256 bit registers and vice versa has
to occur, absolutely killing runtime. It's only AVX512VL where there's a
128 bit wide vpternlogic.
This should be merged after #2019