Use aligned loads in the chorba portions of the clmul crc routines by KungFuJesus · Pull Request #2019 · zlib-ng/zlib-ng

KungFuJesus · 2025-11-21T14:48:14Z

We go through the trouble to do aligned loads, we may as well let the compile know this is certain in doing so. We can't guarantee an aligned store but at least with an aligned load the compiler can elide a load with an subsequent xor multiplication when not copying.

We go through the trouble to do aligned loads, we may as well let the compiler know this is certain in doing so. We can't guarantee an aligned store but at least with an aligned load the compiler can elide a load with a subsequent xor multiplication when not copying.

coderabbitai · 2025-11-21T14:50:50Z

Walkthrough

Modified CRC32 folding implementation in x86 SIMD code to replace unaligned loads (_mm_loadu_si128) with aligned loads (_mm_load_si128) in the folding loop and interleaved sections, assuming 16-byte aligned source data. Stores remain unaligned where needed.

Changes

Cohort / File(s)	Summary
CRC32 SIMD load optimization `arch/x86/crc32_fold_pclmulqdq_tpl.h`	Replaced multiple `_mm_loadu_si128()` calls with `_mm_load_si128()` for sequential 16-byte chunk loads in the main folding loop and interleaved Chorba section, while preserving `_mm_storeu_si128()` for stores

Possibly related PRs

Fix an unfortunate bug with Visual Studio 2015 #1862: Modifies x86 SIMD load/broadcast intrinsics to change memory access patterns for 128/256-bit data, addressing similar alignment concerns.

Suggested labels

optimization, Architecture

Suggested reviewers

nmoinvaz
Dead2

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: converting to aligned loads in specific portions of the CRC routines.
Description check	✅ Passed	The description is directly related to the changeset, explaining the motivation and benefit of using aligned loads.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 469cf6d and 6b5aac9.

📒 Files selected for processing (1)

arch/x86/crc32_fold_pclmulqdq_tpl.h (9 hunks)

🧰 Additional context used

🧠 Learnings (14)

📓 Common learnings

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:14-24
Timestamp: 2025-02-21T01:41:50.358Z
Learning: In zlib-ng's SSE2 vectorized Chorba CRC implementation, the code that calls READ_NEXT macro ensures 16-byte alignment, making explicit alignment checks unnecessary within the macro.

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:42:40.488Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:26-28
Timestamp: 2025-02-21T01:44:03.996Z
Learning: The alignment requirements for chorba_small_nondestructive_sse2 (16-byte alignment and multiple of 8 length) are enforced by its calling function, making additional checks redundant.

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1802
File: arch/x86/chunkset_avx2.c:82-85
Timestamp: 2024-10-07T21:18:37.806Z
Learning: In `arch/x86/chunkset_avx2.c`, when working with AVX2-capable x86 CPUs, unaligned memory access using `_mm_loadu_si128` is acceptable since there is no performance penalty on architectures after Nehalem. Ensuring alignment may introduce unnecessary overhead due to arbitrary offsets into the window.

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1802
File: arch/x86/chunkset_avx2.c:82-85
Timestamp: 2024-10-08T19:37:14.998Z
Learning: In `arch/x86/chunkset_avx2.c`, when working with AVX2-capable x86 CPUs, unaligned memory access using `_mm_loadu_si128` is acceptable since there is no performance penalty on architectures after Nehalem. Ensuring alignment may introduce unnecessary overhead due to arbitrary offsets into the window.

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1778
File: arch/x86/chunkset_avx2.c:160-171
Timestamp: 2024-10-08T21:51:45.330Z
Learning: In `arch/x86/chunkset_avx2.c`, within the `GET_HALFCHUNK_MAG` function, using a conditional branch to select between `_mm_loadl_epi64` and `_mm_loadu_si128` is not recommended because the branching cost outweighs the savings from the load.

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:41:10.063Z
Learning: For SSE2 optimizations, `_mm_cvtsi128_si64` should be used instead of `_mm_extract_epi64` (SSE4.1) for extracting 64-bit values from 128-bit vectors, as it generates more efficient movq instructions.

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/x86_intrins.h:114-117
Timestamp: 2025-02-23T16:51:54.545Z
Learning: In x86/x86_intrins.h, the Clang macros for _mm_cvtsi64x_si128 and _mm_cvtsi128_si64x don't need additional MSVC guards since MSVC's implementation is already protected by `defined(_MSC_VER) && !defined(__clang__)`, making them mutually exclusive.

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1805
File: arch/x86/chunkset_avx512.c:28-30
Timestamp: 2024-10-29T02:18:25.966Z
Learning: In `chunkset_avx512.c`, the `gen_half_mask` function does not require validation for `len` since it will never exceed 16 due to computing the remainder for a 16-byte load.

📚 Learning: 2025-02-21T01:42:40.488Z

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:42:40.488Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.

Applied to files:

arch/x86/crc32_fold_pclmulqdq_tpl.h

📚 Learning: 2024-10-07T21:18:37.806Z

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1802
File: arch/x86/chunkset_avx2.c:82-85
Timestamp: 2024-10-07T21:18:37.806Z
Learning: In `arch/x86/chunkset_avx2.c`, when working with AVX2-capable x86 CPUs, unaligned memory access using `_mm_loadu_si128` is acceptable since there is no performance penalty on architectures after Nehalem. Ensuring alignment may introduce unnecessary overhead due to arbitrary offsets into the window.

Applied to files:

arch/x86/crc32_fold_pclmulqdq_tpl.h

📚 Learning: 2025-02-21T01:41:50.358Z

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:14-24
Timestamp: 2025-02-21T01:41:50.358Z
Learning: In zlib-ng's SSE2 vectorized Chorba CRC implementation, the code that calls READ_NEXT macro ensures 16-byte alignment, making explicit alignment checks unnecessary within the macro.

Applied to files:

arch/x86/crc32_fold_pclmulqdq_tpl.h

📚 Learning: 2024-10-08T21:51:45.330Z

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1778
File: arch/x86/chunkset_avx2.c:160-171
Timestamp: 2024-10-08T21:51:45.330Z
Learning: In `arch/x86/chunkset_avx2.c`, within the `GET_HALFCHUNK_MAG` function, using a conditional branch to select between `_mm_loadl_epi64` and `_mm_loadu_si128` is not recommended because the branching cost outweighs the savings from the load.

Applied to files:

arch/x86/crc32_fold_pclmulqdq_tpl.h

📚 Learning: 2025-02-21T01:44:03.996Z

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:26-28
Timestamp: 2025-02-21T01:44:03.996Z
Learning: The alignment requirements for chorba_small_nondestructive_sse2 (16-byte alignment and multiple of 8 length) are enforced by its calling function, making additional checks redundant.

Applied to files:

arch/x86/crc32_fold_pclmulqdq_tpl.h

📚 Learning: 2025-02-23T16:51:54.545Z

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/x86_intrins.h:114-117
Timestamp: 2025-02-23T16:51:54.545Z
Learning: In x86/x86_intrins.h, the Clang macros for _mm_cvtsi64x_si128 and _mm_cvtsi128_si64x don't need additional MSVC guards since MSVC's implementation is already protected by `defined(_MSC_VER) && !defined(__clang__)`, making them mutually exclusive.

Applied to files:

arch/x86/crc32_fold_pclmulqdq_tpl.h

📚 Learning: 2025-02-23T16:49:52.043Z

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.

Applied to files:

arch/x86/crc32_fold_pclmulqdq_tpl.h

📚 Learning: 2025-02-21T01:41:10.063Z

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:41:10.063Z
Learning: For SSE2 optimizations, `_mm_cvtsi128_si64` should be used instead of `_mm_extract_epi64` (SSE4.1) for extracting 64-bit values from 128-bit vectors, as it generates more efficient movq instructions.

Applied to files:

arch/x86/crc32_fold_pclmulqdq_tpl.h

📚 Learning: 2024-10-29T02:22:55.489Z

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1805
File: arch/x86/chunkset_avx512.c:32-34
Timestamp: 2024-10-29T02:22:55.489Z
Learning: In `arch/x86/chunkset_avx512.c`, the `gen_mask` function's `len` parameter cannot exceed 32 because it is only called on the remaining bytes from a 32-byte vector.

Applied to files:

arch/x86/crc32_fold_pclmulqdq_tpl.h

📚 Learning: 2024-10-29T02:18:25.966Z

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1805
File: arch/x86/chunkset_avx512.c:28-30
Timestamp: 2024-10-29T02:18:25.966Z
Learning: In `chunkset_avx512.c`, the `gen_half_mask` function does not require validation for `len` since it will never exceed 16 due to computing the remainder for a 16-byte load.

Applied to files:

arch/x86/crc32_fold_pclmulqdq_tpl.h

📚 Learning: 2024-10-29T02:22:52.846Z

Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1805
File: inffast_tpl.h:257-262
Timestamp: 2024-10-29T02:22:52.846Z
Learning: In `inffast_tpl.h`, when AVX512 is enabled, the branch involving `chunkcopy_safe` is intentionally eliminated to optimize performance.

Applied to files:

arch/x86/crc32_fold_pclmulqdq_tpl.h

📚 Learning: 2025-06-10T07:38:03.297Z

Learnt from: mtl1979
Repo: zlib-ng/zlib-ng PR: 1921
File: arch/riscv/chunkset_rvv.c:103-104
Timestamp: 2025-06-10T07:38:03.297Z
Learning: In RISC-V chunkset_rvv.c CHUNKCOPY function, when dist < sizeof(chunk_t), the vl variable intentionally becomes 0, causing the while loop to not execute. This is correct behavior because copying full chunks is not safe when the distance is smaller than chunk size, and the function appropriately falls back to memcpy for handling remaining bytes.

Applied to files:

arch/x86/crc32_fold_pclmulqdq_tpl.h

📚 Learning: 2025-01-23T22:01:53.422Z

Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-01-23T22:01:53.422Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.

Applied to files:

arch/x86/crc32_fold_pclmulqdq_tpl.h

🧬 Code graph analysis (1)

arch/x86/crc32_fold_pclmulqdq_tpl.h (3)

arch/x86/chorba_sse41.c (1)

uint32_t (307-335)

arch/x86/crc32_pclmulqdq_tpl.h (4)

partial_fold (228-271)

fold_2 (61-88)

crc32_fold_load (273-278)

fold_3 (90-123)

arch/x86/chorba_sse2.c (2)

uint32_t (23-847)

uint32_t (849-875)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (178)

GitHub Check: macOS GCC UBSAN (ARM64)
GitHub Check: macOS Clang Native Instructions (ARM64)
GitHub Check: Windows ClangCl Win32
GitHub Check: Windows MSVC 2022 v142 Win64
GitHub Check: Windows MSVC 2022 v142 Win32
GitHub Check: Windows MSVC 2022 v140 Win64
GitHub Check: Windows MSVC 2022 v143 Win32
GitHub Check: Windows MSVC 2022 v141 Win64
GitHub Check: Windows MSVC 2022 v140 Win32
GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
GitHub Check: Windows MSVC 2022 v143 Win64
GitHub Check: Ubuntu MinGW i686
GitHub Check: EL10 GCC S390X DFLTCC ASAN
GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
GitHub Check: Ubuntu GCC AARCH64 ASAN
GitHub Check: Ubuntu GCC -O3 OSB
GitHub Check: macOS GCC UBSAN (ARM64)
GitHub Check: macOS Clang Native Instructions (ARM64)
GitHub Check: Windows ClangCl Win32
GitHub Check: Windows MSVC 2022 v142 Win64
GitHub Check: Windows MSVC 2022 v142 Win32
GitHub Check: Windows MSVC 2022 v140 Win64
GitHub Check: Windows MSVC 2022 v143 Win32
GitHub Check: Windows MSVC 2022 v141 Win64
GitHub Check: Windows MSVC 2022 v140 Win32
GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
GitHub Check: Windows MSVC 2022 v143 Win64
GitHub Check: Ubuntu MinGW i686
GitHub Check: EL10 GCC S390X DFLTCC ASAN
GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
GitHub Check: Ubuntu GCC AARCH64 ASAN
GitHub Check: Ubuntu GCC -O3 OSB
GitHub Check: macOS GCC UBSAN (ARM64)
GitHub Check: macOS Clang Native Instructions (ARM64)
GitHub Check: Windows ClangCl Win32
GitHub Check: Windows MSVC 2022 v142 Win64
GitHub Check: Windows MSVC 2022 v142 Win32
GitHub Check: Windows MSVC 2022 v141 Win32
GitHub Check: Windows MSVC 2022 v140 Win64
GitHub Check: Windows MSVC 2022 v143 Win32
GitHub Check: Windows MSVC 2022 v141 Win64
GitHub Check: Windows MSVC 2022 v140 Win32
GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
GitHub Check: Windows MSVC 2022 v143 Win64
GitHub Check: Ubuntu MinGW i686
GitHub Check: EL10 GCC S390X DFLTCC ASAN
GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
GitHub Check: Ubuntu GCC AARCH64 ASAN
GitHub Check: Ubuntu GCC -O3 OSB
GitHub Check: macOS GCC UBSAN (ARM64)
GitHub Check: macOS Clang Native Instructions (ARM64)
GitHub Check: Windows ClangCl Win32
GitHub Check: Windows MSVC 2022 v142 Win64
GitHub Check: Windows MSVC 2022 v142 Win32
GitHub Check: Windows MSVC 2022 v141 Win32
GitHub Check: Windows MSVC 2022 v140 Win64
GitHub Check: Windows MSVC 2022 v143 Win32
GitHub Check: Windows MSVC 2022 v141 Win64
GitHub Check: Windows MSVC 2022 v140 Win32
GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
GitHub Check: Windows MSVC 2022 v143 Win64
GitHub Check: Ubuntu MinGW i686
GitHub Check: EL10 GCC S390X DFLTCC ASAN
GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
GitHub Check: Ubuntu GCC AARCH64 ASAN
GitHub Check: Ubuntu GCC -O3 OSB
GitHub Check: macOS GCC UBSAN (ARM64)
GitHub Check: macOS Clang Native Instructions (ARM64)
GitHub Check: Windows ClangCl Win32
GitHub Check: Windows MSVC 2022 v142 Win64
GitHub Check: Windows MSVC 2022 v142 Win32
GitHub Check: Windows MSVC 2022 v141 Win32
GitHub Check: Windows MSVC 2022 v140 Win64
GitHub Check: Windows MSVC 2022 v143 Win32
GitHub Check: Windows MSVC 2022 v141 Win64
GitHub Check: Windows MSVC 2022 v140 Win32
GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
GitHub Check: Windows MSVC 2022 v143 Win64
GitHub Check: Ubuntu MinGW i686
GitHub Check: EL10 GCC S390X DFLTCC ASAN
GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
GitHub Check: Ubuntu GCC AARCH64 ASAN
GitHub Check: Ubuntu GCC -O3 OSB
GitHub Check: macOS GCC UBSAN (ARM64)
GitHub Check: macOS Clang Native Instructions (ARM64)
GitHub Check: Windows ClangCl Win32
GitHub Check: Windows MSVC 2022 v142 Win64
GitHub Check: Windows MSVC 2022 v142 Win32
GitHub Check: Windows MSVC 2022 v141 Win32
GitHub Check: Windows MSVC 2022 v140 Win64
GitHub Check: Windows MSVC 2022 v143 Win32
GitHub Check: Windows MSVC 2022 v141 Win64
GitHub Check: Windows MSVC 2022 v140 Win32
GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
GitHub Check: Windows MSVC 2022 v143 Win64
GitHub Check: Ubuntu MinGW i686
GitHub Check: EL10 GCC S390X DFLTCC ASAN
GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
GitHub Check: Ubuntu GCC AARCH64 ASAN
GitHub Check: Ubuntu GCC -O3 OSB
GitHub Check: macOS GCC UBSAN (ARM64)
GitHub Check: macOS Clang Native Instructions (ARM64)
GitHub Check: Windows ClangCl Win32
GitHub Check: Windows MSVC 2022 v142 Win64
GitHub Check: Windows MSVC 2022 v142 Win32
GitHub Check: Windows MSVC 2022 v141 Win32
GitHub Check: Windows MSVC 2022 v140 Win64
GitHub Check: Windows MSVC 2022 v143 Win32
GitHub Check: Windows MSVC 2022 v141 Win64
GitHub Check: Windows MSVC 2022 v140 Win32
GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
GitHub Check: Windows MSVC 2022 v143 Win64
GitHub Check: Ubuntu MinGW i686
GitHub Check: EL10 GCC S390X DFLTCC ASAN
GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
GitHub Check: Ubuntu GCC AARCH64 ASAN
GitHub Check: Ubuntu GCC -O3 OSB
GitHub Check: macOS GCC UBSAN (ARM64)
GitHub Check: macOS Clang Native Instructions (ARM64)
GitHub Check: Windows ClangCl Win32
GitHub Check: Windows MSVC 2022 v142 Win64
GitHub Check: Windows MSVC 2022 v142 Win32
GitHub Check: Windows MSVC 2022 v141 Win32
GitHub Check: Windows MSVC 2022 v140 Win64
GitHub Check: Windows MSVC 2022 v143 Win32
GitHub Check: Windows MSVC 2022 v141 Win64
GitHub Check: Windows MSVC 2022 v140 Win32
GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
GitHub Check: Windows MSVC 2022 v143 Win64
GitHub Check: Ubuntu MinGW i686
GitHub Check: EL10 GCC S390X DFLTCC ASAN
GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
GitHub Check: Ubuntu GCC AARCH64 ASAN
GitHub Check: Ubuntu GCC -O3 OSB
GitHub Check: macOS GCC UBSAN (ARM64)
GitHub Check: macOS Clang Native Instructions (ARM64)
GitHub Check: Windows ClangCl Win32
GitHub Check: Windows MSVC 2022 v142 Win64
GitHub Check: Windows MSVC 2022 v142 Win32
GitHub Check: Windows MSVC 2022 v141 Win32
GitHub Check: Windows MSVC 2022 v140 Win64
GitHub Check: Windows MSVC 2022 v143 Win32
GitHub Check: Windows MSVC 2022 v141 Win64
GitHub Check: Windows MSVC 2022 v140 Win32
GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
GitHub Check: Windows MSVC 2022 v143 Win64
GitHub Check: Ubuntu MinGW i686
GitHub Check: EL10 GCC S390X DFLTCC ASAN
GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
GitHub Check: Ubuntu GCC AARCH64 ASAN
GitHub Check: Ubuntu GCC -O3 OSB
GitHub Check: macOS GCC UBSAN (ARM64)
GitHub Check: macOS Clang Native Instructions (ARM64)
GitHub Check: Windows ClangCl Win32
GitHub Check: Windows MSVC 2022 v142 Win64
GitHub Check: Windows MSVC 2022 v142 Win32
GitHub Check: Windows MSVC 2022 v141 Win32
GitHub Check: Windows MSVC 2022 v140 Win64
GitHub Check: Windows MSVC 2022 v143 Win32
GitHub Check: Windows MSVC 2022 v141 Win64
GitHub Check: Windows MSVC 2022 v140 Win32
GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
GitHub Check: Windows MSVC 2022 v143 Win64
GitHub Check: Ubuntu MinGW i686
GitHub Check: EL10 GCC S390X DFLTCC ASAN
GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
GitHub Check: Ubuntu GCC AARCH64 ASAN
GitHub Check: Ubuntu GCC -O3 OSB

🔇 Additional comments (1)

arch/x86/crc32_fold_pclmulqdq_tpl.h (1)

115-122: Aligned loads in Chorba folding loop look correct given src alignment invariants

Switching these Chorba loads from _mm_loadu_si128 to _mm_load_si128 is sound: by the time this loop is reached, src has been aligned to 16 bytes via the algn_diff prologue, and all subsequent adjustments (including the VPCLMUL path and the Chorba loop itself) move src only in 16‑byte increments, so each (__m128i *)src + k remains 16‑byte aligned. This matches the Chorba callers’ contract that they work on 16‑byte–aligned buffers, and it should help the compiler reason about alignment and optimize away some load/XOR/mul sequences in the non‑COPY case. The unaligned stores in the COPY path remain appropriate since dst alignment is not guaranteed. Based on learnings

Also applies to: 141-144, 163-166, 186-189, 209-212, 232-235, 255-258, 278-281, 300-303

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

Provide your own instructions using the high_level_summary_instructions setting.
Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

📝 Description — Summarize the main change in 50–60 words, explaining what was done.

📓 References — List relevant issues, discussions, documentation, or related PRs.

📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.

📊 Contributor Summary — Include a Markdown table showing contributions:
| Contributor | Lines Added | Lines Removed | Files Changed |

✔️ Additional Notes — Add any extra reviewer context.
Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

KungFuJesus · 2025-11-21T14:58:00Z

I did confirm that this is in fact happening on the non-copying variant in several places:

   0.85 │       pclmulqdq $0x10, 0xaf6e(%rip), %xmm15  # 0x2b540
   0.12 │       movdqa    %xmm14,%xmm9
   0.45 │       pclmulqdq $0x1, 0xaf5f(%rip), %xmm3  # 0x2b540
   0.36 │       movdqa    (%rsp),%xmm4
           │       pclmulqdq $0x10, 0xaf4f(%rip), %xmm8  # 0x2b540
   0.76 │       xorps     %xmm15,%xmm3
   0.45 │       pclmulqdq $0x1, 0xaf41(%rip), %xmm2  # 0x2b540
   0.03 │       movdqa    0x90(%rsi),%xmm15
   0.82 │       xorps     %xmm8,%xmm2
   0.24 │       pxor      %xmm6, %xmm3
   0.03 │       movdqa    0x80(%rsi),%xmm8
   0.64 │       pclmulqdq $0x10, 0xaf1c(%rip), %xmm14  # 0x2b540
   0.58 │       pclmulqdq $0x1, 0xaf11(%rip), %xmm9  # 0x2b540
   0.03 │       pxor      %xmm0, %xmm15
   0.67 │       pxor      %xmm11, %xmm4
   0.15 │       subq      $0x280,%r10
   0.09 │       xorps     %xmm14,%xmm9
   0.03 │       pxor      %xmm3, %xmm15
   0.55 │       movdqa    %xmm10,%xmm14
   0.36 │       movdqa    (%rsp),%xmm3
   0.03 │       pxor      %xmm7, %xmm8
   0.12 │       pxor      0xa0(%rsi), %xmm3
   1.03 │       pclmulqdq $0x10, 0xaed5(%rip), %xmm10  # 0x2b540
           │       addq      $0x280,%rsi
   0.55 │       pclmulqdq $0x1, 0xaec3(%rip), %xmm14  # 0x2b540
   0.33 │       pxor      %xmm2, %xmm8

codecov · 2025-11-21T14:58:38Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 81.24%. Comparing base (469cf6d) to head (6b5aac9).
⚠️ Report is 1 commits behind head on develop.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #2019      +/-   ##
===========================================
- Coverage    82.23%   81.24%   -1.00%     
===========================================
  Files          163      163              
  Lines        12863    12863              
  Branches      3171     3171              
===========================================
- Hits         10578    10450     -128     
- Misses        1243     1372     +129     
+ Partials      1042     1041       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

KungFuJesus force-pushed the aligned_loads_crc branch from 324ff19 to 6b5aac9 Compare November 21, 2025 14:49

KungFuJesus mentioned this pull request Nov 21, 2025

Conditionally shortcut via the chorba polynomial based on compile flags #2020

Merged

Dead2 approved these changes Nov 21, 2025

View reviewed changes

nmoinvaz approved these changes Nov 21, 2025

View reviewed changes

Dead2 merged commit f6e28fb into zlib-ng:develop Nov 22, 2025
155 of 156 checks passed

Dead2 mentioned this pull request Nov 25, 2025

2.3.1 Release #2021

Merged

coderabbitai bot mentioned this pull request Dec 25, 2025

Refactor crc32_fold functions into single crc32_copy #2048

Merged

coderabbitai bot mentioned this pull request Jan 2, 2026

Minor improvements to crc32_(v)pclmulqdq. #2060

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use aligned loads in the chorba portions of the clmul crc routines#2019

Use aligned loads in the chorba portions of the clmul crc routines#2019
Dead2 merged 1 commit intozlib-ng:developfrom
KungFuJesus:aligned_loads_crc

KungFuJesus commented Nov 21, 2025

Uh oh!

coderabbitai bot commented Nov 21, 2025 •

edited

Loading

Uh oh!

KungFuJesus commented Nov 21, 2025

Uh oh!

codecov bot commented Nov 21, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

KungFuJesus commented Nov 21, 2025

Uh oh!

coderabbitai bot commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Possibly related PRs

Suggested labels

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

KungFuJesus commented Nov 21, 2025

Uh oh!

codecov bot commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coderabbitai bot commented Nov 21, 2025 •

edited

Loading

codecov bot commented Nov 21, 2025 •

edited

Loading