Skip to content

Use aligned loads in the chorba portions of the clmul crc routines#2019

Merged
Dead2 merged 1 commit intozlib-ng:developfrom
KungFuJesus:aligned_loads_crc
Nov 22, 2025
Merged

Use aligned loads in the chorba portions of the clmul crc routines#2019
Dead2 merged 1 commit intozlib-ng:developfrom
KungFuJesus:aligned_loads_crc

Conversation

@KungFuJesus
Copy link
Copy Markdown
Collaborator

We go through the trouble to do aligned loads, we may as well let the compile know this is certain in doing so. We can't guarantee an aligned store but at least with an aligned load the compiler can elide a load with an subsequent xor multiplication when not copying.

We go through the trouble to do aligned loads, we may as well let the
compiler know this is certain in doing so. We can't guarantee an aligned
store but at least with an aligned load the compiler can elide a load
with a subsequent xor multiplication when not copying.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Nov 21, 2025

Walkthrough

Modified CRC32 folding implementation in x86 SIMD code to replace unaligned loads (_mm_loadu_si128) with aligned loads (_mm_load_si128) in the folding loop and interleaved sections, assuming 16-byte aligned source data. Stores remain unaligned where needed.

Changes

Cohort / File(s) Summary
CRC32 SIMD load optimization
arch/x86/crc32_fold_pclmulqdq_tpl.h
Replaced multiple _mm_loadu_si128() calls with _mm_load_si128() for sequential 16-byte chunk loads in the main folding loop and interleaved Chorba section, while preserving _mm_storeu_si128() for stores

Possibly related PRs

Suggested labels

optimization, Architecture

Suggested reviewers

  • nmoinvaz
  • Dead2

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: converting to aligned loads in specific portions of the CRC routines.
Description check ✅ Passed The description is directly related to the changeset, explaining the motivation and benefit of using aligned loads.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 469cf6d and 6b5aac9.

📒 Files selected for processing (1)
  • arch/x86/crc32_fold_pclmulqdq_tpl.h (9 hunks)
🧰 Additional context used
🧠 Learnings (14)
📓 Common learnings
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:14-24
Timestamp: 2025-02-21T01:41:50.358Z
Learning: In zlib-ng's SSE2 vectorized Chorba CRC implementation, the code that calls READ_NEXT macro ensures 16-byte alignment, making explicit alignment checks unnecessary within the macro.
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:42:40.488Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:26-28
Timestamp: 2025-02-21T01:44:03.996Z
Learning: The alignment requirements for chorba_small_nondestructive_sse2 (16-byte alignment and multiple of 8 length) are enforced by its calling function, making additional checks redundant.
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1802
File: arch/x86/chunkset_avx2.c:82-85
Timestamp: 2024-10-07T21:18:37.806Z
Learning: In `arch/x86/chunkset_avx2.c`, when working with AVX2-capable x86 CPUs, unaligned memory access using `_mm_loadu_si128` is acceptable since there is no performance penalty on architectures after Nehalem. Ensuring alignment may introduce unnecessary overhead due to arbitrary offsets into the window.
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1802
File: arch/x86/chunkset_avx2.c:82-85
Timestamp: 2024-10-08T19:37:14.998Z
Learning: In `arch/x86/chunkset_avx2.c`, when working with AVX2-capable x86 CPUs, unaligned memory access using `_mm_loadu_si128` is acceptable since there is no performance penalty on architectures after Nehalem. Ensuring alignment may introduce unnecessary overhead due to arbitrary offsets into the window.
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1778
File: arch/x86/chunkset_avx2.c:160-171
Timestamp: 2024-10-08T21:51:45.330Z
Learning: In `arch/x86/chunkset_avx2.c`, within the `GET_HALFCHUNK_MAG` function, using a conditional branch to select between `_mm_loadl_epi64` and `_mm_loadu_si128` is not recommended because the branching cost outweighs the savings from the load.
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:41:10.063Z
Learning: For SSE2 optimizations, `_mm_cvtsi128_si64` should be used instead of `_mm_extract_epi64` (SSE4.1) for extracting 64-bit values from 128-bit vectors, as it generates more efficient movq instructions.
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/x86_intrins.h:114-117
Timestamp: 2025-02-23T16:51:54.545Z
Learning: In x86/x86_intrins.h, the Clang macros for _mm_cvtsi64x_si128 and _mm_cvtsi128_si64x don't need additional MSVC guards since MSVC's implementation is already protected by `defined(_MSC_VER) && !defined(__clang__)`, making them mutually exclusive.
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1805
File: arch/x86/chunkset_avx512.c:28-30
Timestamp: 2024-10-29T02:18:25.966Z
Learning: In `chunkset_avx512.c`, the `gen_half_mask` function does not require validation for `len` since it will never exceed 16 due to computing the remainder for a 16-byte load.
📚 Learning: 2025-02-21T01:42:40.488Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:42:40.488Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.

Applied to files:

  • arch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2024-10-07T21:18:37.806Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1802
File: arch/x86/chunkset_avx2.c:82-85
Timestamp: 2024-10-07T21:18:37.806Z
Learning: In `arch/x86/chunkset_avx2.c`, when working with AVX2-capable x86 CPUs, unaligned memory access using `_mm_loadu_si128` is acceptable since there is no performance penalty on architectures after Nehalem. Ensuring alignment may introduce unnecessary overhead due to arbitrary offsets into the window.

Applied to files:

  • arch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2025-02-21T01:41:50.358Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:14-24
Timestamp: 2025-02-21T01:41:50.358Z
Learning: In zlib-ng's SSE2 vectorized Chorba CRC implementation, the code that calls READ_NEXT macro ensures 16-byte alignment, making explicit alignment checks unnecessary within the macro.

Applied to files:

  • arch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2024-10-08T21:51:45.330Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1778
File: arch/x86/chunkset_avx2.c:160-171
Timestamp: 2024-10-08T21:51:45.330Z
Learning: In `arch/x86/chunkset_avx2.c`, within the `GET_HALFCHUNK_MAG` function, using a conditional branch to select between `_mm_loadl_epi64` and `_mm_loadu_si128` is not recommended because the branching cost outweighs the savings from the load.

Applied to files:

  • arch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2025-02-21T01:44:03.996Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:26-28
Timestamp: 2025-02-21T01:44:03.996Z
Learning: The alignment requirements for chorba_small_nondestructive_sse2 (16-byte alignment and multiple of 8 length) are enforced by its calling function, making additional checks redundant.

Applied to files:

  • arch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2025-02-23T16:51:54.545Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/x86_intrins.h:114-117
Timestamp: 2025-02-23T16:51:54.545Z
Learning: In x86/x86_intrins.h, the Clang macros for _mm_cvtsi64x_si128 and _mm_cvtsi128_si64x don't need additional MSVC guards since MSVC's implementation is already protected by `defined(_MSC_VER) && !defined(__clang__)`, making them mutually exclusive.

Applied to files:

  • arch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2025-02-23T16:49:52.043Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.

Applied to files:

  • arch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2025-02-21T01:41:10.063Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:41:10.063Z
Learning: For SSE2 optimizations, `_mm_cvtsi128_si64` should be used instead of `_mm_extract_epi64` (SSE4.1) for extracting 64-bit values from 128-bit vectors, as it generates more efficient movq instructions.

Applied to files:

  • arch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2024-10-29T02:22:55.489Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1805
File: arch/x86/chunkset_avx512.c:32-34
Timestamp: 2024-10-29T02:22:55.489Z
Learning: In `arch/x86/chunkset_avx512.c`, the `gen_mask` function's `len` parameter cannot exceed 32 because it is only called on the remaining bytes from a 32-byte vector.

Applied to files:

  • arch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2024-10-29T02:18:25.966Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1805
File: arch/x86/chunkset_avx512.c:28-30
Timestamp: 2024-10-29T02:18:25.966Z
Learning: In `chunkset_avx512.c`, the `gen_half_mask` function does not require validation for `len` since it will never exceed 16 due to computing the remainder for a 16-byte load.

Applied to files:

  • arch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2024-10-29T02:22:52.846Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1805
File: inffast_tpl.h:257-262
Timestamp: 2024-10-29T02:22:52.846Z
Learning: In `inffast_tpl.h`, when AVX512 is enabled, the branch involving `chunkcopy_safe` is intentionally eliminated to optimize performance.

Applied to files:

  • arch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2025-06-10T07:38:03.297Z
Learnt from: mtl1979
Repo: zlib-ng/zlib-ng PR: 1921
File: arch/riscv/chunkset_rvv.c:103-104
Timestamp: 2025-06-10T07:38:03.297Z
Learning: In RISC-V chunkset_rvv.c CHUNKCOPY function, when dist < sizeof(chunk_t), the vl variable intentionally becomes 0, causing the while loop to not execute. This is correct behavior because copying full chunks is not safe when the distance is smaller than chunk size, and the function appropriately falls back to memcpy for handling remaining bytes.

Applied to files:

  • arch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2025-01-23T22:01:53.422Z
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-01-23T22:01:53.422Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.

Applied to files:

  • arch/x86/crc32_fold_pclmulqdq_tpl.h
🧬 Code graph analysis (1)
arch/x86/crc32_fold_pclmulqdq_tpl.h (3)
arch/x86/chorba_sse41.c (1)
  • uint32_t (307-335)
arch/x86/crc32_pclmulqdq_tpl.h (4)
  • partial_fold (228-271)
  • fold_2 (61-88)
  • crc32_fold_load (273-278)
  • fold_3 (90-123)
arch/x86/chorba_sse2.c (2)
  • uint32_t (23-847)
  • uint32_t (849-875)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (178)
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: Windows ClangCl Win32
  • GitHub Check: Windows MSVC 2022 v142 Win64
  • GitHub Check: Windows MSVC 2022 v142 Win32
  • GitHub Check: Windows MSVC 2022 v140 Win64
  • GitHub Check: Windows MSVC 2022 v143 Win32
  • GitHub Check: Windows MSVC 2022 v141 Win64
  • GitHub Check: Windows MSVC 2022 v140 Win32
  • GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
  • GitHub Check: Windows MSVC 2022 v143 Win64
  • GitHub Check: Ubuntu MinGW i686
  • GitHub Check: EL10 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: Ubuntu GCC -O3 OSB
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: Windows ClangCl Win32
  • GitHub Check: Windows MSVC 2022 v142 Win64
  • GitHub Check: Windows MSVC 2022 v142 Win32
  • GitHub Check: Windows MSVC 2022 v140 Win64
  • GitHub Check: Windows MSVC 2022 v143 Win32
  • GitHub Check: Windows MSVC 2022 v141 Win64
  • GitHub Check: Windows MSVC 2022 v140 Win32
  • GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
  • GitHub Check: Windows MSVC 2022 v143 Win64
  • GitHub Check: Ubuntu MinGW i686
  • GitHub Check: EL10 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: Ubuntu GCC -O3 OSB
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: Windows ClangCl Win32
  • GitHub Check: Windows MSVC 2022 v142 Win64
  • GitHub Check: Windows MSVC 2022 v142 Win32
  • GitHub Check: Windows MSVC 2022 v141 Win32
  • GitHub Check: Windows MSVC 2022 v140 Win64
  • GitHub Check: Windows MSVC 2022 v143 Win32
  • GitHub Check: Windows MSVC 2022 v141 Win64
  • GitHub Check: Windows MSVC 2022 v140 Win32
  • GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
  • GitHub Check: Windows MSVC 2022 v143 Win64
  • GitHub Check: Ubuntu MinGW i686
  • GitHub Check: EL10 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: Ubuntu GCC -O3 OSB
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: Windows ClangCl Win32
  • GitHub Check: Windows MSVC 2022 v142 Win64
  • GitHub Check: Windows MSVC 2022 v142 Win32
  • GitHub Check: Windows MSVC 2022 v141 Win32
  • GitHub Check: Windows MSVC 2022 v140 Win64
  • GitHub Check: Windows MSVC 2022 v143 Win32
  • GitHub Check: Windows MSVC 2022 v141 Win64
  • GitHub Check: Windows MSVC 2022 v140 Win32
  • GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
  • GitHub Check: Windows MSVC 2022 v143 Win64
  • GitHub Check: Ubuntu MinGW i686
  • GitHub Check: EL10 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: Ubuntu GCC -O3 OSB
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: Windows ClangCl Win32
  • GitHub Check: Windows MSVC 2022 v142 Win64
  • GitHub Check: Windows MSVC 2022 v142 Win32
  • GitHub Check: Windows MSVC 2022 v141 Win32
  • GitHub Check: Windows MSVC 2022 v140 Win64
  • GitHub Check: Windows MSVC 2022 v143 Win32
  • GitHub Check: Windows MSVC 2022 v141 Win64
  • GitHub Check: Windows MSVC 2022 v140 Win32
  • GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
  • GitHub Check: Windows MSVC 2022 v143 Win64
  • GitHub Check: Ubuntu MinGW i686
  • GitHub Check: EL10 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: Ubuntu GCC -O3 OSB
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: Windows ClangCl Win32
  • GitHub Check: Windows MSVC 2022 v142 Win64
  • GitHub Check: Windows MSVC 2022 v142 Win32
  • GitHub Check: Windows MSVC 2022 v141 Win32
  • GitHub Check: Windows MSVC 2022 v140 Win64
  • GitHub Check: Windows MSVC 2022 v143 Win32
  • GitHub Check: Windows MSVC 2022 v141 Win64
  • GitHub Check: Windows MSVC 2022 v140 Win32
  • GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
  • GitHub Check: Windows MSVC 2022 v143 Win64
  • GitHub Check: Ubuntu MinGW i686
  • GitHub Check: EL10 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: Ubuntu GCC -O3 OSB
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: Windows ClangCl Win32
  • GitHub Check: Windows MSVC 2022 v142 Win64
  • GitHub Check: Windows MSVC 2022 v142 Win32
  • GitHub Check: Windows MSVC 2022 v141 Win32
  • GitHub Check: Windows MSVC 2022 v140 Win64
  • GitHub Check: Windows MSVC 2022 v143 Win32
  • GitHub Check: Windows MSVC 2022 v141 Win64
  • GitHub Check: Windows MSVC 2022 v140 Win32
  • GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
  • GitHub Check: Windows MSVC 2022 v143 Win64
  • GitHub Check: Ubuntu MinGW i686
  • GitHub Check: EL10 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: Ubuntu GCC -O3 OSB
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: Windows ClangCl Win32
  • GitHub Check: Windows MSVC 2022 v142 Win64
  • GitHub Check: Windows MSVC 2022 v142 Win32
  • GitHub Check: Windows MSVC 2022 v141 Win32
  • GitHub Check: Windows MSVC 2022 v140 Win64
  • GitHub Check: Windows MSVC 2022 v143 Win32
  • GitHub Check: Windows MSVC 2022 v141 Win64
  • GitHub Check: Windows MSVC 2022 v140 Win32
  • GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
  • GitHub Check: Windows MSVC 2022 v143 Win64
  • GitHub Check: Ubuntu MinGW i686
  • GitHub Check: EL10 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: Ubuntu GCC -O3 OSB
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: Windows ClangCl Win32
  • GitHub Check: Windows MSVC 2022 v142 Win64
  • GitHub Check: Windows MSVC 2022 v142 Win32
  • GitHub Check: Windows MSVC 2022 v141 Win32
  • GitHub Check: Windows MSVC 2022 v140 Win64
  • GitHub Check: Windows MSVC 2022 v143 Win32
  • GitHub Check: Windows MSVC 2022 v141 Win64
  • GitHub Check: Windows MSVC 2022 v140 Win32
  • GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
  • GitHub Check: Windows MSVC 2022 v143 Win64
  • GitHub Check: Ubuntu MinGW i686
  • GitHub Check: EL10 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: Ubuntu GCC -O3 OSB
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: Windows ClangCl Win32
  • GitHub Check: Windows MSVC 2022 v142 Win64
  • GitHub Check: Windows MSVC 2022 v142 Win32
  • GitHub Check: Windows MSVC 2022 v141 Win32
  • GitHub Check: Windows MSVC 2022 v140 Win64
  • GitHub Check: Windows MSVC 2022 v143 Win32
  • GitHub Check: Windows MSVC 2022 v141 Win64
  • GitHub Check: Windows MSVC 2022 v140 Win32
  • GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
  • GitHub Check: Windows MSVC 2022 v143 Win64
  • GitHub Check: Ubuntu MinGW i686
  • GitHub Check: EL10 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: Ubuntu GCC -O3 OSB
🔇 Additional comments (1)
arch/x86/crc32_fold_pclmulqdq_tpl.h (1)

115-122: Aligned loads in Chorba folding loop look correct given src alignment invariants

Switching these Chorba loads from _mm_loadu_si128 to _mm_load_si128 is sound: by the time this loop is reached, src has been aligned to 16 bytes via the algn_diff prologue, and all subsequent adjustments (including the VPCLMUL path and the Chorba loop itself) move src only in 16‑byte increments, so each (__m128i *)src + k remains 16‑byte aligned. This matches the Chorba callers’ contract that they work on 16‑byte–aligned buffers, and it should help the compiler reason about alignment and optimize away some load/XOR/mul sequences in the non‑COPY case. The unaligned stores in the COPY path remain appropriate since dst alignment is not guaranteed. Based on learnings

Also applies to: 141-144, 163-166, 186-189, 209-212, 232-235, 255-258, 278-281, 300-303

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

  1. 📝 Description — Summarize the main change in 50–60 words, explaining what was done.
  2. 📓 References — List relevant issues, discussions, documentation, or related PRs.
  3. 📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.
  4. 📊 Contributor Summary — Include a Markdown table showing contributions:
    | Contributor | Lines Added | Lines Removed | Files Changed |
  5. ✔️ Additional Notes — Add any extra reviewer context.
    Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@KungFuJesus
Copy link
Copy Markdown
Collaborator Author

I did confirm that this is in fact happening on the non-copying variant in several places:

   0.85 │       pclmulqdq $0x10, 0xaf6e(%rip), %xmm15  # 0x2b540
   0.12 │       movdqa    %xmm14,%xmm9
   0.45 │       pclmulqdq $0x1, 0xaf5f(%rip), %xmm3  # 0x2b540
   0.36 │       movdqa    (%rsp),%xmm4
           │       pclmulqdq $0x10, 0xaf4f(%rip), %xmm8  # 0x2b540
   0.76 │       xorps     %xmm15,%xmm3
   0.45 │       pclmulqdq $0x1, 0xaf41(%rip), %xmm2  # 0x2b540
   0.03 │       movdqa    0x90(%rsi),%xmm15
   0.82 │       xorps     %xmm8,%xmm2
   0.24 │       pxor      %xmm6, %xmm3
   0.03 │       movdqa    0x80(%rsi),%xmm8
   0.64 │       pclmulqdq $0x10, 0xaf1c(%rip), %xmm14  # 0x2b540
   0.58 │       pclmulqdq $0x1, 0xaf11(%rip), %xmm9  # 0x2b540
   0.03 │       pxor      %xmm0, %xmm15
   0.67 │       pxor      %xmm11, %xmm4
   0.15 │       subq      $0x280,%r10
   0.09 │       xorps     %xmm14,%xmm9
   0.03 │       pxor      %xmm3, %xmm15
   0.55 │       movdqa    %xmm10,%xmm14
   0.36 │       movdqa    (%rsp),%xmm3
   0.03 │       pxor      %xmm7, %xmm8
   0.12 │       pxor      0xa0(%rsi), %xmm3
   1.03 │       pclmulqdq $0x10, 0xaed5(%rip), %xmm10  # 0x2b540
           │       addq      $0x280,%rsi
   0.55 │       pclmulqdq $0x1, 0xaec3(%rip), %xmm14  # 0x2b540
   0.33 │       pxor      %xmm2, %xmm8

@codecov
Copy link
Copy Markdown

codecov bot commented Nov 21, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 81.24%. Comparing base (469cf6d) to head (6b5aac9).
⚠️ Report is 1 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #2019      +/-   ##
===========================================
- Coverage    82.23%   81.24%   -1.00%     
===========================================
  Files          163      163              
  Lines        12863    12863              
  Branches      3171     3171              
===========================================
- Hits         10578    10450     -128     
- Misses        1243     1372     +129     
+ Partials      1042     1041       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Dead2 Dead2 merged commit f6e28fb into zlib-ng:develop Nov 22, 2025
155 of 156 checks passed
@Dead2 Dead2 mentioned this pull request Nov 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants