Conversation
3302f0d to
b3d891f
Compare
WalkthroughReplaces internal crc32_c with a new alignment-aware crc32_chorba dispatcher, removes crc32_c from builds and callers, remaps native_crc32 to crc32_chorba or crc32_braid via new WITHOUT_CHORBA / WITHOUT_CHORBA_SSE guards, adds an MSVC SSE guard, and updates SSE gating, templates, functable, tests, and benchmarks. Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant native_crc32
participant crc32_chorba
participant crc32_braid_internal
participant crc32_braid
Note right of native_crc32: compile-time mapping
Caller->>native_crc32: call(crc, buf, len)
alt WITHOUT_CHORBA not defined
native_crc32->>crc32_chorba: dispatch(crc, buf, len)
crc32_chorba->>crc32_braid_internal: handle unaligned prefix
crc32_braid_internal-->>crc32_chorba: prefix_crc
alt aligned_len > large_threshold
crc32_chorba->>crc32_braid: large nondestructive Chorba path
else aligned_len in medium_range
crc32_chorba->>crc32_braid: medium/small Chorba path
else
crc32_chorba->>crc32_braid_internal: braid fallback
end
crc32_chorba-->>native_crc32: final_crc
else
native_crc32->>crc32_braid: direct braid path
crc32_braid-->>native_crc32: final_crc
end
native_crc32-->>Caller: result
Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (7)
arch/x86/chorba_sse41.c (1)
320-324: Document the two-level gating hierarchy for clarity.The conditional structure enables:
- File-level:
!WITHOUT_CHORBA_SSE- enables SSE41 Chorba variant- Large-path:
!WITHOUT_CHORBA- enables fallback to generic large ChorbaThis allows
WITHOUT_CHORBAto disable just the generic large path while keeping the SSE41 medium/small implementations. Consider adding a brief comment explaining this hierarchy, as the nested conditionals with theelsecontinuation on line 323 may be non-obvious to future maintainers.Consider adding a comment like:
// When WITHOUT_CHORBA is defined, skip the generic large path fallback // but retain SSE41-optimized medium/small paths #if !defined(WITHOUT_CHORBA) if(aligned_len > CHORBA_LARGE_THRESHOLD) { c = crc32_chorba_118960_nondestructive(c, (z_word_t*) aligned_buf, aligned_len); } else #endifarch/x86/crc32_fold_pclmulqdq_tpl.h (1)
108-136: Double-check XOR_INITIAL128 usage once per block.Within the Chorba loop XOR_INITIAL128 is applied to chorba8 only (line 134). That’s fine if and only if the later xmm_t* loads in this loop should not receive XOR_INITIAL128 (unlike the standard 64B loop below). Please confirm the intended single application; otherwise, initial CRC may be mixed incorrectly.
CMakeLists.txt (1)
220-223: Macro naming: unify on WITHOUT_CHORBA_SSE.Headers also define NO_CHORBA_SSE for v142 range. Consider removing NO_CHORBA_SSE and using a single WITHOUT_CHORBA_SSE everywhere to avoid drift. I.e., replace the header-only guard with this definition or map NO_CHORBA_SSE to WITHOUT_CHORBA_SSE centrally.
arch/x86/x86_functions.h (2)
27-31: Keep helper internal; avoid exposing non-API SSE2 helper.chorba_small_nondestructive_sse2 is only used inside arch/x86/chorba_sse2.c. Drop this declaration from the public x86 header or mark it Z_INTERNAL to avoid accidental external use.
- uint32_t chorba_small_nondestructive_sse2(uint32_t c, const uint64_t *aligned_buf, size_t aligned_len); + Z_INTERNAL uint32_t chorba_small_nondestructive_sse2(uint32_t c, const uint64_t *aligned_buf, size_t aligned_len);Or remove it entirely if no other TU needs it.
13-15: Consolidate NO_CHORBA_SSE vs WITHOUT_CHORBA_SSE.The header defines NO_CHORBA_SSE for a specific MSVC range, while other guards use WITHOUT_CHORBA_SSE. Use a single macro (prefer WITHOUT_CHORBA_SSE as set by CMake) and, if needed, shim NO_CHORBA_SSE to it to prevent configuration mismatches.
Also applies to: 39-41, 107-111, 123-126
test/test_crc32.cc (1)
294-296: Gating aligns with build flags; add a large-buffer check for chorba_c.The new chorba_c tests are correctly guarded. Consider also running crc32_large_buf for chorba_c (as you do for others) to catch large‑threshold regressions.
Also applies to: 317-322
arch/generic/crc32_chorba_c.c (1)
1449-1478: Clarify 8‑byte vs 16‑byte alignment choice.algn_diff targets 8B alignment (using a 16‑modulus trick). SSE code aligns to 16B. If 8B is intentional for generic C (to match uint64_t loads), please add a brief comment; otherwise, consider standardizing to the common “align to 16” pattern for consistency. Function’s xor in/out and path selection look correct.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (17)
CMakeLists.txt(1 hunks)Makefile.in(0 hunks)arch/generic/Makefile.in(0 hunks)arch/generic/crc32_c.c(0 hunks)arch/generic/crc32_chorba_c.c(1 hunks)arch/generic/generic_functions.h(2 hunks)arch/riscv/crc32_zbc.c(2 hunks)arch/s390/crc32-vx.c(2 hunks)arch/x86/chorba_sse2.c(2 hunks)arch/x86/chorba_sse41.c(3 hunks)arch/x86/crc32_fold_pclmulqdq_tpl.h(1 hunks)arch/x86/crc32_pclmulqdq_tpl.h(0 hunks)arch/x86/x86_functions.h(4 hunks)crc32.h(0 hunks)functable.c(4 hunks)test/benchmarks/benchmark_crc32.cc(1 hunks)test/test_crc32.cc(2 hunks)
💤 Files with no reviewable changes (5)
- Makefile.in
- arch/x86/crc32_pclmulqdq_tpl.h
- arch/generic/crc32_c.c
- arch/generic/Makefile.in
- crc32.h
🧰 Additional context used
🧠 Learnings (23)
📓 Common learnings
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:14-24
Timestamp: 2025-02-21T01:41:50.358Z
Learning: In zlib-ng's SSE2 vectorized Chorba CRC implementation, the code that calls READ_NEXT macro ensures 16-byte alignment, making explicit alignment checks unnecessary within the macro.
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:42:40.488Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-01-23T22:01:53.422Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:26-28
Timestamp: 2025-02-21T01:44:03.996Z
Learning: The alignment requirements for chorba_small_nondestructive_sse2 (16-byte alignment and multiple of 8 length) are enforced by its calling function, making additional checks redundant.
📚 Learning: 2025-01-23T22:01:53.422Z
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-01-23T22:01:53.422Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.
Applied to files:
functable.carch/generic/generic_functions.harch/s390/crc32-vx.carch/x86/x86_functions.hCMakeLists.txtarch/x86/chorba_sse2.carch/x86/crc32_fold_pclmulqdq_tpl.htest/test_crc32.ccarch/riscv/crc32_zbc.carch/generic/crc32_chorba_c.carch/x86/chorba_sse41.ctest/benchmarks/benchmark_crc32.cc
📚 Learning: 2025-02-23T16:49:52.043Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.
Applied to files:
functable.carch/generic/generic_functions.harch/s390/crc32-vx.carch/x86/x86_functions.hCMakeLists.txtarch/x86/chorba_sse2.carch/x86/crc32_fold_pclmulqdq_tpl.htest/test_crc32.ccarch/riscv/crc32_zbc.carch/generic/crc32_chorba_c.carch/x86/chorba_sse41.ctest/benchmarks/benchmark_crc32.cc
📚 Learning: 2025-02-21T01:42:40.488Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:42:40.488Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.
Applied to files:
functable.carch/generic/generic_functions.harch/s390/crc32-vx.carch/x86/x86_functions.hCMakeLists.txtarch/x86/chorba_sse2.carch/x86/crc32_fold_pclmulqdq_tpl.htest/test_crc32.ccarch/riscv/crc32_zbc.carch/generic/crc32_chorba_c.carch/x86/chorba_sse41.ctest/benchmarks/benchmark_crc32.cc
📚 Learning: 2025-02-21T01:41:50.358Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:14-24
Timestamp: 2025-02-21T01:41:50.358Z
Learning: In zlib-ng's SSE2 vectorized Chorba CRC implementation, the code that calls READ_NEXT macro ensures 16-byte alignment, making explicit alignment checks unnecessary within the macro.
Applied to files:
functable.carch/generic/generic_functions.harch/s390/crc32-vx.carch/x86/x86_functions.hCMakeLists.txtarch/x86/chorba_sse2.carch/x86/crc32_fold_pclmulqdq_tpl.htest/test_crc32.ccarch/riscv/crc32_zbc.carch/generic/crc32_chorba_c.carch/x86/chorba_sse41.ctest/benchmarks/benchmark_crc32.cc
📚 Learning: 2024-10-29T02:22:52.846Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1805
File: inffast_tpl.h:257-262
Timestamp: 2024-10-29T02:22:52.846Z
Learning: In `inffast_tpl.h`, when AVX512 is enabled, the branch involving `chunkcopy_safe` is intentionally eliminated to optimize performance.
Applied to files:
functable.carch/x86/x86_functions.harch/x86/crc32_fold_pclmulqdq_tpl.harch/x86/chorba_sse41.c
📚 Learning: 2025-02-21T01:44:03.996Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:26-28
Timestamp: 2025-02-21T01:44:03.996Z
Learning: The alignment requirements for chorba_small_nondestructive_sse2 (16-byte alignment and multiple of 8 length) are enforced by its calling function, making additional checks redundant.
Applied to files:
functable.carch/x86/x86_functions.harch/x86/chorba_sse2.carch/x86/crc32_fold_pclmulqdq_tpl.htest/test_crc32.ccarch/generic/crc32_chorba_c.carch/x86/chorba_sse41.c
📚 Learning: 2025-02-23T16:51:54.545Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/x86_intrins.h:114-117
Timestamp: 2025-02-23T16:51:54.545Z
Learning: In x86/x86_intrins.h, the Clang macros for _mm_cvtsi64x_si128 and _mm_cvtsi128_si64x don't need additional MSVC guards since MSVC's implementation is already protected by `defined(_MSC_VER) && !defined(__clang__)`, making them mutually exclusive.
Applied to files:
arch/x86/x86_functions.harch/x86/crc32_fold_pclmulqdq_tpl.htest/test_crc32.ccarch/x86/chorba_sse41.c
📚 Learning: 2024-10-08T21:51:45.330Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1778
File: arch/x86/chunkset_avx2.c:160-171
Timestamp: 2024-10-08T21:51:45.330Z
Learning: In `arch/x86/chunkset_avx2.c`, within the `GET_HALFCHUNK_MAG` function, using a conditional branch to select between `_mm_loadl_epi64` and `_mm_loadu_si128` is not recommended because the branching cost outweighs the savings from the load.
Applied to files:
arch/x86/x86_functions.harch/x86/chorba_sse2.carch/x86/chorba_sse41.c
📚 Learning: 2024-12-20T23:35:59.830Z
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1830
File: arch/generic/generic_functions.h:39-53
Timestamp: 2024-12-20T23:35:59.830Z
Learning: The `longest_match_unaligned_*` functions are templated using the LONGEST_MATCH macro in match_tpl.h, so their implementations are generated rather than explicitly defined.
Applied to files:
arch/x86/x86_functions.h
📚 Learning: 2024-10-08T19:37:14.998Z
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1797
File: inflate.c:84-86
Timestamp: 2024-10-08T19:37:14.998Z
Learning: When `INFLATE_ALLOW_INVALID_DISTANCE_TOOFAR_ARRR` is not defined, `state->sane` is completely removed from the codebase and does not need to be initialized.
Applied to files:
arch/x86/x86_functions.h
📚 Learning: 2024-09-25T16:25:56.686Z
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1797
File: inflate.c:81-83
Timestamp: 2024-09-25T16:25:56.686Z
Learning: When `INFLATE_STRICT` is not defined, the variable `state->dmax` is completely removed from the code.
Applied to files:
arch/x86/x86_functions.h
📚 Learning: 2025-02-23T16:50:50.925Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/x86_intrins.h:0-0
Timestamp: 2025-02-23T16:50:50.925Z
Learning: MSVC does not define `__GNUC__`, so adding `!defined(_MSC_VER)` to GCC detection macros is redundant when `__GNUC__` is already being checked.
Applied to files:
CMakeLists.txt
📚 Learning: 2024-10-29T02:22:55.489Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1805
File: arch/x86/chunkset_avx512.c:32-34
Timestamp: 2024-10-29T02:22:55.489Z
Learning: In `arch/x86/chunkset_avx512.c`, the `gen_mask` function's `len` parameter cannot exceed 32 because it is only called on the remaining bytes from a 32-byte vector.
Applied to files:
CMakeLists.txtarch/riscv/crc32_zbc.carch/x86/chorba_sse41.c
📚 Learning: 2024-10-08T19:37:14.998Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1778
File: chunkset_tpl.h:164-165
Timestamp: 2024-10-08T19:37:14.998Z
Learning: In `chunkset_tpl.h`, using `goto` in the `CHUNKMEMSET` function aids the compiler in inlining the function, so it should be retained.
Applied to files:
CMakeLists.txt
📚 Learning: 2025-02-21T01:37:54.508Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:830-831
Timestamp: 2025-02-21T01:37:54.508Z
Learning: The alignment calculation `((uintptr_t)16 - ((uintptr_t)buf & 15)) & 15` is safe from overflow as it's mathematically bounded between 1 and 16, making the use of uintptr_t appropriate for this pointer arithmetic.
Applied to files:
arch/x86/chorba_sse2.carch/x86/chorba_sse41.c
📚 Learning: 2024-12-25T19:45:06.009Z
Learnt from: samrussell
Repo: zlib-ng/zlib-ng PR: 1837
File: arch/generic/crc32_braid_c.c:596-597
Timestamp: 2024-12-25T19:45:06.009Z
Learning: We should verify out-of-bounds pointer arithmetic in chorba_118960_nondestructive() and related loops where "input + i" is used, especially when len < expected block sizes.
Applied to files:
arch/x86/chorba_sse2.carch/generic/crc32_chorba_c.carch/x86/chorba_sse41.c
📚 Learning: 2025-06-09T16:46:28.468Z
Learnt from: mtl1979
Repo: zlib-ng/zlib-ng PR: 1921
File: arch/riscv/chunkset_rvv.c:97-102
Timestamp: 2025-06-09T16:46:28.468Z
Learning: In RISC-V chunkset_rvv.c CHUNKCOPY function, the alignment logic intentionally copies sizeof(chunk_t) bytes but advances pointers by align bytes to avoid unaligned access. This "re-reading" technique is a deliberate optimization to maintain proper alignment for subsequent operations, not a bug.
Applied to files:
arch/x86/chorba_sse2.carch/riscv/crc32_zbc.carch/x86/chorba_sse41.c
📚 Learning: 2024-12-22T20:40:03.280Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1833
File: test/test_crc32.cc:0-0
Timestamp: 2024-12-22T20:40:03.280Z
Learning: In the crc32_align test class (test/test_crc32.cc), the alignment offset parameter is always guaranteed to be non-negative because the test parameters are controlled and it never makes sense to have a negative offset.
Applied to files:
arch/x86/chorba_sse2.ctest/test_crc32.ccarch/x86/chorba_sse41.c
📚 Learning: 2025-04-15T09:30:10.081Z
Learnt from: mtl1979
Repo: zlib-ng/zlib-ng PR: 1904
File: arch/riscv/Makefile.in:12-14
Timestamp: 2025-04-15T09:30:10.081Z
Learning: Feature detection modules like riscv_features.c should not be compiled with feature-specific flags (like RVVFLAG) because they need to be compilable on all systems regardless of feature support. These modules perform runtime detection and must initialize feature availability flags to zero on unsupported systems.
Applied to files:
arch/riscv/crc32_zbc.c
📚 Learning: 2025-06-10T07:38:03.297Z
Learnt from: mtl1979
Repo: zlib-ng/zlib-ng PR: 1921
File: arch/riscv/chunkset_rvv.c:103-104
Timestamp: 2025-06-10T07:38:03.297Z
Learning: In RISC-V chunkset_rvv.c CHUNKCOPY function, when dist < sizeof(chunk_t), the vl variable intentionally becomes 0, causing the while loop to not execute. This is correct behavior because copying full chunks is not safe when the distance is smaller than chunk size, and the function appropriately falls back to memcpy for handling remaining bytes.
Applied to files:
arch/riscv/crc32_zbc.c
📚 Learning: 2024-10-07T21:18:37.806Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1802
File: arch/x86/chunkset_avx2.c:82-85
Timestamp: 2024-10-07T21:18:37.806Z
Learning: In `arch/x86/chunkset_avx2.c`, when working with AVX2-capable x86 CPUs, unaligned memory access using `_mm_loadu_si128` is acceptable since there is no performance penalty on architectures after Nehalem. Ensuring alignment may introduce unnecessary overhead due to arbitrary offsets into the window.
Applied to files:
arch/x86/chorba_sse41.c
📚 Learning: 2024-10-07T22:00:02.180Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1802
File: chunkset_tpl.h:92-108
Timestamp: 2024-10-07T22:00:02.180Z
Learning: In the `HALFCHUNKCOPY` function in `chunkset_tpl.h`, the `len` parameter is always bound to be small by the callers and will not exceed the limits of an `int32_t`.
Applied to files:
arch/x86/chorba_sse41.c
🧬 Code graph analysis (2)
arch/x86/crc32_fold_pclmulqdq_tpl.h (1)
arch/x86/crc32_pclmulqdq_tpl.h (2)
fold_12(168-208)fold_4(125-166)
arch/generic/crc32_chorba_c.c (4)
arch/riscv/crc32_zbc.c (1)
uint32_t(83-98)arch/x86/chorba_sse2.c (2)
uint32_t(23-847)uint32_t(849-879)arch/x86/chorba_sse41.c (2)
uint32_t(55-305)uint32_t(307-340)arch/generic/crc32_braid_c.c (2)
uint32_t(63-213)uint32_t(215-222)
🔇 Additional comments (12)
arch/riscv/crc32_zbc.c (3)
9-9: LGTM! Include provides necessary declarations.The addition of
arch_functions.his necessary to provide the declaration ofcrc32_braidused in the fallback paths below.
91-91: Consistent with the change at line 86.This change correctly replaces
crc32_cwithcrc32_braidfor handling unaligned trailing bytes, maintaining consistency with the short-length fallback path.
86-86: No changes needed—replacement is correct.The verification confirms that
crc32_braidis the established baseline CRC32 implementation used consistently throughout the codebase as a fallback for short buffers and unaligned data. It's declared inarch/generic/generic_functions.h, implemented with proper pre/post-conditioning inarch/generic/crc32_braid_c.c, and serves as the defaultnative_crc32in the functable. This replacement pattern is identical across multiple architecture-specific implementations (S390, x86, generic), confirming that the change at line 86 aligns with the codebase's established refactoring approach.arch/generic/generic_functions.h (2)
28-34: LGTM - Clean declaration of crc32_chorba with proper gating.The new
crc32_chorbafunction is properly declared under the#ifndef WITHOUT_CHORBAguard, consistent with the PR's goal to make this the new generic CRC implementation that can be selectively disabled.
53-57: LGTM - Correct conditional mapping for native_crc32.The conditional mapping properly selects between
crc32_chorba(when Chorba is enabled) andcrc32_braid(when Chorba is disabled), replacing the previouscrc32_capproach. This aligns with the PR's architectural changes.arch/x86/chorba_sse41.c (3)
1-1: LGTM - Correct feature gating for SSE41 Chorba variant.The change from
WITHOUT_CHORBAtoWITHOUT_CHORBA_SSEproperly separates the SSE-specific Chorba implementations from the generic C fallback, as described in the PR objectives.
10-10: LGTM - Cleaner include structure.Consolidating to
arch_functions.hremoves the need for forward declarations and improves maintainability.
12-15: No actionable issues found; macro implementation is correct.The READ_NEXT macro correctly loads two contiguous
__m128ivalues:afrom byte offsetoffandbfrom offsetoff + 16. The macro uses_mm_load_si128(aligned load), which is appropriate since alignment is enforced by the calling function's loop structure (i += 32ensures proper alignment). This implementation is identical to the working code inchorba_sse2.cand was part of the original SSE4.1 implementation.test/benchmarks/benchmark_crc32.cc (1)
65-73: LGTM - Correct benchmark registration structure.The separation of generic Chorba (
chorba_cunder#ifndef WITHOUT_CHORBA) from SSE-specific variants (chorba_sse2andchorba_sse41underWITHOUT_CHORBA_SSEguards) correctly reflects the new two-tier gating architecture introduced in this PR.arch/s390/crc32-vx.c (1)
205-206: Switch to braid looks correct; pre/post-conditioning preserved.Using crc32_braid for short, prealign, and tail maintains the xor-in/out contract around crc32_le_vgfm_16. No behavioral regression anticipated.
If you have a reproducer for MSVC/Clang s390 builds, please run crc32 known-vectors to double‑check chaining across the three segments (short, prealign+VX, tail).Also applies to: 210-212, 218-221
arch/x86/chorba_sse2.c (1)
854-872: Good: consistent use of crc32_braid_internal with explicit xor in/out.Prealign with braid_internal, then aligned Chorba path(s), and small fallback are all using the same pre/post-conditioning. Alignment math is the common 16-byte form. LGTM.
If you tweak thresholds later, keep SSE2’s 16B load requirement in mind; the wrapper is the right place to enforce it. Based on learnings.functable.c (1)
83-86: Runtime selection order is sane; braid as baseline, Chorba when enabled.Defaults to braid, promotes to generic Chorba when allowed, and then to arch paths (SSE2/SSE41/PCLMUL etc.). Good separation and precedence. Ensure WITHOUT_CHORBA_SSE is consistently defined by CMake across all TUs.
Also applies to: 98-103, 113-117, 118-136, 146-151
8ada950 to
71f36da
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
arch/x86/x86_functions.h (1)
13-15: Critical: Macro name mismatch will break MSVC 2019 workaround.The macro is defined as
NO_CHORBA_SSEon line 14, but the entire codebase usesWITHOUT_CHORBA_SSE(lines 27, 39, 107, 123 in this file, and throughout the PR). This means the MSVC 2019 code-generation bug workaround will never activate, potentially causing segfaults on affected platforms.Apply this diff:
-#define NO_CHORBA_SSE +#define WITHOUT_CHORBA_SSE
🧹 Nitpick comments (1)
arch/x86/crc32_fold_pclmulqdq_tpl.h (1)
108-324: Broaden Chorba loop eligibility to cover 640 B spansEach iteration consumes 128 B of preload plus 512 B of folded payload, so the loop only needs 640 B available. The current guard (
512 + 64 + 16*8) demands 704 B, which skips the optimized path for inputs between 640 B and 703 B and falls back to the slower 64 B loop instead. Dropping the extra 64 B keeps the safety margin but lets those lengths benefit from the fast path.- while (len >= 512 + 64 + 16*8) { + while (len >= 16*8 + 32*16) {
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (17)
CMakeLists.txt(1 hunks)Makefile.in(0 hunks)arch/generic/Makefile.in(0 hunks)arch/generic/crc32_c.c(0 hunks)arch/generic/crc32_chorba_c.c(3 hunks)arch/generic/generic_functions.h(2 hunks)arch/riscv/crc32_zbc.c(2 hunks)arch/s390/crc32-vx.c(2 hunks)arch/x86/chorba_sse2.c(2 hunks)arch/x86/chorba_sse41.c(3 hunks)arch/x86/crc32_fold_pclmulqdq_tpl.h(1 hunks)arch/x86/crc32_pclmulqdq_tpl.h(0 hunks)arch/x86/x86_functions.h(4 hunks)crc32.h(0 hunks)functable.c(4 hunks)test/benchmarks/benchmark_crc32.cc(1 hunks)test/test_crc32.cc(2 hunks)
💤 Files with no reviewable changes (5)
- arch/x86/crc32_pclmulqdq_tpl.h
- crc32.h
- arch/generic/Makefile.in
- arch/generic/crc32_c.c
- Makefile.in
🚧 Files skipped from review as they are similar to previous changes (2)
- arch/generic/crc32_chorba_c.c
- CMakeLists.txt
🧰 Additional context used
🧠 Learnings (23)
📓 Common learnings
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:14-24
Timestamp: 2025-02-21T01:41:50.358Z
Learning: In zlib-ng's SSE2 vectorized Chorba CRC implementation, the code that calls READ_NEXT macro ensures 16-byte alignment, making explicit alignment checks unnecessary within the macro.
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:42:40.488Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.
Learnt from: mtl1979
Repo: zlib-ng/zlib-ng PR: 1925
File: arch/loongarch/lasxintrin_ext.h:38-65
Timestamp: 2025-07-04T16:59:44.725Z
Learning: In zlib-ng, the policy for supporting old compilers is to maintain compatibility for up to two years after GitHub removes support for them in CI, or when there is no long-time contributor left testing for the architecture if the architecture was not supported by GitHub Actions at all.
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-01-23T22:01:53.422Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:26-28
Timestamp: 2025-02-21T01:44:03.996Z
Learning: The alignment requirements for chorba_small_nondestructive_sse2 (16-byte alignment and multiple of 8 length) are enforced by its calling function, making additional checks redundant.
📚 Learning: 2025-02-21T01:41:50.358Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:14-24
Timestamp: 2025-02-21T01:41:50.358Z
Learning: In zlib-ng's SSE2 vectorized Chorba CRC implementation, the code that calls READ_NEXT macro ensures 16-byte alignment, making explicit alignment checks unnecessary within the macro.
Applied to files:
arch/s390/crc32-vx.carch/riscv/crc32_zbc.carch/x86/crc32_fold_pclmulqdq_tpl.harch/x86/x86_functions.htest/test_crc32.ccarch/x86/chorba_sse41.cfunctable.carch/x86/chorba_sse2.carch/generic/generic_functions.htest/benchmarks/benchmark_crc32.cc
📚 Learning: 2025-02-21T01:42:40.488Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:42:40.488Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.
Applied to files:
arch/s390/crc32-vx.carch/riscv/crc32_zbc.carch/x86/crc32_fold_pclmulqdq_tpl.harch/x86/x86_functions.htest/test_crc32.ccarch/x86/chorba_sse41.cfunctable.carch/x86/chorba_sse2.carch/generic/generic_functions.htest/benchmarks/benchmark_crc32.cc
📚 Learning: 2025-02-23T16:49:52.043Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.
Applied to files:
arch/s390/crc32-vx.carch/riscv/crc32_zbc.carch/x86/crc32_fold_pclmulqdq_tpl.harch/x86/x86_functions.htest/test_crc32.ccarch/x86/chorba_sse41.cfunctable.carch/x86/chorba_sse2.carch/generic/generic_functions.htest/benchmarks/benchmark_crc32.cc
📚 Learning: 2025-01-23T22:01:53.422Z
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-01-23T22:01:53.422Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.
Applied to files:
arch/s390/crc32-vx.carch/riscv/crc32_zbc.carch/x86/crc32_fold_pclmulqdq_tpl.harch/x86/x86_functions.htest/test_crc32.ccarch/x86/chorba_sse41.cfunctable.carch/x86/chorba_sse2.carch/generic/generic_functions.htest/benchmarks/benchmark_crc32.cc
📚 Learning: 2025-06-09T16:46:28.468Z
Learnt from: mtl1979
Repo: zlib-ng/zlib-ng PR: 1921
File: arch/riscv/chunkset_rvv.c:97-102
Timestamp: 2025-06-09T16:46:28.468Z
Learning: In RISC-V chunkset_rvv.c CHUNKCOPY function, the alignment logic intentionally copies sizeof(chunk_t) bytes but advances pointers by align bytes to avoid unaligned access. This "re-reading" technique is a deliberate optimization to maintain proper alignment for subsequent operations, not a bug.
Applied to files:
arch/riscv/crc32_zbc.carch/x86/crc32_fold_pclmulqdq_tpl.harch/x86/chorba_sse41.carch/x86/chorba_sse2.c
📚 Learning: 2025-04-15T09:30:10.081Z
Learnt from: mtl1979
Repo: zlib-ng/zlib-ng PR: 1904
File: arch/riscv/Makefile.in:12-14
Timestamp: 2025-04-15T09:30:10.081Z
Learning: Feature detection modules like riscv_features.c should not be compiled with feature-specific flags (like RVVFLAG) because they need to be compilable on all systems regardless of feature support. These modules perform runtime detection and must initialize feature availability flags to zero on unsupported systems.
Applied to files:
arch/riscv/crc32_zbc.c
📚 Learning: 2025-06-10T07:38:03.297Z
Learnt from: mtl1979
Repo: zlib-ng/zlib-ng PR: 1921
File: arch/riscv/chunkset_rvv.c:103-104
Timestamp: 2025-06-10T07:38:03.297Z
Learning: In RISC-V chunkset_rvv.c CHUNKCOPY function, when dist < sizeof(chunk_t), the vl variable intentionally becomes 0, causing the while loop to not execute. This is correct behavior because copying full chunks is not safe when the distance is smaller than chunk size, and the function appropriately falls back to memcpy for handling remaining bytes.
Applied to files:
arch/riscv/crc32_zbc.carch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2024-10-29T02:22:55.489Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1805
File: arch/x86/chunkset_avx512.c:32-34
Timestamp: 2024-10-29T02:22:55.489Z
Learning: In `arch/x86/chunkset_avx512.c`, the `gen_mask` function's `len` parameter cannot exceed 32 because it is only called on the remaining bytes from a 32-byte vector.
Applied to files:
arch/riscv/crc32_zbc.carch/x86/chorba_sse41.c
📚 Learning: 2024-10-29T02:22:52.846Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1805
File: inffast_tpl.h:257-262
Timestamp: 2024-10-29T02:22:52.846Z
Learning: In `inffast_tpl.h`, when AVX512 is enabled, the branch involving `chunkcopy_safe` is intentionally eliminated to optimize performance.
Applied to files:
arch/x86/crc32_fold_pclmulqdq_tpl.harch/x86/x86_functions.harch/x86/chorba_sse41.cfunctable.c
📚 Learning: 2025-02-23T16:51:54.545Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/x86_intrins.h:114-117
Timestamp: 2025-02-23T16:51:54.545Z
Learning: In x86/x86_intrins.h, the Clang macros for _mm_cvtsi64x_si128 and _mm_cvtsi128_si64x don't need additional MSVC guards since MSVC's implementation is already protected by `defined(_MSC_VER) && !defined(__clang__)`, making them mutually exclusive.
Applied to files:
arch/x86/crc32_fold_pclmulqdq_tpl.harch/x86/x86_functions.htest/test_crc32.ccarch/x86/chorba_sse41.carch/x86/chorba_sse2.ctest/benchmarks/benchmark_crc32.cc
📚 Learning: 2024-12-25T19:45:06.009Z
Learnt from: samrussell
Repo: zlib-ng/zlib-ng PR: 1837
File: arch/generic/crc32_braid_c.c:596-597
Timestamp: 2024-12-25T19:45:06.009Z
Learning: We should verify out-of-bounds pointer arithmetic in chorba_118960_nondestructive() and related loops where "input + i" is used, especially when len < expected block sizes.
Applied to files:
arch/x86/crc32_fold_pclmulqdq_tpl.harch/x86/chorba_sse41.carch/x86/chorba_sse2.c
📚 Learning: 2025-02-21T01:44:03.996Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:26-28
Timestamp: 2025-02-21T01:44:03.996Z
Learning: The alignment requirements for chorba_small_nondestructive_sse2 (16-byte alignment and multiple of 8 length) are enforced by its calling function, making additional checks redundant.
Applied to files:
arch/x86/crc32_fold_pclmulqdq_tpl.harch/x86/x86_functions.htest/test_crc32.ccarch/x86/chorba_sse41.cfunctable.carch/x86/chorba_sse2.c
📚 Learning: 2024-10-08T21:51:45.330Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1778
File: arch/x86/chunkset_avx2.c:160-171
Timestamp: 2024-10-08T21:51:45.330Z
Learning: In `arch/x86/chunkset_avx2.c`, within the `GET_HALFCHUNK_MAG` function, using a conditional branch to select between `_mm_loadl_epi64` and `_mm_loadu_si128` is not recommended because the branching cost outweighs the savings from the load.
Applied to files:
arch/x86/crc32_fold_pclmulqdq_tpl.harch/x86/x86_functions.harch/x86/chorba_sse41.carch/x86/chorba_sse2.c
📚 Learning: 2024-10-07T21:23:13.401Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1802
File: chunkset_tpl.h:135-135
Timestamp: 2024-10-07T21:23:13.401Z
Learning: In the `CHUNKMEMSET` function within `chunkset_tpl.h`, extra bounds checks are avoided to maintain performance in critical code sections. Branching is minimized to prevent negative impacts on speculative execution. The variable `len` is enforced with `safelen` early on.
Applied to files:
arch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2024-10-29T02:18:25.966Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1805
File: arch/x86/chunkset_avx512.c:28-30
Timestamp: 2024-10-29T02:18:25.966Z
Learning: In `chunkset_avx512.c`, the `gen_half_mask` function does not require validation for `len` since it will never exceed 16 due to computing the remainder for a 16-byte load.
Applied to files:
arch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2024-12-20T23:35:59.830Z
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1830
File: arch/generic/generic_functions.h:39-53
Timestamp: 2024-12-20T23:35:59.830Z
Learning: The `longest_match_unaligned_*` functions are templated using the LONGEST_MATCH macro in match_tpl.h, so their implementations are generated rather than explicitly defined.
Applied to files:
arch/x86/x86_functions.h
📚 Learning: 2024-10-08T19:37:14.998Z
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1797
File: inflate.c:84-86
Timestamp: 2024-10-08T19:37:14.998Z
Learning: When `INFLATE_ALLOW_INVALID_DISTANCE_TOOFAR_ARRR` is not defined, `state->sane` is completely removed from the codebase and does not need to be initialized.
Applied to files:
arch/x86/x86_functions.h
📚 Learning: 2024-09-25T16:25:56.686Z
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1797
File: inflate.c:81-83
Timestamp: 2024-09-25T16:25:56.686Z
Learning: When `INFLATE_STRICT` is not defined, the variable `state->dmax` is completely removed from the code.
Applied to files:
arch/x86/x86_functions.h
📚 Learning: 2024-12-22T20:40:03.280Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1833
File: test/test_crc32.cc:0-0
Timestamp: 2024-12-22T20:40:03.280Z
Learning: In the crc32_align test class (test/test_crc32.cc), the alignment offset parameter is always guaranteed to be non-negative because the test parameters are controlled and it never makes sense to have a negative offset.
Applied to files:
test/test_crc32.ccarch/x86/chorba_sse41.carch/x86/chorba_sse2.c
📚 Learning: 2024-10-07T21:18:37.806Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1802
File: arch/x86/chunkset_avx2.c:82-85
Timestamp: 2024-10-07T21:18:37.806Z
Learning: In `arch/x86/chunkset_avx2.c`, when working with AVX2-capable x86 CPUs, unaligned memory access using `_mm_loadu_si128` is acceptable since there is no performance penalty on architectures after Nehalem. Ensuring alignment may introduce unnecessary overhead due to arbitrary offsets into the window.
Applied to files:
arch/x86/chorba_sse41.c
📚 Learning: 2025-02-21T01:37:54.508Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:830-831
Timestamp: 2025-02-21T01:37:54.508Z
Learning: The alignment calculation `((uintptr_t)16 - ((uintptr_t)buf & 15)) & 15` is safe from overflow as it's mathematically bounded between 1 and 16, making the use of uintptr_t appropriate for this pointer arithmetic.
Applied to files:
arch/x86/chorba_sse41.carch/x86/chorba_sse2.c
📚 Learning: 2024-10-07T22:00:02.180Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1802
File: chunkset_tpl.h:92-108
Timestamp: 2024-10-07T22:00:02.180Z
Learning: In the `HALFCHUNKCOPY` function in `chunkset_tpl.h`, the `len` parameter is always bound to be small by the callers and will not exceed the limits of an `int32_t`.
Applied to files:
arch/x86/chorba_sse41.c
🧬 Code graph analysis (1)
arch/x86/crc32_fold_pclmulqdq_tpl.h (1)
arch/x86/crc32_pclmulqdq_tpl.h (2)
fold_12(168-208)fold_4(125-166)
🔇 Additional comments (12)
arch/riscv/crc32_zbc.c (3)
9-9: LGTM: Include added for crc32_braid declaration.The addition of
arch_functions.hcorrectly provides the declaration forcrc32_braid, which replacescrc32_cin this file.
85-87: LGTM: Short buffer fallback correctly uses crc32_braid.The replacement of
crc32_cwithcrc32_braidis appropriate for short buffers (< 16 bytes). The conditioning logic is correct sincecrc32_braidhandles XOR pre/post-conditioning internally.
89-94: LGTM: Unaligned portion handling correctly uses crc32_braid.The replacement is correct. The
crc32_braidcall returns a properly conditioned CRC value, which is then pre-conditioned at line 95 (crc ^= 0xFFFFFFFF) before being passed to the clmul implementation. The conditioning flow is consistent and correct.arch/x86/x86_functions.h (2)
27-30: LGTM! Guard and declaration are correct.The
WITHOUT_CHORBA_SSEguard properly disables Chorba SSE2 variants when needed, and exposingchorba_small_nondestructive_sse2is necessary since the SSE4.1 implementation needs to call it as a fallback.
107-110: native_crc32 mapping is consistent and correct.The conditional definition at lines 107-110 properly overrides the generic base (crc32_chorba or crc32_braid) with the SSE2-optimized variant (crc32_chorba_sse2) when appropriate. The nested conditions—
DISABLE_RUNTIME_CPU_DETECTION,HAVE_BUILTIN_CTZ,X86_SSE2, and!WITHOUT_CHORBA_SSE—follow the same pattern used consistently across all x86 architecture variants (SSE4.1, PCLMUL, VPCLMULQDQ). The#undefbefore#definecorrectly prevents conflicts with prior definitions. No functable or runtime dispatch routing is used for CRC32 in this codebase; routing is entirely compile-time macro-based. No issues found.test/benchmarks/benchmark_crc32.cc (1)
65-73: LGTM! Benchmark registrations updated consistently.The benchmarks correctly reflect the reorganization:
- Generic Chorba (
chorba_c) is now registered unconditionally whenWITHOUT_CHORBAis not defined- SSE2 and SSE4.1 variants are properly gated with
WITHOUT_CHORBA_SSEThe guards align with the new macro scheme introduced in this PR.
arch/x86/chorba_sse2.c (3)
1-1: LGTM! File-level guard correctly gates SSE2 Chorba variants.The guard
#if defined(X86_SSE2) && !defined(WITHOUT_CHORBA_SSE)ensures the entire SSE2 Chorba implementation is disabled when the macro is set (e.g., on MSVC 2019 non-AMD64 builds with the code-generation bug).
9-9: LGTM! Header reorganization aligns with PR objectives.Switching from
arch/generic/generic_functions.htoarch_functions.his consistent with the removal ofcrc32_cand the new functable routing approach.
862-871: LGTM! Conditional compilation logic is correct.The nested conditional structure properly implements the PR's objectives:
- When
WITHOUT_CHORBAis defined: skips large Chorba path, keeps SSE2-optimized small path- When
WITHOUT_CHORBAis not defined: uses large path for very large buffers, small SSE2 path for medium buffers, braid fallback for small buffersThis maintains SSE2 Chorba algorithm implementations while allowing the generic C fallback to be disabled independently.
arch/x86/chorba_sse41.c (3)
1-1: LGTM! File-level guard correctly gates SSE4.1 Chorba variants.Consistent with the SSE2 changes, this guard ensures the SSE4.1 Chorba implementation is disabled when
WITHOUT_CHORBA_SSEis set.
10-10: LGTM! Header reorganization is consistent.Matching the SSE2 file, this switches to
arch_functions.hfor the new functable routing.
320-329: LGTM! Multi-tier fallback logic is correct.The SSE4.1 implementation properly handles multiple buffer size thresholds:
- Very large (> CHORBA_LARGE_THRESHOLD): uses generic 118960 variant (when available)
- Medium (CHORBA_MEDIUM range): uses SSE4.1-optimized 32768 variant
- Small-to-medium (> CHORBA_SMALL_THRESHOLD_64BIT): falls back to SSE2 small variant
- Small: uses braid internal
The cross-call to
chorba_small_nondestructive_sse2is valid since SSE4.1 is a superset of SSE2, and the declaration is now properly exposed inx86_functions.h(line 29).
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #2004 +/- ##
===========================================
- Coverage 83.31% 83.26% -0.06%
===========================================
Files 161 160 -1
Lines 12960 12963 +3
Branches 3149 3149
===========================================
- Hits 10798 10794 -4
- Misses 1134 1142 +8
+ Partials 1028 1027 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Develop x86-64: PR x86-64: Develop rpi5: PR rpi5: Default compile (with chorba enabled) results in library size reductions: Performance for default builds is unchanged, as expected since this does not change anything for those builds. |
mtl1979
left a comment
There was a problem hiding this comment.
Just some nitpick comments about logically backwards preprocessor tests.
71f36da to
34355e3
Compare
Now WITHOUT_CHORBA will only disable the crc32_chorba C fallback. SSE2, SSE41 and pclmul variants will still be able to use their Chorba-algorithm based code, but their fallback to the generic crc32_chorba C code in SSE2 and SSE41 will be disabled, reducing their performance on really big input buffers (not used during deflate/inflate, only when calling crc32 directly). Remove the crc32_c function (and its file crc32_c.c), instead use the normal functable routing to select between crc32_braid and crc32_chorba. Disable sse2 and sse4.1 variants of Chorba-crc32 on MSVC older than 2022 due to code generation bug in 2019 causing segfaults. Compile either crc32_chorba_small_nondestructive or crc32_chorba_small_nondestructive_32bit, not both. Don't compile crc32_chorba_32768_nondestructive on 32bit arch.
34355e3 to
f2d7337
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
arch/x86/x86_functions.h (1)
13-30: Fix MSVC Chorba-SSE guard mismatchWe still
#define NO_CHORBA_SSEhere, but every new guard now keys offWITHOUT_CHORBA_SSE. On MSVC 2019 the safeguard never engages, so the Chorba SSE2/SSE4.1 code stays enabled and the known code-gen segfault this PR wants to avoid will still fire. Please emit the same symbol the rest of the file (and functable.c) is checking.-#define NO_CHORBA_SSE +#define WITHOUT_CHORBA_SSE
🧹 Nitpick comments (2)
arch/x86/crc32_fold_pclmulqdq_tpl.h (2)
114-114: Clarify loop entry condition for better readability.The condition
len >= 512 + 64 + 16*8evaluates correctly to 704 bytes but obscures the structure. Consider rewriting aslen >= 16*8 + 8*64 + 64(preload + 8 processing groups + margin) or simplylen >= 704with a comment explaining the breakdown.Apply this diff:
- while (len >= 512 + 64 + 16*8) { + // Require 128B preload + 8×64B processing + 64B margin = 704B + while (len >= 16*8 + 8*64 + 64) {
115-323: Consider using aligned loads for potential performance gain.Throughout the Chorba loop,
_mm_loadu_si128(unaligned) is used, butsrcis guaranteed to be 16-byte aligned after the alignment correction (lines 54–91) and remains aligned through subsequent advances (128B and 512B are multiples of 16). The standard loop at line 328 uses_mm_load_si128(aligned), suggesting this is safe and intended.While modern CPUs have minimal penalty for unaligned loads on aligned data, using aligned loads would be more consistent and could improve performance on older architectures.
Example change for the preload section:
- __m128i chorba8 = _mm_loadu_si128((__m128i *)src); - __m128i chorba7 = _mm_loadu_si128((__m128i *)src + 1); + __m128i chorba8 = _mm_load_si128((__m128i *)src); + __m128i chorba7 = _mm_load_si128((__m128i *)src + 1);Apply similar changes to all other
_mm_loadu_si128calls within the Chorba loop (lines 141–144, 163–166, 186–189, 209–212, 232–235, 255–258, 278–281, 300–303).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (17)
CMakeLists.txt(1 hunks)Makefile.in(0 hunks)arch/generic/Makefile.in(0 hunks)arch/generic/crc32_c.c(0 hunks)arch/generic/crc32_chorba_c.c(3 hunks)arch/generic/generic_functions.h(2 hunks)arch/riscv/crc32_zbc.c(2 hunks)arch/s390/crc32-vx.c(2 hunks)arch/x86/chorba_sse2.c(2 hunks)arch/x86/chorba_sse41.c(3 hunks)arch/x86/crc32_fold_pclmulqdq_tpl.h(1 hunks)arch/x86/crc32_pclmulqdq_tpl.h(0 hunks)arch/x86/x86_functions.h(4 hunks)crc32.h(0 hunks)functable.c(4 hunks)test/benchmarks/benchmark_crc32.cc(1 hunks)test/test_crc32.cc(2 hunks)
💤 Files with no reviewable changes (5)
- arch/generic/Makefile.in
- Makefile.in
- arch/generic/crc32_c.c
- arch/x86/crc32_pclmulqdq_tpl.h
- crc32.h
🚧 Files skipped from review as they are similar to previous changes (6)
- test/test_crc32.cc
- CMakeLists.txt
- arch/s390/crc32-vx.c
- arch/x86/chorba_sse2.c
- arch/x86/chorba_sse41.c
- arch/riscv/crc32_zbc.c
🧰 Additional context used
🧠 Learnings (19)
📓 Common learnings
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:42:40.488Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-01-23T22:01:53.422Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:14-24
Timestamp: 2025-02-21T01:41:50.358Z
Learning: In zlib-ng's SSE2 vectorized Chorba CRC implementation, the code that calls READ_NEXT macro ensures 16-byte alignment, making explicit alignment checks unnecessary within the macro.
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:26-28
Timestamp: 2025-02-21T01:44:03.996Z
Learning: The alignment requirements for chorba_small_nondestructive_sse2 (16-byte alignment and multiple of 8 length) are enforced by its calling function, making additional checks redundant.
📚 Learning: 2025-01-23T22:01:53.422Z
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-01-23T22:01:53.422Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.
Applied to files:
functable.carch/x86/crc32_fold_pclmulqdq_tpl.harch/generic/generic_functions.harch/generic/crc32_chorba_c.carch/x86/x86_functions.htest/benchmarks/benchmark_crc32.cc
📚 Learning: 2025-02-21T01:42:40.488Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:42:40.488Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.
Applied to files:
functable.carch/x86/crc32_fold_pclmulqdq_tpl.harch/generic/generic_functions.harch/generic/crc32_chorba_c.carch/x86/x86_functions.htest/benchmarks/benchmark_crc32.cc
📚 Learning: 2025-02-23T16:49:52.043Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.
Applied to files:
functable.carch/x86/crc32_fold_pclmulqdq_tpl.harch/generic/generic_functions.harch/generic/crc32_chorba_c.carch/x86/x86_functions.htest/benchmarks/benchmark_crc32.cc
📚 Learning: 2025-02-21T01:41:50.358Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:14-24
Timestamp: 2025-02-21T01:41:50.358Z
Learning: In zlib-ng's SSE2 vectorized Chorba CRC implementation, the code that calls READ_NEXT macro ensures 16-byte alignment, making explicit alignment checks unnecessary within the macro.
Applied to files:
functable.carch/x86/crc32_fold_pclmulqdq_tpl.harch/generic/generic_functions.harch/generic/crc32_chorba_c.carch/x86/x86_functions.htest/benchmarks/benchmark_crc32.cc
📚 Learning: 2024-10-29T02:22:52.846Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1805
File: inffast_tpl.h:257-262
Timestamp: 2024-10-29T02:22:52.846Z
Learning: In `inffast_tpl.h`, when AVX512 is enabled, the branch involving `chunkcopy_safe` is intentionally eliminated to optimize performance.
Applied to files:
functable.carch/x86/crc32_fold_pclmulqdq_tpl.harch/x86/x86_functions.htest/benchmarks/benchmark_crc32.cc
📚 Learning: 2025-02-21T01:44:03.996Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:26-28
Timestamp: 2025-02-21T01:44:03.996Z
Learning: The alignment requirements for chorba_small_nondestructive_sse2 (16-byte alignment and multiple of 8 length) are enforced by its calling function, making additional checks redundant.
Applied to files:
functable.carch/x86/crc32_fold_pclmulqdq_tpl.harch/generic/crc32_chorba_c.carch/x86/x86_functions.htest/benchmarks/benchmark_crc32.cc
📚 Learning: 2024-12-25T19:45:06.009Z
Learnt from: samrussell
Repo: zlib-ng/zlib-ng PR: 1837
File: arch/generic/crc32_braid_c.c:596-597
Timestamp: 2024-12-25T19:45:06.009Z
Learning: We should verify out-of-bounds pointer arithmetic in chorba_118960_nondestructive() and related loops where "input + i" is used, especially when len < expected block sizes.
Applied to files:
arch/x86/crc32_fold_pclmulqdq_tpl.htest/benchmarks/benchmark_crc32.cc
📚 Learning: 2025-06-09T16:46:28.468Z
Learnt from: mtl1979
Repo: zlib-ng/zlib-ng PR: 1921
File: arch/riscv/chunkset_rvv.c:97-102
Timestamp: 2025-06-09T16:46:28.468Z
Learning: In RISC-V chunkset_rvv.c CHUNKCOPY function, the alignment logic intentionally copies sizeof(chunk_t) bytes but advances pointers by align bytes to avoid unaligned access. This "re-reading" technique is a deliberate optimization to maintain proper alignment for subsequent operations, not a bug.
Applied to files:
arch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2024-10-08T21:51:45.330Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1778
File: arch/x86/chunkset_avx2.c:160-171
Timestamp: 2024-10-08T21:51:45.330Z
Learning: In `arch/x86/chunkset_avx2.c`, within the `GET_HALFCHUNK_MAG` function, using a conditional branch to select between `_mm_loadl_epi64` and `_mm_loadu_si128` is not recommended because the branching cost outweighs the savings from the load.
Applied to files:
arch/x86/crc32_fold_pclmulqdq_tpl.harch/x86/x86_functions.htest/benchmarks/benchmark_crc32.cc
📚 Learning: 2024-10-07T21:23:13.401Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1802
File: chunkset_tpl.h:135-135
Timestamp: 2024-10-07T21:23:13.401Z
Learning: In the `CHUNKMEMSET` function within `chunkset_tpl.h`, extra bounds checks are avoided to maintain performance in critical code sections. Branching is minimized to prevent negative impacts on speculative execution. The variable `len` is enforced with `safelen` early on.
Applied to files:
arch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2025-06-10T07:38:03.297Z
Learnt from: mtl1979
Repo: zlib-ng/zlib-ng PR: 1921
File: arch/riscv/chunkset_rvv.c:103-104
Timestamp: 2025-06-10T07:38:03.297Z
Learning: In RISC-V chunkset_rvv.c CHUNKCOPY function, when dist < sizeof(chunk_t), the vl variable intentionally becomes 0, causing the while loop to not execute. This is correct behavior because copying full chunks is not safe when the distance is smaller than chunk size, and the function appropriately falls back to memcpy for handling remaining bytes.
Applied to files:
arch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2024-10-29T02:18:25.966Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1805
File: arch/x86/chunkset_avx512.c:28-30
Timestamp: 2024-10-29T02:18:25.966Z
Learning: In `chunkset_avx512.c`, the `gen_half_mask` function does not require validation for `len` since it will never exceed 16 due to computing the remainder for a 16-byte load.
Applied to files:
arch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2025-02-23T16:51:54.545Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/x86_intrins.h:114-117
Timestamp: 2025-02-23T16:51:54.545Z
Learning: In x86/x86_intrins.h, the Clang macros for _mm_cvtsi64x_si128 and _mm_cvtsi128_si64x don't need additional MSVC guards since MSVC's implementation is already protected by `defined(_MSC_VER) && !defined(__clang__)`, making them mutually exclusive.
Applied to files:
arch/x86/x86_functions.htest/benchmarks/benchmark_crc32.cc
📚 Learning: 2024-12-20T23:35:59.830Z
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1830
File: arch/generic/generic_functions.h:39-53
Timestamp: 2024-12-20T23:35:59.830Z
Learning: The `longest_match_unaligned_*` functions are templated using the LONGEST_MATCH macro in match_tpl.h, so their implementations are generated rather than explicitly defined.
Applied to files:
arch/x86/x86_functions.h
📚 Learning: 2024-10-08T19:37:14.998Z
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1797
File: inflate.c:84-86
Timestamp: 2024-10-08T19:37:14.998Z
Learning: When `INFLATE_ALLOW_INVALID_DISTANCE_TOOFAR_ARRR` is not defined, `state->sane` is completely removed from the codebase and does not need to be initialized.
Applied to files:
arch/x86/x86_functions.h
📚 Learning: 2024-09-25T16:25:56.686Z
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1797
File: inflate.c:81-83
Timestamp: 2024-09-25T16:25:56.686Z
Learning: When `INFLATE_STRICT` is not defined, the variable `state->dmax` is completely removed from the code.
Applied to files:
arch/x86/x86_functions.h
📚 Learning: 2025-02-23T16:50:50.925Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/x86_intrins.h:0-0
Timestamp: 2025-02-23T16:50:50.925Z
Learning: MSVC does not define `__GNUC__`, so adding `!defined(_MSC_VER)` to GCC detection macros is redundant when `__GNUC__` is already being checked.
Applied to files:
test/benchmarks/benchmark_crc32.cc
📚 Learning: 2024-10-04T03:17:24.773Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1778
File: chunkset_tpl.h:164-165
Timestamp: 2024-10-04T03:17:24.773Z
Learning: In `chunkset_tpl.h`, using `goto` in the `CHUNKMEMSET` function aids the compiler in inlining the function, so it should be retained.
Applied to files:
test/benchmarks/benchmark_crc32.cc
🧬 Code graph analysis (2)
arch/x86/crc32_fold_pclmulqdq_tpl.h (1)
arch/x86/crc32_pclmulqdq_tpl.h (2)
fold_12(168-208)fold_4(125-166)
arch/generic/crc32_chorba_c.c (4)
arch/riscv/crc32_zbc.c (1)
uint32_t(83-98)arch/x86/chorba_sse2.c (2)
uint32_t(23-847)uint32_t(849-879)arch/x86/chorba_sse41.c (2)
uint32_t(55-305)uint32_t(307-340)arch/generic/crc32_braid_c.c (2)
uint32_t(63-213)uint32_t(215-222)
Split out separately from #2003. Easier to review, and with disabled Chorba-sse variants on old MSVC versions.
WITHOUT_CHORBA will now only disable the crc32_chorba C fallback.
SSE2, SSE41 and pclmul variants will still be able to use their Chorba-algorithm based code, but their fallback to the generic crc32_chorba C code in SSE2 and SSE41 will be disabled, reducing their performance on really big input buffers (not used during deflate/inflate, only when calling crc32 directly).
Remove the crc32_c function (and its file crc32_c.c), instead use the normal functable routing to select between crc32_braid and crc32_chorba.
Disable sse2 and sse4.1 variants of Chorba-crc32 on MSVC older than 2022 due to code generation bug in 2019 causing segfaults.
Compile either crc32_chorba_small_nondestructive or crc32_chorba_small_nondestructive_32bit, not both. Shaves ~1.1KB off the library size.
Don't compile crc32_chorba_32768_nondestructive on 32bit arch.