Reorganize Chorba activation and refactor Chorba dispatch#2003
Reorganize Chorba activation and refactor Chorba dispatch#2003
Conversation
Develop chorba:PR chorba first commitPR chorba w/second commitNot much happening with first commit only, should be pretty much identical. |
Develop no-chorba:PR no-chorba first commitPR no-chorba w/second commitFirst commit allows SSE2 and SSE4.1 optimized functions to be used, providing a huge speedup even without the chorba_c fallback included in the code. With this PR, you can also save 15KB on the library size if you disable chorba_c when compiling for modern systems that don't need that fallback. Without chorba_c, chorba_sse2 and chorba_sse41 still perform perfectly up to 256KB/512KB (32/64bit) buffers, above that they only perform at about 73% of the speed compared to with chorba_c. But they still perform at nearly 3x the speed compared to crc32_braid that was used for these cpus before version 2.3.0. |
WalkthroughRemoved the generic Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant native as native_crc32 (macro)
participant disp as crc32_chorba / crc32_chorba_sse*
participant chorba as Chorba_variants
participant braid as crc32_braid
Caller->>native: call(native_crc32, crc, buf, len)
alt ISA-specific SSE enabled
native->>disp: route to crc32_chorba_sse2 / crc32_chorba_sse41
else generic
native->>disp: route to crc32_chorba
end
disp->>disp: init c = (~crc)&0xffffffff, compute algn_diff
alt len < algn_diff + CHORBA_SMALL_THRESHOLD
disp->>braid: crc32_braid(c, buf, len)
else
disp->>braid: process head (unaligned) via crc32_braid
disp->>chorba: select large/medium/small nondestructive variant for aligned body
chorba-->>disp: processed aligned chunk(s)
alt tail remains
disp->>braid: process tail via crc32_braid
end
end
disp-->>Caller: return (c ^ 0xffffffff)
Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
e2b7823 to
d2da176
Compare
There was a problem hiding this comment.
Actionable comments posted: 5
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (17)
CMakeLists.txt(0 hunks)Makefile.in(0 hunks)arch/generic/Makefile.in(0 hunks)arch/generic/crc32_c.c(0 hunks)arch/generic/crc32_chorba_c.c(1 hunks)arch/generic/generic_functions.h(2 hunks)arch/riscv/crc32_zbc.c(2 hunks)arch/s390/crc32-vx.c(2 hunks)arch/x86/chorba_sse2.c(2 hunks)arch/x86/chorba_sse41.c(3 hunks)arch/x86/crc32_fold_pclmulqdq_tpl.h(1 hunks)arch/x86/crc32_pclmulqdq_tpl.h(0 hunks)arch/x86/x86_functions.h(2 hunks)crc32.h(1 hunks)functable.c(3 hunks)test/benchmarks/benchmark_crc32.cc(1 hunks)test/test_crc32.cc(2 hunks)
💤 Files with no reviewable changes (5)
- arch/generic/Makefile.in
- CMakeLists.txt
- Makefile.in
- arch/generic/crc32_c.c
- arch/x86/crc32_pclmulqdq_tpl.h
🧰 Additional context used
🧠 Learnings (21)
📓 Common learnings
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:42:40.488Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-01-23T22:01:53.422Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:14-24
Timestamp: 2025-02-21T01:41:50.358Z
Learning: In zlib-ng's SSE2 vectorized Chorba CRC implementation, the code that calls READ_NEXT macro ensures 16-byte alignment, making explicit alignment checks unnecessary within the macro.
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:26-28
Timestamp: 2025-02-21T01:44:03.996Z
Learning: The alignment requirements for chorba_small_nondestructive_sse2 (16-byte alignment and multiple of 8 length) are enforced by its calling function, making additional checks redundant.
📚 Learning: 2025-02-21T01:42:40.488Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:42:40.488Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.
Applied to files:
arch/x86/crc32_fold_pclmulqdq_tpl.harch/s390/crc32-vx.carch/x86/x86_functions.harch/riscv/crc32_zbc.cfunctable.carch/x86/chorba_sse2.carch/x86/chorba_sse41.ctest/benchmarks/benchmark_crc32.cctest/test_crc32.ccarch/generic/crc32_chorba_c.carch/generic/generic_functions.hcrc32.h
📚 Learning: 2025-01-23T22:01:53.422Z
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-01-23T22:01:53.422Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.
Applied to files:
arch/x86/crc32_fold_pclmulqdq_tpl.harch/s390/crc32-vx.carch/x86/x86_functions.harch/riscv/crc32_zbc.cfunctable.carch/x86/chorba_sse2.carch/x86/chorba_sse41.ctest/benchmarks/benchmark_crc32.cctest/test_crc32.ccarch/generic/crc32_chorba_c.carch/generic/generic_functions.hcrc32.h
📚 Learning: 2025-02-21T01:44:03.996Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:26-28
Timestamp: 2025-02-21T01:44:03.996Z
Learning: The alignment requirements for chorba_small_nondestructive_sse2 (16-byte alignment and multiple of 8 length) are enforced by its calling function, making additional checks redundant.
Applied to files:
arch/x86/crc32_fold_pclmulqdq_tpl.harch/x86/x86_functions.harch/x86/chorba_sse2.carch/x86/chorba_sse41.ctest/test_crc32.cccrc32.h
📚 Learning: 2025-02-21T01:41:50.358Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:14-24
Timestamp: 2025-02-21T01:41:50.358Z
Learning: In zlib-ng's SSE2 vectorized Chorba CRC implementation, the code that calls READ_NEXT macro ensures 16-byte alignment, making explicit alignment checks unnecessary within the macro.
Applied to files:
arch/x86/crc32_fold_pclmulqdq_tpl.harch/s390/crc32-vx.carch/x86/x86_functions.harch/riscv/crc32_zbc.cfunctable.carch/x86/chorba_sse2.carch/x86/chorba_sse41.ctest/benchmarks/benchmark_crc32.cctest/test_crc32.ccarch/generic/crc32_chorba_c.carch/generic/generic_functions.hcrc32.h
📚 Learning: 2025-02-23T16:49:52.043Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.
Applied to files:
arch/x86/crc32_fold_pclmulqdq_tpl.harch/s390/crc32-vx.carch/x86/x86_functions.harch/riscv/crc32_zbc.cfunctable.carch/x86/chorba_sse2.carch/x86/chorba_sse41.ctest/benchmarks/benchmark_crc32.cctest/test_crc32.ccarch/generic/crc32_chorba_c.carch/generic/generic_functions.hcrc32.h
📚 Learning: 2025-02-23T16:51:54.545Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/x86_intrins.h:114-117
Timestamp: 2025-02-23T16:51:54.545Z
Learning: In x86/x86_intrins.h, the Clang macros for _mm_cvtsi64x_si128 and _mm_cvtsi128_si64x don't need additional MSVC guards since MSVC's implementation is already protected by `defined(_MSC_VER) && !defined(__clang__)`, making them mutually exclusive.
Applied to files:
arch/x86/crc32_fold_pclmulqdq_tpl.harch/x86/x86_functions.harch/x86/chorba_sse41.ctest/test_crc32.cc
📚 Learning: 2024-10-04T03:17:24.773Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1778
File: chunkset_tpl.h:164-165
Timestamp: 2024-10-04T03:17:24.773Z
Learning: In `chunkset_tpl.h`, using `goto` in the `CHUNKMEMSET` function aids the compiler in inlining the function, so it should be retained.
Applied to files:
arch/x86/crc32_fold_pclmulqdq_tpl.harch/x86/chorba_sse41.c
📚 Learning: 2024-10-29T02:22:52.846Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1805
File: inffast_tpl.h:257-262
Timestamp: 2024-10-29T02:22:52.846Z
Learning: In `inffast_tpl.h`, when AVX512 is enabled, the branch involving `chunkcopy_safe` is intentionally eliminated to optimize performance.
Applied to files:
arch/x86/crc32_fold_pclmulqdq_tpl.harch/x86/x86_functions.hfunctable.carch/x86/chorba_sse41.c
📚 Learning: 2024-10-08T21:51:45.330Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1778
File: arch/x86/chunkset_avx2.c:160-171
Timestamp: 2024-10-08T21:51:45.330Z
Learning: In `arch/x86/chunkset_avx2.c`, within the `GET_HALFCHUNK_MAG` function, using a conditional branch to select between `_mm_loadl_epi64` and `_mm_loadu_si128` is not recommended because the branching cost outweighs the savings from the load.
Applied to files:
arch/x86/x86_functions.hfunctable.carch/x86/chorba_sse2.carch/x86/chorba_sse41.c
📚 Learning: 2024-12-20T23:35:59.830Z
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1830
File: arch/generic/generic_functions.h:39-53
Timestamp: 2024-12-20T23:35:59.830Z
Learning: The `longest_match_unaligned_*` functions are templated using the LONGEST_MATCH macro in match_tpl.h, so their implementations are generated rather than explicitly defined.
Applied to files:
arch/x86/x86_functions.h
📚 Learning: 2024-10-08T19:37:14.998Z
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1797
File: inflate.c:84-86
Timestamp: 2024-10-08T19:37:14.998Z
Learning: When `INFLATE_ALLOW_INVALID_DISTANCE_TOOFAR_ARRR` is not defined, `state->sane` is completely removed from the codebase and does not need to be initialized.
Applied to files:
arch/x86/x86_functions.h
📚 Learning: 2025-06-09T16:46:28.468Z
Learnt from: mtl1979
Repo: zlib-ng/zlib-ng PR: 1921
File: arch/riscv/chunkset_rvv.c:97-102
Timestamp: 2025-06-09T16:46:28.468Z
Learning: In RISC-V chunkset_rvv.c CHUNKCOPY function, the alignment logic intentionally copies sizeof(chunk_t) bytes but advances pointers by align bytes to avoid unaligned access. This "re-reading" technique is a deliberate optimization to maintain proper alignment for subsequent operations, not a bug.
Applied to files:
arch/riscv/crc32_zbc.carch/x86/chorba_sse2.c
📚 Learning: 2025-04-15T09:30:10.081Z
Learnt from: mtl1979
Repo: zlib-ng/zlib-ng PR: 1904
File: arch/riscv/Makefile.in:12-14
Timestamp: 2025-04-15T09:30:10.081Z
Learning: Feature detection modules like riscv_features.c should not be compiled with feature-specific flags (like RVVFLAG) because they need to be compilable on all systems regardless of feature support. These modules perform runtime detection and must initialize feature availability flags to zero on unsupported systems.
Applied to files:
arch/riscv/crc32_zbc.c
📚 Learning: 2025-06-10T07:38:03.297Z
Learnt from: mtl1979
Repo: zlib-ng/zlib-ng PR: 1921
File: arch/riscv/chunkset_rvv.c:103-104
Timestamp: 2025-06-10T07:38:03.297Z
Learning: In RISC-V chunkset_rvv.c CHUNKCOPY function, when dist < sizeof(chunk_t), the vl variable intentionally becomes 0, causing the while loop to not execute. This is correct behavior because copying full chunks is not safe when the distance is smaller than chunk size, and the function appropriately falls back to memcpy for handling remaining bytes.
Applied to files:
arch/riscv/crc32_zbc.c
📚 Learning: 2024-10-29T02:22:55.489Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1805
File: arch/x86/chunkset_avx512.c:32-34
Timestamp: 2024-10-29T02:22:55.489Z
Learning: In `arch/x86/chunkset_avx512.c`, the `gen_mask` function's `len` parameter cannot exceed 32 because it is only called on the remaining bytes from a 32-byte vector.
Applied to files:
arch/riscv/crc32_zbc.carch/x86/chorba_sse41.c
📚 Learning: 2024-10-08T19:37:14.998Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1802
File: arch/x86/chunkset_avx2.c:82-85
Timestamp: 2024-10-08T19:37:14.998Z
Learning: In `arch/x86/chunkset_avx2.c`, when working with AVX2-capable x86 CPUs, unaligned memory access using `_mm_loadu_si128` is acceptable since there is no performance penalty on architectures after Nehalem. Ensuring alignment may introduce unnecessary overhead due to arbitrary offsets into the window.
Applied to files:
arch/x86/chorba_sse2.c
📚 Learning: 2025-02-21T01:37:54.508Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:830-831
Timestamp: 2025-02-21T01:37:54.508Z
Learning: The alignment calculation `((uintptr_t)16 - ((uintptr_t)buf & 15)) & 15` is safe from overflow as it's mathematically bounded between 1 and 16, making the use of uintptr_t appropriate for this pointer arithmetic.
Applied to files:
arch/x86/chorba_sse2.c
📚 Learning: 2024-12-22T20:40:03.280Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1833
File: test/test_crc32.cc:0-0
Timestamp: 2024-12-22T20:40:03.280Z
Learning: In the crc32_align test class (test/test_crc32.cc), the alignment offset parameter is always guaranteed to be non-negative because the test parameters are controlled and it never makes sense to have a negative offset.
Applied to files:
test/test_crc32.cc
📚 Learning: 2024-12-25T19:45:06.009Z
Learnt from: samrussell
Repo: zlib-ng/zlib-ng PR: 1837
File: arch/generic/crc32_braid_c.c:596-597
Timestamp: 2024-12-25T19:45:06.009Z
Learning: We should verify out-of-bounds pointer arithmetic in chorba_118960_nondestructive() and related loops where "input + i" is used, especially when len < expected block sizes.
Applied to files:
arch/generic/crc32_chorba_c.c
📚 Learning: 2024-10-07T22:00:02.180Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1802
File: chunkset_tpl.h:92-108
Timestamp: 2024-10-07T22:00:02.180Z
Learning: In the `HALFCHUNKCOPY` function in `chunkset_tpl.h`, the `len` parameter is always bound to be small by the callers and will not exceed the limits of an `int32_t`.
Applied to files:
crc32.h
🧬 Code graph analysis (3)
arch/x86/crc32_fold_pclmulqdq_tpl.h (1)
arch/x86/crc32_pclmulqdq_tpl.h (2)
fold_12(168-208)fold_4(125-166)
test/benchmarks/benchmark_crc32.cc (1)
test/benchmarks/benchmark_main.cc (1)
cpu_features(16-16)
arch/generic/crc32_chorba_c.c (3)
arch/x86/chorba_sse2.c (2)
uint32_t(27-851)uint32_t(853-880)arch/x86/chorba_sse41.c (2)
uint32_t(59-309)uint32_t(311-340)arch/generic/crc32_braid_c.c (2)
uint32_t(63-213)uint32_t(215-222)
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #2003 +/- ##
===========================================
- Coverage 83.31% 83.27% -0.05%
===========================================
Files 161 160 -1
Lines 12960 12953 -7
Branches 3149 3145 -4
===========================================
- Hits 10798 10786 -12
+ Misses 1134 1101 -33
- Partials 1028 1066 +38 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
76ec271 to
6c541fb
Compare
|
32-bit MSVC does seem to fail with a segfault, not sure what is causing that. |
|
Only fails in Release build... Output is: |
|
these look like boundaries where we move to a larger polynomial and have a
barely large-enough input buffer that we're probably overrunning, upping
the thresholds would be a simple workaround (to >78 and >148 respectively)
…On Wed, Nov 12, 2025, 17:02 Mika Lindqvist ***@***.***> wrote:
*mtl1979* left a comment (zlib-ng/zlib-ng#2003)
<#2003 (comment)>
Only fails in Release build... Output is:
[ FAILED ] 18 tests, listed below:
[ FAILED ] crc32/crc32_variant.chorba_sse2/76, where GetParam() = 16-byte object <00-00 00-00 B0-35 0B-00 64-00 00-00 B9-1D C7-E7>
[ FAILED ] crc32/crc32_variant.chorba_sse2/77, where GetParam() = 16-byte object <00-00 00-00 18-36 0B-00 64-00 00-00 77-27 A5-EA>
[ FAILED ] crc32/crc32_variant.chorba_sse2/78, where GetParam() = 16-byte object <00-00 00-00 80-36 0B-00 64-00 00-00 48-20 47-CD>
[ FAILED ] crc32/crc32_variant.chorba_sse2/143, where GetParam() = 16-byte object <D3-4F B3-74 B0-35 0B-00 64-00 00-00 60-80 F9-CC>
[ FAILED ] crc32/crc32_variant.chorba_sse2/144, where GetParam() = 16-byte object <70-D7 1F-35 18-36 0B-00 64-00 00-00 12-53 B9-D8>
[ FAILED ] crc32/crc32_variant.chorba_sse2/145, where GetParam() = 16-byte object <77-EF 5A-C4 80-36 0B-00 64-00 00-00 12-99 1C-BB>
[ FAILED ] crc32/crc32_variant.chorba_sse2/146, where GetParam() = 16-byte object <77-EF 5A-C4 D8-3C 0B-00 58-02 00-00 5B-FA 8A-88>
[ FAILED ] crc32/crc32_variant.chorba_sse2/147, where GetParam() = 16-byte object <00-00 00-00 A0-E5 0C-00 00-80 00-00 B2-26 77-21>
[ FAILED ] crc32/crc32_variant.chorba_sse2/148, where GetParam() = 16-byte object <00-00 00-00 A0-E5 0C-00 00-40 00-00 F0-22 17-E8>
[ FAILED ] crc32/crc32_variant.chorba_sse41/76, where GetParam() = 16-byte object <00-00 00-00 B0-35 0B-00 64-00 00-00 B9-1D C7-E7>
[ FAILED ] crc32/crc32_variant.chorba_sse41/77, where GetParam() = 16-byte object <00-00 00-00 18-36 0B-00 64-00 00-00 77-27 A5-EA>
[ FAILED ] crc32/crc32_variant.chorba_sse41/78, where GetParam() = 16-byte object <00-00 00-00 80-36 0B-00 64-00 00-00 48-20 47-CD>
[ FAILED ] crc32/crc32_variant.chorba_sse41/143, where GetParam() = 16-byte object <D3-4F B3-74 B0-35 0B-00 64-00 00-00 60-80 F9-CC>
[ FAILED ] crc32/crc32_variant.chorba_sse41/144, where GetParam() = 16-byte object <70-D7 1F-35 18-36 0B-00 64-00 00-00 12-53 B9-D8>
[ FAILED ] crc32/crc32_variant.chorba_sse41/145, where GetParam() = 16-byte object <77-EF 5A-C4 80-36 0B-00 64-00 00-00 12-99 1C-BB>
[ FAILED ] crc32/crc32_variant.chorba_sse41/146, where GetParam() = 16-byte object <77-EF 5A-C4 D8-3C 0B-00 58-02 00-00 5B-FA 8A-88>
[ FAILED ] crc32/crc32_variant.chorba_sse41/147, where GetParam() = 16-byte object <00-00 00-00 A0-E5 0C-00 00-80 00-00 B2-26 77-21>
[ FAILED ] crc32/crc32_variant.chorba_sse41/148, where GetParam() = 16-byte object <00-00 00-00 A0-E5 0C-00 00-40 00-00 F0-22 17-E8>
—
Reply to this email directly, view it on GitHub
<#2003 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AALJUP5SBMUMLTK55Q53JA334NK2XAVCNFSM6AAAAACL2ELE22VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTKMRSGYYTONBTHE>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
If I comment out call to |
I notice that I have used the wrong operator in the threshold check, |
Now WITHOUT_CHORBA will only disable the crc32_chorba C fallback. SSE2, SSE41 and pclmul variants will still be able to use their Chorba-algorithm based code, but their fallback to the generic crc32_chorba C code in SSE2 and SSE41 will be disabled, reducing their performance on really big input buffers. Remove the crc32_c function (and its file crc32_c.c), instead use the normal functable routing to select between crc32_braid and crc32_chorba.
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (17)
CMakeLists.txt(0 hunks)Makefile.in(0 hunks)arch/generic/Makefile.in(0 hunks)arch/generic/crc32_c.c(0 hunks)arch/generic/crc32_chorba_c.c(1 hunks)arch/generic/generic_functions.h(2 hunks)arch/riscv/crc32_zbc.c(2 hunks)arch/s390/crc32-vx.c(2 hunks)arch/x86/chorba_sse2.c(2 hunks)arch/x86/chorba_sse41.c(3 hunks)arch/x86/crc32_fold_pclmulqdq_tpl.h(1 hunks)arch/x86/crc32_pclmulqdq_tpl.h(0 hunks)arch/x86/x86_functions.h(2 hunks)crc32.h(1 hunks)functable.c(3 hunks)test/benchmarks/benchmark_crc32.cc(1 hunks)test/test_crc32.cc(2 hunks)
💤 Files with no reviewable changes (5)
- Makefile.in
- arch/generic/Makefile.in
- CMakeLists.txt
- arch/x86/crc32_pclmulqdq_tpl.h
- arch/generic/crc32_c.c
🚧 Files skipped from review as they are similar to previous changes (1)
- arch/generic/crc32_chorba_c.c
🧰 Additional context used
🧠 Learnings (22)
📓 Common learnings
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:42:40.488Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-01-23T22:01:53.422Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:14-24
Timestamp: 2025-02-21T01:41:50.358Z
Learning: In zlib-ng's SSE2 vectorized Chorba CRC implementation, the code that calls READ_NEXT macro ensures 16-byte alignment, making explicit alignment checks unnecessary within the macro.
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:26-28
Timestamp: 2025-02-21T01:44:03.996Z
Learning: The alignment requirements for chorba_small_nondestructive_sse2 (16-byte alignment and multiple of 8 length) are enforced by its calling function, making additional checks redundant.
📚 Learning: 2025-02-21T01:41:50.358Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:14-24
Timestamp: 2025-02-21T01:41:50.358Z
Learning: In zlib-ng's SSE2 vectorized Chorba CRC implementation, the code that calls READ_NEXT macro ensures 16-byte alignment, making explicit alignment checks unnecessary within the macro.
Applied to files:
arch/s390/crc32-vx.carch/generic/generic_functions.htest/test_crc32.ccarch/x86/x86_functions.hfunctable.ccrc32.htest/benchmarks/benchmark_crc32.ccarch/x86/chorba_sse41.carch/x86/crc32_fold_pclmulqdq_tpl.harch/riscv/crc32_zbc.carch/x86/chorba_sse2.c
📚 Learning: 2025-02-21T01:42:40.488Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:42:40.488Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.
Applied to files:
arch/s390/crc32-vx.carch/generic/generic_functions.htest/test_crc32.ccarch/x86/x86_functions.hfunctable.ccrc32.htest/benchmarks/benchmark_crc32.ccarch/x86/chorba_sse41.carch/x86/crc32_fold_pclmulqdq_tpl.harch/riscv/crc32_zbc.carch/x86/chorba_sse2.c
📚 Learning: 2025-02-23T16:49:52.043Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.
Applied to files:
arch/s390/crc32-vx.carch/generic/generic_functions.htest/test_crc32.ccarch/x86/x86_functions.hfunctable.ccrc32.htest/benchmarks/benchmark_crc32.ccarch/x86/chorba_sse41.carch/x86/crc32_fold_pclmulqdq_tpl.harch/riscv/crc32_zbc.carch/x86/chorba_sse2.c
📚 Learning: 2025-01-23T22:01:53.422Z
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-01-23T22:01:53.422Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.
Applied to files:
arch/s390/crc32-vx.carch/generic/generic_functions.htest/test_crc32.ccarch/x86/x86_functions.hfunctable.ccrc32.htest/benchmarks/benchmark_crc32.ccarch/x86/chorba_sse41.carch/x86/crc32_fold_pclmulqdq_tpl.harch/riscv/crc32_zbc.carch/x86/chorba_sse2.c
📚 Learning: 2024-12-22T20:40:03.280Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1833
File: test/test_crc32.cc:0-0
Timestamp: 2024-12-22T20:40:03.280Z
Learning: In the crc32_align test class (test/test_crc32.cc), the alignment offset parameter is always guaranteed to be non-negative because the test parameters are controlled and it never makes sense to have a negative offset.
Applied to files:
test/test_crc32.ccarch/x86/chorba_sse2.c
📚 Learning: 2025-02-21T01:44:03.996Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:26-28
Timestamp: 2025-02-21T01:44:03.996Z
Learning: The alignment requirements for chorba_small_nondestructive_sse2 (16-byte alignment and multiple of 8 length) are enforced by its calling function, making additional checks redundant.
Applied to files:
test/test_crc32.ccarch/x86/x86_functions.hcrc32.harch/x86/chorba_sse41.carch/x86/chorba_sse2.c
📚 Learning: 2025-02-23T16:51:54.545Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/x86_intrins.h:114-117
Timestamp: 2025-02-23T16:51:54.545Z
Learning: In x86/x86_intrins.h, the Clang macros for _mm_cvtsi64x_si128 and _mm_cvtsi128_si64x don't need additional MSVC guards since MSVC's implementation is already protected by `defined(_MSC_VER) && !defined(__clang__)`, making them mutually exclusive.
Applied to files:
arch/x86/x86_functions.harch/x86/chorba_sse41.carch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2024-10-08T21:51:45.330Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1778
File: arch/x86/chunkset_avx2.c:160-171
Timestamp: 2024-10-08T21:51:45.330Z
Learning: In `arch/x86/chunkset_avx2.c`, within the `GET_HALFCHUNK_MAG` function, using a conditional branch to select between `_mm_loadl_epi64` and `_mm_loadu_si128` is not recommended because the branching cost outweighs the savings from the load.
Applied to files:
arch/x86/x86_functions.hfunctable.carch/x86/chorba_sse41.carch/x86/chorba_sse2.c
📚 Learning: 2024-12-20T23:35:59.830Z
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1830
File: arch/generic/generic_functions.h:39-53
Timestamp: 2024-12-20T23:35:59.830Z
Learning: The `longest_match_unaligned_*` functions are templated using the LONGEST_MATCH macro in match_tpl.h, so their implementations are generated rather than explicitly defined.
Applied to files:
arch/x86/x86_functions.h
📚 Learning: 2024-10-29T02:22:52.846Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1805
File: inffast_tpl.h:257-262
Timestamp: 2024-10-29T02:22:52.846Z
Learning: In `inffast_tpl.h`, when AVX512 is enabled, the branch involving `chunkcopy_safe` is intentionally eliminated to optimize performance.
Applied to files:
arch/x86/x86_functions.hfunctable.carch/x86/chorba_sse41.carch/x86/crc32_fold_pclmulqdq_tpl.h
📚 Learning: 2024-10-08T19:37:14.998Z
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1797
File: inflate.c:84-86
Timestamp: 2024-10-08T19:37:14.998Z
Learning: When `INFLATE_ALLOW_INVALID_DISTANCE_TOOFAR_ARRR` is not defined, `state->sane` is completely removed from the codebase and does not need to be initialized.
Applied to files:
arch/x86/x86_functions.h
📚 Learning: 2024-12-25T19:45:06.009Z
Learnt from: samrussell
Repo: zlib-ng/zlib-ng PR: 1837
File: arch/generic/crc32_braid_c.c:596-597
Timestamp: 2024-12-25T19:45:06.009Z
Learning: We should verify out-of-bounds pointer arithmetic in chorba_118960_nondestructive() and related loops where "input + i" is used, especially when len < expected block sizes.
Applied to files:
arch/x86/chorba_sse41.carch/x86/chorba_sse2.c
📚 Learning: 2024-10-07T21:23:13.401Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1802
File: chunkset_tpl.h:135-135
Timestamp: 2024-10-07T21:23:13.401Z
Learning: In the `CHUNKMEMSET` function within `chunkset_tpl.h`, extra bounds checks are avoided to maintain performance in critical code sections. Branching is minimized to prevent negative impacts on speculative execution. The variable `len` is enforced with `safelen` early on.
Applied to files:
arch/x86/chorba_sse41.carch/x86/chorba_sse2.c
📚 Learning: 2024-12-27T10:06:16.184Z
Learnt from: samrussell
Repo: zlib-ng/zlib-ng PR: 1837
File: arch/generic/crc32_braid_c.c:215-217
Timestamp: 2024-12-27T10:06:16.184Z
Learning: There's a business requirement to use stack allocation instead of heap for the Chorba buffer.
Applied to files:
arch/x86/chorba_sse41.c
📚 Learning: 2024-10-29T02:22:55.489Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1805
File: arch/x86/chunkset_avx512.c:32-34
Timestamp: 2024-10-29T02:22:55.489Z
Learning: In `arch/x86/chunkset_avx512.c`, the `gen_mask` function's `len` parameter cannot exceed 32 because it is only called on the remaining bytes from a 32-byte vector.
Applied to files:
arch/x86/chorba_sse41.carch/riscv/crc32_zbc.c
📚 Learning: 2024-10-07T21:18:37.806Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1802
File: arch/x86/chunkset_avx2.c:82-85
Timestamp: 2024-10-07T21:18:37.806Z
Learning: In `arch/x86/chunkset_avx2.c`, when working with AVX2-capable x86 CPUs, unaligned memory access using `_mm_loadu_si128` is acceptable since there is no performance penalty on architectures after Nehalem. Ensuring alignment may introduce unnecessary overhead due to arbitrary offsets into the window.
Applied to files:
arch/x86/chorba_sse41.c
📚 Learning: 2025-06-09T16:46:28.468Z
Learnt from: mtl1979
Repo: zlib-ng/zlib-ng PR: 1921
File: arch/riscv/chunkset_rvv.c:97-102
Timestamp: 2025-06-09T16:46:28.468Z
Learning: In RISC-V chunkset_rvv.c CHUNKCOPY function, the alignment logic intentionally copies sizeof(chunk_t) bytes but advances pointers by align bytes to avoid unaligned access. This "re-reading" technique is a deliberate optimization to maintain proper alignment for subsequent operations, not a bug.
Applied to files:
arch/riscv/crc32_zbc.carch/x86/chorba_sse2.c
📚 Learning: 2025-04-15T09:30:10.081Z
Learnt from: mtl1979
Repo: zlib-ng/zlib-ng PR: 1904
File: arch/riscv/Makefile.in:12-14
Timestamp: 2025-04-15T09:30:10.081Z
Learning: Feature detection modules like riscv_features.c should not be compiled with feature-specific flags (like RVVFLAG) because they need to be compilable on all systems regardless of feature support. These modules perform runtime detection and must initialize feature availability flags to zero on unsupported systems.
Applied to files:
arch/riscv/crc32_zbc.c
📚 Learning: 2025-06-10T07:38:03.297Z
Learnt from: mtl1979
Repo: zlib-ng/zlib-ng PR: 1921
File: arch/riscv/chunkset_rvv.c:103-104
Timestamp: 2025-06-10T07:38:03.297Z
Learning: In RISC-V chunkset_rvv.c CHUNKCOPY function, when dist < sizeof(chunk_t), the vl variable intentionally becomes 0, causing the while loop to not execute. This is correct behavior because copying full chunks is not safe when the distance is smaller than chunk size, and the function appropriately falls back to memcpy for handling remaining bytes.
Applied to files:
arch/riscv/crc32_zbc.c
📚 Learning: 2024-10-08T19:37:14.998Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1802
File: arch/x86/chunkset_avx2.c:168-168
Timestamp: 2024-10-08T19:37:14.998Z
Learning: In the function `GET_HALFCHUNK_MAG` in `arch/x86/chunkset_avx2.c`, `dist` is always less than 16, so `16 - dist` in `__msan_unpoison` is always positive.
Applied to files:
arch/x86/chorba_sse2.c
📚 Learning: 2025-02-21T01:37:54.508Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:830-831
Timestamp: 2025-02-21T01:37:54.508Z
Learning: The alignment calculation `((uintptr_t)16 - ((uintptr_t)buf & 15)) & 15` is safe from overflow as it's mathematically bounded between 1 and 16, making the use of uintptr_t appropriate for this pointer arithmetic.
Applied to files:
arch/x86/chorba_sse2.c
🧬 Code graph analysis (1)
arch/x86/crc32_fold_pclmulqdq_tpl.h (1)
arch/x86/crc32_pclmulqdq_tpl.h (2)
fold_12(168-208)fold_4(125-166)
🔇 Additional comments (5)
arch/x86/x86_functions.h (2)
27-27: LGTM: Chorba SSE declarations now unconditional.Removing the
WITHOUT_CHORBAguard allows the SSE2 and SSE4.1 Chorba implementations to remain available even when the generic C fallback is disabled, consistent with the PR's objective.Also applies to: 37-37
104-105: native_crc32 mapping is consistent between static defines and runtime dispatch.Verification confirms alignment: arch/x86/x86_functions.h lines 105 and 120 define
native_crc32tocrc32_chorba_sse2andcrc32_chorba_sse41respectively, matching the functable.c runtime assignments at lines 125 and 147. The precedence order (SSE2 base, SSE4.1 override) is consistent across both static preprocessor mappings and dynamic function table dispatch.arch/x86/chorba_sse2.c (1)
854-856: LGTM: Unified initialization and alignment calculation.The refactored approach computing
candalgn_diffupfront simplifies the control flow and aligns with the PR's goal of unifying dispatch logic.test/benchmarks/benchmark_crc32.cc (1)
65-73: LGTM: Benchmark registrations updated consistently.The refactored benchmark registrations properly separate
chorba_c,chorba_sse2, andchorba_sse41variants with appropriate feature guards and runtime CPU checks, aligning with the broader reorganization.test/test_crc32.cc (1)
294-296: LGTM: Test registrations updated consistently.The refactored test guards now rely directly on feature macros (
X86_SSE2,X86_SSE41) rather than Chorba-specific guards, consistent with the unconditional SSE declarations inx86_functions.h.Note: PR objectives document 18 failing
crc32_varianttests on SSE2/SSE4.1 in release builds, concentrated at threshold boundaries. These failures may be related to the threshold condition flagged inchorba_sse2.c.Also applies to: 317-322
- Fixed alignment diff calculation in crc32_chorba. - Fixed length check to happen early, avoiding extra branches for too short lengths, this also allows removing one function call to crc32_braid_internal to handle those. Gbench shows ~0.15-0.25ns saved per call for lengths shorter than CHORBA_SMALL_THRESHOLD. - Avoid calculating aligned len if buffer is already aligned
|
|
||
| Z_INTERNAL uint32_t crc32_braid_internal(uint32_t c, const uint8_t *buf, size_t len); | ||
| #if OPTIMAL_CMP == 64 | ||
| # define CHORBA_SMALL_THRESHOLD 72 |
First commit:
Disabling Chorba using CMake parameter
WITH_CRC32_CHORBA=OFFwill now only disable the crc32_chorba C fallback.SSE2, SSE41 and pclmul variants will still be able to use their own Chorba-algorithm based code, but their fallback to
the generic crc32_chorba C code in SSE2 and SSE41 will be disabled, reducing their performance on really big
input buffers. But this allows the SSE2 and SSE41 functions to still be used in the first place, and they provide a huge win over crc32_braid for buffers over 72 bytes.
Remove the crc32_c function (and its file crc32_c.c), instead use the normal functable routing to select
between crc32_braid and crc32_chorba.
Second commit:
this also allows removing one function call to crc32_braid_internal to handle those.
Gbench shows ~0.15-0.25ns saved per call for lengths shorter than CHORBA_SMALL_THRESHOLD.