Add benchmark for crc32_fold_copy functions#2008
Conversation
|
Edit: Removed incorrect benchmark numbers, see next comment below instead for updated numbers. Regression is still there though. There seems to be a performance regression with pclmul.. @KungFuJesus ideas? |
WalkthroughAdds a new CRC32 fold-copy benchmark source and integrates it into the benchmarks build; the benchmark fixture runs fold-copy workloads across multiple CRC32 implementations and architecture-specific variants, varying input size and misalignment while finalizing the fold state. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Bench as BenchmarkRunner
participant Fixture as crc32_fc (SetUp/TearDown)
participant Driver as bench_loop
participant Impl as crc32_fold_impl (braid/chorba/arm/loongarch/...)
participant Fold as crc32_fold (fold_copy/fold_final)
rect rgb(240,248,255)
Bench->>Fixture: create & SetUp (alloc + seed input)
Bench->>Driver: invoke BENCHMARK loop (sizes, misalign)
Driver->>Impl: call fold_copy(src, dst, &state)
Impl->>Fold: perform fold_copy operations
Driver->>Fold: fold_final(&state)
Fold-->>Bench: result (prevent optimize)
Bench->>Fixture: TearDown (free)
end
Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
test/crc32_test_strings_p.h (1)
4-22: Make the test header more self-contained to reduce include-order brittlenessThis header relies on
ALIGNED_,uint8_t, andsize_tbeing defined by earlier includes in the including.ccfiles. That’s fragile for a shared test header and easy to break with future refactors.Consider explicitly including the necessary standard headers (and, if appropriate,
zbuild.hforALIGNED_) here, or at least documenting that this header must be included after those definitions.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
arch/generic/generic_functions.h(1 hunks)test/CMakeLists.txt(1 hunks)test/benchmarks/CMakeLists.txt(1 hunks)test/benchmarks/benchmark_crc32_fold_copy.cc(1 hunks)test/crc32_test_strings_p.h(1 hunks)test/test_crc32.cc(2 hunks)test/test_crc32_fold_copy.cc(1 hunks)
🧰 Additional context used
🧠 Learnings (7)
📓 Common learnings
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-01-23T22:01:53.422Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:14-24
Timestamp: 2025-02-21T01:41:50.358Z
Learning: In zlib-ng's SSE2 vectorized Chorba CRC implementation, the code that calls READ_NEXT macro ensures 16-byte alignment, making explicit alignment checks unnecessary within the macro.
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:42:40.488Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.
📚 Learning: 2025-01-23T22:01:53.422Z
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-01-23T22:01:53.422Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.
Applied to files:
arch/generic/generic_functions.htest/test_crc32.cctest/test_crc32_fold_copy.cctest/crc32_test_strings_p.htest/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-06-18T19:28:32.987Z
Learnt from: phprus
Repo: zlib-ng/zlib-ng PR: 1925
File: arch/loongarch/slide_hash_lsx.c:64-64
Timestamp: 2025-06-18T19:28:32.987Z
Learning: In zlib-ng's slide_hash functions, when calling slide_hash_chain(), the fourth parameter should be the window size (wsize) representing the number of entries in the prev table, not s->w_size directly. The local variable wsize is typically cast to the appropriate type (uint16_t) from s->w_size for this purpose.
Applied to files:
arch/generic/generic_functions.h
📚 Learning: 2025-02-23T16:49:52.043Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.
Applied to files:
arch/generic/generic_functions.htest/test_crc32.cctest/test_crc32_fold_copy.cctest/crc32_test_strings_p.htest/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-21T01:41:50.358Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:14-24
Timestamp: 2025-02-21T01:41:50.358Z
Learning: In zlib-ng's SSE2 vectorized Chorba CRC implementation, the code that calls READ_NEXT macro ensures 16-byte alignment, making explicit alignment checks unnecessary within the macro.
Applied to files:
arch/generic/generic_functions.htest/test_crc32_fold_copy.cctest/crc32_test_strings_p.htest/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2024-12-22T20:40:03.280Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1833
File: test/test_crc32.cc:0-0
Timestamp: 2024-12-22T20:40:03.280Z
Learning: In the crc32_align test class (test/test_crc32.cc), the alignment offset parameter is always guaranteed to be non-negative because the test parameters are controlled and it never makes sense to have a negative offset.
Applied to files:
test/test_crc32.cctest/test_crc32_fold_copy.cctest/crc32_test_strings_p.h
📚 Learning: 2025-02-21T01:42:40.488Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:42:40.488Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.
Applied to files:
test/test_crc32.cctest/test_crc32_fold_copy.cctest/crc32_test_strings_p.htest/benchmarks/benchmark_crc32_fold_copy.cc
🧬 Code graph analysis (4)
arch/generic/generic_functions.h (2)
crc32.h (1)
crc32_fold_s(20-23)functable.c (1)
crc32_fold_stub(402-405)
test/benchmarks/CMakeLists.txt (2)
test/benchmarks/benchmark_crc32.cc (1)
crc32(21-48)test/benchmarks/benchmark_adler32_copy.cc (1)
adler32_copy(24-63)
test/test_crc32_fold_copy.cc (1)
arch/arm/crc32_armv8.c (1)
crc32_fold_copy_armv8(76-79)
test/benchmarks/benchmark_crc32_fold_copy.cc (1)
arch/arm/crc32_armv8.c (1)
crc32_fold_copy_armv8(76-79)
🔇 Additional comments (5)
test/benchmarks/CMakeLists.txt (1)
47-47: Benchmark source wired in cleanly.The fold-copy benchmark now builds with the rest of the suite without touching other targets.
test/CMakeLists.txt (1)
186-187: New fold-copy test registered appropriately.Including
test_crc32_fold_copy.ccunder the ZLIBNG test gate keeps build knobs consistent.test/test_crc32.cc (1)
12-19: Shared CRC32 vectors plugged in smoothly.Reusing
crc32_testskeeps the parameterized suite intact while centralizing the data source.Also applies to: 77-77
arch/generic/generic_functions.h (1)
15-17: Function pointer typedefs look correct.The new fold-reset/copy/final aliases mirror the existing prototypes and expose them cleanly to consumers.
test/test_crc32_fold_copy.cc (1)
21-83: Great coverage of fold-copy paths.The fixture drives reset/copy/final across all feature variants and double-checks the copied payload, which is exactly what we need here.
c403a94 to
3ea4cc7
Compare
|
Since input data is nearly never aligned, I changed the benchmark to add misalignment to both the input and output buffers. Measured performance went down across the board, but this is more representative of the real-world usage of these functions. |
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (2)
test/benchmarks/benchmark_crc32_fold_copy.cc (2)
24-28: Initialize the entire buffer to avoid undefined behaviorThe loop still only initializes
BUFSIZE/sizeof(uint32_t)bytes (8196 bytes), leaving the remaining ~24KB uninitialized. Since benchmarks read up to 32768 bytes, this causes reads from uninitialized memory.Apply this diff:
void SetUp(const ::benchmark::State& state) { - for (uint32_t i = 0; i < BUFSIZE/sizeof(uint32_t); i++) { + for (uint32_t i = 0; i < BUFSIZE; i++) { testdata[i] = rand(); } }
53-60: Add return statement afterSkipWithErrorto prevent executing unsupported codeWhen
support_flagis false, the benchmark callsstate.SkipWithErrorbut continues to executeBench(...), which will invoke architecture-specific instructions the CPU doesn't support. This can crash with SIGILL (illegal instruction).Apply this diff:
#define BENCHMARK_CRC32_FOLD(name, resfunc, copyfunc, finfunc, support_flag) \ BENCHMARK_DEFINE_F(crc32_fc, name)(benchmark::State& state) { \ if (!support_flag) { \ state.SkipWithError("CPU does not support " #name); \ + return; \ } \ Bench(state, resfunc, copyfunc, finfunc); \ } \ BENCHMARK_REGISTER_F(crc32_fc, name)->Arg(16)->Arg(48)->Arg(192)->Arg(512)->Arg(4<<10)->Arg(16<<10)->Arg(32<<10);
🧹 Nitpick comments (1)
test/benchmarks/benchmark_crc32_fold_copy.cc (1)
30-50: Consider resetting CRC state between benchmark iterationsThe CRC state (
crc_st) is initialized once (line 34) and then accumulates data across all benchmark iterations. This means the state processes potentially megabytes of data before finalization, which may not reflect typical usage patterns and could skew measurements if CRC behavior varies with accumulated state.Consider resetting the state each iteration:
void Bench(benchmark::State& state, crc32_fold_reset_func fold_reset, crc32_fold_copy_func fold_copy, crc32_fold_final_func fold_final) { int misalign = 0; - // Prepare an initial crc state - fold_reset(&crc_st); crc = 0; // Benchmark the CRC32 fold copy operation for (auto _ : state) { + fold_reset(&crc_st); fold_copy(&crc_st, dstbuf + misalign, testdata + misalign, (size_t)state.range(0)); + crc = fold_final(&crc_st); misalign++; if (misalign == 15) misalign = 0; } - // Finalize the CRC32 calculation - crc = fold_final(&crc_st); - // Prevent the result from being optimized away benchmark::DoNotOptimize(crc); }Alternatively, protect
dstbuffrom optimization to ensure the copy operation is measured:// Prevent the result from being optimized away benchmark::DoNotOptimize(crc); + benchmark::DoNotOptimize(dstbuf);
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
test/benchmarks/CMakeLists.txt(1 hunks)test/benchmarks/benchmark_crc32_fold_copy.cc(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- test/benchmarks/CMakeLists.txt
🧰 Additional context used
🧠 Learnings (7)
📓 Common learnings
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-01-23T22:01:53.422Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.
📚 Learning: 2024-10-08T21:51:45.330Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1778
File: arch/x86/chunkset_avx2.c:160-171
Timestamp: 2024-10-08T21:51:45.330Z
Learning: In `arch/x86/chunkset_avx2.c`, within the `GET_HALFCHUNK_MAG` function, using a conditional branch to select between `_mm_loadl_epi64` and `_mm_loadu_si128` is not recommended because the branching cost outweighs the savings from the load.
Applied to files:
test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-23T16:51:54.545Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/x86_intrins.h:114-117
Timestamp: 2025-02-23T16:51:54.545Z
Learning: In x86/x86_intrins.h, the Clang macros for _mm_cvtsi64x_si128 and _mm_cvtsi128_si64x don't need additional MSVC guards since MSVC's implementation is already protected by `defined(_MSC_VER) && !defined(__clang__)`, making them mutually exclusive.
Applied to files:
test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-01-23T22:01:53.422Z
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-01-23T22:01:53.422Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.
Applied to files:
test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-23T16:49:52.043Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.
Applied to files:
test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-21T01:42:40.488Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:42:40.488Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.
Applied to files:
test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-21T01:41:50.358Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:14-24
Timestamp: 2025-02-21T01:41:50.358Z
Learning: In zlib-ng's SSE2 vectorized Chorba CRC implementation, the code that calls READ_NEXT macro ensures 16-byte alignment, making explicit alignment checks unnecessary within the macro.
Applied to files:
test/benchmarks/benchmark_crc32_fold_copy.cc
🧬 Code graph analysis (1)
test/benchmarks/benchmark_crc32_fold_copy.cc (1)
arch/arm/crc32_armv8.c (1)
crc32_fold_copy_armv8(76-79)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
- GitHub Check: macOS GCC UBSAN (ARM64)
- GitHub Check: macOS GCC UBSAN (ARM64)
- GitHub Check: macOS GCC UBSAN (ARM64)
- GitHub Check: macOS GCC UBSAN (ARM64)
- GitHub Check: macOS GCC UBSAN (ARM64)
- GitHub Check: macOS GCC UBSAN (ARM64)
- GitHub Check: macOS GCC UBSAN (ARM64)
- GitHub Check: macOS GCC UBSAN (ARM64)
- GitHub Check: macOS GCC UBSAN (ARM64)
- GitHub Check: macOS GCC UBSAN (ARM64)
c16284b to
a72a6f4
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #2008 +/- ##
===========================================
+ Coverage 80.05% 81.26% +1.20%
===========================================
Files 162 163 +1
Lines 13558 12862 -696
Branches 3474 3171 -303
===========================================
- Hits 10854 10452 -402
+ Misses 1640 1352 -288
+ Partials 1064 1058 -6 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
a72a6f4 to
98fceeb
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
test/benchmarks/benchmark_crc32_fold_copy.cc (1)
63-70: AddreturnafterSkipWithErrorto prevent executing unsupported instructions.When
support_flagis false, the macro callsstate.SkipWithError(...)but continues to executeBench(...), which may invoke CPU instructions unsupported by the current processor (e.g., PCLMULQDQ, VPCLMULQDQ, ARM CRC32). This can trigger illegal instruction faults and crash the benchmark.Apply this diff:
#define BENCHMARK_CRC32_FOLD(name, resfunc, copyfunc, finfunc, support_flag) \ BENCHMARK_DEFINE_F(crc32_fc, name)(benchmark::State& state) { \ if (!support_flag) { \ state.SkipWithError("CPU does not support " #name); \ + return; \ } \ Bench(state, resfunc, copyfunc, finfunc); \ } \ BENCHMARK_REGISTER_F(crc32_fc, name)->Arg(16)->Arg(48)->Arg(192)->Arg(512)->Arg(4<<10)->Arg(16<<10)->Arg(32<<10);
🧹 Nitpick comments (2)
test/benchmarks/benchmark_crc32_fold_copy.cc (2)
26-27: Consider usingzng_alloc/zng_freefor consistency.The existing
benchmark_crc32.ccuseszng_alloc/zng_freefor buffer allocation. For consistency across the benchmark suite, consider switching frommalloc/freetozng_alloc/zng_free.
43-47: Consider documenting the misalignment pattern.The rotating misalignment (0-14 bytes) intentionally tests various alignment scenarios. A brief comment explaining this design choice would improve code clarity.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
test/benchmarks/CMakeLists.txt(1 hunks)test/benchmarks/benchmark_crc32_fold_copy.cc(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- test/benchmarks/CMakeLists.txt
🧰 Additional context used
🧠 Learnings (6)
📚 Learning: 2024-10-08T21:51:45.330Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1778
File: arch/x86/chunkset_avx2.c:160-171
Timestamp: 2024-10-08T21:51:45.330Z
Learning: In `arch/x86/chunkset_avx2.c`, within the `GET_HALFCHUNK_MAG` function, using a conditional branch to select between `_mm_loadl_epi64` and `_mm_loadu_si128` is not recommended because the branching cost outweighs the savings from the load.
Applied to files:
test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-23T16:51:54.545Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/x86_intrins.h:114-117
Timestamp: 2025-02-23T16:51:54.545Z
Learning: In x86/x86_intrins.h, the Clang macros for _mm_cvtsi64x_si128 and _mm_cvtsi128_si64x don't need additional MSVC guards since MSVC's implementation is already protected by `defined(_MSC_VER) && !defined(__clang__)`, making them mutually exclusive.
Applied to files:
test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-23T16:49:52.043Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.
Applied to files:
test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-01-23T22:01:53.422Z
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-01-23T22:01:53.422Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.
Applied to files:
test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-21T01:42:40.488Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:42:40.488Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.
Applied to files:
test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-21T01:41:50.358Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:14-24
Timestamp: 2025-02-21T01:41:50.358Z
Learning: In zlib-ng's SSE2 vectorized Chorba CRC implementation, the code that calls READ_NEXT macro ensures 16-byte alignment, making explicit alignment checks unnecessary within the macro.
Applied to files:
test/benchmarks/benchmark_crc32_fold_copy.cc
🧬 Code graph analysis (1)
test/benchmarks/benchmark_crc32_fold_copy.cc (2)
arch/arm/crc32_armv8.c (1)
crc32_fold_copy_armv8(76-79)test/benchmarks/benchmark_crc32.cc (1)
crc32(21-48)
🔇 Additional comments (2)
test/benchmarks/benchmark_crc32_fold_copy.cc (2)
30-32: Buffer initialization is now correct.The loop correctly initializes the entire buffer (
BUFSIZEbytes) by iteratingBUFSIZE/sizeof(uint32_t)times overuint32_telements. This resolves the previously flagged concern about reading uninitialized memory in larger benchmark runs.
72-94: LGTM: Benchmark registrations are properly structured.The generic benchmark is always available, and platform-specific variants are correctly gated by both compile-time macros (
ARM_CRC32,X86_PCLMULQDQ_CRC, etc.) and runtime feature detection flags. The compound condition forvpclmulqdq(line 88) properly checks all required CPU features.
98fceeb to
069fbac
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
test/benchmarks/benchmark_crc32_fold_copy.cc (1)
63-69: Short-circuit afterSkipWithErrorto avoid executing unsupported instructionsWhen
support_flagis false, the macro callsstate.SkipWithErrorbut still falls through toBench, which may execute PCLMUL/VPCLMUL/ARM CRC/LoongArch CRC instructions on CPUs that don’t support them, risking illegal-instruction faults.Add an early
returnafterSkipWithError:-#define BENCHMARK_CRC32_FOLD(name, resfunc, copyfunc, finfunc, support_flag) \ - BENCHMARK_DEFINE_F(crc32_fc, name)(benchmark::State& state) { \ - if (!support_flag) { \ - state.SkipWithError("CPU does not support " #name); \ - } \ - Bench(state, resfunc, copyfunc, finfunc); \ - } \ +#define BENCHMARK_CRC32_FOLD(name, resfunc, copyfunc, finfunc, support_flag) \ + BENCHMARK_DEFINE_F(crc32_fc, name)(benchmark::State& state) { \ + if (!support_flag) { \ + state.SkipWithError("CPU does not support " #name); \ + return; \ + } \ + Bench(state, resfunc, copyfunc, finfunc); \ + } \
🧹 Nitpick comments (1)
test/test_crc32_fold_copy.cc (1)
49-79: Clarify and validateminlen/onlyzeroconstraints for optimized variantsThe
(minlen, onlyzero)parameters fornative,pclmulqdq, andvpclmulqdqare conservative (e.g.,minlen=16,onlyzero=1), but the inline comment still asks whether 16 bytes is actually the minimum for the PCLMUL-based implementations:// Is 16 bytes len the minimum for pclmul functions? TEST_CRC32_FOLD(pclmulqdq, 16, 1, ...)It would be good to:
- Confirm the real minimum-length and initial-CRC constraints from the corresponding implementations, and
- Update
minlen/onlyzero(and this comment) to reflect the actual guarantees, so you don’t unnecessarily skip valid test cases or accidentally allow invalid ones if the assumptions are off.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
arch/generic/generic_functions.h(1 hunks)test/CMakeLists.txt(1 hunks)test/benchmarks/CMakeLists.txt(1 hunks)test/benchmarks/benchmark_crc32_fold_copy.cc(1 hunks)test/crc32_test_strings_p.h(1 hunks)test/test_crc32.cc(2 hunks)test/test_crc32_fold_copy.cc(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- test/benchmarks/CMakeLists.txt
- test/CMakeLists.txt
🧰 Additional context used
🧠 Learnings (9)
📓 Common learnings
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.
📚 Learning: 2024-12-22T20:40:03.280Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1833
File: test/test_crc32.cc:0-0
Timestamp: 2024-12-22T20:40:03.280Z
Learning: In the crc32_align test class (test/test_crc32.cc), the alignment offset parameter is always guaranteed to be non-negative because the test parameters are controlled and it never makes sense to have a negative offset.
Applied to files:
test/test_crc32.cctest/test_crc32_fold_copy.cctest/crc32_test_strings_p.h
📚 Learning: 2025-02-23T16:49:52.043Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.
Applied to files:
test/test_crc32.ccarch/generic/generic_functions.htest/test_crc32_fold_copy.cctest/crc32_test_strings_p.htest/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-01-23T22:01:53.422Z
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-01-23T22:01:53.422Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.
Applied to files:
test/test_crc32.ccarch/generic/generic_functions.htest/test_crc32_fold_copy.cctest/crc32_test_strings_p.htest/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-21T01:42:40.488Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:42:40.488Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.
Applied to files:
test/test_crc32.cctest/test_crc32_fold_copy.cctest/crc32_test_strings_p.htest/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-06-18T19:28:32.987Z
Learnt from: phprus
Repo: zlib-ng/zlib-ng PR: 1925
File: arch/loongarch/slide_hash_lsx.c:64-64
Timestamp: 2025-06-18T19:28:32.987Z
Learning: In zlib-ng's slide_hash functions, when calling slide_hash_chain(), the fourth parameter should be the window size (wsize) representing the number of entries in the prev table, not s->w_size directly. The local variable wsize is typically cast to the appropriate type (uint16_t) from s->w_size for this purpose.
Applied to files:
arch/generic/generic_functions.h
📚 Learning: 2025-02-21T01:41:50.358Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:14-24
Timestamp: 2025-02-21T01:41:50.358Z
Learning: In zlib-ng's SSE2 vectorized Chorba CRC implementation, the code that calls READ_NEXT macro ensures 16-byte alignment, making explicit alignment checks unnecessary within the macro.
Applied to files:
arch/generic/generic_functions.htest/test_crc32_fold_copy.cctest/crc32_test_strings_p.htest/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2024-10-08T21:51:45.330Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1778
File: arch/x86/chunkset_avx2.c:160-171
Timestamp: 2024-10-08T21:51:45.330Z
Learning: In `arch/x86/chunkset_avx2.c`, within the `GET_HALFCHUNK_MAG` function, using a conditional branch to select between `_mm_loadl_epi64` and `_mm_loadu_si128` is not recommended because the branching cost outweighs the savings from the load.
Applied to files:
test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-23T16:51:54.545Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/x86_intrins.h:114-117
Timestamp: 2025-02-23T16:51:54.545Z
Learning: In x86/x86_intrins.h, the Clang macros for _mm_cvtsi64x_si128 and _mm_cvtsi128_si64x don't need additional MSVC guards since MSVC's implementation is already protected by `defined(_MSC_VER) && !defined(__clang__)`, making them mutually exclusive.
Applied to files:
test/benchmarks/benchmark_crc32_fold_copy.cc
🧬 Code graph analysis (3)
arch/generic/generic_functions.h (2)
crc32.h (1)
crc32_fold_s(20-23)functable.c (1)
crc32_fold_stub(402-405)
test/test_crc32_fold_copy.cc (1)
arch/arm/crc32_armv8.c (1)
crc32_fold_copy_armv8(76-79)
test/benchmarks/benchmark_crc32_fold_copy.cc (3)
arch/arm/crc32_armv8.c (1)
crc32_fold_copy_armv8(76-79)test/benchmarks/benchmark_crc32.cc (1)
crc32(21-48)test/benchmarks/benchmark_adler32_copy.cc (1)
adler32_copy(24-63)
🔇 Additional comments (8)
arch/generic/generic_functions.h (1)
15-17: New crc32_fold function-pointer typedefs look consistent and correctSignatures match the existing
crc32_fold_reset_c,crc32_fold_copy_c, andcrc32_fold_final_cprototypes below and provide a clean way to wire these into tests/benchmarks. No issues from a type or ABI perspective.test/benchmarks/benchmark_crc32_fold_copy.cc (2)
25-32: Fixture buffer allocation and initialization are now soundAllocating
BUFSIZEfor bothtestdataanddstbufand fillingBUFSIZE/sizeof(uint32_t)elements ensures the entire source region used by the benchmark is initialized, including when misaligned up to 15 bytes.
35-55: Misalignment and CRC accumulation logic inBenchare reasonableThe rotating
misalignin[0, 14]combined withBUFSIZE = 32768 + 16keeps all accesses in-bounds for the largeststate.range(0)(32 KiB), while accumulating CRC over many calls for realistic throughput measurement. The pattern mirrors existing CRC/adler benchmarks in this repo.test/test_crc32.cc (2)
8-16: Shared CRC32 test-data header is integrated cleanlyIncluding
crc32_test_strings_p.h(withcrc32_test/crc32_tests) inside the existingextern "C"block is appropriate and keeps the C-style test data usable from C++. The additionalzutil_p.hinclude matches the header’s use ofALIGNED_and friends.
72-72: Switch tocrc32_testsdata source for parameterized tests looks correct
INSTANTIATE_TEST_SUITE_P(crc32, crc32_variant, testing::ValuesIn(crc32_tests));correctly wires the new shared test cases into the existingcrc32_variantparameterized tests without changing the test logic.test/test_crc32_fold_copy.cc (1)
17-44: Fold-copy fixture correctly validates both CRC and copy semantics
crc32_fold_testsets up a localcrc32_foldstate, enforcesparams.len <= BUFSIZE, applies the(minlen, onlyzero)gating, then runsreset → set initial crc → copy → finaland checks both the resulting CRC andmemcmpof src vs dst. This covers the essential correctness for all fold-copy variants without risking out-of-bounds accesses.test/crc32_test_strings_p.h (2)
4-16: 32 KiB pattern buffer and setup function are straightforward and safe
crc32_fullwin_bufpluscrc32_setup_buf()and thebuf32kinitializer give each TU a 32 KiB, 16-byte-aligned buffer filled with a simple 0–255 pattern, which is appropriate for the large-buffer CRC tests.
17-181: Sharedcrc32_teststable is well-structured and compatible with existing testsThe
crc32_teststruct andcrc32_tests[]cover a broad mix of inputs (NULL buffers, short strings, random-ish 20/30/100-byte strings, and full-window 16–32 KiB cases) with precomputed expected CRCs. Field types and values are compatible with the existingcrc32_varianttest harness and the new fold-copy tests.
069fbac to
f1861dc
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
test/benchmarks/benchmark_crc32_fold_copy.cc (1)
69-76: Ensure benchmarks return immediately afterSkipWithErrorfor unsupported CPUs.When
support_flagis false, the macro callsstate.SkipWithError(...)but still goes on to callBench(...). This can execute instructions the CPU doesn't support (e.g., PCLMULQDQ/AVX512), potentially causing an illegal-instruction fault.Short-circuit after skipping:
#define BENCHMARK_CRC32_FOLD(name, resfunc, copyfunc, finfunc, support_flag) \ BENCHMARK_DEFINE_F(crc32_fc, name)(benchmark::State& state) { \ if (!support_flag) { \ state.SkipWithError("CPU does not support " #name); \ + return; \ } \ Bench(state, resfunc, copyfunc, finfunc); \ } \
🧹 Nitpick comments (1)
test/benchmarks/benchmark_crc32_fold_copy.cc (1)
31-39: Consider usingzng_allocfor consistent alignment across benchmarks.The SetUp method uses
mallocfor buffer allocation, whereas other benchmarks in the codebase (e.g.,benchmark_crc32.cc,benchmark_adler32_copy.cc) usezng_alloc. Usingmallocdoesn't guarantee 16-byte alignment, which can cause SIMD implementations to cross cacheline boundaries unpredictably and introduce variance in benchmark results.Apply this diff to align with other benchmarks:
- void SetUp(const ::benchmark::State& state) { - testdata = (uint32_t *)malloc(BUFSIZE); - dstbuf = (uint8_t *)malloc(BUFSIZE); - assert((testdata != NULL) && (dstbuf != NULL)); + void SetUp(const ::benchmark::State& state) { + testdata = (uint32_t *)zng_alloc(BUFSIZE); + dstbuf = (uint8_t *)zng_alloc(BUFSIZE); + assert((testdata != NULL) && (dstbuf != NULL));And update TearDown:
void TearDown(const ::benchmark::State& state) { - free(testdata); - free(dstbuf); + zng_free(testdata); + zng_free(dstbuf); }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
test/benchmarks/CMakeLists.txt(1 hunks)test/benchmarks/benchmark_crc32_fold_copy.cc(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- test/benchmarks/CMakeLists.txt
🧰 Additional context used
🧠 Learnings (7)
📓 Common learnings
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.
📚 Learning: 2024-10-08T21:51:45.330Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1778
File: arch/x86/chunkset_avx2.c:160-171
Timestamp: 2024-10-08T21:51:45.330Z
Learning: In `arch/x86/chunkset_avx2.c`, within the `GET_HALFCHUNK_MAG` function, using a conditional branch to select between `_mm_loadl_epi64` and `_mm_loadu_si128` is not recommended because the branching cost outweighs the savings from the load.
Applied to files:
test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-23T16:51:54.545Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/x86_intrins.h:114-117
Timestamp: 2025-02-23T16:51:54.545Z
Learning: In x86/x86_intrins.h, the Clang macros for _mm_cvtsi64x_si128 and _mm_cvtsi128_si64x don't need additional MSVC guards since MSVC's implementation is already protected by `defined(_MSC_VER) && !defined(__clang__)`, making them mutually exclusive.
Applied to files:
test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-23T16:49:52.043Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.
Applied to files:
test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-01-23T22:01:53.422Z
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-01-23T22:01:53.422Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.
Applied to files:
test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-21T01:42:40.488Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:42:40.488Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.
Applied to files:
test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-21T01:41:50.358Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:14-24
Timestamp: 2025-02-21T01:41:50.358Z
Learning: In zlib-ng's SSE2 vectorized Chorba CRC implementation, the code that calls READ_NEXT macro ensures 16-byte alignment, making explicit alignment checks unnecessary within the macro.
Applied to files:
test/benchmarks/benchmark_crc32_fold_copy.cc
🧬 Code graph analysis (1)
test/benchmarks/benchmark_crc32_fold_copy.cc (3)
arch/arm/crc32_armv8.c (1)
crc32_fold_copy_armv8(76-79)test/benchmarks/benchmark_crc32.cc (2)
state(21-48)state(35-43)test/benchmarks/benchmark_adler32_copy.cc (1)
adler32_copy(24-63)
f1861dc to
9216a65
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (2)
test/benchmarks/benchmark_crc32_fold_copy.cc (2)
85-92: Critical: Add return afterSkipWithErrorto prevent executing unsupported instructions.This critical issue from the previous review remains unresolved. When
support_flagis false, the macro callsstate.SkipWithError(...)but continues to executeBench(...), which can invoke CPU instructions the hardware doesn't support (e.g., PCLMULQDQ, VPCLMULQDQ, ARM CRC32). This can cause illegal-instruction faults and crash the benchmark.Add a
returnstatement to short-circuit after skipping:#define BENCHMARK_CRC32_FOLD(name, resfunc, copyfunc, finfunc, support_flag) \ BENCHMARK_DEFINE_F(crc32_fc, name)(benchmark::State& state) { \ if (!support_flag) { \ state.SkipWithError("CPU does not support " #name); \ + return; \ } \ Bench(state, resfunc, copyfunc, finfunc); \ } \ BENCHMARK_REGISTER_F(crc32_fc, name)->Arg(16)->Arg(48)->Arg(192)->Arg(512)->Arg(4<<10)->Arg(16<<10)->Arg(32<<10);
57-77: CRC state is still accumulated across iterations, skewing measurements.The previous review raised this concern, and it remains unresolved. The
fold_resetcall at line 61 occurs once before the benchmark loop, causing each iteration to accumulate CRC data from previous iterations. This creates dependencies that don't represent typical usage and can skew timing measurements.While rotating misalignment is beneficial for testing real-world scenarios, the CRC state should be independent per iteration.
Move the
fold_resetcall inside the benchmark loop:void Bench(benchmark::State& state, crc32_fold_reset_func fold_reset, crc32_fold_copy_func fold_copy, crc32_fold_final_func fold_final) { int misalign = 0; - // Prepare an initial crc state - fold_reset(&crc_st); crc = 0; // Benchmark the CRC32 fold copy operation for (auto _ : state) { + fold_reset(&crc_st); fold_copy(&crc_st, dstbuf + misalign, (const unsigned char*)testdata + misalign, (size_t)state.range(0)); misalign++; if (misalign > 14) misalign = 0; } // Finalize the CRC32 calculation crc = fold_final(&crc_st);
🧹 Nitpick comments (1)
test/benchmarks/benchmark_crc32_fold_copy.cc (1)
48-49: Consider usingzng_allocfor consistency with other benchmarks.Other benchmark files in this directory (
benchmark_crc32.cc,benchmark_adler32_copy.cc) usezng_allocinstead ofmallocfor buffer allocation. Usingzng_allocprovides consistent memory allocation behavior across all benchmarks.Apply this diff:
- testdata = (uint32_t *)malloc(BUFSIZE); - dstbuf = (uint8_t *)malloc(BUFSIZE); + testdata = (uint32_t *)zng_alloc(BUFSIZE); + dstbuf = (uint8_t *)zng_alloc(BUFSIZE);And update TearDown:
- free(testdata); - free(dstbuf); + zng_free(testdata); + zng_free(dstbuf);
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
test/benchmarks/CMakeLists.txt(1 hunks)test/benchmarks/benchmark_crc32_fold_copy.cc(1 hunks)
🧰 Additional context used
🧠 Learnings (8)
📓 Common learnings
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.
📚 Learning: 2024-10-08T21:51:45.330Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1778
File: arch/x86/chunkset_avx2.c:160-171
Timestamp: 2024-10-08T21:51:45.330Z
Learning: In `arch/x86/chunkset_avx2.c`, within the `GET_HALFCHUNK_MAG` function, using a conditional branch to select between `_mm_loadl_epi64` and `_mm_loadu_si128` is not recommended because the branching cost outweighs the savings from the load.
Applied to files:
test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-23T16:51:54.545Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/x86_intrins.h:114-117
Timestamp: 2025-02-23T16:51:54.545Z
Learning: In x86/x86_intrins.h, the Clang macros for _mm_cvtsi64x_si128 and _mm_cvtsi128_si64x don't need additional MSVC guards since MSVC's implementation is already protected by `defined(_MSC_VER) && !defined(__clang__)`, making them mutually exclusive.
Applied to files:
test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-01-23T22:01:53.422Z
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-01-23T22:01:53.422Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.
Applied to files:
test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-21T01:42:40.488Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:42:40.488Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.
Applied to files:
test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-23T16:49:52.043Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.
Applied to files:
test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-21T01:41:50.358Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:14-24
Timestamp: 2025-02-21T01:41:50.358Z
Learning: In zlib-ng's SSE2 vectorized Chorba CRC implementation, the code that calls READ_NEXT macro ensures 16-byte alignment, making explicit alignment checks unnecessary within the macro.
Applied to files:
test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-11-15T12:43:50.332Z
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 2001
File: test/CMakeLists.txt:10-12
Timestamp: 2025-11-15T12:43:50.332Z
Learning: In zlib-ng, the TEST_STOCK_ZLIB CMake variable is intentionally not defined as a CMake option. It's meant to be set manually via command-line (e.g., -DTEST_STOCK_ZLIB=ON) when testing against stock zlib. The testing workflow for stock zlib requires undocumented manual steps and is considered experimental/incomplete.
Applied to files:
test/benchmarks/CMakeLists.txt
🧬 Code graph analysis (2)
test/benchmarks/benchmark_crc32_fold_copy.cc (3)
arch/arm/crc32_armv8.c (1)
crc32_fold_copy_armv8(76-79)test/benchmarks/benchmark_adler32_copy.cc (1)
adler32_copy(24-63)test/benchmarks/benchmark_crc32.cc (2)
state(21-48)state(35-43)
test/benchmarks/CMakeLists.txt (2)
test/benchmarks/benchmark_adler32_copy.cc (1)
adler32_copy(24-63)test/benchmarks/benchmark_crc32.cc (2)
state(21-48)state(35-43)
🔇 Additional comments (1)
test/benchmarks/CMakeLists.txt (1)
47-47: LGTM!The new benchmark source file is correctly added to the executable's source list.
9216a65 to
20815a2
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (2)
test/benchmarks/benchmark_crc32_fold_copy.cc (2)
61-81: Reset CRC state per iteration for independent measurements.The CRC state is reset once before the benchmark loop (line 65), then reused across all iterations. Each iteration accumulates CRC data from previous iterations, creating dependencies that can skew timing measurements and don't represent typical usage patterns.
Move the
fold_resetcall inside the benchmark loop:void Bench(benchmark::State& state, crc32_fold_reset_func fold_reset, crc32_fold_copy_func fold_copy, crc32_fold_final_func fold_final) { int misalign = 0; - // Prepare an initial crc state - fold_reset(&crc_st); crc = 0; // Benchmark the CRC32 fold copy operation for (auto _ : state) { + fold_reset(&crc_st); fold_copy(&crc_st, dstbuf + misalign, (const unsigned char*)testdata + misalign, (size_t)state.range(0)); misalign++; if (misalign > 14) misalign = 0; } // Finalize the CRC32 calculation crc = fold_final(&crc_st);
89-96: Ensure benchmarks return immediately afterSkipWithErrorfor unsupported CPUs.When
support_flagis false, the macro callsstate.SkipWithError(...)but continues to callBench(...). This can execute instructions the CPU doesn't support (e.g., PCLMULQDQ/ARM CRC), potentially causing illegal-instruction faults.Short-circuit after skipping:
#define BENCHMARK_CRC32_FOLD(name, resfunc, copyfunc, finfunc, support_flag) \ BENCHMARK_DEFINE_F(crc32_fc, name)(benchmark::State& state) { \ if (!support_flag) { \ state.SkipWithError("CPU does not support " #name); \ + return; \ } \ Bench(state, resfunc, copyfunc, finfunc); \ } \
🧹 Nitpick comments (1)
test/benchmarks/benchmark_crc32_fold_copy.cc (1)
52-53: Consider usingzng_allocfor consistency with other benchmarks.The codebase's other benchmarks use
zng_alloc/zng_freeinstead ofmalloc/free.Apply this diff for consistency:
- testdata = (uint32_t *)malloc(BUFSIZE); - dstbuf = (uint8_t *)malloc(BUFSIZE); + testdata = (uint32_t *)zng_alloc(BUFSIZE); + dstbuf = (uint8_t *)zng_alloc(BUFSIZE);And update TearDown:
- free(testdata); - free(dstbuf); + zng_free(testdata); + zng_free(dstbuf);
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
test/benchmarks/CMakeLists.txt(1 hunks)test/benchmarks/benchmark_crc32_fold_copy.cc(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- test/benchmarks/CMakeLists.txt
🧰 Additional context used
🧠 Learnings (7)
📓 Common learnings
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.
📚 Learning: 2024-10-08T21:51:45.330Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1778
File: arch/x86/chunkset_avx2.c:160-171
Timestamp: 2024-10-08T21:51:45.330Z
Learning: In `arch/x86/chunkset_avx2.c`, within the `GET_HALFCHUNK_MAG` function, using a conditional branch to select between `_mm_loadl_epi64` and `_mm_loadu_si128` is not recommended because the branching cost outweighs the savings from the load.
Applied to files:
test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-23T16:51:54.545Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/x86_intrins.h:114-117
Timestamp: 2025-02-23T16:51:54.545Z
Learning: In x86/x86_intrins.h, the Clang macros for _mm_cvtsi64x_si128 and _mm_cvtsi128_si64x don't need additional MSVC guards since MSVC's implementation is already protected by `defined(_MSC_VER) && !defined(__clang__)`, making them mutually exclusive.
Applied to files:
test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-01-23T22:01:53.422Z
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-01-23T22:01:53.422Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.
Applied to files:
test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-21T01:42:40.488Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:42:40.488Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.
Applied to files:
test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-23T16:49:52.043Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.
Applied to files:
test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-21T01:41:50.358Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:14-24
Timestamp: 2025-02-21T01:41:50.358Z
Learning: In zlib-ng's SSE2 vectorized Chorba CRC implementation, the code that calls READ_NEXT macro ensures 16-byte alignment, making explicit alignment checks unnecessary within the macro.
Applied to files:
test/benchmarks/benchmark_crc32_fold_copy.cc
🧬 Code graph analysis (1)
test/benchmarks/benchmark_crc32_fold_copy.cc (2)
arch/arm/crc32_armv8.c (1)
crc32_fold_copy_armv8(76-79)test/benchmarks/benchmark_crc32.cc (2)
state(21-48)state(35-43)
Uses local functions for benchmarking some of the run-time selected variants.
20815a2 to
dbe3d8a
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
test/benchmarks/benchmark_crc32_fold_copy.cc (1)
89-96: Return immediately afterSkipWithErrorto avoid executing unsupported implementationsIn
BENCHMARK_CRC32_FOLD, whensupport_flagis false you callstate.SkipWithError(...)but still fall through toBench(...). On a CPU that lacks the required instructions (e.g., PCLMULQDQ/VPCLMULQDQ/ARM CRC), this can still execute those code paths and potentially crash with an illegal instruction.Add an early
returnafterSkipWithError:-#define BENCHMARK_CRC32_FOLD(name, resfunc, copyfunc, finfunc, support_flag) \ - BENCHMARK_DEFINE_F(crc32_fc, name)(benchmark::State& state) { \ - if (!support_flag) { \ - state.SkipWithError("CPU does not support " #name); \ - } \ - Bench(state, resfunc, copyfunc, finfunc); \ - } \ +#define BENCHMARK_CRC32_FOLD(name, resfunc, copyfunc, finfunc, support_flag) \ + BENCHMARK_DEFINE_F(crc32_fc, name)(benchmark::State& state) { \ + if (!support_flag) { \ + state.SkipWithError("CPU does not support " #name); \ + return; \ + } \ + Bench(state, resfunc, copyfunc, finfunc); \ + } \Please confirm this matches how your version of Google Benchmark handles
SkipWithError(i.e., it does not itself abort execution of the fixture body) and that no other macros rely on the previous fall‑through behavior.
🧹 Nitpick comments (2)
test/benchmarks/benchmark_crc32_fold_copy.cc (2)
50-58: Full buffer initialization removes potential UB for larger sizes
SetUpallocatesBUFSIZEbytes for both buffers and fillsBUFSIZE/sizeof(uint32_t)entries, so all bytes that can be touched by lengths up to32<<10plus misalignment are initialized. This fixes the earlier risk of reading uninitialized memory in the benchmark.If you want consistency with
test/benchmarks/benchmark_crc32.cc, you could switch tozng_alloc/zng_freeinstead ofmalloc/free, but it's not required for correctness.
60-81: Bench loop behavior and misalignment pattern look intentionalThe streaming-style usage (
fold_resetonce, then repeatedfold_copycalls and a singlefold_finalat the end) matches the pattern intest/benchmarks/benchmark_crc32.cc, where the CRC state is carried across iterations. The misalignment cyclemisalign = 0..14is safely bounded givenBUFSIZE = 32768 + 16 + 16, so even the largestArg(32<<10)plus misalignment stays in-bounds for both src and dst.If you ever want to benchmark independent messages instead of a single long stream, you could add a separate mode that calls
fold_reset/fold_finalper iteration (possibly withstate.PauseTiming()around the reset/final if you want to exclude their cost).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
test/benchmarks/CMakeLists.txt(1 hunks)test/benchmarks/benchmark_crc32_fold_copy.cc(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- test/benchmarks/CMakeLists.txt
🧰 Additional context used
🧠 Learnings (7)
📓 Common learnings
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.
📚 Learning: 2024-10-08T21:51:45.330Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1778
File: arch/x86/chunkset_avx2.c:160-171
Timestamp: 2024-10-08T21:51:45.330Z
Learning: In `arch/x86/chunkset_avx2.c`, within the `GET_HALFCHUNK_MAG` function, using a conditional branch to select between `_mm_loadl_epi64` and `_mm_loadu_si128` is not recommended because the branching cost outweighs the savings from the load.
Applied to files:
test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-23T16:51:54.545Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/x86_intrins.h:114-117
Timestamp: 2025-02-23T16:51:54.545Z
Learning: In x86/x86_intrins.h, the Clang macros for _mm_cvtsi64x_si128 and _mm_cvtsi128_si64x don't need additional MSVC guards since MSVC's implementation is already protected by `defined(_MSC_VER) && !defined(__clang__)`, making them mutually exclusive.
Applied to files:
test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-01-23T22:01:53.422Z
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-01-23T22:01:53.422Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.
Applied to files:
test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-21T01:42:40.488Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:42:40.488Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.
Applied to files:
test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-23T16:49:52.043Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.
Applied to files:
test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-21T01:41:50.358Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:14-24
Timestamp: 2025-02-21T01:41:50.358Z
Learning: In zlib-ng's SSE2 vectorized Chorba CRC implementation, the code that calls READ_NEXT macro ensures 16-byte alignment, making explicit alignment checks unnecessary within the macro.
Applied to files:
test/benchmarks/benchmark_crc32_fold_copy.cc
🧬 Code graph analysis (1)
test/benchmarks/benchmark_crc32_fold_copy.cc (2)
arch/arm/crc32_armv8.c (1)
crc32_fold_copy_armv8(76-79)test/benchmarks/benchmark_crc32.cc (2)
state(21-48)state(35-43)
🔇 Additional comments (2)
test/benchmarks/benchmark_crc32_fold_copy.cc (2)
17-41: Fold‑copy wrapper helpers look correct and consistent with existing patternsThe braid/chorba fold‑copy shims update
crc->valuethenmemcpythe source to destination, matching thecrc32_fold_copy_armv8style fromarch/arm/crc32_armv8.cand respecting theWITHOUT_*/x86 feature guards. No correctness or UB concerns here.
98-129: Benchmark registrations and feature gating look coherentThe generic
braid_cplusnative(whenDISABLE_RUNTIME_CPU_DETECTIONis set) and the per‑architecture registrations (ARMv8, Chorba variants, PCLMULQDQ/VPCLMULQDQ, LoongArch) are guarded by the appropriate compile‑time macros andtest_cpu_featuresruntime checks. The Arg list matches the sizes discussed in the PR (16, 48, 192, 512, 4K, 16K, 32K) and aligns with the chosenBUFSIZE.
Since some variants are run-time selected, we need a few local implementations.