Skip to content

Add benchmark for crc32_fold_copy functions#2008

Merged
Dead2 merged 1 commit intodevelopfrom
bench_crc32_fold_copy
Nov 16, 2025
Merged

Add benchmark for crc32_fold_copy functions#2008
Dead2 merged 1 commit intodevelopfrom
bench_crc32_fold_copy

Conversation

@Dead2
Copy link
Copy Markdown
Member

@Dead2 Dead2 commented Nov 14, 2025

Since some variants are run-time selected, we need a few local implementations.

@Dead2
Copy link
Copy Markdown
Member Author

Dead2 commented Nov 14, 2025

Edit: Removed incorrect benchmark numbers, see next comment below instead for updated numbers. Regression is still there though.

There seems to be a performance regression with pclmul.. @KungFuJesus ideas?

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Nov 14, 2025

Walkthrough

Adds a new CRC32 fold-copy benchmark source and integrates it into the benchmarks build; the benchmark fixture runs fold-copy workloads across multiple CRC32 implementations and architecture-specific variants, varying input size and misalignment while finalizing the fold state.

Changes

Cohort / File(s) Summary
Build configuration
test/benchmarks/CMakeLists.txt
Adds benchmark_crc32_fold_copy.cc to the benchmark_zlib executable source list.
CRC32 fold-copy benchmark
test/benchmarks/benchmark_crc32_fold_copy.cc
New benchmark file introducing crc32_fc fixture, BENCHMARK_CRC32_FOLD macro, per-implementation wrapper functions (braid, chorba, SSE/SSSE3/SSE4, PCLMUL/VPCLMUL, ARMv8, LoongArch) guarded by compile-time and runtime CPU feature checks, and registered benchmark variants with multiple size and misalignment arguments.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Bench as BenchmarkRunner
    participant Fixture as crc32_fc (SetUp/TearDown)
    participant Driver as bench_loop
    participant Impl as crc32_fold_impl (braid/chorba/arm/loongarch/...)
    participant Fold as crc32_fold (fold_copy/fold_final)

    rect rgb(240,248,255)
        Bench->>Fixture: create & SetUp (alloc + seed input)
        Bench->>Driver: invoke BENCHMARK loop (sizes, misalign)
        Driver->>Impl: call fold_copy(src, dst, &state)
        Impl->>Fold: perform fold_copy operations
        Driver->>Fold: fold_final(&state)
        Fold-->>Bench: result (prevent optimize)
        Bench->>Fixture: TearDown (free)
    end
Loading

Possibly related PRs

Suggested reviewers

  • KungFuJesus
  • nmoinvaz

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Add benchmark for crc32_fold_copy functions' directly and accurately describes the main change: adding a new benchmark for crc32_fold_copy functions.
Description check ✅ Passed The description relates to the changeset by explaining the rationale (need for local implementations due to run-time variant selection), though it is brief and lacks implementation details.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch bench_crc32_fold_copy

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
test/crc32_test_strings_p.h (1)

4-22: Make the test header more self-contained to reduce include-order brittleness

This header relies on ALIGNED_, uint8_t, and size_t being defined by earlier includes in the including .cc files. That’s fragile for a shared test header and easy to break with future refactors.

Consider explicitly including the necessary standard headers (and, if appropriate, zbuild.h for ALIGNED_) here, or at least documenting that this header must be included after those definitions.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8003f57 and c403a94.

📒 Files selected for processing (7)
  • arch/generic/generic_functions.h (1 hunks)
  • test/CMakeLists.txt (1 hunks)
  • test/benchmarks/CMakeLists.txt (1 hunks)
  • test/benchmarks/benchmark_crc32_fold_copy.cc (1 hunks)
  • test/crc32_test_strings_p.h (1 hunks)
  • test/test_crc32.cc (2 hunks)
  • test/test_crc32_fold_copy.cc (1 hunks)
🧰 Additional context used
🧠 Learnings (7)
📓 Common learnings
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-01-23T22:01:53.422Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:14-24
Timestamp: 2025-02-21T01:41:50.358Z
Learning: In zlib-ng's SSE2 vectorized Chorba CRC implementation, the code that calls READ_NEXT macro ensures 16-byte alignment, making explicit alignment checks unnecessary within the macro.
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:42:40.488Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.
📚 Learning: 2025-01-23T22:01:53.422Z
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-01-23T22:01:53.422Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.

Applied to files:

  • arch/generic/generic_functions.h
  • test/test_crc32.cc
  • test/test_crc32_fold_copy.cc
  • test/crc32_test_strings_p.h
  • test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-06-18T19:28:32.987Z
Learnt from: phprus
Repo: zlib-ng/zlib-ng PR: 1925
File: arch/loongarch/slide_hash_lsx.c:64-64
Timestamp: 2025-06-18T19:28:32.987Z
Learning: In zlib-ng's slide_hash functions, when calling slide_hash_chain(), the fourth parameter should be the window size (wsize) representing the number of entries in the prev table, not s->w_size directly. The local variable wsize is typically cast to the appropriate type (uint16_t) from s->w_size for this purpose.

Applied to files:

  • arch/generic/generic_functions.h
📚 Learning: 2025-02-23T16:49:52.043Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.

Applied to files:

  • arch/generic/generic_functions.h
  • test/test_crc32.cc
  • test/test_crc32_fold_copy.cc
  • test/crc32_test_strings_p.h
  • test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-21T01:41:50.358Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:14-24
Timestamp: 2025-02-21T01:41:50.358Z
Learning: In zlib-ng's SSE2 vectorized Chorba CRC implementation, the code that calls READ_NEXT macro ensures 16-byte alignment, making explicit alignment checks unnecessary within the macro.

Applied to files:

  • arch/generic/generic_functions.h
  • test/test_crc32_fold_copy.cc
  • test/crc32_test_strings_p.h
  • test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2024-12-22T20:40:03.280Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1833
File: test/test_crc32.cc:0-0
Timestamp: 2024-12-22T20:40:03.280Z
Learning: In the crc32_align test class (test/test_crc32.cc), the alignment offset parameter is always guaranteed to be non-negative because the test parameters are controlled and it never makes sense to have a negative offset.

Applied to files:

  • test/test_crc32.cc
  • test/test_crc32_fold_copy.cc
  • test/crc32_test_strings_p.h
📚 Learning: 2025-02-21T01:42:40.488Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:42:40.488Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.

Applied to files:

  • test/test_crc32.cc
  • test/test_crc32_fold_copy.cc
  • test/crc32_test_strings_p.h
  • test/benchmarks/benchmark_crc32_fold_copy.cc
🧬 Code graph analysis (4)
arch/generic/generic_functions.h (2)
crc32.h (1)
  • crc32_fold_s (20-23)
functable.c (1)
  • crc32_fold_stub (402-405)
test/benchmarks/CMakeLists.txt (2)
test/benchmarks/benchmark_crc32.cc (1)
  • crc32 (21-48)
test/benchmarks/benchmark_adler32_copy.cc (1)
  • adler32_copy (24-63)
test/test_crc32_fold_copy.cc (1)
arch/arm/crc32_armv8.c (1)
  • crc32_fold_copy_armv8 (76-79)
test/benchmarks/benchmark_crc32_fold_copy.cc (1)
arch/arm/crc32_armv8.c (1)
  • crc32_fold_copy_armv8 (76-79)
🔇 Additional comments (5)
test/benchmarks/CMakeLists.txt (1)

47-47: Benchmark source wired in cleanly.

The fold-copy benchmark now builds with the rest of the suite without touching other targets.

test/CMakeLists.txt (1)

186-187: New fold-copy test registered appropriately.

Including test_crc32_fold_copy.cc under the ZLIBNG test gate keeps build knobs consistent.

test/test_crc32.cc (1)

12-19: Shared CRC32 vectors plugged in smoothly.

Reusing crc32_tests keeps the parameterized suite intact while centralizing the data source.

Also applies to: 77-77

arch/generic/generic_functions.h (1)

15-17: Function pointer typedefs look correct.

The new fold-reset/copy/final aliases mirror the existing prototypes and expose them cleanly to consumers.

test/test_crc32_fold_copy.cc (1)

21-83: Great coverage of fold-copy paths.

The fixture drives reset/copy/final across all feature variants and double-checks the copied payload, which is exactly what we need here.

@Dead2 Dead2 force-pushed the bench_crc32_fold_copy branch from c403a94 to 3ea4cc7 Compare November 14, 2025 15:00
@Dead2
Copy link
Copy Markdown
Member Author

Dead2 commented Nov 14, 2025

Since input data is nearly never aligned, I changed the benchmark to add misalignment to both the input and output buffers. Measured performance went down across the board, but this is more representative of the real-world usage of these functions.

--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
crc32_fc/generic/16             23.5 ns         23.5 ns     89272216
crc32_fc/generic/48             26.8 ns         26.8 ns     78404637
crc32_fc/generic/192            32.9 ns         32.9 ns     63991392
crc32_fc/generic/512            54.2 ns         54.2 ns     38770049
crc32_fc/generic/4096            140 ns          140 ns     14969916
crc32_fc/generic/16384           415 ns          415 ns      5077014
crc32_fc/generic/32768          1116 ns         1115 ns      1882548
crc32_fc/pclmulqdq/16           15.9 ns         15.9 ns    132193992
crc32_fc/pclmulqdq/48           17.5 ns         17.4 ns    120368352
crc32_fc/pclmulqdq/192          22.7 ns         22.7 ns     92563589
crc32_fc/pclmulqdq/512          34.8 ns         34.7 ns     60475368
crc32_fc/pclmulqdq/4096          191 ns          191 ns     10980335
crc32_fc/pclmulqdq/16384         726 ns          725 ns      2899615
crc32_fc/pclmulqdq/32768        1582 ns         1581 ns      1327845
crc32_fc/vpclmulqdq/16          15.6 ns         15.6 ns    134519234
crc32_fc/vpclmulqdq/48          17.7 ns         17.7 ns    118561629
crc32_fc/vpclmulqdq/192         23.5 ns         23.4 ns     89594099
crc32_fc/vpclmulqdq/512         43.0 ns         43.0 ns     48840976
crc32_fc/vpclmulqdq/4096         102 ns          102 ns     20660312
crc32_fc/vpclmulqdq/16384        320 ns          320 ns      6565988
crc32_fc/vpclmulqdq/32768        699 ns          698 ns      3007749

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
test/benchmarks/benchmark_crc32_fold_copy.cc (2)

24-28: Initialize the entire buffer to avoid undefined behavior

The loop still only initializes BUFSIZE/sizeof(uint32_t) bytes (8196 bytes), leaving the remaining ~24KB uninitialized. Since benchmarks read up to 32768 bytes, this causes reads from uninitialized memory.

Apply this diff:

     void SetUp(const ::benchmark::State& state) {
-        for (uint32_t i = 0; i < BUFSIZE/sizeof(uint32_t); i++) {
+        for (uint32_t i = 0; i < BUFSIZE; i++) {
             testdata[i] = rand();
         }
     }

53-60: Add return statement after SkipWithError to prevent executing unsupported code

When support_flag is false, the benchmark calls state.SkipWithError but continues to execute Bench(...), which will invoke architecture-specific instructions the CPU doesn't support. This can crash with SIGILL (illegal instruction).

Apply this diff:

 #define BENCHMARK_CRC32_FOLD(name, resfunc, copyfunc, finfunc, support_flag) \
     BENCHMARK_DEFINE_F(crc32_fc, name)(benchmark::State& state) { \
         if (!support_flag) { \
             state.SkipWithError("CPU does not support " #name); \
+            return; \
         } \
         Bench(state, resfunc, copyfunc, finfunc); \
     } \
     BENCHMARK_REGISTER_F(crc32_fc, name)->Arg(16)->Arg(48)->Arg(192)->Arg(512)->Arg(4<<10)->Arg(16<<10)->Arg(32<<10);
🧹 Nitpick comments (1)
test/benchmarks/benchmark_crc32_fold_copy.cc (1)

30-50: Consider resetting CRC state between benchmark iterations

The CRC state (crc_st) is initialized once (line 34) and then accumulates data across all benchmark iterations. This means the state processes potentially megabytes of data before finalization, which may not reflect typical usage patterns and could skew measurements if CRC behavior varies with accumulated state.

Consider resetting the state each iteration:

     void Bench(benchmark::State& state, crc32_fold_reset_func fold_reset, crc32_fold_copy_func fold_copy,
                crc32_fold_final_func fold_final) {
         int misalign = 0;
-        // Prepare an initial crc state
-        fold_reset(&crc_st);
         crc = 0;
 
         // Benchmark the CRC32 fold copy operation
         for (auto _ : state) {
+            fold_reset(&crc_st);
             fold_copy(&crc_st, dstbuf + misalign, testdata + misalign, (size_t)state.range(0));
+            crc = fold_final(&crc_st);
             misalign++;
             if (misalign == 15)
                 misalign = 0;
         }
 
-        // Finalize the CRC32 calculation
-        crc = fold_final(&crc_st);
-
         // Prevent the result from being optimized away
         benchmark::DoNotOptimize(crc);
     }

Alternatively, protect dstbuf from optimization to ensure the copy operation is measured:

         // Prevent the result from being optimized away
         benchmark::DoNotOptimize(crc);
+        benchmark::DoNotOptimize(dstbuf);
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c403a94 and 3ea4cc7.

📒 Files selected for processing (2)
  • test/benchmarks/CMakeLists.txt (1 hunks)
  • test/benchmarks/benchmark_crc32_fold_copy.cc (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/benchmarks/CMakeLists.txt
🧰 Additional context used
🧠 Learnings (7)
📓 Common learnings
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-01-23T22:01:53.422Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.
📚 Learning: 2024-10-08T21:51:45.330Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1778
File: arch/x86/chunkset_avx2.c:160-171
Timestamp: 2024-10-08T21:51:45.330Z
Learning: In `arch/x86/chunkset_avx2.c`, within the `GET_HALFCHUNK_MAG` function, using a conditional branch to select between `_mm_loadl_epi64` and `_mm_loadu_si128` is not recommended because the branching cost outweighs the savings from the load.

Applied to files:

  • test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-23T16:51:54.545Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/x86_intrins.h:114-117
Timestamp: 2025-02-23T16:51:54.545Z
Learning: In x86/x86_intrins.h, the Clang macros for _mm_cvtsi64x_si128 and _mm_cvtsi128_si64x don't need additional MSVC guards since MSVC's implementation is already protected by `defined(_MSC_VER) && !defined(__clang__)`, making them mutually exclusive.

Applied to files:

  • test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-01-23T22:01:53.422Z
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-01-23T22:01:53.422Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.

Applied to files:

  • test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-23T16:49:52.043Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.

Applied to files:

  • test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-21T01:42:40.488Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:42:40.488Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.

Applied to files:

  • test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-21T01:41:50.358Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:14-24
Timestamp: 2025-02-21T01:41:50.358Z
Learning: In zlib-ng's SSE2 vectorized Chorba CRC implementation, the code that calls READ_NEXT macro ensures 16-byte alignment, making explicit alignment checks unnecessary within the macro.

Applied to files:

  • test/benchmarks/benchmark_crc32_fold_copy.cc
🧬 Code graph analysis (1)
test/benchmarks/benchmark_crc32_fold_copy.cc (1)
arch/arm/crc32_armv8.c (1)
  • crc32_fold_copy_armv8 (76-79)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: macOS GCC UBSAN (ARM64)

@Dead2 Dead2 force-pushed the bench_crc32_fold_copy branch 2 times, most recently from c16284b to a72a6f4 Compare November 14, 2025 16:01
@codecov
Copy link
Copy Markdown

codecov bot commented Nov 14, 2025

Codecov Report

❌ Patch coverage is 0% with 49 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.26%. Comparing base (c917ed6) to head (dbe3d8a).
⚠️ Report is 2 commits behind head on develop.

Files with missing lines Patch % Lines
test/benchmarks/benchmark_crc32_fold_copy.cc 0.00% 45 Missing and 4 partials ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #2008      +/-   ##
===========================================
+ Coverage    80.05%   81.26%   +1.20%     
===========================================
  Files          162      163       +1     
  Lines        13558    12862     -696     
  Branches      3474     3171     -303     
===========================================
- Hits         10854    10452     -402     
+ Misses        1640     1352     -288     
+ Partials      1064     1058       -6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Dead2 Dead2 force-pushed the bench_crc32_fold_copy branch from a72a6f4 to 98fceeb Compare November 14, 2025 18:03
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
test/benchmarks/benchmark_crc32_fold_copy.cc (1)

63-70: Add return after SkipWithError to prevent executing unsupported instructions.

When support_flag is false, the macro calls state.SkipWithError(...) but continues to execute Bench(...), which may invoke CPU instructions unsupported by the current processor (e.g., PCLMULQDQ, VPCLMULQDQ, ARM CRC32). This can trigger illegal instruction faults and crash the benchmark.

Apply this diff:

 #define BENCHMARK_CRC32_FOLD(name, resfunc, copyfunc, finfunc, support_flag) \
     BENCHMARK_DEFINE_F(crc32_fc, name)(benchmark::State& state) { \
         if (!support_flag) { \
             state.SkipWithError("CPU does not support " #name); \
+            return; \
         } \
         Bench(state, resfunc, copyfunc, finfunc); \
     } \
     BENCHMARK_REGISTER_F(crc32_fc, name)->Arg(16)->Arg(48)->Arg(192)->Arg(512)->Arg(4<<10)->Arg(16<<10)->Arg(32<<10);
🧹 Nitpick comments (2)
test/benchmarks/benchmark_crc32_fold_copy.cc (2)

26-27: Consider using zng_alloc/zng_free for consistency.

The existing benchmark_crc32.cc uses zng_alloc/zng_free for buffer allocation. For consistency across the benchmark suite, consider switching from malloc/free to zng_alloc/zng_free.


43-47: Consider documenting the misalignment pattern.

The rotating misalignment (0-14 bytes) intentionally tests various alignment scenarios. A brief comment explaining this design choice would improve code clarity.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a72a6f4 and 98fceeb.

📒 Files selected for processing (2)
  • test/benchmarks/CMakeLists.txt (1 hunks)
  • test/benchmarks/benchmark_crc32_fold_copy.cc (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/benchmarks/CMakeLists.txt
🧰 Additional context used
🧠 Learnings (6)
📚 Learning: 2024-10-08T21:51:45.330Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1778
File: arch/x86/chunkset_avx2.c:160-171
Timestamp: 2024-10-08T21:51:45.330Z
Learning: In `arch/x86/chunkset_avx2.c`, within the `GET_HALFCHUNK_MAG` function, using a conditional branch to select between `_mm_loadl_epi64` and `_mm_loadu_si128` is not recommended because the branching cost outweighs the savings from the load.

Applied to files:

  • test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-23T16:51:54.545Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/x86_intrins.h:114-117
Timestamp: 2025-02-23T16:51:54.545Z
Learning: In x86/x86_intrins.h, the Clang macros for _mm_cvtsi64x_si128 and _mm_cvtsi128_si64x don't need additional MSVC guards since MSVC's implementation is already protected by `defined(_MSC_VER) && !defined(__clang__)`, making them mutually exclusive.

Applied to files:

  • test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-23T16:49:52.043Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.

Applied to files:

  • test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-01-23T22:01:53.422Z
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-01-23T22:01:53.422Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.

Applied to files:

  • test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-21T01:42:40.488Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:42:40.488Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.

Applied to files:

  • test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-21T01:41:50.358Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:14-24
Timestamp: 2025-02-21T01:41:50.358Z
Learning: In zlib-ng's SSE2 vectorized Chorba CRC implementation, the code that calls READ_NEXT macro ensures 16-byte alignment, making explicit alignment checks unnecessary within the macro.

Applied to files:

  • test/benchmarks/benchmark_crc32_fold_copy.cc
🧬 Code graph analysis (1)
test/benchmarks/benchmark_crc32_fold_copy.cc (2)
arch/arm/crc32_armv8.c (1)
  • crc32_fold_copy_armv8 (76-79)
test/benchmarks/benchmark_crc32.cc (1)
  • crc32 (21-48)
🔇 Additional comments (2)
test/benchmarks/benchmark_crc32_fold_copy.cc (2)

30-32: Buffer initialization is now correct.

The loop correctly initializes the entire buffer (BUFSIZE bytes) by iterating BUFSIZE/sizeof(uint32_t) times over uint32_t elements. This resolves the previously flagged concern about reading uninitialized memory in larger benchmark runs.


72-94: LGTM: Benchmark registrations are properly structured.

The generic benchmark is always available, and platform-specific variants are correctly gated by both compile-time macros (ARM_CRC32, X86_PCLMULQDQ_CRC, etc.) and runtime feature detection flags. The compound condition for vpclmulqdq (line 88) properly checks all required CPU features.

@Dead2 Dead2 force-pushed the bench_crc32_fold_copy branch from 98fceeb to 069fbac Compare November 14, 2025 23:39
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
test/benchmarks/benchmark_crc32_fold_copy.cc (1)

63-69: Short-circuit after SkipWithError to avoid executing unsupported instructions

When support_flag is false, the macro calls state.SkipWithError but still falls through to Bench, which may execute PCLMUL/VPCLMUL/ARM CRC/LoongArch CRC instructions on CPUs that don’t support them, risking illegal-instruction faults.

Add an early return after SkipWithError:

-#define BENCHMARK_CRC32_FOLD(name, resfunc, copyfunc, finfunc, support_flag) \
-    BENCHMARK_DEFINE_F(crc32_fc, name)(benchmark::State& state) { \
-        if (!support_flag) { \
-            state.SkipWithError("CPU does not support " #name); \
-        } \
-        Bench(state, resfunc, copyfunc, finfunc); \
-    } \
+#define BENCHMARK_CRC32_FOLD(name, resfunc, copyfunc, finfunc, support_flag) \
+    BENCHMARK_DEFINE_F(crc32_fc, name)(benchmark::State& state) { \
+        if (!support_flag) { \
+            state.SkipWithError("CPU does not support " #name); \
+            return; \
+        } \
+        Bench(state, resfunc, copyfunc, finfunc); \
+    } \
🧹 Nitpick comments (1)
test/test_crc32_fold_copy.cc (1)

49-79: Clarify and validate minlen/onlyzero constraints for optimized variants

The (minlen, onlyzero) parameters for native, pclmulqdq, and vpclmulqdq are conservative (e.g., minlen=16, onlyzero=1), but the inline comment still asks whether 16 bytes is actually the minimum for the PCLMUL-based implementations:

// Is 16 bytes len the minimum for pclmul functions?
TEST_CRC32_FOLD(pclmulqdq, 16, 1, ...)

It would be good to:

  • Confirm the real minimum-length and initial-CRC constraints from the corresponding implementations, and
  • Update minlen / onlyzero (and this comment) to reflect the actual guarantees, so you don’t unnecessarily skip valid test cases or accidentally allow invalid ones if the assumptions are off.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 98fceeb and 069fbac.

📒 Files selected for processing (7)
  • arch/generic/generic_functions.h (1 hunks)
  • test/CMakeLists.txt (1 hunks)
  • test/benchmarks/CMakeLists.txt (1 hunks)
  • test/benchmarks/benchmark_crc32_fold_copy.cc (1 hunks)
  • test/crc32_test_strings_p.h (1 hunks)
  • test/test_crc32.cc (2 hunks)
  • test/test_crc32_fold_copy.cc (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • test/benchmarks/CMakeLists.txt
  • test/CMakeLists.txt
🧰 Additional context used
🧠 Learnings (9)
📓 Common learnings
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.
📚 Learning: 2024-12-22T20:40:03.280Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1833
File: test/test_crc32.cc:0-0
Timestamp: 2024-12-22T20:40:03.280Z
Learning: In the crc32_align test class (test/test_crc32.cc), the alignment offset parameter is always guaranteed to be non-negative because the test parameters are controlled and it never makes sense to have a negative offset.

Applied to files:

  • test/test_crc32.cc
  • test/test_crc32_fold_copy.cc
  • test/crc32_test_strings_p.h
📚 Learning: 2025-02-23T16:49:52.043Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.

Applied to files:

  • test/test_crc32.cc
  • arch/generic/generic_functions.h
  • test/test_crc32_fold_copy.cc
  • test/crc32_test_strings_p.h
  • test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-01-23T22:01:53.422Z
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-01-23T22:01:53.422Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.

Applied to files:

  • test/test_crc32.cc
  • arch/generic/generic_functions.h
  • test/test_crc32_fold_copy.cc
  • test/crc32_test_strings_p.h
  • test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-21T01:42:40.488Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:42:40.488Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.

Applied to files:

  • test/test_crc32.cc
  • test/test_crc32_fold_copy.cc
  • test/crc32_test_strings_p.h
  • test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-06-18T19:28:32.987Z
Learnt from: phprus
Repo: zlib-ng/zlib-ng PR: 1925
File: arch/loongarch/slide_hash_lsx.c:64-64
Timestamp: 2025-06-18T19:28:32.987Z
Learning: In zlib-ng's slide_hash functions, when calling slide_hash_chain(), the fourth parameter should be the window size (wsize) representing the number of entries in the prev table, not s->w_size directly. The local variable wsize is typically cast to the appropriate type (uint16_t) from s->w_size for this purpose.

Applied to files:

  • arch/generic/generic_functions.h
📚 Learning: 2025-02-21T01:41:50.358Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:14-24
Timestamp: 2025-02-21T01:41:50.358Z
Learning: In zlib-ng's SSE2 vectorized Chorba CRC implementation, the code that calls READ_NEXT macro ensures 16-byte alignment, making explicit alignment checks unnecessary within the macro.

Applied to files:

  • arch/generic/generic_functions.h
  • test/test_crc32_fold_copy.cc
  • test/crc32_test_strings_p.h
  • test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2024-10-08T21:51:45.330Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1778
File: arch/x86/chunkset_avx2.c:160-171
Timestamp: 2024-10-08T21:51:45.330Z
Learning: In `arch/x86/chunkset_avx2.c`, within the `GET_HALFCHUNK_MAG` function, using a conditional branch to select between `_mm_loadl_epi64` and `_mm_loadu_si128` is not recommended because the branching cost outweighs the savings from the load.

Applied to files:

  • test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-23T16:51:54.545Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/x86_intrins.h:114-117
Timestamp: 2025-02-23T16:51:54.545Z
Learning: In x86/x86_intrins.h, the Clang macros for _mm_cvtsi64x_si128 and _mm_cvtsi128_si64x don't need additional MSVC guards since MSVC's implementation is already protected by `defined(_MSC_VER) && !defined(__clang__)`, making them mutually exclusive.

Applied to files:

  • test/benchmarks/benchmark_crc32_fold_copy.cc
🧬 Code graph analysis (3)
arch/generic/generic_functions.h (2)
crc32.h (1)
  • crc32_fold_s (20-23)
functable.c (1)
  • crc32_fold_stub (402-405)
test/test_crc32_fold_copy.cc (1)
arch/arm/crc32_armv8.c (1)
  • crc32_fold_copy_armv8 (76-79)
test/benchmarks/benchmark_crc32_fold_copy.cc (3)
arch/arm/crc32_armv8.c (1)
  • crc32_fold_copy_armv8 (76-79)
test/benchmarks/benchmark_crc32.cc (1)
  • crc32 (21-48)
test/benchmarks/benchmark_adler32_copy.cc (1)
  • adler32_copy (24-63)
🔇 Additional comments (8)
arch/generic/generic_functions.h (1)

15-17: New crc32_fold function-pointer typedefs look consistent and correct

Signatures match the existing crc32_fold_reset_c, crc32_fold_copy_c, and crc32_fold_final_c prototypes below and provide a clean way to wire these into tests/benchmarks. No issues from a type or ABI perspective.

test/benchmarks/benchmark_crc32_fold_copy.cc (2)

25-32: Fixture buffer allocation and initialization are now sound

Allocating BUFSIZE for both testdata and dstbuf and filling BUFSIZE/sizeof(uint32_t) elements ensures the entire source region used by the benchmark is initialized, including when misaligned up to 15 bytes.


35-55: Misalignment and CRC accumulation logic in Bench are reasonable

The rotating misalign in [0, 14] combined with BUFSIZE = 32768 + 16 keeps all accesses in-bounds for the largest state.range(0) (32 KiB), while accumulating CRC over many calls for realistic throughput measurement. The pattern mirrors existing CRC/adler benchmarks in this repo.

test/test_crc32.cc (2)

8-16: Shared CRC32 test-data header is integrated cleanly

Including crc32_test_strings_p.h (with crc32_test/crc32_tests) inside the existing extern "C" block is appropriate and keeps the C-style test data usable from C++. The additional zutil_p.h include matches the header’s use of ALIGNED_ and friends.


72-72: Switch to crc32_tests data source for parameterized tests looks correct

INSTANTIATE_TEST_SUITE_P(crc32, crc32_variant, testing::ValuesIn(crc32_tests)); correctly wires the new shared test cases into the existing crc32_variant parameterized tests without changing the test logic.

test/test_crc32_fold_copy.cc (1)

17-44: Fold-copy fixture correctly validates both CRC and copy semantics

crc32_fold_test sets up a local crc32_fold state, enforces params.len <= BUFSIZE, applies the (minlen, onlyzero) gating, then runs reset → set initial crc → copy → final and checks both the resulting CRC and memcmp of src vs dst. This covers the essential correctness for all fold-copy variants without risking out-of-bounds accesses.

test/crc32_test_strings_p.h (2)

4-16: 32 KiB pattern buffer and setup function are straightforward and safe

crc32_fullwin_buf plus crc32_setup_buf() and the buf32k initializer give each TU a 32 KiB, 16-byte-aligned buffer filled with a simple 0–255 pattern, which is appropriate for the large-buffer CRC tests.


17-181: Shared crc32_tests table is well-structured and compatible with existing tests

The crc32_test struct and crc32_tests[] cover a broad mix of inputs (NULL buffers, short strings, random-ish 20/30/100-byte strings, and full-window 16–32 KiB cases) with precomputed expected CRCs. Field types and values are compatible with the existing crc32_variant test harness and the new fold-copy tests.

@Dead2 Dead2 force-pushed the bench_crc32_fold_copy branch from 069fbac to f1861dc Compare November 15, 2025 15:14
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
test/benchmarks/benchmark_crc32_fold_copy.cc (1)

69-76: Ensure benchmarks return immediately after SkipWithError for unsupported CPUs.

When support_flag is false, the macro calls state.SkipWithError(...) but still goes on to call Bench(...). This can execute instructions the CPU doesn't support (e.g., PCLMULQDQ/AVX512), potentially causing an illegal-instruction fault.

Short-circuit after skipping:

 #define BENCHMARK_CRC32_FOLD(name, resfunc, copyfunc, finfunc, support_flag) \
     BENCHMARK_DEFINE_F(crc32_fc, name)(benchmark::State& state) { \
         if (!support_flag) { \
             state.SkipWithError("CPU does not support " #name); \
+            return; \
         } \
         Bench(state, resfunc, copyfunc, finfunc); \
     } \
🧹 Nitpick comments (1)
test/benchmarks/benchmark_crc32_fold_copy.cc (1)

31-39: Consider using zng_alloc for consistent alignment across benchmarks.

The SetUp method uses malloc for buffer allocation, whereas other benchmarks in the codebase (e.g., benchmark_crc32.cc, benchmark_adler32_copy.cc) use zng_alloc. Using malloc doesn't guarantee 16-byte alignment, which can cause SIMD implementations to cross cacheline boundaries unpredictably and introduce variance in benchmark results.

Apply this diff to align with other benchmarks:

-    void SetUp(const ::benchmark::State& state) {
-        testdata = (uint32_t *)malloc(BUFSIZE);
-        dstbuf = (uint8_t *)malloc(BUFSIZE);
-        assert((testdata != NULL) && (dstbuf != NULL));
+    void SetUp(const ::benchmark::State& state) {
+        testdata = (uint32_t *)zng_alloc(BUFSIZE);
+        dstbuf = (uint8_t *)zng_alloc(BUFSIZE);
+        assert((testdata != NULL) && (dstbuf != NULL));

And update TearDown:

     void TearDown(const ::benchmark::State& state) {
-        free(testdata);
-        free(dstbuf);
+        zng_free(testdata);
+        zng_free(dstbuf);
     }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 069fbac and f1861dc.

📒 Files selected for processing (2)
  • test/benchmarks/CMakeLists.txt (1 hunks)
  • test/benchmarks/benchmark_crc32_fold_copy.cc (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/benchmarks/CMakeLists.txt
🧰 Additional context used
🧠 Learnings (7)
📓 Common learnings
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.
📚 Learning: 2024-10-08T21:51:45.330Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1778
File: arch/x86/chunkset_avx2.c:160-171
Timestamp: 2024-10-08T21:51:45.330Z
Learning: In `arch/x86/chunkset_avx2.c`, within the `GET_HALFCHUNK_MAG` function, using a conditional branch to select between `_mm_loadl_epi64` and `_mm_loadu_si128` is not recommended because the branching cost outweighs the savings from the load.

Applied to files:

  • test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-23T16:51:54.545Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/x86_intrins.h:114-117
Timestamp: 2025-02-23T16:51:54.545Z
Learning: In x86/x86_intrins.h, the Clang macros for _mm_cvtsi64x_si128 and _mm_cvtsi128_si64x don't need additional MSVC guards since MSVC's implementation is already protected by `defined(_MSC_VER) && !defined(__clang__)`, making them mutually exclusive.

Applied to files:

  • test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-23T16:49:52.043Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.

Applied to files:

  • test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-01-23T22:01:53.422Z
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-01-23T22:01:53.422Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.

Applied to files:

  • test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-21T01:42:40.488Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:42:40.488Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.

Applied to files:

  • test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-21T01:41:50.358Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:14-24
Timestamp: 2025-02-21T01:41:50.358Z
Learning: In zlib-ng's SSE2 vectorized Chorba CRC implementation, the code that calls READ_NEXT macro ensures 16-byte alignment, making explicit alignment checks unnecessary within the macro.

Applied to files:

  • test/benchmarks/benchmark_crc32_fold_copy.cc
🧬 Code graph analysis (1)
test/benchmarks/benchmark_crc32_fold_copy.cc (3)
arch/arm/crc32_armv8.c (1)
  • crc32_fold_copy_armv8 (76-79)
test/benchmarks/benchmark_crc32.cc (2)
  • state (21-48)
  • state (35-43)
test/benchmarks/benchmark_adler32_copy.cc (1)
  • adler32_copy (24-63)

@Dead2 Dead2 force-pushed the bench_crc32_fold_copy branch from f1861dc to 9216a65 Compare November 16, 2025 11:48
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
test/benchmarks/benchmark_crc32_fold_copy.cc (2)

85-92: Critical: Add return after SkipWithError to prevent executing unsupported instructions.

This critical issue from the previous review remains unresolved. When support_flag is false, the macro calls state.SkipWithError(...) but continues to execute Bench(...), which can invoke CPU instructions the hardware doesn't support (e.g., PCLMULQDQ, VPCLMULQDQ, ARM CRC32). This can cause illegal-instruction faults and crash the benchmark.

Add a return statement to short-circuit after skipping:

 #define BENCHMARK_CRC32_FOLD(name, resfunc, copyfunc, finfunc, support_flag) \
     BENCHMARK_DEFINE_F(crc32_fc, name)(benchmark::State& state) { \
         if (!support_flag) { \
             state.SkipWithError("CPU does not support " #name); \
+            return; \
         } \
         Bench(state, resfunc, copyfunc, finfunc); \
     } \
     BENCHMARK_REGISTER_F(crc32_fc, name)->Arg(16)->Arg(48)->Arg(192)->Arg(512)->Arg(4<<10)->Arg(16<<10)->Arg(32<<10);

57-77: CRC state is still accumulated across iterations, skewing measurements.

The previous review raised this concern, and it remains unresolved. The fold_reset call at line 61 occurs once before the benchmark loop, causing each iteration to accumulate CRC data from previous iterations. This creates dependencies that don't represent typical usage and can skew timing measurements.

While rotating misalignment is beneficial for testing real-world scenarios, the CRC state should be independent per iteration.

Move the fold_reset call inside the benchmark loop:

     void Bench(benchmark::State& state, crc32_fold_reset_func fold_reset, crc32_fold_copy_func fold_copy,
                crc32_fold_final_func fold_final) {
         int misalign = 0;
-        // Prepare an initial crc state
-        fold_reset(&crc_st);
         crc = 0;
 
         // Benchmark the CRC32 fold copy operation
         for (auto _ : state) {
+            fold_reset(&crc_st);
             fold_copy(&crc_st, dstbuf + misalign, (const unsigned char*)testdata + misalign, (size_t)state.range(0));
             misalign++;
             if (misalign > 14)
                 misalign = 0;
         }
 
         // Finalize the CRC32 calculation
         crc = fold_final(&crc_st);
🧹 Nitpick comments (1)
test/benchmarks/benchmark_crc32_fold_copy.cc (1)

48-49: Consider using zng_alloc for consistency with other benchmarks.

Other benchmark files in this directory (benchmark_crc32.cc, benchmark_adler32_copy.cc) use zng_alloc instead of malloc for buffer allocation. Using zng_alloc provides consistent memory allocation behavior across all benchmarks.

Apply this diff:

-        testdata = (uint32_t *)malloc(BUFSIZE);
-        dstbuf = (uint8_t *)malloc(BUFSIZE);
+        testdata = (uint32_t *)zng_alloc(BUFSIZE);
+        dstbuf = (uint8_t *)zng_alloc(BUFSIZE);

And update TearDown:

-        free(testdata);
-        free(dstbuf);
+        zng_free(testdata);
+        zng_free(dstbuf);
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f1861dc and 9216a65.

📒 Files selected for processing (2)
  • test/benchmarks/CMakeLists.txt (1 hunks)
  • test/benchmarks/benchmark_crc32_fold_copy.cc (1 hunks)
🧰 Additional context used
🧠 Learnings (8)
📓 Common learnings
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.
📚 Learning: 2024-10-08T21:51:45.330Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1778
File: arch/x86/chunkset_avx2.c:160-171
Timestamp: 2024-10-08T21:51:45.330Z
Learning: In `arch/x86/chunkset_avx2.c`, within the `GET_HALFCHUNK_MAG` function, using a conditional branch to select between `_mm_loadl_epi64` and `_mm_loadu_si128` is not recommended because the branching cost outweighs the savings from the load.

Applied to files:

  • test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-23T16:51:54.545Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/x86_intrins.h:114-117
Timestamp: 2025-02-23T16:51:54.545Z
Learning: In x86/x86_intrins.h, the Clang macros for _mm_cvtsi64x_si128 and _mm_cvtsi128_si64x don't need additional MSVC guards since MSVC's implementation is already protected by `defined(_MSC_VER) && !defined(__clang__)`, making them mutually exclusive.

Applied to files:

  • test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-01-23T22:01:53.422Z
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-01-23T22:01:53.422Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.

Applied to files:

  • test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-21T01:42:40.488Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:42:40.488Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.

Applied to files:

  • test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-23T16:49:52.043Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.

Applied to files:

  • test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-21T01:41:50.358Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:14-24
Timestamp: 2025-02-21T01:41:50.358Z
Learning: In zlib-ng's SSE2 vectorized Chorba CRC implementation, the code that calls READ_NEXT macro ensures 16-byte alignment, making explicit alignment checks unnecessary within the macro.

Applied to files:

  • test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-11-15T12:43:50.332Z
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 2001
File: test/CMakeLists.txt:10-12
Timestamp: 2025-11-15T12:43:50.332Z
Learning: In zlib-ng, the TEST_STOCK_ZLIB CMake variable is intentionally not defined as a CMake option. It's meant to be set manually via command-line (e.g., -DTEST_STOCK_ZLIB=ON) when testing against stock zlib. The testing workflow for stock zlib requires undocumented manual steps and is considered experimental/incomplete.

Applied to files:

  • test/benchmarks/CMakeLists.txt
🧬 Code graph analysis (2)
test/benchmarks/benchmark_crc32_fold_copy.cc (3)
arch/arm/crc32_armv8.c (1)
  • crc32_fold_copy_armv8 (76-79)
test/benchmarks/benchmark_adler32_copy.cc (1)
  • adler32_copy (24-63)
test/benchmarks/benchmark_crc32.cc (2)
  • state (21-48)
  • state (35-43)
test/benchmarks/CMakeLists.txt (2)
test/benchmarks/benchmark_adler32_copy.cc (1)
  • adler32_copy (24-63)
test/benchmarks/benchmark_crc32.cc (2)
  • state (21-48)
  • state (35-43)
🔇 Additional comments (1)
test/benchmarks/CMakeLists.txt (1)

47-47: LGTM!

The new benchmark source file is correctly added to the executable's source list.

@Dead2 Dead2 force-pushed the bench_crc32_fold_copy branch from 9216a65 to 20815a2 Compare November 16, 2025 11:53
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
test/benchmarks/benchmark_crc32_fold_copy.cc (2)

61-81: Reset CRC state per iteration for independent measurements.

The CRC state is reset once before the benchmark loop (line 65), then reused across all iterations. Each iteration accumulates CRC data from previous iterations, creating dependencies that can skew timing measurements and don't represent typical usage patterns.

Move the fold_reset call inside the benchmark loop:

     void Bench(benchmark::State& state, crc32_fold_reset_func fold_reset, crc32_fold_copy_func fold_copy,
                crc32_fold_final_func fold_final) {
         int misalign = 0;
-        // Prepare an initial crc state
-        fold_reset(&crc_st);
         crc = 0;
 
         // Benchmark the CRC32 fold copy operation
         for (auto _ : state) {
+            fold_reset(&crc_st);
             fold_copy(&crc_st, dstbuf + misalign, (const unsigned char*)testdata + misalign, (size_t)state.range(0));
             misalign++;
             if (misalign > 14)
                 misalign = 0;
         }
 
         // Finalize the CRC32 calculation
         crc = fold_final(&crc_st);

89-96: Ensure benchmarks return immediately after SkipWithError for unsupported CPUs.

When support_flag is false, the macro calls state.SkipWithError(...) but continues to call Bench(...). This can execute instructions the CPU doesn't support (e.g., PCLMULQDQ/ARM CRC), potentially causing illegal-instruction faults.

Short-circuit after skipping:

 #define BENCHMARK_CRC32_FOLD(name, resfunc, copyfunc, finfunc, support_flag) \
     BENCHMARK_DEFINE_F(crc32_fc, name)(benchmark::State& state) { \
         if (!support_flag) { \
             state.SkipWithError("CPU does not support " #name); \
+            return; \
         } \
         Bench(state, resfunc, copyfunc, finfunc); \
     } \
🧹 Nitpick comments (1)
test/benchmarks/benchmark_crc32_fold_copy.cc (1)

52-53: Consider using zng_alloc for consistency with other benchmarks.

The codebase's other benchmarks use zng_alloc/zng_free instead of malloc/free.

Apply this diff for consistency:

-        testdata = (uint32_t *)malloc(BUFSIZE);
-        dstbuf = (uint8_t *)malloc(BUFSIZE);
+        testdata = (uint32_t *)zng_alloc(BUFSIZE);
+        dstbuf = (uint8_t *)zng_alloc(BUFSIZE);

And update TearDown:

-        free(testdata);
-        free(dstbuf);
+        zng_free(testdata);
+        zng_free(dstbuf);
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9216a65 and 20815a2.

📒 Files selected for processing (2)
  • test/benchmarks/CMakeLists.txt (1 hunks)
  • test/benchmarks/benchmark_crc32_fold_copy.cc (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/benchmarks/CMakeLists.txt
🧰 Additional context used
🧠 Learnings (7)
📓 Common learnings
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.
📚 Learning: 2024-10-08T21:51:45.330Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1778
File: arch/x86/chunkset_avx2.c:160-171
Timestamp: 2024-10-08T21:51:45.330Z
Learning: In `arch/x86/chunkset_avx2.c`, within the `GET_HALFCHUNK_MAG` function, using a conditional branch to select between `_mm_loadl_epi64` and `_mm_loadu_si128` is not recommended because the branching cost outweighs the savings from the load.

Applied to files:

  • test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-23T16:51:54.545Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/x86_intrins.h:114-117
Timestamp: 2025-02-23T16:51:54.545Z
Learning: In x86/x86_intrins.h, the Clang macros for _mm_cvtsi64x_si128 and _mm_cvtsi128_si64x don't need additional MSVC guards since MSVC's implementation is already protected by `defined(_MSC_VER) && !defined(__clang__)`, making them mutually exclusive.

Applied to files:

  • test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-01-23T22:01:53.422Z
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-01-23T22:01:53.422Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.

Applied to files:

  • test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-21T01:42:40.488Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:42:40.488Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.

Applied to files:

  • test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-23T16:49:52.043Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.

Applied to files:

  • test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-21T01:41:50.358Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:14-24
Timestamp: 2025-02-21T01:41:50.358Z
Learning: In zlib-ng's SSE2 vectorized Chorba CRC implementation, the code that calls READ_NEXT macro ensures 16-byte alignment, making explicit alignment checks unnecessary within the macro.

Applied to files:

  • test/benchmarks/benchmark_crc32_fold_copy.cc
🧬 Code graph analysis (1)
test/benchmarks/benchmark_crc32_fold_copy.cc (2)
arch/arm/crc32_armv8.c (1)
  • crc32_fold_copy_armv8 (76-79)
test/benchmarks/benchmark_crc32.cc (2)
  • state (21-48)
  • state (35-43)

Uses local functions for benchmarking some of the run-time selected variants.
@Dead2 Dead2 force-pushed the bench_crc32_fold_copy branch from 20815a2 to dbe3d8a Compare November 16, 2025 15:22
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
test/benchmarks/benchmark_crc32_fold_copy.cc (1)

89-96: Return immediately after SkipWithError to avoid executing unsupported implementations

In BENCHMARK_CRC32_FOLD, when support_flag is false you call state.SkipWithError(...) but still fall through to Bench(...). On a CPU that lacks the required instructions (e.g., PCLMULQDQ/VPCLMULQDQ/ARM CRC), this can still execute those code paths and potentially crash with an illegal instruction.

Add an early return after SkipWithError:

-#define BENCHMARK_CRC32_FOLD(name, resfunc, copyfunc, finfunc, support_flag) \
-    BENCHMARK_DEFINE_F(crc32_fc, name)(benchmark::State& state) { \
-        if (!support_flag) { \
-            state.SkipWithError("CPU does not support " #name); \
-        } \
-        Bench(state, resfunc, copyfunc, finfunc); \
-    } \
+#define BENCHMARK_CRC32_FOLD(name, resfunc, copyfunc, finfunc, support_flag) \
+    BENCHMARK_DEFINE_F(crc32_fc, name)(benchmark::State& state) { \
+        if (!support_flag) { \
+            state.SkipWithError("CPU does not support " #name); \
+            return; \
+        } \
+        Bench(state, resfunc, copyfunc, finfunc); \
+    } \

Please confirm this matches how your version of Google Benchmark handles SkipWithError (i.e., it does not itself abort execution of the fixture body) and that no other macros rely on the previous fall‑through behavior.

🧹 Nitpick comments (2)
test/benchmarks/benchmark_crc32_fold_copy.cc (2)

50-58: Full buffer initialization removes potential UB for larger sizes

SetUp allocates BUFSIZE bytes for both buffers and fills BUFSIZE/sizeof(uint32_t) entries, so all bytes that can be touched by lengths up to 32<<10 plus misalignment are initialized. This fixes the earlier risk of reading uninitialized memory in the benchmark.

If you want consistency with test/benchmarks/benchmark_crc32.cc, you could switch to zng_alloc/zng_free instead of malloc/free, but it's not required for correctness.


60-81: Bench loop behavior and misalignment pattern look intentional

The streaming-style usage (fold_reset once, then repeated fold_copy calls and a single fold_final at the end) matches the pattern in test/benchmarks/benchmark_crc32.cc, where the CRC state is carried across iterations. The misalignment cycle misalign = 0..14 is safely bounded given BUFSIZE = 32768 + 16 + 16, so even the largest Arg(32<<10) plus misalignment stays in-bounds for both src and dst.

If you ever want to benchmark independent messages instead of a single long stream, you could add a separate mode that calls fold_reset/fold_final per iteration (possibly with state.PauseTiming() around the reset/final if you want to exclude their cost).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 20815a2 and dbe3d8a.

📒 Files selected for processing (2)
  • test/benchmarks/CMakeLists.txt (1 hunks)
  • test/benchmarks/benchmark_crc32_fold_copy.cc (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/benchmarks/CMakeLists.txt
🧰 Additional context used
🧠 Learnings (7)
📓 Common learnings
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.
📚 Learning: 2024-10-08T21:51:45.330Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1778
File: arch/x86/chunkset_avx2.c:160-171
Timestamp: 2024-10-08T21:51:45.330Z
Learning: In `arch/x86/chunkset_avx2.c`, within the `GET_HALFCHUNK_MAG` function, using a conditional branch to select between `_mm_loadl_epi64` and `_mm_loadu_si128` is not recommended because the branching cost outweighs the savings from the load.

Applied to files:

  • test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-23T16:51:54.545Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/x86_intrins.h:114-117
Timestamp: 2025-02-23T16:51:54.545Z
Learning: In x86/x86_intrins.h, the Clang macros for _mm_cvtsi64x_si128 and _mm_cvtsi128_si64x don't need additional MSVC guards since MSVC's implementation is already protected by `defined(_MSC_VER) && !defined(__clang__)`, making them mutually exclusive.

Applied to files:

  • test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-01-23T22:01:53.422Z
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-01-23T22:01:53.422Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.

Applied to files:

  • test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-21T01:42:40.488Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:42:40.488Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.

Applied to files:

  • test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-23T16:49:52.043Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.

Applied to files:

  • test/benchmarks/benchmark_crc32_fold_copy.cc
📚 Learning: 2025-02-21T01:41:50.358Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:14-24
Timestamp: 2025-02-21T01:41:50.358Z
Learning: In zlib-ng's SSE2 vectorized Chorba CRC implementation, the code that calls READ_NEXT macro ensures 16-byte alignment, making explicit alignment checks unnecessary within the macro.

Applied to files:

  • test/benchmarks/benchmark_crc32_fold_copy.cc
🧬 Code graph analysis (1)
test/benchmarks/benchmark_crc32_fold_copy.cc (2)
arch/arm/crc32_armv8.c (1)
  • crc32_fold_copy_armv8 (76-79)
test/benchmarks/benchmark_crc32.cc (2)
  • state (21-48)
  • state (35-43)
🔇 Additional comments (2)
test/benchmarks/benchmark_crc32_fold_copy.cc (2)

17-41: Fold‑copy wrapper helpers look correct and consistent with existing patterns

The braid/chorba fold‑copy shims update crc->value then memcpy the source to destination, matching the crc32_fold_copy_armv8 style from arch/arm/crc32_armv8.c and respecting the WITHOUT_*/x86 feature guards. No correctness or UB concerns here.


98-129: Benchmark registrations and feature gating look coherent

The generic braid_c plus native (when DISABLE_RUNTIME_CPU_DETECTION is set) and the per‑architecture registrations (ARMv8, Chorba variants, PCLMULQDQ/VPCLMULQDQ, LoongArch) are guarded by the appropriate compile‑time macros and test_cpu_features runtime checks. The Arg list matches the sizes discussed in the PR (16, 48, 192, 512, 4K, 16K, 32K) and aligns with the chosen BUFSIZE.

@Dead2 Dead2 merged commit bb3be30 into develop Nov 16, 2025
301 of 304 checks passed
@Dead2 Dead2 mentioned this pull request Nov 16, 2025
@Dead2 Dead2 deleted the bench_crc32_fold_copy branch November 29, 2025 22:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant