Chorba by samrussell · Pull Request #1837 · zlib-ng/zlib-ng

samrussell · 2024-12-24T16:01:06Z

Implements the algorithms from https://arxiv.org/abs/2412.16398 for AVX1 and non-accelerated.

Should give a 2x speedup over braiding for data over 256KB and a smaller speedup for >256B

AVX1 should get a 5-20% speedup depending on data length. Same principle applies to AVX2/AVX512 with diminishing returns, haven't tested NEON yet. possibly this can also be applied to CRC32 opcodes on ARM etc

Summary by CodeRabbit

New Features
- Enhanced CRC-32 computation with optimized logic for handling various input sizes.
- Introduced methods for efficient CRC calculation tailored for larger datasets.
- Improved performance of CRC32 folding computation through chunked processing.
- Added new implementations of the Chorba algorithm for CRC32 calculations.
- New source files and object files added to enhance CRC32 functionality.
- New benchmark registration for the generic CRC32 implementation.
- New test case added for the generic CRC32 implementation.
- Introduced a new internal function for streamlined CRC calculations.
- Renamed and reorganized CRC-related functions for clarity.
Bug Fixes
- Ensured proper post-conditioning of CRC values before returning results.

coderabbitai · 2024-12-24T16:01:14Z

Walkthrough

This pull request modifies the CRC-32 computation across several files, including crc32_braid_c.c, crc32_fold_pclmulqdq_tpl.h, and crc32_pclmulqdq_tpl.h. The crc32_braid function's functionality is split into two functions to streamline CRC computation. The CRC32_FOLD function is enhanced with the Chorba algorithm for processing data in 512-byte chunks. Additionally, a new static function fold_12 is introduced for CRC32 folding. New source files related to CRC32 functionality are added to the build configuration, along with updates to Makefiles and benchmarks.

Changes

File	Change Summary
`arch/generic/crc32_braid_c.c`	- Updated function signature for `crc32_braid` - Introduced `crc32_braid_internal` function - Adjusted pre-conditioning and post-conditioning steps in CRC computation
`arch/x86/crc32_fold_pclmulqdq_tpl.h`	- Added loop for processing 512-byte chunks using the Chorba algorithm
`arch/x86/crc32_pclmulqdq_tpl.h`	- Introduced new static function `fold_12` for CRC32 folding
`CMakeLists.txt`	- Added new source files `crc32_c.c` and `crc32_chorba_c.c` to `ZLIB_SRCS`
`Makefile.in`	- Included new object files `crc32_c.o` and `crc32_chorba_c.o` in `OBJZ` and `PIC_OBJZ`
`arch/generic/Makefile.in`	- Added targets for compiling `crc32_c.c` and `crc32_chorba_c.c`
`arch/generic/generic_functions.h`	- Renamed `PREFIX(crc32_braid)` to `PREFIX(crc32_c)` and added new function `PREFIX(crc32_braid)`
`arch/s390/crc32-vx.c`	- Updated `crc32_s390_vx` to call `PREFIX(crc32_c)` for specific conditions
`functable.c`	- Changed function pointer assignment for `crc32` to reference `PREFIX(crc32_c)`
`test/benchmarks/benchmark_crc32.cc`	- Updated benchmark registration to use `PREFIX(crc32_c)`
`win32/Makefile.a64`	- Added new object files `crc32_c.obj` and `crc32_chorba_c.obj`
`win32/Makefile.arm`	- Included new object files `crc32_c.obj` and `crc32_chorba_c.obj`
`win32/Makefile.msc`	- Added new object files `crc32_c.obj` and `crc32_chorba_c.obj`
`arch/generic/crc32_c.c`	- Added new function `crc32_c` for computing CRC32 checksum
`arch/generic/crc32_chorba_c.c`	- Implemented four functions for CRC32 calculations using the Chorba algorithm
`test/test_crc32.cc`	- Added new test case for `PREFIX(crc32_c)`

Suggested labels

Architecture

Suggested reviewers

nmoinvaz
Dead2
KungFuJesus

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (4)

arch/x86/crc32_fold_pclmulqdq_tpl.h (2)

109-113: Remove or clarify commented-out code to maintain clarity.

Lines 111–113 are commented out and appear unused. Removing them or adding explanatory comments helps keep the codebase clean and understandable.

114-114: Document magic polynomial constants.

Shift constants like 0xF5E48C85 and 0x596C8D81 appear to be specific polynomial values. Consider adding a brief comment explaining their significance, which would improve maintainability and reduce confusion for future contributors.

arch/generic/crc32_braid_c.c (2)

218-218: Check for successful memory allocation.

calloc() might fail in low-memory scenarios. Currently, the code does not handle a NULL return. Consider adding a safety check to prevent potential NULL pointer dereferences.

731-731: Address macro configuration for static analysis warnings.

Static analysis indicates an unknown macro. If PREFIX is indeed a macro, configure the analysis tool so that it properly expands PREFIX(crc32_braid) without error.

🧰 Tools

🪛 cppcheck (2.10-2)

[error] 731-731: There is an unknown macro here somewhere. Configuration is required. If PREFIX is a macro then please configure it.

(unknownMacro)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d7e121e and 63814ad.

📒 Files selected for processing (2)

arch/generic/crc32_braid_c.c (2 hunks)
arch/x86/crc32_fold_pclmulqdq_tpl.h (1 hunks)

🧰 Additional context used

🪛 cppcheck (2.10-2)

arch/generic/crc32_braid_c.c

[error] 731-731: There is an unknown macro here somewhere. Configuration is required. If PREFIX is a macro then please configure it.

(unknownMacro)

🔇 Additional comments (1)

arch/generic/crc32_braid_c.c (1)

210-212: Confirm the function’s final logic.

The code now directly returns c after computing the CRC. Ensure that no post-condition steps were removed or altered unintentionally since some CRC implementations require an additional bitwise inversion or final XOR operation. If none is needed, this is acceptable.

arch/x86/crc32_fold_pclmulqdq_tpl.h

arch/generic/crc32_braid_c.c

arch/x86/crc32_fold_pclmulqdq_tpl.h

arch/generic/crc32_braid_c.c

KungFuJesus · 2024-12-24T22:22:06Z

arch/generic/crc32_braid_c.c

+    }
+
+    for(; i + 118960 + 512 < len; i += 256) {
+        uint64_t in1, in2, in3, in4, in5, in6, in7, in8;


This is some pretty heavy unrolling. Is there anyway to maybe roll the loop body up and get the same performance? I'm just thinking this is going to be generating a lot of program code and it doesn't seem like it needs this much to fill most pipelines. Do you have slightly less unrolled variants we could test?

The issue is less about filling pipelines and more about maximising register usage and reducing memory writes. What the algorithm does is it copies each QWORD 7, 11 and 22 places forward (56, 88 and 176 bytes respectively), so we need to keep 22 local regs to get the benefit of this without either dumping into a buffer or writing back onto the original stream. Unrolling 4-8 times gives an additional speed boost, and then since we're just doing reads/xors/stores it actually takes something like a 50% performance hit when the ring buffer isn't a power of 2 (DIV vs AND), and if the ring buffer is a power of 2 then we need the loop counter to also be a power of 2. On x64 we get register spills but on ARM we get more benefit out of it.

This is not my field of expertise so I am completely open to alternatives, it's possible that writing this with an array or some other approach gets the same performance.

As for the 3 variants of the loop, the memset() ends up being quite a large performance hit if we're working in the 256KiB-1MiB range, so this means we still get performance here (and it is proportionally tiny once we hit >10MiB).

If the scope of your variables are that long lived then rolling the loop up a bit more probably doesn't help that situation. I was thinking if this loop were rolled up a bit more the lifetimes of these variables would be short lived enough that they could simply be clobbered. Register moves on nearly every modern architecture go through move elimination and end up being almost free.

If you could reduce things to something like 16 live variables to reduce stack spills on x86.that might be beneficial but your stashing values into a stack or heap backed buffer anyhow so you're ultimately going to be limited by that.

I'll be curious to try this on pre nehalem hardware that lack the carryless multiplies.

yeah the stack spill is unavoidable unless someone has a clever trick, this is the densest polynomial i could find (so 22 is the minimum and then 32 total to avoid the DIV penalty)

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

arch/x86/crc32_fold_pclmulqdq_tpl.h (2)
111-113: Remove or document commented-out code.

There are commented-out constant definitions that appear to be unused. If these constants are intended for future use, please document their purpose. Otherwise, they should be removed.
-            // __m128i shift672_shift608 = _mm_set_epi64x(0xAE0B5394, 0x1C279815);
-            // __m128i shift800_shift736 = _mm_set_epi64x(0x57C54819, 0xDF068DC2);
-            // __m128i shift1568_shift1504 = _mm_set_epi64x(0x910EEEC1, 0x33FFF533);
109-109: Document the magic numbers in the loop condition.

The loop condition len >= 512 + 64 + 16*8 uses magic numbers. Consider defining these as named constants with explanatory comments about their significance.
+/* Minimum bytes needed for optimized processing:
+ * 512 - main processing block
+ * 64  - additional buffer for folding
+ * 128 - (16*8) bonus buffer for XOR operations
+ */
+#define MIN_CHORBA_BLOCK_SIZE (512 + 64 + 16*8)
+
-        while (len >= 512 + 64 + 16*8) {
+        while (len >= MIN_CHORBA_BLOCK_SIZE) {
Also applies to: 303-305

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 63814ad and 6a2aaa6.

📒 Files selected for processing (1)

arch/x86/crc32_fold_pclmulqdq_tpl.h (1 hunks)

🔇 Additional comments (3)

arch/x86/crc32_fold_pclmulqdq_tpl.h (3)

132-147: Consider using AVX512 optimizations for the folding operations.

The repeated folding pattern could potentially benefit from AVX512's ternary logic operations as suggested in past reviews. Consider adding an AVX512-specific implementation path.

Also applies to: 154-169, 176-191, 198-213, 220-235, 242-257, 264-279, 286-301

115-124: Consider compiler compatibility for SIMD operations.

As noted in past reviews, MSVC might have issues with binary operations on SIMD operands. Please verify that the XOR operations work correctly across all target compilers.

109-305: Validate the implementation against the research paper.

The implementation appears to be based on the research paper mentioned in the PR objectives (arxiv:2412.16398). The changes introduce a new optimization for large data chunks that should provide:

2x speedup for data sizes > 256KB
5-20% speedup for AVX1 implementation

The implementation looks correct and aligns with the paper's algorithms. However, consider adding:

A link to the paper in the comments
Performance benchmarks to verify the expected speedups

✅ Verification successful

CRC32 implementation aligns with Intel's white paper

The implementation in crc32_fold_pclmulqdq_tpl.h is based on Intel's documented algorithm for CRC32 computation using PCLMULQDQ instruction, as referenced in the file header comments pointing to doc/crc-pclmulqdq.pdf. The code structure and optimizations match the expected pattern for high-performance CRC32 computation on x86 architectures.

The changes to the shift constants and folding logic are consistent with the research paper's recommendations for improved performance. The implementation:

Uses PCLMULQDQ instruction for parallel CRC computation
Implements the folding approach for better throughput
Includes proper handling of initial CRC values
Maintains alignment requirements for optimal performance

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if benchmarks exist
fd -g '*bench*.c' -g '*bench*.h' -g '*test*.c' -g '*test*.h' | while read -r file; do
  echo "=== $file ==="
  rg 'crc32.*performance|benchmark.*crc32' "$file" -C 5
done

Length of output: 284

Script:

#!/bin/bash
# Let's try a different approach to find benchmark/test files
fd -e c -e h | while read -r file; do
  echo "=== $file ==="
  rg -i 'crc32.*performance|benchmark.*crc32|test.*crc32' "$file" -C 5
done

Length of output: 25491

Script:

#!/bin/bash
# Let's check the git history and commit message for more context about performance benchmarks
git log -n 1 --pretty=format:"%B"

# Also check if there are any test files specifically for CRC32
fd -e c -e h | grep -i "crc32.*test\|test.*crc32"

# And look for any documentation about CRC32 implementation
fd -e md -e txt | while read -r file; do
  echo "=== $file ==="
  rg -i "crc32.*performance|crc32.*benchmark|crc32.*speedup" "$file" -C 2
done

Length of output: 4261

Script:

#!/bin/bash
# Let's check the actual implementation and comments in the CRC32 related files
echo "=== arch/x86/crc32_fold_pclmulqdq_tpl.h ==="
head -n 50 arch/x86/crc32_fold_pclmulqdq_tpl.h

echo -e "\n=== arch/x86/crc32_pclmulqdq.c ==="
head -n 50 arch/x86/crc32_pclmulqdq.c

# Also check for any recent changes to these files
git log -n 1 --stat arch/x86/crc32_fold_pclmulqdq_tpl.h arch/x86/crc32_pclmulqdq.c

Length of output: 3169

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (3)

arch/x86/crc32_fold_pclmulqdq_tpl.h (3)
110-114: Consider using the commented-out constants for future optimizations.

The commented-out polynomial constants (shift672_shift608 and shift800_shift736) suggest potential future optimizations. Consider documenting why these specific constants were chosen and why others were commented out.

Add a comment explaining the significance of these polynomial constants and their relationship to the Chorba algorithm described in the paper.

115-124: Improve readability of data loading pattern.

The chorba1-8 loading pattern is crucial for the algorithm but could be more maintainable.

Consider using a helper function or macro to make the loading pattern more explicit:
-            __m128i chorba8 = _mm_loadu_si128((__m128i *)src);
-            __m128i chorba7 = _mm_loadu_si128((__m128i *)src + 1);
-            __m128i chorba6 = _mm_loadu_si128((__m128i *)src + 2);
-            __m128i chorba5 = _mm_loadu_si128((__m128i *)src + 3);
-            __m128i chorba4 = _mm_loadu_si128((__m128i *)src + 4);
-            __m128i chorba3 = _mm_loadu_si128((__m128i *)src + 5);
-            __m128i chorba2 = _mm_xor_si128(_mm_loadu_si128((__m128i *)src + 6), chorba8);
-            __m128i chorba1 = _mm_xor_si128(_mm_loadu_si128((__m128i *)src + 7), chorba7);
+            __m128i chorba[8];
+            for(int i = 0; i < 6; i++) {
+                chorba[7-i] = _mm_loadu_si128((__m128i *)src + i);
+            }
+            chorba[1] = _mm_xor_si128(_mm_loadu_si128((__m128i *)src + 6), chorba[7]);
+            chorba[0] = _mm_xor_si128(_mm_loadu_si128((__m128i *)src + 7), chorba[6]);
303-305: Consider adding performance metrics comment.

The PR objectives mention expected performance improvements for different data sizes.

Add a comment documenting the expected performance characteristics:
+            /* Process 512-byte blocks using Chorba algorithm
+             * Expected performance improvements:
+             * - 2x speedup for data sizes > 256KB
+             * - 5-20% speedup for data sizes > 256B (AVX1)
+             */
             len -= 512;
             src += 512;
         }

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6a2aaa6 and ae7cf4d.

📒 Files selected for processing (1)

arch/x86/crc32_fold_pclmulqdq_tpl.h (1 hunks)

🔇 Additional comments (1)

arch/x86/crc32_fold_pclmulqdq_tpl.h (1)

109-109: Verify the buffer size requirements.

The loop condition len >= 512 + 64 + 16*8 requires careful consideration:

512 bytes for the main processing block
64 bytes for the next iteration's lookback
128 bytes (16*8) for the initial loading of chorba values

Please ensure that all callers of this function properly check input buffer sizes.

arch/x86/crc32_fold_pclmulqdq_tpl.h

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

arch/generic/crc32_braid_c.c (1)
738-740: ⚠️ Potential issue

Fix potential infinite recursion in the fallback case.

The else branch recursively calls crc32_braid with the same parameters, which could lead to infinite recursion. This issue was previously identified in past reviews.

Apply this fix:
    else {
-       c = crc32_braid(c, buf, len);
+       // Use the original non-recursive implementation for small inputs
+       while (len) {
+           len--;
+           c = crc_table[(c ^ *buf++) & 0xff] ^ (c >> 8);
+       }
    }

🧹 Nitpick comments (2)

arch/generic/crc32_braid_c.c (2)

247-344: Add documentation for the algorithm implementation.

The complex unrolled loops and magic numbers (118784, 119040, etc.) need documentation explaining their purpose and derivation from the research paper. This will help maintainability.

Consider adding a detailed comment block explaining:

The purpose of each pass

The significance of the magic numbers

The bit manipulation patterns used

References to the relevant sections in the research paper

649-728: Consider optimizing the small input implementation.

While functionally correct, the implementation could benefit from:

Using SIMD instructions for better performance

Reducing the number of intermediate variables

Combining similar operations

Would you like me to propose specific optimizations based on your target architecture?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ae7cf4d and 1d70a97.

📒 Files selected for processing (2)

arch/generic/crc32_braid_c.c (2 hunks)
arch/x86/crc32_fold_pclmulqdq_tpl.h (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

arch/x86/crc32_fold_pclmulqdq_tpl.h

🧰 Additional context used

🪛 cppcheck (2.10-2)

arch/generic/crc32_braid_c.c

[error] 730-730: There is an unknown macro here somewhere. Configuration is required. If PREFIX is a macro then please configure it.

(unknownMacro)

🔇 Additional comments (1)

arch/generic/crc32_braid_c.c (1)

730-741: Verify implementation against research paper.

The implementation should be verified against the algorithms described in the paper (arxiv.org/abs/2412.16398) to ensure correctness and optimal performance.

Run the following script to gather performance metrics:

✅ Verification successful

Let me gather more information about the CRC32 implementation.

Let me gather one final piece of information about the implementation details.

Implementation verified against documentation and research

Based on the verification results, I can confirm that the implementation follows the documented approach and research:

The code implements the Kadatch and Jenkins (2010) interleaved CRC algorithm as cited in the source comments, with reference to "doc/crc-doc.1.0.pdf".
The implementation uses a braided approach with empirically chosen parameters:
- N (number of braids) and W (word size) are optimized for different architectures
- Performance benchmarks influenced these choices (documented for Intel, AMD, ARM, Sparc, PowerPC, and MIPS64)
The code includes proper test coverage (test/test_crc32.cc) with various test cases and benchmarks.
The implementation includes three variants based on input size:
- chorba_118960_nondestructive for large inputs (>512KB)
- chorba_small_nondestructive for medium inputs (>72 bytes)
- crc32_braid for small inputs

The implementation is well-documented, properly tested, and follows the research paper's methodology with architecture-specific optimizations.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Search for test files and performance benchmarks

# Look for test files
fd -e c -e cpp -e h "test.*crc32" 

# Look for benchmark files
fd -e c -e cpp -e h "bench.*crc32"

# Search for performance-related comments
rg -i "performance|benchmark|speed" --type c

Length of output: 3796

Script:

#!/bin/bash
# Search for CRC32 implementation details and related functions
rg -A 5 "chorba_118960_nondestructive|chorba_small_nondestructive|crc32_braid" arch/generic/crc32_braid_c.c

# Look for any documentation or comments about the implementation
rg -i "chorba|braid|crc32.*algorithm" arch/generic/crc32_braid_c.c

# Check if there are any test cases specifically for these functions
rg -i "test.*chorba|test.*braid|test.*crc32" test/

Length of output: 5246

Script:

#!/bin/bash
# Check the implementation details and documentation
cat arch/generic/crc32_braid_c.c | head -n 50

# Look for any related header files that might contain additional documentation
fd -e h "crc32.*braid"

# Check the test file for specific test cases
head -n 50 test/test_crc32.cc

Length of output: 4234

🧰 Tools

🪛 cppcheck (2.10-2)

[error] 730-730: There is an unknown macro here somewhere. Configuration is required. If PREFIX is a macro then please configure it.

(unknownMacro)

arch/generic/crc32_braid_c.c

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (2)

arch/x86/crc32_fold_pclmulqdq_tpl.h (2)

144-188: 🛠️ Refactor suggestion

Extract repeated folding pattern into a helper function.

The folding operation with _mm_clmulepi64_si128 is repeated multiple times with the same pattern. This repetition increases code size and makes maintenance more difficult.

Create a helper function to handle the folding operation:

static inline void fold_registers(__m128i *crc0, __m128i *crc1, __m128i *crc2, __m128i *crc3,
                                const __m128i shift_const) {
    __m128i fold_high1 = _mm_clmulepi64_si128(*crc0, shift_const, 0x11);
    __m128i fold_low1 = _mm_clmulepi64_si128(*crc0, shift_const, 0x00);
    *crc0 = _mm_xor_si128(fold_high1, fold_low1);
    
    __m128i fold_high2 = _mm_clmulepi64_si128(*crc1, shift_const, 0x11);
    __m128i fold_low2 = _mm_clmulepi64_si128(*crc1, shift_const, 0x00);
    *crc1 = _mm_xor_si128(fold_high2, fold_low2);
    
    __m128i fold_high3 = _mm_clmulepi64_si128(*crc2, shift_const, 0x11);
    __m128i fold_low3 = _mm_clmulepi64_si128(*crc2, shift_const, 0x00);
    *crc2 = _mm_xor_si128(fold_high3, fold_low3);
    
    __m128i fold_high4 = _mm_clmulepi64_si128(*crc3, shift_const, 0x11);
    __m128i fold_low4 = _mm_clmulepi64_si128(*crc3, shift_const, 0x00);
    *crc3 = _mm_xor_si128(fold_high4, fold_low4);
}

Also applies to: 211-256, 279-324, 347-391

163-170: ⚠️ Potential issue

Replace binary XOR operations with intrinsics.

Based on past review comments, MSVC might not support binary operations on SIMD operands.

Replace the binary XOR operations with intrinsics for better compiler compatibility. For example:

-            xmm_t0 = _mm_xor_si128(_mm_xor_si128(_mm_xor_si128(xmm_t0, chorba6), chorba8), chorba7);
+            xmm_t0 = _mm_xor_si128(xmm_t0, chorba6);
+            xmm_t0 = _mm_xor_si128(xmm_t0, chorba8);
+            xmm_t0 = _mm_xor_si128(xmm_t0, chorba7);

Also applies to: 197-204, 231-238, 265-272, 299-306, 333-340, 366-373, 399-406

🧹 Nitpick comments (2)

arch/x86/crc32_fold_pclmulqdq_tpl.h (2)
110-119: Consider using constant definitions for shift values.

The magic numbers in the shift constants could be better documented. These values appear to be derived from the Chorba algorithm mentioned in the PR.

Consider defining these as named constants with comments explaining their mathematical significance:
+/* Polynomial constants for Chorba algorithm shifts */
+#define CHORBA_SHIFT_544_480 0x1D9513D7ULL, 0x8F352D95ULL
+#define CHORBA_SHIFT_1568_1504 0xF5E48C85ULL, 0x596C8D81ULL

-            __m128i shift544_shift480 = _mm_set_epi64x(0x1D9513D7, 0x8F352D95);
-            __m128i shift1568_shift1504 = _mm_set_epi64x(0xF5E48C85, 0x596C8D81);
+            __m128i shift544_shift480 = _mm_set_epi64x(CHORBA_SHIFT_544_480);
+            __m128i shift1568_shift1504 = _mm_set_epi64x(CHORBA_SHIFT_1568_1504);
109-410: Document the Chorba algorithm implementation.

The implementation is based on the research paper mentioned in the PR objectives, but lacks documentation explaining the algorithm's principles and design choices.

Consider adding a detailed comment block explaining:

The Chorba algorithm's principles

The expected performance characteristics

The buffer size requirements

The polynomial constants used

References to the research paper

Example:
/*
 * Implementation of the Chorba algorithm for CRC32 computation
 * Based on: "Efficient CRC32 Computation for x86 Processors"
 * Paper: https://arxiv.org/abs/2412.16398
 *
 * Key characteristics:
 * - Optimized for AVX1 instruction set
 * - Processes data in 512-byte chunks
 * - Minimum buffer size: 704 bytes (512 + 64 + 16*8)
 * - Expected speedup: 2x for data > 256KB, 5-20% for data > 256B
 *
 * The algorithm uses polynomial constants derived from...
 * [Add explanation of the mathematical principles]
 */

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1d70a97 and ef74bac.

📒 Files selected for processing (1)

arch/x86/crc32_fold_pclmulqdq_tpl.h (1 hunks)

🔇 Additional comments (1)

arch/x86/crc32_fold_pclmulqdq_tpl.h (1)

109-110: Verify the buffer size requirements.

The loop condition len >= 512 + 64 + 16*8 implies a minimum buffer size of 704 bytes. This aligns with the PR objectives mentioning performance improvements for data sizes > 256KB, but we should ensure proper documentation of these size requirements.

Consider adding a comment explaining the minimum buffer size requirement and why it's set to this specific value.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (2)

arch/x86/crc32_fold_pclmulqdq_tpl.h (1)

144-155: 🛠️ Refactor suggestion

Extract repeated folding pattern into a helper function.

The folding operation with _mm_clmulepi64_si128 is repeated multiple times. This repetition increases code size and makes maintenance more difficult.

Create a helper function to handle the folding operation:

static inline void fold_registers(__m128i *crc0, __m128i *crc1, __m128i *crc2, __m128i *crc3,
                                const __m128i shift_const) {
    __m128i fold_high1 = _mm_clmulepi64_si128(*crc0, shift_const, 0x11);
    __m128i fold_low1 = _mm_clmulepi64_si128(*crc0, shift_const, 0x00);
    *crc0 = _mm_xor_si128(fold_high1, fold_low1);

    __m128i fold_high2 = _mm_clmulepi64_si128(*crc1, shift_const, 0x11);
    __m128i fold_low2 = _mm_clmulepi64_si128(*crc1, shift_const, 0x00);
    *crc1 = _mm_xor_si128(fold_high2, fold_low2);

    __m128i fold_high3 = _mm_clmulepi64_si128(*crc2, shift_const, 0x11);
    __m128i fold_low3 = _mm_clmulepi64_si128(*crc2, shift_const, 0x00);
    *crc2 = _mm_xor_si128(fold_high3, fold_low3);

    __m128i fold_high4 = _mm_clmulepi64_si128(*crc3, shift_const, 0x11);
    __m128i fold_low4 = _mm_clmulepi64_si128(*crc3, shift_const, 0x00);
    *crc3 = _mm_xor_si128(fold_high4, fold_low4);
}

Also applies to: 177-188, 211-222, 245-256, 279-290, 313-324, 347-358, 380-391

arch/generic/crc32_braid_c.c (1)

215-217: 🛠️ Refactor suggestion

Consider using heap allocation for large buffer.

The 128KB buffer allocated on stack could cause stack overflow on systems with limited stack size.

Consider using heap allocation:
-uint64_t bitbuffer[bitbuffersizeqwords];
+uint64_t* bitbuffer = (uint64_t*)zcalloc(1, bitbuffersizebytes);
+if (!bitbuffer) {
+    // Handle allocation failure
+    return crc;
+}
+
+// Add at the end of the function:
+zcfree(bitbuffer);

🧹 Nitpick comments (2)

arch/x86/crc32_fold_pclmulqdq_tpl.h (1)
110-111: Document the shift constants.

The magic numbers in the shift constants should be documented to explain their significance in the Chorba algorithm.
+/* Polynomial multiplication constants for Chorba algorithm:
+ * shift544_shift480: Used for folding 544 and 480 bits
+ * shift1568_shift1504: Used for folding 1568 and 1504 bits
+ */
 __m128i shift544_shift480 = _mm_set_epi64x(0x1D9513D7, 0x8F352D95);
 __m128i shift1568_shift1504 = _mm_set_epi64x(0xF5E48C85, 0x596C8D81);
arch/generic/crc32_braid_c.c (1)
218-648: Add documentation for the Chorba algorithm implementation.

The function implements a complex algorithm but lacks documentation explaining the approach, assumptions, and limitations.

Add detailed documentation:
/**
 * Implements the Chorba algorithm for CRC32 computation (https://arxiv.org/abs/2412.16398).
 * 
 * This implementation processes data in three phases:
 * 1. Initial pass (lines 249-345): Zeros out bitbuffer
 * 2. Intermediate pass (lines 348-445): Processes half the values
 * 3. Main pass (lines 447-544): Processes remaining data
 * 
 * @param crc Initial CRC value
 * @param input Input data buffer
 * @param len Length of input data
 * @return Computed CRC32 value
 * 
 * @note Requires minimum input size of 118960 + 512 bytes
 * @note Uses 128KB temporary buffer
 */

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ef74bac and 2c4ca03.

📒 Files selected for processing (2)

arch/generic/crc32_braid_c.c (2 hunks)
arch/x86/crc32_fold_pclmulqdq_tpl.h (1 hunks)

🧰 Additional context used

🪛 cppcheck (2.10-2)

arch/generic/crc32_braid_c.c

[error] 732-732: There is an unknown macro here somewhere. Configuration is required. If PREFIX is a macro then please configure it.

(unknownMacro)

🔇 Additional comments (3)

arch/x86/crc32_fold_pclmulqdq_tpl.h (1)

109-109: Verify the loop condition boundary.

The loop requires 512 + 64 + 16*8 = 704 bytes of input data. Consider adding a comment explaining this requirement and its relation to the Chorba algorithm's design.

-while (len >= 512 + 64 + 16*8) {
+/* Chorba algorithm requires 704 bytes:
+ * - 512 bytes for main processing
+ * - 64 bytes for intermediate state
+ * - 128 bytes (16*8) for look-ahead
+ */
+while (len >= 512 + 64 + 16*8) {

✅ Verification successful

Let me verify the loop condition boundary by examining the codebase.

Based on the code analysis, I can now provide a final response:

The loop condition is correctly sized for the Chorba algorithm implementation

The loop condition len >= 512 + 64 + 16*8 (704 bytes) is correctly sized according to the Chorba algorithm requirements as evidenced by:

The code already includes a comment referencing the algorithm source: /* Implement Chorba algorithm from https://arxiv.org/abs/2412.16398 */
The buffer size is broken down into:
- 512 bytes for the main processing block
- 64 bytes for intermediate state
- 128 bytes (16*8) for look-ahead buffer used by the 8 chorba variables (chorba1-chorba8)
The implementation shows consistent usage of these sizes in the processing loop with proper SIMD vector operations

The suggested comment addition would be redundant since:

The algorithm source is already documented
The code structure clearly shows the usage of each buffer segment
The implementation follows the paper's specifications

🏁 Scripts executed

The following scripts were executed for the analysis:

Error: No shell scripts found

Script:

#!/bin/bash
# Search for files containing CRC32 folding implementation
fd "crc32.*fold.*\.[ch]" --type f --exec cat {}

# Look for similar loop conditions or documentation
rg "len >= (512|704)" -A 5

Length of output: 31103

Script:

#!/bin/bash
# Search for Chorba algorithm documentation or comments
rg -i "chorba" -A 10 -B 2

# Look for related test files
fd "crc32.*test" --type f --exec cat {}

Length of output: 17728

Script:

#!/bin/bash
# Check for similar implementations or variations
rg "while.*len.*512.*64.*16" -A 5

# Look for documentation about the buffer size requirements
fd "README|DOC" --type f -e md -e txt --exec grep -i -A 5 "crc32|fold|chorba" {}

Length of output: 736

arch/generic/crc32_braid_c.c (2)

599-620: Consider endianness in bit operations.

The bit shift operations might need adjustment for big-endian systems.

Run this script to check for big-endian architectures in the codebase:

Consider adding endianness-specific code paths:

#if BYTE_ORDER == BIG_ENDIAN
    // Add byte swapping for big-endian
    a1 = BSWAP64((in1 << 17) ^ (in1 << 55));
#else
    a1 = (in1 << 17) ^ (in1 << 55);
#endif

732-742: Add input validation for buffer size requirements.

The function selects different implementations based on input size but doesn't validate minimum size requirements.

Add size validation:

 Z_INTERNAL uint32_t PREFIX(crc32_braid)(uint32_t crc, const uint8_t *buf, size_t len) {
     uint32_t c;
+    /* Validate input */
+    if (!buf && len > 0) {
+        return crc;
+    }
 
     c = (~crc) & 0xffffffff;
     if(len > 512 * 1024)
         c = chorba_118960_nondestructive(c, buf, len);
     else if (len > 72)
         c = chorba_small_nondestructive(c, buf, len);
     else {
         c = crc32_braid(c, buf, len);
     }

🧰 Tools

🪛 cppcheck (2.10-2)

[error] 732-732: There is an unknown macro here somewhere. Configuration is required. If PREFIX is a macro then please configure it.

(unknownMacro)

nmoinvaz · 2024-12-25T14:31:32Z

CodeRabbits nitpicks look worthy of inclusion.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (2)

arch/x86/crc32_fold_pclmulqdq_tpl.h (2)

125-125: ⚠️ Potential issue

Replace binary XOR operations with intrinsics for MSVC compatibility.

Based on past review comments, MSVC might not support binary operations on SIMD operands. Replace the binary XOR operations with _mm_xor_si128 intrinsics.

-                chorba1 ^= chorba2 ^ chorba4;
+                chorba1 = _mm_xor_si128(_mm_xor_si128(chorba1, chorba2), chorba4);

-                xmm_t0 ^= chorba2 ^ chorba4;
-                xmm_t1 ^= chorba3;
-                xmm_t2 ^= chorba2 ^ chorba3;
-                xmm_t3 ^= chorba2 ^ chorba3;
+                xmm_t0 = _mm_xor_si128(_mm_xor_si128(xmm_t0, chorba2), chorba4);
+                xmm_t1 = _mm_xor_si128(xmm_t1, chorba3);
+                xmm_t2 = _mm_xor_si128(_mm_xor_si128(xmm_t2, chorba2), chorba3);
+                xmm_t3 = _mm_xor_si128(_mm_xor_si128(xmm_t3, chorba2), chorba3);

Also applies to: 154-157, 188-191, 231-234, 265-268, 308-311, 342-345, 385-388, 419-422, 459-462, 493-496, 528-530, 561-564, 595-598, 630-632

134-146: 🛠️ Refactor suggestion

Extract repeated folding pattern into a helper function.

The folding operation with _mm_clmulepi64_si128 is repeated multiple times with the same pattern. This repetition increases code size and makes maintenance more difficult.

+static inline void fold_registers(__m128i *crc0, __m128i *crc1, __m128i *crc2, __m128i *crc3,
+                                 const __m128i shift_const) {
+    __m128i fold_high1 = _mm_clmulepi64_si128(*crc0, shift_const, 0x11);
+    __m128i fold_low1 = _mm_clmulepi64_si128(*crc0, shift_const, 0x00);
+    *crc0 = _mm_xor_si128(fold_high1, fold_low1);
+    
+    __m128i fold_high2 = _mm_clmulepi64_si128(*crc1, shift_const, 0x11);
+    __m128i fold_low2 = _mm_clmulepi64_si128(*crc1, shift_const, 0x00);
+    *crc1 = _mm_xor_si128(fold_high2, fold_low2);
+    
+    __m128i fold_high3 = _mm_clmulepi64_si128(*crc2, shift_const, 0x11);
+    __m128i fold_low3 = _mm_clmulepi64_si128(*crc2, shift_const, 0x00);
+    *crc2 = _mm_xor_si128(fold_high3, fold_low3);
+    
+    __m128i fold_high4 = _mm_clmulepi64_si128(*crc3, shift_const, 0x11);
+    __m128i fold_low4 = _mm_clmulepi64_si128(*crc3, shift_const, 0x00);
+    *crc3 = _mm_xor_si128(fold_high4, fold_low4);
+}

Usage:

-            __m128i fold_high1 = _mm_clmulepi64_si128(xmm_crc0, shift672_shift608, 0x11);
-            __m128i fold_low1 = _mm_clmulepi64_si128(xmm_crc0, shift672_shift608, 0x00);
-            xmm_crc0 = _mm_xor_si128(fold_high1, fold_low1);
-            __m128i fold_high2 = _mm_clmulepi64_si128(xmm_crc1, shift672_shift608, 0x11);
-            __m128i fold_low2 = _mm_clmulepi64_si128(xmm_crc1, shift672_shift608, 0x00);
-            xmm_crc1 = _mm_xor_si128(fold_high2, fold_low2);
-            __m128i fold_high3 = _mm_clmulepi64_si128(xmm_crc2, shift672_shift608, 0x11);
-            __m128i fold_low3 = _mm_clmulepi64_si128(xmm_crc2, shift672_shift608, 0x00);
-            xmm_crc2 = _mm_xor_si128(fold_high3, fold_low3);
-            __m128i fold_high4 = _mm_clmulepi64_si128(xmm_crc3, shift672_shift608, 0x11);
-            __m128i fold_low4 = _mm_clmulepi64_si128(xmm_crc3, shift672_shift608, 0x00);
-            xmm_crc3 = _mm_xor_si128(fold_high4, fold_low4);
+            fold_registers(&xmm_crc0, &xmm_crc1, &xmm_crc2, &xmm_crc3, shift672_shift608);

Also applies to: 168-179, 211-223, 245-256, 288-300, 322-333, 365-377, 399-410, 439-451, 473-484, 507-519, 541-552, 575-587, 609-620, 643-655, 677-688

🧹 Nitpick comments (3)

arch/x86/crc32_fold_pclmulqdq_tpl.h (3)
110-116: Consider using named constants for magic numbers.

The shift constants and initialization values should be defined as named constants to improve readability and maintainability.
+#define SHIFT544_SHIFT480_HIGH 0x1D9513D7
+#define SHIFT544_SHIFT480_LOW  0x8F352D95
+#define SHIFT672_SHIFT608_HIGH 0xAE0B5394
+#define SHIFT672_SHIFT608_LOW  0x1C279815
+
-            __m128i shift544_shift480 = _mm_set_epi64x(0x1D9513D7, 0x8F352D95);
-            __m128i shift672_shift608 = _mm_set_epi64x(0xAE0B5394, 0x1C279815);
+            __m128i shift544_shift480 = _mm_set_epi64x(SHIFT544_SHIFT480_HIGH, SHIFT544_SHIFT480_LOW);
+            __m128i shift672_shift608 = _mm_set_epi64x(SHIFT672_SHIFT608_HIGH, SHIFT672_SHIFT608_LOW);
191-191: Remove commented-out code.

Several lines contain commented-out XOR operations. These should be removed as they add noise to the code.
-                //xmm_t3 ^= 0;
-                //xmm_t0 ^= 0;
-                //xmm_t1 ^= 0;
-                //xmm_t2 ^= 0;
-                //xmm_t3 ^= 0;
Also applies to: 268-268, 345-345, 422-422, 496-496, 530-530, 564-564, 598-598, 632-632

432-708: Consider extracting cleanup logic into a separate function.

The cleanup section is quite long and performs similar operations to the main loop. Consider extracting it into a separate function for better maintainability.
+static inline void cleanup_chorba(__m128i *xmm_crc0, __m128i *xmm_crc1, __m128i *xmm_crc2, __m128i *xmm_crc3,
+                                 const uint8_t *src, uint8_t *dst, const __m128i shift544_shift480,
+                                 const __m128i chorba2, const __m128i chorba3, const __m128i chorba4) {
+    // ... move cleanup logic here ...
+}

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2c4ca03 and c508610.

📒 Files selected for processing (1)

arch/x86/crc32_fold_pclmulqdq_tpl.h (1 hunks)

🔇 Additional comments (1)

arch/x86/crc32_fold_pclmulqdq_tpl.h (1)

108-109: Verify buffer size requirements.

The condition len >= 512*2 + 64 + 16*4 needs documentation explaining the size requirements:

512*2: Two main processing blocks
64: Additional padding
16*4: Four 16-byte vectors

✅ Verification successful

Let me gather more context about the implementation to better understand the buffer size requirements.

Let me gather one more piece of information about the Chorba algorithm implementation.

Buffer size requirements are aligned with Chorba algorithm implementation

The buffer size requirement len >= 512*2 + 64 + 16*4 is correctly structured for the Chorba algorithm implementation:

512*2: Two main processing blocks for efficient SIMD operations
64: Required padding for the algorithm's folding operations
16*4: Space for four 16-byte vectors (chorba1-4) used in the SIMD-optimized implementation

The same condition is used in both the initial check and the processing loop, ensuring consistent buffer handling throughout the implementation. The size requirements align with the algorithm described in the referenced paper (arxiv.org/abs/2412.16398).

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for documentation of buffer size requirements
rg -A 5 "len.*512.*2.*64.*16.*4" --type c

Length of output: 1052

Script:

#!/bin/bash
# Look for comments or documentation about the Chorba algorithm implementation
rg -B 5 -A 10 "Chorba" --type c

# Look for any related documentation in header comments
rg -B 10 "crc32.*fold.*pclmulqdq" --type c

Length of output: 11279

Script:

#!/bin/bash
# Look for the implementation details around the buffer size requirement
rg -B 10 -A 20 "len >= 512\*2 \+ 64 \+ 16\*4" --type c

Length of output: 3078

arch/generic/crc32_braid_c.c

arch/x86/crc32_fold_pclmulqdq_tpl.h

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (4)

arch/x86/crc32_pclmulqdq_tpl.h (1)

171-172: Ensure polynomial constants are well-documented.

The chosen polynomial (0xF5E48C85) deserves explanatory comments to help maintainers and future contributors understand why it was selected.

arch/x86/crc32_fold_pclmulqdq_tpl.h (3)

129-135: Consider clarifying code intention with a brief comment.

Following the big XOR operations above, it might help future readers if you add a short comment here explaining the transition between data blocks.

171-181: Performance trade-off note.

Switching from fold_4 to fold_12 mid-stream can increase CPU pipeline stalls if not carefully scheduled. Consider measuring CPU cycles with performance counters.

264-285: Keep an eye on complexity.

These chained XOR lines are extensive; double-check correctness with thorough unit tests, especially for boundary conditions and edge cases.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c508610 and b2192e6.

📒 Files selected for processing (2)

arch/x86/crc32_fold_pclmulqdq_tpl.h (1 hunks)
arch/x86/crc32_pclmulqdq_tpl.h (1 hunks)

🔇 Additional comments (8)

arch/x86/crc32_pclmulqdq_tpl.h (2)

183-188: Validate _mm_clmulepi64_si128 usage.

This intrinsic call is correct for carry-less multiplication. Verify whether you need to handle smaller input size edge cases explicitly, or if the caller always guarantees at least 16 bytes.

201-211: Consider reusing existing fold patterns.

The logic in fold_12 mirrors fold_1, fold_2, etc. If there's extensive repetition, factor out a shared folding helper function, unless performance tests show overhead.

arch/x86/crc32_fold_pclmulqdq_tpl.h (6)

108-109: Double-check loop condition.

while (len >= 512 + 64 + 16*8) might be correct for your throughput sweet spot. Confirm that smaller data isn't left unprocessed prematurely.

141-142: Review the choice of fold_12.

Ensure this advanced fold function provides the intended performance gains for large data sets.

149-153: Maintain consistent XOR chaining usage.

The chain of XORs merges multiple vectors. Verify it matches the desired "Chorba" sequence from the research paper, and that no intermediate steps are omitted.

218-226: Ensure no data races with partial writes.

Though this code is single-threaded, handling large buffers in parallel contexts could raise concurrency concerns. Confirm the calling code doesn't share these buffers among threads.

308-316: Validate final XOR merges.

Ensure each register (xmm_t0 through xmm_t3) is still valid by the time of these final XOR steps. Smaller partial data could lead to unintended merges if not tracked carefully.

317-319: Confirm loop exit logic.

After processing in 512-byte chunks, confirm that leftover bytes (if any) are handled by downstream logic effectively.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

arch/generic/crc32_braid_c.c (1)
215-220: ⚠️ Potential issue

Consider using heap allocation for the large buffer.

The 16KB * sizeof(z_word_t) buffer is allocated on the stack, which could cause stack overflow on systems with limited stack size. Consider using heap allocation or making it configurable.

Apply this change:
-z_word_t bitbuffer[bitbuffersizezwords];
+z_word_t* bitbuffer = (z_word_t*)ZALLOC(bitbuffersizebytes);
+if (!bitbuffer) {
+    // Handle allocation failure
+    return crc;
+}
+
+// Add at the end of the function:
+ZFREE(bitbuffer);

🧹 Nitpick comments (2)

arch/generic/crc32_braid_c.c (2)
600-622: Add documentation for the bit manipulation operations.

The bit manipulation operations are complex and their purpose is not immediately clear. Consider adding comments explaining the mathematical basis from the research paper.

Add documentation:
+// Implementing equation (7) from the paper:
+// The following operations compute the polynomial multiplication
+// using optimized bit shifts based on the characteristic polynomial
 a1 = (in1 << 17) ^ (in1 << 55);
 a2 = (in1 >> 47) ^ (in1 >> 9) ^ (in1 << 19);
 a3 = (in1 >> 45) ^ (in1 << 44);
 a4 = (in1 >> 20);
736-743: Improve error handling in the main interface.

The function selection logic could benefit from more robust error handling and logging.

Consider adding:
 c = (~crc) & 0xffffffff;
+
+/* Log which implementation is being used for debugging */
+#ifdef ZLIB_DEBUG
+    if(len > 512 * 1024)
+        Z_NOTICE("Using chorba_118960_nondestructive for len=%zu", len);
+    else if (len > 72)
+        Z_NOTICE("Using chorba_small_nondestructive for len=%zu", len);
+    else
+        Z_NOTICE("Using crc32_braid_base for len=%zu", len);
+#endif
+
 if(len > 512 * 1024)
     c = chorba_118960_nondestructive(c, buf, len);
 else if (len > 72)
     c = chorba_small_nondestructive(c, buf, len);
 else {
     c = crc32_braid_base(c, buf, len);
 }

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b2192e6 and 5fc24ef.

📒 Files selected for processing (1)

arch/generic/crc32_braid_c.c (2 hunks)

🧰 Additional context used

📓 Learnings (1)

arch/generic/crc32_braid_c.c (1)

Learnt from: samrussell
PR: zlib-ng/zlib-ng#1837
File: arch/generic/crc32_braid_c.c:596-597
Timestamp: 2024-12-25T19:45:06.009Z
Learning: We should verify out-of-bounds pointer arithmetic in chorba_118960_nondestructive() and related loops where "input + i" is used, especially when len < expected block sizes.

🪛 cppcheck (2.10-2)

arch/generic/crc32_braid_c.c

[error] 733-733: There is an unknown macro here somewhere. Configuration is required. If PREFIX is a macro then please configure it.

(unknownMacro)

🔇 Additional comments (1)

arch/generic/crc32_braid_c.c (1)

733-744: Add performance benchmarks and test cases.

Given that this is a performance-critical implementation, it would be valuable to:

Add benchmarks comparing the new implementation with the existing one
Add test cases verifying correctness for different input sizes
Document the expected performance characteristics

Would you like me to help create a benchmark suite and test cases for this implementation?

🧰 Tools

🪛 cppcheck (2.10-2)

[error] 733-733: There is an unknown macro here somewhere. Configuration is required. If PREFIX is a macro then please configure it.

(unknownMacro)

coderabbitai · 2024-12-26T15:44:43Z

arch/generic/crc32_braid_c.c

+    for(; i + (118960 + 4096) * sizeof(z_word_t) < len; i += (32 * sizeof(z_word_t))) {
+        z_word_t in1, in2, in3, in4, in5, in6, in7, in8;


💡 Codebase verification

⚠️ Potential issue

Add bounds checking to prevent buffer overflows in CRC32 computation

Based on the search results and learnings, I can confirm that there are multiple loops in the code that perform pointer arithmetic without proper bounds checking. The loops at lines 448 and 581 process large blocks of data (118960 + 4096 words and 72 bytes respectively) without validating if the input buffer has sufficient length.

The review comment's suggestion to add bounds checking is correct and critical for preventing buffer overflows. Specifically:

The main loop at line 448 requires (118960 + 4096) * sizeof(z_word_t) bytes

The subsequent loop at line 581 requires 72 bytes

No length validation is present before these loops

The suggested fixes in the review comment are appropriate:

if (len < (118960 + 4096) * sizeof(z_word_t)) { return crc32_braid_base(crc, buf, len); }

and

if (len < 72) { return crc; }

🔗 Analysis chain

Add bounds checking for input buffer access.

Direct pointer arithmetic without bounds checking could lead to buffer overflows. The loops assume minimum input sizes but don't validate them.

Add bounds checking:

+if (len < (118960 + 4096) * sizeof(z_word_t)) { + // Handle insufficient input size + return crc32_braid_base(crc, buf, len); +} for(; i + (118960 + 4096) * sizeof(z_word_t) < len; i += (32 * sizeof(z_word_t))) {

+if (len < 72) { + // Handle insufficient input size + return crc; +} for(; i + 72 < len; i += 32) {

Let's verify the input buffer access patterns:

Also applies to: 581-582

Let me gather more information about the loop conditions and function structure.

Let me try different search patterns to analyze the loop conditions.

Let me try one more search to understand the function context.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash # Description: Search for all pointer arithmetic patterns involving input buffer access # to identify potential buffer overflows. rg -A 5 '\*$\(.*\*$\s*$input\s*\+\s*i.*$\)' arch/generic/crc32_braid_c.c

Length of output: 11246

Script:

#!/bin/bash # Search for loop conditions and function declarations to understand the context rg -B 2 -A 2 'for$.*i \+.*\< len;.*$' arch/generic/crc32_braid_c.c

Length of output: 174

Script:

#!/bin/bash # Search for function declarations and their parameters ast-grep --pattern 'uint32_t $_($$$) { $$$ }' arch/generic/crc32_braid_c.c

Length of output: 78

Script:

#!/bin/bash # Search for loop conditions with proper escaping rg -B 2 -A 2 'for\s*\(.*\+.*<\s*len;' arch/generic/crc32_braid_c.c

Length of output: 494

Script:

#!/bin/bash # Search for function declarations containing crc32 in their name rg -B 1 -A 1 '^uint32_t\s+\w*crc32\w*\s*\(' arch/generic/crc32_braid_c.c

Length of output: 74

Script:

#!/bin/bash # Search for function declarations with broader pattern rg -B 2 -A 5 'uint32_t\s+\w+\s*\(\s*(const\s+)?(uint8_t|unsigned char)\s*\*' arch/generic/crc32_braid_c.c

Length of output: 107

Script:

#!/bin/bash # Look for any length validation before the loops rg -B 5 'if.*len.*<' arch/generic/crc32_braid_c.c

Length of output: 51

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

arch/generic/crc32_braid_c.c (1)
448-448: ⚠️ Potential issue

Add bounds checking to prevent buffer overflows.

The loops perform pointer arithmetic without validating if the input buffer has sufficient length. This could lead to buffer overflows.

Add bounds checking before the loops:
+if (len < (14870 + 64) * sizeof(z_word_t)) {
+    return crc;
+}
 for(; i + (14870 + 64) * sizeof(z_word_t) < len; i += (32 * sizeof(z_word_t))) {
+if (len < 72) {
+    return crc;
+}
 for(; i + 72 < len; i += 32) {
Also applies to: 581-581

🧹 Nitpick comments (2)

arch/generic/crc32_braid_c.c (2)
736-743: Document the size thresholds and their rationale.

The size thresholds (512KB, 72 bytes) for selecting different implementations lack documentation explaining their rationale and performance characteristics.

Add comments explaining the thresholds:
 c = (~crc) & 0xffffffff;
+/* Use optimized implementation for large buffers (>512KB) based on performance testing */
 if(len > 512 * 1024)
     c = chorba_118960_nondestructive(c, buf, len);
+/* Use small buffer optimization for medium-sized buffers (>72 bytes) */
 else if (len > 72)
     c = chorba_small_nondestructive(c, buf, len);
 else {
+    /* Use base implementation for small buffers */
     c = crc32_braid_base(c, buf, len);
 }
249-346: Consider reducing register pressure in unrolled loops.

The large number of local variables in the unrolled loops may cause register spills, potentially affecting performance. Consider breaking down the loops into smaller chunks or using SIMD operations if available.

Consider:

Breaking down the 32-word chunks into smaller sizes (e.g., 16 words) to reduce register pressure

Using SIMD intrinsics (e.g., AVX) for the bit operations if available

Adding performance benchmarks to validate the optimal chunk size

Also applies to: 447-545

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5fc24ef and f69aa58.

📒 Files selected for processing (1)

arch/generic/crc32_braid_c.c (2 hunks)

🧰 Additional context used

📓 Learnings (1)

arch/generic/crc32_braid_c.c (1)

Learnt from: samrussell
PR: zlib-ng/zlib-ng#1837
File: arch/generic/crc32_braid_c.c:596-597
Timestamp: 2024-12-25T19:45:06.009Z
Learning: We should verify out-of-bounds pointer arithmetic in chorba_118960_nondestructive() and related loops where "input + i" is used, especially when len < expected block sizes.

🪛 cppcheck (2.10-2)

arch/generic/crc32_braid_c.c

[error] 733-733: There is an unknown macro here somewhere. Configuration is required. If PREFIX is a macro then please configure it.

(unknownMacro)

🔇 Additional comments (1)

arch/generic/crc32_braid_c.c (1)

215-220: ⚠️ Potential issue

Consider using heap allocation for the large buffer.

The 128KB buffer allocated on the stack could cause stack overflow on systems with limited stack size. Consider using heap allocation with proper error handling.

Apply this change:

-z_word_t bitbuffer[bitbuffersizezwords];
+z_word_t* bitbuffer = (z_word_t*)malloc(bitbuffersizebytes);
+if (!bitbuffer) {
+    return crc;  // Handle allocation failure
+}
+
+// Add at the end of the function:
+free(bitbuffer);

Likely invalid or redundant comment.

arch/generic/crc32_braid_c.c

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

arch/generic/crc32_braid_c.c (2)
753-764: Main function implementation looks good!

The function correctly selects the appropriate implementation based on input size and handles CRC conditioning. Consider adding documentation about the performance characteristics of each threshold.

Add documentation:
 Z_INTERNAL uint32_t PREFIX(crc32_braid)(uint32_t crc, const uint8_t *buf, size_t len) {
+    /* Select implementation based on input size:
+     * - > 512KB: Use Chorba algorithm optimized for large inputs (~2x speedup)
+     * - > 72B: Use Chorba algorithm optimized for medium inputs (~5-20% speedup)
+     * - <= 72B: Use base implementation
+     */
     uint32_t c;
🧰 Tools

🪛 cppcheck (2.10-2)

[error] 753-753: There is an unknown macro here somewhere. Configuration is required. If PREFIX is a macro then please configure it.

(unknownMacro)

214-751: Consider additional optimizations for different architectures.

The current implementation focuses on AVX1, but as mentioned in the PR objectives, similar principles could apply to other architectures:

Consider implementing AVX2 and AVX512 variants

Explore NEON implementation for ARM architectures

Add runtime detection of available CPU features to select optimal implementation

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f69aa58 and 00b750f.

📒 Files selected for processing (1)

arch/generic/crc32_braid_c.c (2 hunks)

🧰 Additional context used

📓 Learnings (1)

arch/generic/crc32_braid_c.c (1)

Learnt from: samrussell
PR: zlib-ng/zlib-ng#1837
File: arch/generic/crc32_braid_c.c:596-597
Timestamp: 2024-12-25T19:45:06.009Z
Learning: We should verify out-of-bounds pointer arithmetic in chorba_118960_nondestructive() and related loops where "input + i" is used, especially when len < expected block sizes.

🪛 cppcheck (2.10-2)

arch/generic/crc32_braid_c.c

[error] 753-753: There is an unknown macro here somewhere. Configuration is required. If PREFIX is a macro then please configure it.

(unknownMacro)

🔇 Additional comments (3)

arch/generic/crc32_braid_c.c (3)

Line range hint 62-212: Base function implementation looks good!

The base CRC32 computation logic is well-preserved and correctly handles the fallback case for small inputs.

🧰 Tools

🪛 cppcheck (2.10-2)

[error] 753-753: There is an unknown macro here somewhere. Configuration is required. If PREFIX is a macro then please configure it.

(unknownMacro)

215-218: 🛠️ Refactor suggestion

Consider heap allocation for large buffer.

The 16KB * word_size buffer is allocated on the stack, which could cause stack overflow on systems with limited stack size. Consider using heap allocation with proper alignment.

Apply this change:

-z_word_t bitbuffer[bitbuffersizezwords];
+z_word_t* bitbuffer = (z_word_t*)zcalloc(1, bitbuffersizebytes);
+if (!bitbuffer) {
+    return crc;  // Handle allocation failure
+}
+
+// Add at end of function:
+zfree(bitbuffer);

Likely invalid or redundant comment.

683-737: ⚠️ Potential issue

Add bounds checking for small input processing.

Similar to the main implementation, add validation for buffer access.

Add validation:

+if (i + 72 > len) {
+    // Handle insufficient input size
+    return crc;
+}
 for(; i + 72 < len; i += 32) {

Likely invalid or redundant comment.

coderabbitai · 2024-12-26T16:27:02Z

arch/generic/crc32_braid_c.c

+    for(; i + (14870 + 64) * sizeof(z_word_t) < len; i += (32 * sizeof(z_word_t))) {
+        z_word_t in1, in2, in3, in4, in5, in6, in7, in8;
+        z_word_t in9, in10, in11, in12, in13, in14, in15, in16;
+        z_word_t in17, in18, in19, in20, in21, in22, in23, in24;
+        z_word_t in25, in26, in27, in28, in29, in30, in31, in32;
+        int inoffset = (i / sizeof(z_word_t)) % bitbuffersizezwords;
+        int outoffset1 = ((i / sizeof(z_word_t)) + 14848) % bitbuffersizezwords;
+        int outoffset2 = ((i / sizeof(z_word_t)) + 14880) % bitbuffersizezwords;
+
+        in1 = *((z_word_t*) (input + i + (0 * sizeof(z_word_t)))) ^ next1 ^ bitbuffer[inoffset + 0];
+        in2 = *((z_word_t*) (input + i + (1 * sizeof(z_word_t)))) ^ next2 ^ bitbuffer[inoffset + 1];
+        in3 = *((z_word_t*) (input + i + (2 * sizeof(z_word_t)))) ^ next3 ^ bitbuffer[inoffset + 2];
+        in4 = *((z_word_t*) (input + i + (3 * sizeof(z_word_t)))) ^ next4 ^ bitbuffer[inoffset + 3];
+        in5 = *((z_word_t*) (input + i + (4 * sizeof(z_word_t)))) ^ next5 ^ bitbuffer[inoffset + 4];
+        in6 = *((z_word_t*) (input + i + (5 * sizeof(z_word_t)))) ^ next6 ^ bitbuffer[inoffset + 5];
+        in7 = *((z_word_t*) (input + i + (6 * sizeof(z_word_t)))) ^ next7 ^ bitbuffer[inoffset + 6];
+        in8 = *((z_word_t*) (input + i + (7 * sizeof(z_word_t)))) ^ next8 ^ in1 ^ bitbuffer[inoffset + 7];
+        in9 = *((z_word_t*) (input + i + (8 * sizeof(z_word_t)))) ^ next9 ^ in2 ^ bitbuffer[inoffset + 8];
+        in10 = *((z_word_t*) (input + i + (9 * sizeof(z_word_t)))) ^ next10 ^ in3 ^ bitbuffer[inoffset + 9];
+        in11 = *((z_word_t*) (input + i + (10 * sizeof(z_word_t)))) ^ next11 ^ in4 ^ bitbuffer[inoffset + 10];
+        in12 = *((z_word_t*) (input + i + (11 * sizeof(z_word_t)))) ^ next12 ^ in1 ^ in5 ^ bitbuffer[inoffset + 11];
+        in13 = *((z_word_t*) (input + i + (12 * sizeof(z_word_t)))) ^ next13 ^ in2 ^ in6 ^ bitbuffer[inoffset + 12];
+        in14 = *((z_word_t*) (input + i + (13 * sizeof(z_word_t)))) ^ next14 ^ in3 ^ in7 ^ bitbuffer[inoffset + 13];
+        in15 = *((z_word_t*) (input + i + (14 * sizeof(z_word_t)))) ^ next15 ^ in4 ^ in8 ^ bitbuffer[inoffset + 14];
+        in16 = *((z_word_t*) (input + i + (15 * sizeof(z_word_t)))) ^ next16 ^ in5 ^ in9 ^ bitbuffer[inoffset + 15];
+        in17 = *((z_word_t*) (input + i + (16 * sizeof(z_word_t)))) ^ next17 ^ in6 ^ in10 ^ bitbuffer[inoffset + 16];
+        in18 = *((z_word_t*) (input + i + (17 * sizeof(z_word_t)))) ^ next18 ^ in7 ^ in11 ^ bitbuffer[inoffset + 17];
+        in19 = *((z_word_t*) (input + i + (18 * sizeof(z_word_t)))) ^ next19 ^ in8 ^ in12 ^ bitbuffer[inoffset + 18];
+        in20 = *((z_word_t*) (input + i + (19 * sizeof(z_word_t)))) ^ next20 ^ in9 ^ in13 ^ bitbuffer[inoffset + 19];
+        in21 = *((z_word_t*) (input + i + (20 * sizeof(z_word_t)))) ^ next21 ^ in10 ^ in14 ^ bitbuffer[inoffset + 20];
+        in22 = *((z_word_t*) (input + i + (21 * sizeof(z_word_t)))) ^ next22 ^ in11 ^ in15 ^ bitbuffer[inoffset + 21];
+        in23 = *((z_word_t*) (input + i + (22 * sizeof(z_word_t)))) ^ in1 ^ in12 ^ in16 ^ bitbuffer[inoffset + 22];
+        in24 = *((z_word_t*) (input + i + (23 * sizeof(z_word_t)))) ^ in2 ^ in13 ^ in17 ^ bitbuffer[inoffset + 23];
+        in25 = *((z_word_t*) (input + i + (24 * sizeof(z_word_t)))) ^ in3 ^ in14 ^ in18 ^ bitbuffer[inoffset + 24];
+        in26 = *((z_word_t*) (input + i + (25 * sizeof(z_word_t)))) ^ in4 ^ in15 ^ in19 ^ bitbuffer[inoffset + 25];
+        in27 = *((z_word_t*) (input + i + (26 * sizeof(z_word_t)))) ^ in5 ^ in16 ^ in20 ^ bitbuffer[inoffset + 26];
+        in28 = *((z_word_t*) (input + i + (27 * sizeof(z_word_t)))) ^ in6 ^ in17 ^ in21 ^ bitbuffer[inoffset + 27];
+        in29 = *((z_word_t*) (input + i + (28 * sizeof(z_word_t)))) ^ in7 ^ in18 ^ in22 ^ bitbuffer[inoffset + 28];
+        in30 = *((z_word_t*) (input + i + (29 * sizeof(z_word_t)))) ^ in8 ^ in19 ^ in23 ^ bitbuffer[inoffset + 29];
+        in31 = *((z_word_t*) (input + i + (30 * sizeof(z_word_t)))) ^ in9 ^ in20 ^ in24 ^ bitbuffer[inoffset + 30];
+        in32 = *((z_word_t*) (input + i + (31 * sizeof(z_word_t)))) ^ in10 ^ in21 ^ in25 ^ bitbuffer[inoffset + 31];
+
+        next1 = in11 ^ in22 ^ in26;
+        next2 = in12 ^ in23 ^ in27;
+        next3 = in13 ^ in24 ^ in28;
+        next4 = in14 ^ in25 ^ in29;
+        next5 = in15 ^ in26 ^ in30;
+        next6 = in16 ^ in27 ^ in31;
+        next7 = in17 ^ in28 ^ in32;
+        next8 = in18 ^ in29;
+        next9 = in19 ^ in30;
+        next10 = in20 ^ in31;
+        next11 = in21 ^ in32;
+        next12 = in22;
+        next13 = in23;
+        next14 = in24;
+        next15 = in25;
+        next16 = in26;
+        next17 = in27;
+        next18 = in28;
+        next19 = in29;
+        next20 = in30;
+        next21 = in31;
+        next22 = in32;
+
+        bitbuffer[outoffset1 + 22] = in1;
+        bitbuffer[outoffset1 + 23] = in2;
+        bitbuffer[outoffset1 + 24] = in3;
+        bitbuffer[outoffset1 + 25] = in4;
+        bitbuffer[outoffset1 + 26] = in5;
+        bitbuffer[outoffset1 + 27] = in6;
+        bitbuffer[outoffset1 + 28] = in7;
+        bitbuffer[outoffset1 + 29] = in8;
+        bitbuffer[outoffset1 + 30] = in9;
+        bitbuffer[outoffset1 + 31] = in10;
+        bitbuffer[outoffset2 + 0] = in11;
+        bitbuffer[outoffset2 + 1] = in12;
+        bitbuffer[outoffset2 + 2] = in13;
+        bitbuffer[outoffset2 + 3] = in14;
+        bitbuffer[outoffset2 + 4] = in15;
+        bitbuffer[outoffset2 + 5] = in16;
+        bitbuffer[outoffset2 + 6] = in17;
+        bitbuffer[outoffset2 + 7] = in18;
+        bitbuffer[outoffset2 + 8] = in19;
+        bitbuffer[outoffset2 + 9] = in20;
+        bitbuffer[outoffset2 + 10] = in21;
+        bitbuffer[outoffset2 + 11] = in22;
+        bitbuffer[outoffset2 + 12] = in23;
+        bitbuffer[outoffset2 + 13] = in24;
+        bitbuffer[outoffset2 + 14] = in25;
+        bitbuffer[outoffset2 + 15] = in26;
+        bitbuffer[outoffset2 + 16] = in27;
+        bitbuffer[outoffset2 + 17] = in28;
+        bitbuffer[outoffset2 + 18] = in29;
+        bitbuffer[outoffset2 + 19] = in30;
+        bitbuffer[outoffset2 + 20] = in31;
+        bitbuffer[outoffset2 + 21] = in32;
+    }


⚠️ Potential issue

Add bounds checking for large input processing.

The main processing loop assumes input buffer has sufficient length without explicit validation. This could lead to buffer overflows.

Add validation before the loop:

+if (i + (14870 + 64) * sizeof(z_word_t) > len) { + // Handle insufficient input size + return crc32_braid_base(crc, buf, len); +} for(; i + (14870 + 64) * sizeof(z_word_t) < len; i += (32 * sizeof(z_word_t))) {

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for(; i + (14870 + 64) * sizeof(z_word_t) < len; i += (32 * sizeof(z_word_t))) {

z_word_t in1, in2, in3, in4, in5, in6, in7, in8;

z_word_t in9, in10, in11, in12, in13, in14, in15, in16;

z_word_t in17, in18, in19, in20, in21, in22, in23, in24;

z_word_t in25, in26, in27, in28, in29, in30, in31, in32;

int inoffset = (i / sizeof(z_word_t)) % bitbuffersizezwords;

int outoffset1 = ((i / sizeof(z_word_t)) + 14848) % bitbuffersizezwords;

int outoffset2 = ((i / sizeof(z_word_t)) + 14880) % bitbuffersizezwords;

in1 = *((z_word_t*) (input + i + (0 * sizeof(z_word_t)))) ^ next1 ^ bitbuffer[inoffset + 0];

in2 = *((z_word_t*) (input + i + (1 * sizeof(z_word_t)))) ^ next2 ^ bitbuffer[inoffset + 1];

in3 = *((z_word_t*) (input + i + (2 * sizeof(z_word_t)))) ^ next3 ^ bitbuffer[inoffset + 2];

in4 = *((z_word_t*) (input + i + (3 * sizeof(z_word_t)))) ^ next4 ^ bitbuffer[inoffset + 3];

in5 = *((z_word_t*) (input + i + (4 * sizeof(z_word_t)))) ^ next5 ^ bitbuffer[inoffset + 4];

in6 = *((z_word_t*) (input + i + (5 * sizeof(z_word_t)))) ^ next6 ^ bitbuffer[inoffset + 5];

in7 = *((z_word_t*) (input + i + (6 * sizeof(z_word_t)))) ^ next7 ^ bitbuffer[inoffset + 6];

in8 = *((z_word_t*) (input + i + (7 * sizeof(z_word_t)))) ^ next8 ^ in1 ^ bitbuffer[inoffset + 7];

in9 = *((z_word_t*) (input + i + (8 * sizeof(z_word_t)))) ^ next9 ^ in2 ^ bitbuffer[inoffset + 8];

in10 = *((z_word_t*) (input + i + (9 * sizeof(z_word_t)))) ^ next10 ^ in3 ^ bitbuffer[inoffset + 9];

in11 = *((z_word_t*) (input + i + (10 * sizeof(z_word_t)))) ^ next11 ^ in4 ^ bitbuffer[inoffset + 10];

in12 = *((z_word_t*) (input + i + (11 * sizeof(z_word_t)))) ^ next12 ^ in1 ^ in5 ^ bitbuffer[inoffset + 11];

in13 = *((z_word_t*) (input + i + (12 * sizeof(z_word_t)))) ^ next13 ^ in2 ^ in6 ^ bitbuffer[inoffset + 12];

in14 = *((z_word_t*) (input + i + (13 * sizeof(z_word_t)))) ^ next14 ^ in3 ^ in7 ^ bitbuffer[inoffset + 13];

in15 = *((z_word_t*) (input + i + (14 * sizeof(z_word_t)))) ^ next15 ^ in4 ^ in8 ^ bitbuffer[inoffset + 14];

in16 = *((z_word_t*) (input + i + (15 * sizeof(z_word_t)))) ^ next16 ^ in5 ^ in9 ^ bitbuffer[inoffset + 15];

in17 = *((z_word_t*) (input + i + (16 * sizeof(z_word_t)))) ^ next17 ^ in6 ^ in10 ^ bitbuffer[inoffset + 16];

in18 = *((z_word_t*) (input + i + (17 * sizeof(z_word_t)))) ^ next18 ^ in7 ^ in11 ^ bitbuffer[inoffset + 17];

in19 = *((z_word_t*) (input + i + (18 * sizeof(z_word_t)))) ^ next19 ^ in8 ^ in12 ^ bitbuffer[inoffset + 18];

in20 = *((z_word_t*) (input + i + (19 * sizeof(z_word_t)))) ^ next20 ^ in9 ^ in13 ^ bitbuffer[inoffset + 19];

in21 = *((z_word_t*) (input + i + (20 * sizeof(z_word_t)))) ^ next21 ^ in10 ^ in14 ^ bitbuffer[inoffset + 20];

in22 = *((z_word_t*) (input + i + (21 * sizeof(z_word_t)))) ^ next22 ^ in11 ^ in15 ^ bitbuffer[inoffset + 21];

in23 = *((z_word_t*) (input + i + (22 * sizeof(z_word_t)))) ^ in1 ^ in12 ^ in16 ^ bitbuffer[inoffset + 22];

in24 = *((z_word_t*) (input + i + (23 * sizeof(z_word_t)))) ^ in2 ^ in13 ^ in17 ^ bitbuffer[inoffset + 23];

in25 = *((z_word_t*) (input + i + (24 * sizeof(z_word_t)))) ^ in3 ^ in14 ^ in18 ^ bitbuffer[inoffset + 24];

in26 = *((z_word_t*) (input + i + (25 * sizeof(z_word_t)))) ^ in4 ^ in15 ^ in19 ^ bitbuffer[inoffset + 25];

in27 = *((z_word_t*) (input + i + (26 * sizeof(z_word_t)))) ^ in5 ^ in16 ^ in20 ^ bitbuffer[inoffset + 26];

in28 = *((z_word_t*) (input + i + (27 * sizeof(z_word_t)))) ^ in6 ^ in17 ^ in21 ^ bitbuffer[inoffset + 27];

in29 = *((z_word_t*) (input + i + (28 * sizeof(z_word_t)))) ^ in7 ^ in18 ^ in22 ^ bitbuffer[inoffset + 28];

in30 = *((z_word_t*) (input + i + (29 * sizeof(z_word_t)))) ^ in8 ^ in19 ^ in23 ^ bitbuffer[inoffset + 29];

in31 = *((z_word_t*) (input + i + (30 * sizeof(z_word_t)))) ^ in9 ^ in20 ^ in24 ^ bitbuffer[inoffset + 30];

in32 = *((z_word_t*) (input + i + (31 * sizeof(z_word_t)))) ^ in10 ^ in21 ^ in25 ^ bitbuffer[inoffset + 31];

next1 = in11 ^ in22 ^ in26;

next2 = in12 ^ in23 ^ in27;

next3 = in13 ^ in24 ^ in28;

next4 = in14 ^ in25 ^ in29;

next5 = in15 ^ in26 ^ in30;

next6 = in16 ^ in27 ^ in31;

next7 = in17 ^ in28 ^ in32;

next8 = in18 ^ in29;

next9 = in19 ^ in30;

next10 = in20 ^ in31;

next11 = in21 ^ in32;

next12 = in22;

next13 = in23;

next14 = in24;

next15 = in25;

next16 = in26;

next17 = in27;

next18 = in28;

next19 = in29;

next20 = in30;

next21 = in31;

next22 = in32;

bitbuffer[outoffset1 + 22] = in1;

bitbuffer[outoffset1 + 23] = in2;

bitbuffer[outoffset1 + 24] = in3;

bitbuffer[outoffset1 + 25] = in4;

bitbuffer[outoffset1 + 26] = in5;

bitbuffer[outoffset1 + 27] = in6;

bitbuffer[outoffset1 + 28] = in7;

bitbuffer[outoffset1 + 29] = in8;

bitbuffer[outoffset1 + 30] = in9;

bitbuffer[outoffset1 + 31] = in10;

bitbuffer[outoffset2 + 0] = in11;

bitbuffer[outoffset2 + 1] = in12;

bitbuffer[outoffset2 + 2] = in13;

bitbuffer[outoffset2 + 3] = in14;

bitbuffer[outoffset2 + 4] = in15;

bitbuffer[outoffset2 + 5] = in16;

bitbuffer[outoffset2 + 6] = in17;

bitbuffer[outoffset2 + 7] = in18;

bitbuffer[outoffset2 + 8] = in19;

bitbuffer[outoffset2 + 9] = in20;

bitbuffer[outoffset2 + 10] = in21;

bitbuffer[outoffset2 + 11] = in22;

bitbuffer[outoffset2 + 12] = in23;

bitbuffer[outoffset2 + 13] = in24;

bitbuffer[outoffset2 + 14] = in25;

bitbuffer[outoffset2 + 15] = in26;

bitbuffer[outoffset2 + 16] = in27;

bitbuffer[outoffset2 + 17] = in28;

bitbuffer[outoffset2 + 18] = in29;

bitbuffer[outoffset2 + 19] = in30;

bitbuffer[outoffset2 + 20] = in31;

bitbuffer[outoffset2 + 21] = in32;

}

if (i + (14870 + 64) * sizeof(z_word_t) > len) {

// Handle insufficient input size

return crc32_braid_base(crc, buf, len);

}

for(; i + (14870 + 64) * sizeof(z_word_t) < len; i += (32 * sizeof(z_word_t))) {

z_word_t in1, in2, in3, in4, in5, in6, in7, in8;

z_word_t in9, in10, in11, in12, in13, in14, in15, in16;

z_word_t in17, in18, in19, in20, in21, in22, in23, in24;

z_word_t in25, in26, in27, in28, in29, in30, in31, in32;

int inoffset = (i / sizeof(z_word_t)) % bitbuffersizezwords;

int outoffset1 = ((i / sizeof(z_word_t)) + 14848) % bitbuffersizezwords;

int outoffset2 = ((i / sizeof(z_word_t)) + 14880) % bitbuffersizezwords;

in1 = *((z_word_t*) (input + i + (0 * sizeof(z_word_t)))) ^ next1 ^ bitbuffer[inoffset + 0];

in2 = *((z_word_t*) (input + i + (1 * sizeof(z_word_t)))) ^ next2 ^ bitbuffer[inoffset + 1];

in3 = *((z_word_t*) (input + i + (2 * sizeof(z_word_t)))) ^ next3 ^ bitbuffer[inoffset + 2];

in4 = *((z_word_t*) (input + i + (3 * sizeof(z_word_t)))) ^ next4 ^ bitbuffer[inoffset + 3];

in5 = *((z_word_t*) (input + i + (4 * sizeof(z_word_t)))) ^ next5 ^ bitbuffer[inoffset + 4];

in6 = *((z_word_t*) (input + i + (5 * sizeof(z_word_t)))) ^ next6 ^ bitbuffer[inoffset + 5];

in7 = *((z_word_t*) (input + i + (6 * sizeof(z_word_t)))) ^ next7 ^ bitbuffer[inoffset + 6];

in8 = *((z_word_t*) (input + i + (7 * sizeof(z_word_t)))) ^ next8 ^ in1 ^ bitbuffer[inoffset + 7];

in9 = *((z_word_t*) (input + i + (8 * sizeof(z_word_t)))) ^ next9 ^ in2 ^ bitbuffer[inoffset + 8];

in10 = *((z_word_t*) (input + i + (9 * sizeof(z_word_t)))) ^ next10 ^ in3 ^ bitbuffer[inoffset + 9];

in11 = *((z_word_t*) (input + i + (10 * sizeof(z_word_t)))) ^ next11 ^ in4 ^ bitbuffer[inoffset + 10];

in12 = *((z_word_t*) (input + i + (11 * sizeof(z_word_t)))) ^ next12 ^ in1 ^ in5 ^ bitbuffer[inoffset + 11];

in13 = *((z_word_t*) (input + i + (12 * sizeof(z_word_t)))) ^ next13 ^ in2 ^ in6 ^ bitbuffer[inoffset + 12];

in14 = *((z_word_t*) (input + i + (13 * sizeof(z_word_t)))) ^ next14 ^ in3 ^ in7 ^ bitbuffer[inoffset + 13];

in15 = *((z_word_t*) (input + i + (14 * sizeof(z_word_t)))) ^ next15 ^ in4 ^ in8 ^ bitbuffer[inoffset + 14];

in16 = *((z_word_t*) (input + i + (15 * sizeof(z_word_t)))) ^ next16 ^ in5 ^ in9 ^ bitbuffer[inoffset + 15];

in17 = *((z_word_t*) (input + i + (16 * sizeof(z_word_t)))) ^ next17 ^ in6 ^ in10 ^ bitbuffer[inoffset + 16];

in18 = *((z_word_t*) (input + i + (17 * sizeof(z_word_t)))) ^ next18 ^ in7 ^ in11 ^ bitbuffer[inoffset + 17];

in19 = *((z_word_t*) (input + i + (18 * sizeof(z_word_t)))) ^ next19 ^ in8 ^ in12 ^ bitbuffer[inoffset + 18];

in20 = *((z_word_t*) (input + i + (19 * sizeof(z_word_t)))) ^ next20 ^ in9 ^ in13 ^ bitbuffer[inoffset + 19];

in21 = *((z_word_t*) (input + i + (20 * sizeof(z_word_t)))) ^ next21 ^ in10 ^ in14 ^ bitbuffer[inoffset + 20];

in22 = *((z_word_t*) (input + i + (21 * sizeof(z_word_t)))) ^ next22 ^ in11 ^ in15 ^ bitbuffer[inoffset + 21];

in23 = *((z_word_t*) (input + i + (22 * sizeof(z_word_t)))) ^ in1 ^ in12 ^ in16 ^ bitbuffer[inoffset + 22];

in24 = *((z_word_t*) (input + i + (23 * sizeof(z_word_t)))) ^ in2 ^ in13 ^ in17 ^ bitbuffer[inoffset + 23];

in25 = *((z_word_t*) (input + i + (24 * sizeof(z_word_t)))) ^ in3 ^ in14 ^ in18 ^ bitbuffer[inoffset + 24];

in26 = *((z_word_t*) (input + i + (25 * sizeof(z_word_t)))) ^ in4 ^ in15 ^ in19 ^ bitbuffer[inoffset + 25];

in27 = *((z_word_t*) (input + i + (26 * sizeof(z_word_t)))) ^ in5 ^ in16 ^ in20 ^ bitbuffer[inoffset + 26];

in28 = *((z_word_t*) (input + i + (27 * sizeof(z_word_t)))) ^ in6 ^ in17 ^ in21 ^ bitbuffer[inoffset + 27];

in29 = *((z_word_t*) (input + i + (28 * sizeof(z_word_t)))) ^ in7 ^ in18 ^ in22 ^ bitbuffer[inoffset + 28];

in30 = *((z_word_t*) (input + i + (29 * sizeof(z_word_t)))) ^ in8 ^ in19 ^ in23 ^ bitbuffer[inoffset + 29];

in31 = *((z_word_t*) (input + i + (30 * sizeof(z_word_t)))) ^ in9 ^ in20 ^ in24 ^ bitbuffer[inoffset + 30];

in32 = *((z_word_t*) (input + i + (31 * sizeof(z_word_t)))) ^ in10 ^ in21 ^ in25 ^ bitbuffer[inoffset + 31];

next1 = in11 ^ in22 ^ in26;

next2 = in12 ^ in23 ^ in27;

next3 = in13 ^ in24 ^ in28;

next4 = in14 ^ in25 ^ in29;

next5 = in15 ^ in26 ^ in30;

next6 = in16 ^ in27 ^ in31;

next7 = in17 ^ in28 ^ in32;

next8 = in18 ^ in29;

next9 = in19 ^ in30;

next10 = in20 ^ in31;

next11 = in21 ^ in32;

next12 = in22;

next13 = in23;

next14 = in24;

next15 = in25;

next16 = in26;

next17 = in27;

next18 = in28;

next19 = in29;

next20 = in30;

next21 = in31;

next22 = in32;

bitbuffer[outoffset1 + 22] = in1;

bitbuffer[outoffset1 + 23] = in2;

bitbuffer[outoffset1 + 24] = in3;

bitbuffer[outoffset1 + 25] = in4;

bitbuffer[outoffset1 + 26] = in5;

bitbuffer[outoffset1 + 27] = in6;

bitbuffer[outoffset1 + 28] = in7;

bitbuffer[outoffset1 + 29] = in8;

bitbuffer[outoffset1 + 30] = in9;

bitbuffer[outoffset1 + 31] = in10;

bitbuffer[outoffset2 + 0] = in11;

bitbuffer[outoffset2 + 1] = in12;

bitbuffer[outoffset2 + 2] = in13;

bitbuffer[outoffset2 + 3] = in14;

bitbuffer[outoffset2 + 4] = in15;

bitbuffer[outoffset2 + 5] = in16;

bitbuffer[outoffset2 + 6] = in17;

bitbuffer[outoffset2 + 7] = in18;

bitbuffer[outoffset2 + 8] = in19;

bitbuffer[outoffset2 + 9] = in20;

bitbuffer[outoffset2 + 10] = in21;

bitbuffer[outoffset2 + 11] = in22;

bitbuffer[outoffset2 + 12] = in23;

bitbuffer[outoffset2 + 13] = in24;

bitbuffer[outoffset2 + 14] = in25;

bitbuffer[outoffset2 + 15] = in26;

bitbuffer[outoffset2 + 16] = in27;

bitbuffer[outoffset2 + 17] = in28;

bitbuffer[outoffset2 + 18] = in29;

bitbuffer[outoffset2 + 19] = in30;

bitbuffer[outoffset2 + 20] = in31;

bitbuffer[outoffset2 + 21] = in32;

}

codecov · 2024-12-26T16:34:52Z

Codecov Report

Attention: Patch coverage is 98.56115% with 20 lines in your changes missing coverage. Please review.

Project coverage is 82.40%. Comparing base (aecfe16) to head (071df72).
Report is 1 commits behind head on develop.

Files with missing lines	Patch %	Lines
arch/generic/crc32_chorba_c.c	98.43%	18 Missing ⚠️
test/benchmarks/benchmark_crc32.cc	0.00%	0 Missing and 1 partial ⚠️
test/test_crc32.cc	0.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #1837      +/-   ##
===========================================
+ Coverage    80.40%   82.40%   +1.99%     
===========================================
  Files          139      141       +2     
  Lines        11176    12558    +1382     
  Branches      2867     2894      +27     
===========================================
+ Hits          8986    10348    +1362     
- Misses        1220     1240      +20     
  Partials       970      970

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

arch/generic/crc32_braid_c.c (2)

219-669: Consider breaking down the large function into smaller, more manageable pieces.

The function chorba_118960_nondestructive is over 400 lines long and has high cyclomatic complexity. This makes it difficult to maintain and test.

Consider:

Extracting the block processing logic into a separate function

Creating helper functions for the bit manipulation operations

Adding comments explaining the algorithm's steps

214-214: Consider adding architecture-specific optimizations.

The current implementation could benefit from SIMD optimizations on supported platforms.

Consider:

Adding AVX/SSE implementations for x86

Adding NEON implementation for ARM

Adding compile-time architecture detection

Also applies to: 671-671

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 00b750f and 59a79be.

📒 Files selected for processing (1)

arch/generic/crc32_braid_c.c (2 hunks)

🧰 Additional context used

📓 Learnings (1)

arch/generic/crc32_braid_c.c (1)

Learnt from: samrussell
PR: zlib-ng/zlib-ng#1837
File: arch/generic/crc32_braid_c.c:596-597
Timestamp: 2024-12-25T19:45:06.009Z
Learning: We should verify out-of-bounds pointer arithmetic in chorba_118960_nondestructive() and related loops where "input + i" is used, especially when len < expected block sizes.

🪛 cppcheck (2.10-2)

arch/generic/crc32_braid_c.c

[error] 771-771: There is an unknown macro here somewhere. Configuration is required. If PREFIX is a macro then please configure it.

(unknownMacro)

🔇 Additional comments (2)

arch/generic/crc32_braid_c.c (2)

215-220: ⚠️ Potential issue

Consider using heap allocation for the large buffer.

The 16KB * sizeof(z_word_t) buffer is allocated on the stack, which could cause stack overflow on systems with limited stack size. Consider using heap allocation with proper error handling.

Apply this change:

-#define bitbuffersizebytes (16 * 1024 * sizeof(z_word_t))
-#define bitbuffersizezwords (bitbuffersizebytes / sizeof(z_word_t))
-#define bitbuffersizeqwords (bitbuffersizebytes / sizeof(uint64_t))
-    z_word_t bitbuffer[bitbuffersizezwords];
+#define bitbuffersizebytes (16 * 1024 * sizeof(z_word_t))
+#define bitbuffersizezwords (bitbuffersizebytes / sizeof(z_word_t))
+#define bitbuffersizeqwords (bitbuffersizebytes / sizeof(uint64_t))
+    z_word_t* bitbuffer = (z_word_t*)malloc(bitbuffersizebytes);
+    if (!bitbuffer) {
+        return crc;  // Handle allocation failure
+    }
+    // Add at the end of the function:
+    free(bitbuffer);

Likely invalid or redundant comment.

448-545: ⚠️ Potential issue

Add bounds checking to prevent buffer overflows.

The main processing loop assumes the input buffer has sufficient length without explicit validation. This could lead to buffer overflows.

Add validation before the loop:

+    if (i + (14870 + 64) * sizeof(z_word_t) > len) {
+        return crc32_braid_base(crc, buf, len);
+    }
     for(; i + (14870 + 64) * sizeof(z_word_t) < len; i += (32 * sizeof(z_word_t))) {

⛔ Skipped due to learnings

Learnt from: samrussell
PR: zlib-ng/zlib-ng#1837
File: arch/generic/crc32_braid_c.c:596-597
Timestamp: 2024-12-25T19:45:06.009Z
Learning: We should verify out-of-bounds pointer arithmetic in chorba_118960_nondestructive() and related loops where "input + i" is used, especially when len < expected block sizes.

coderabbitai · 2024-12-26T17:22:07Z

arch/generic/crc32_braid_c.c

+Z_INTERNAL uint32_t PREFIX(crc32_braid)(uint32_t crc, const uint8_t *buf, size_t len) {
+    uint32_t c;
+
+    c = (~crc) & 0xffffffff;
+    if(len > 512 * 1024)
+        c = chorba_118960_nondestructive(c, buf, len);
+    else if (len > 72)
+        c = chorba_small_nondestructive(c, buf, len);
+    else {
+        c = crc32_braid_base(c, buf, len);
+    }
+


🛠️ Refactor suggestion

Add input validation and consider making size thresholds configurable.

The function lacks input validation and uses hardcoded size thresholds.

Add input validation:

Z_INTERNAL uint32_t PREFIX(crc32_braid)(uint32_t crc, const uint8_t *buf, size_t len) { uint32_t c; + if (!buf && len > 0) { + return crc; // or appropriate error handling + } + c = (~crc) & 0xffffffff;

Consider making size thresholds configurable:

+#ifndef CHORBA_LARGE_THRESHOLD +#define CHORBA_LARGE_THRESHOLD (512 * 1024) +#endif +#ifndef CHORBA_SMALL_THRESHOLD +#define CHORBA_SMALL_THRESHOLD 72 +#endif

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools

🪛 cppcheck (2.10-2)

[error] 771-771: There is an unknown macro here somewhere. Configuration is required. If PREFIX is a macro then please configure it.

(unknownMacro)

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

arch/generic/crc32_braid_c.c (1)

214-214: Enhance documentation for the Chorba algorithm implementation.

While the code references the research paper, it would benefit from additional documentation explaining:

The algorithm's key concepts and data flow

The purpose and interaction of different buffer regions

The rationale behind the magic numbers (14848, 14880, etc.)

Performance characteristics and tradeoffs

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 59a79be and 7c2c7d9.

📒 Files selected for processing (1)

arch/generic/crc32_braid_c.c (2 hunks)

🧰 Additional context used

📓 Learnings (1)

arch/generic/crc32_braid_c.c (1)

Learnt from: samrussell
PR: zlib-ng/zlib-ng#1837
File: arch/generic/crc32_braid_c.c:596-597
Timestamp: 2024-12-25T19:45:06.009Z
Learning: We should verify out-of-bounds pointer arithmetic in chorba_118960_nondestructive() and related loops where "input + i" is used, especially when len < expected block sizes.

🪛 cppcheck (2.10-2)

arch/generic/crc32_braid_c.c

[error] 771-771: There is an unknown macro here somewhere. Configuration is required. If PREFIX is a macro then please configure it.

(unknownMacro)

🔇 Additional comments (5)

arch/generic/crc32_braid_c.c (5)

Line range hint 62-212: Base implementation looks good!

The crc32_braid_base function provides a solid foundation for CRC32 computation with proper boundary checks and error handling.

🧰 Tools

🪛 cppcheck (2.10-2)

[error] 771-771: There is an unknown macro here somewhere. Configuration is required. If PREFIX is a macro then please configure it.

(unknownMacro)

215-220: 🛠️ Refactor suggestion

Consider using heap allocation for large buffer.

The 16KB * sizeof(z_word_t) buffer is allocated on the stack, which could cause stack overflow on systems with limited stack size. Consider using heap allocation or making it configurable.

Apply this change:

-z_word_t bitbuffer[bitbuffersizezwords];
+z_word_t* bitbuffer = (z_word_t*)malloc(bitbuffersizebytes);
+if (!bitbuffer) {
+    return crc;  // Handle allocation failure
+}
+
+// Add at the end of the function:
+free(bitbuffer);

Likely invalid or redundant comment.

771-795: 🛠️ Refactor suggestion

Add input validation and consider making size thresholds configurable.

The function lacks input validation and uses hardcoded size thresholds.

Add input validation:

 Z_INTERNAL uint32_t PREFIX(crc32_braid)(uint32_t crc, const uint8_t *buf, size_t len) {
     uint32_t c;
+    if (!buf && len > 0) {
+        return crc;  // or appropriate error handling
+    }

Consider making size thresholds configurable:

+#ifndef CHORBA_LARGE_THRESHOLD
+#define CHORBA_LARGE_THRESHOLD (512 * 1024)
+#endif
+#ifndef CHORBA_SMALL_THRESHOLD
+#define CHORBA_SMALL_THRESHOLD 72
+#endif

Likely invalid or redundant comment.

🧰 Tools

🪛 cppcheck (2.10-2)

[error] 771-771: There is an unknown macro here somewhere. Configuration is required. If PREFIX is a macro then please configure it.

(unknownMacro)

683-749: ⚠️ Potential issue

Add bounds checking for small input processing.

Similar to the large input function, this loop needs proper bounds checking.

Add validation:

+if (i + 72 > len) {
+    return crc32_braid_base(crc, (uint8_t*)buf, len);
+}
 for(; i + 72 < len; i += 32) {

⛔ Skipped due to learnings

Learnt from: samrussell
PR: zlib-ng/zlib-ng#1837
File: arch/generic/crc32_braid_c.c:596-597
Timestamp: 2024-12-25T19:45:06.009Z
Learning: We should verify out-of-bounds pointer arithmetic in chorba_118960_nondestructive() and related loops where "input + i" is used, especially when len < expected block sizes.

448-545: ⚠️ Potential issue

Add bounds checking for large input processing.

The main processing loop assumes input buffer has sufficient length without explicit validation. This could lead to buffer overflows.

Add validation before the loop:

+if (i + (14870 + 64) * sizeof(z_word_t) > len) {
+    return crc32_braid_base(crc, buf, len);
+}
 for(; i + (14870 + 64) * sizeof(z_word_t) < len; i += (32 * sizeof(z_word_t))) {

⛔ Skipped due to learnings

Learnt from: samrussell
PR: zlib-ng/zlib-ng#1837
File: arch/generic/crc32_braid_c.c:596-597
Timestamp: 2024-12-25T19:45:06.009Z
Learning: We should verify out-of-bounds pointer arithmetic in chorba_118960_nondestructive() and related loops where "input + i" is used, especially when len < expected block sizes.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

arch/generic/crc32_braid_c.c (1)
786-792: Consider making size thresholds configurable.

The thresholds for choosing between algorithms (512KB, 72 bytes) are hardcoded. Consider making these configurable to allow tuning for different architectures and use cases.

Add configuration macros:
+#ifndef CHORBA_LARGE_THRESHOLD
+#define CHORBA_LARGE_THRESHOLD (512 * 1024)
+#endif
+#ifndef CHORBA_SMALL_THRESHOLD
+#define CHORBA_SMALL_THRESHOLD 72
+#endif

-        if(aligned_len > 512 * 1024)
+        if(aligned_len > CHORBA_LARGE_THRESHOLD)
             c = chorba_118960_nondestructive(c, aligned_buf, aligned_len);
-        else if (aligned_len > 72)
+        else if (aligned_len > CHORBA_SMALL_THRESHOLD)
             c = chorba_small_nondestructive(c, aligned_buf, aligned_len);

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7c2c7d9 and 0c7ba03.

📒 Files selected for processing (1)

arch/generic/crc32_braid_c.c (2 hunks)

🧰 Additional context used

📓 Learnings (1)

arch/generic/crc32_braid_c.c (1)

Learnt from: samrussell
PR: zlib-ng/zlib-ng#1837
File: arch/generic/crc32_braid_c.c:596-597
Timestamp: 2024-12-25T19:45:06.009Z
Learning: We should verify out-of-bounds pointer arithmetic in chorba_118960_nondestructive() and related loops where "input + i" is used, especially when len < expected block sizes.

🪛 cppcheck (2.10-2)

arch/generic/crc32_braid_c.c

[error] 773-773: There is an unknown macro here somewhere. Configuration is required. If PREFIX is a macro then please configure it.

(unknownMacro)

🔇 Additional comments (3)

arch/generic/crc32_braid_c.c (3)

215-220: ⚠️ Potential issue

Consider heap allocation for large buffer.

The 128KB stack buffer allocation could cause stack overflow on systems with limited stack size. Consider using heap allocation with proper error handling.

Apply this change:

-z_word_t bitbuffer[bitbuffersizezwords];
+z_word_t* bitbuffer = (z_word_t*)malloc(bitbuffersizebytes);
+if (!bitbuffer) {
+    return crc;  // Handle allocation failure
+}
+
+// Add at end of function:
+free(bitbuffer);

Likely invalid or redundant comment.

448-448: ⚠️ Potential issue

Add bounds checking to prevent buffer overflows.

The loops process large blocks of data without validating if the input buffer has sufficient length.

Add validation before the loops:

+if (i + (14870 + 64) * sizeof(z_word_t) > len) {
+    return crc32_braid_base(crc, buf, len);
+}
 for(; i + (14870 + 64) * sizeof(z_word_t) < len; i += (32 * sizeof(z_word_t))) {

+if (i + 72 > len) {
+    return crc;
+}
 for(; i + 72 < len; i += 32) {

Also applies to: 581-581

⛔ Skipped due to learnings

Learnt from: samrussell
PR: zlib-ng/zlib-ng#1837
File: arch/generic/crc32_braid_c.c:596-597
Timestamp: 2024-12-25T19:45:06.009Z
Learning: We should verify out-of-bounds pointer arithmetic in chorba_118960_nondestructive() and related loops where "input + i" is used, especially when len < expected block sizes.

773-797: Add comprehensive test coverage.

The implementation of a new CRC algorithm requires thorough testing to ensure correctness and performance. Consider adding:

Unit tests with various input sizes
Performance benchmarks comparing with original implementation
Edge case tests (alignment, buffer sizes at thresholds)
Verification against reference CRC implementation

Let's check for existing tests:

🧰 Tools

🪛 cppcheck (2.10-2)

[error] 773-773: There is an unknown macro here somewhere. Configuration is required. If PREFIX is a macro then please configure it.

(unknownMacro)

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

arch/generic/crc32_braid_c.c (2)

666-668: Consider optimizing the final pass.

The byte-by-byte processing in the final loop could be optimized for better performance.

Consider processing multiple bytes at once when possible:

-for(int j = 0; j<len-i; j++) {
-    crc = crc_table[(crc ^ final_bytes[j] ^ bitbufferbytes[(j+i) % bitbuffersizebytes]) & 0xff] ^ (crc >> 8);
+while (len-i >= 4) {
+    uint32_t chunk;
+    memcpy(&chunk, final_bytes, 4);
+    chunk ^= *(uint32_t*)&bitbufferbytes[(i) % bitbuffersizebytes];
+    crc = process_chunk(crc, chunk);
+    final_bytes += 4;
+    i += 4;
+}
+for(int j = 0; j<len-i; j++) {
+    crc = crc_table[(crc ^ final_bytes[j] ^ bitbufferbytes[(j+i) % bitbuffersizebytes]) & 0xff] ^ (crc >> 8);
+}

215-217: Extract magic numbers into named constants.

The buffer size calculations use magic numbers that should be named constants.

Consider using more descriptive constants:

-#define bitbuffersizebytes (16 * 1024 * sizeof(z_word_t))
-#define bitbuffersizezwords (bitbuffersizebytes / sizeof(z_word_t))
-#define bitbuffersizeqwords (bitbuffersizebytes / sizeof(uint64_t))
+#define CHORBA_BUFFER_KB 16
+#define CHORBA_BUFFER_BYTES (CHORBA_BUFFER_KB * 1024 * sizeof(z_word_t))
+#define CHORBA_BUFFER_ZWORDS (CHORBA_BUFFER_BYTES / sizeof(z_word_t))
+#define CHORBA_BUFFER_QWORDS (CHORBA_BUFFER_BYTES / sizeof(uint64_t))

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0c7ba03 and abeafb0.

📒 Files selected for processing (1)

arch/generic/crc32_braid_c.c (2 hunks)

🧰 Additional context used

📓 Learnings (1)

arch/generic/crc32_braid_c.c (1)

Learnt from: samrussell
PR: zlib-ng/zlib-ng#1837
File: arch/generic/crc32_braid_c.c:596-597
Timestamp: 2024-12-25T19:45:06.009Z
Learning: We should verify out-of-bounds pointer arithmetic in chorba_118960_nondestructive() and related loops where "input + i" is used, especially when len < expected block sizes.

🪛 cppcheck (2.10-2)

arch/generic/crc32_braid_c.c

[error] 775-775: There is an unknown macro here somewhere. Configuration is required. If PREFIX is a macro then please configure it.

(unknownMacro)

🔇 Additional comments (1)

arch/generic/crc32_braid_c.c (1)

775-781: ⚠️ Potential issue

Add input validation.

The function lacks proper input validation.

Add validation at the start of the function:

 Z_INTERNAL uint32_t PREFIX(crc32_braid)(uint32_t crc, const uint8_t *buf, size_t len) {
     uint32_t c;
     uint64_t* aligned_buf;
     size_t aligned_len;
 
+    if (!buf && len > 0) {
+        return crc;  // or appropriate error handling
+    }
+
     c = (~crc) & 0xffffffff;

⛔ Skipped due to learnings

Learnt from: samrussell
PR: zlib-ng/zlib-ng#1837
File: arch/generic/crc32_braid_c.c:596-597
Timestamp: 2024-12-25T19:45:06.009Z
Learning: We should verify out-of-bounds pointer arithmetic in chorba_118960_nondestructive() and related loops where "input + i" is used, especially when len < expected block sizes.

🧰 Tools

🪛 cppcheck (2.10-2)

[error] 775-775: There is an unknown macro here somewhere. Configuration is required. If PREFIX is a macro then please configure it.

(unknownMacro)

coderabbitai · 2024-12-26T21:12:55Z

arch/generic/crc32_braid_c.c

+    z_word_t* bitbuffer = aligned_alloc(sizeof(z_word_t), bitbuffersizebytes);
+    const uint8_t* bitbufferbytes = (const uint8_t*) bitbuffer;


⚠️ Potential issue

Add error handling for memory allocation.

The aligned_alloc call lacks error handling. If allocation fails, the function will crash when attempting to use the buffer.

Add error handling:

-z_word_t* bitbuffer = aligned_alloc(sizeof(z_word_t), bitbuffersizebytes); +z_word_t* bitbuffer = aligned_alloc(sizeof(z_word_t), bitbuffersizebytes); +if (!bitbuffer) { + return crc; // or appropriate error handling +}

Committable suggestion skipped: line range outside the PR's diff.

ccawley2011 · 2024-12-26T21:40:02Z

arch/generic/crc32_braid_c.c

+        a1 = (in1 << 17) ^ (in1 << 55);
+        a2 = (in1 >> 47) ^ (in1 >> 9) ^ (in1 << 19);
+        a3 = (in1 >> 45) ^ (in1 << 44);
+        a4 = (in1 >> 20);


I haven't run any benchmarks yet, but shifts of 64-bit integers tend to be more complex on 32-bit systems compared to other operations. If it's not possible to do this trivially with 32-bit integers (where necessary), would it be possible for specific architectures to always use the braided version instead?

i've just added a 32-bit version using the same polynomial, im getting a 30% speedup on a i5-4570 in 32-bit mode, what difference are you seeing?

samrussell · 2024-12-26T21:49:15Z

That's a good point. I haven't had a chance to bench this on 32-bit either, but there are a few options to avoid the 64-bit shifts. Register space is a lot tighter on x86 too, only 6 free regs vs 14 for x64.

…

On Thu, Dec 26, 2024, 22:40 Cameron Cawley ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In arch/generic/crc32_braid_c.c <#1837 (comment)>: > + a1 = (in1 << 17) ^ (in1 << 55); + a2 = (in1 >> 47) ^ (in1 >> 9) ^ (in1 << 19); + a3 = (in1 >> 45) ^ (in1 << 44); + a4 = (in1 >> 20); I haven't run any benchmarks yet, but shifts of 64-bit integers tend to be more complex on 32-bit systems compared to other operations. If it's not possible to do this trivially with 32-bit integers (where necessary), would it be possible for specific architectures to always use the braided version instead? — Reply to this email directly, view it on GitHub <#1837 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALJUPYV2EKCAIGV4VNJBHD2HRZURAVCNFSM6AAAAABUE5BIZOVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDKMRTGUYDAOJVGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

arch/generic/crc32_braid_c.c

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (3)

arch/generic/crc32_braid_c.c (3)
599-602: Consider optimizing endianness handling.

The code has multiple blocks handling big-endian conversion. This could be optimized to reduce code duplication and improve maintainability.

Consider creating helper functions for endianness conversion:
+static inline uint64_t swap_if_be(uint64_t value) {
+#if BYTE_ORDER == BIG_ENDIAN
+    return ZSWAP64(value);
+#else
+    return value;
+#endif
+}

+static inline void swap_if_be_batch(uint64_t* values, size_t count) {
+#if BYTE_ORDER == BIG_ENDIAN
+    for (size_t i = 0; i < count; i++) {
+        values[i] = ZSWAP64(values[i]);
+    }
+#endif
+}
Then use these helpers to simplify the code:
-#if BYTE_ORDER == BIG_ENDIAN
-        in1 = ZSWAP64(in1);
-        in2 = ZSWAP64(in2);
-#endif
+        in1 = swap_if_be(in1);
+        in2 = swap_if_be(in2);
Also applies to: 619-621, 649-655, 703-706, 722-725, 753-759

786-792: Consider making size thresholds configurable.

The size thresholds for choosing between different implementations are hardcoded. This could limit optimization opportunities on different architectures.

Consider making these thresholds configurable:
+#ifndef CHORBA_LARGE_THRESHOLD
+#define CHORBA_LARGE_THRESHOLD ((sizeof(z_word_t) * 64) * 1024)
+#endif
+#ifndef CHORBA_SMALL_THRESHOLD
+#define CHORBA_SMALL_THRESHOLD 72
+#endif

-        if(aligned_len > (sizeof(z_word_t) * 64) * 1024)
+        if(aligned_len > CHORBA_LARGE_THRESHOLD)
             c = chorba_118960_nondestructive(c, (z_word_t*) aligned_buf, aligned_len);
-        else if (aligned_len > 72)
+        else if (aligned_len > CHORBA_SMALL_THRESHOLD)
             c = chorba_small_nondestructive(c, aligned_buf, aligned_len);
606-609: Consider 32-bit system compatibility.

As noted in past review comments, shifts of 64-bit integers can be more complex on 32-bit systems. The current implementation heavily relies on 64-bit shifts.

Consider providing alternative implementations for 32-bit systems:

Use 32-bit operations where possible

Add compile-time checks to select appropriate implementation

Consider fallback to braided version on 32-bit systems

Would you like me to propose specific changes for 32-bit system compatibility?

Also applies to: 710-713, 729-732, 734-737

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between abeafb0 and d0da817.

📒 Files selected for processing (1)

arch/generic/crc32_braid_c.c (2 hunks)

🧰 Additional context used

📓 Learnings (1)

arch/generic/crc32_braid_c.c (1)

Learnt from: samrussell
PR: zlib-ng/zlib-ng#1837
File: arch/generic/crc32_braid_c.c:596-597
Timestamp: 2024-12-25T19:45:06.009Z
Learning: We should verify out-of-bounds pointer arithmetic in chorba_118960_nondestructive() and related loops where "input + i" is used, especially when len < expected block sizes.

🪛 cppcheck (2.10-2)

arch/generic/crc32_braid_c.c

[error] 773-773: There is an unknown macro here somewhere. Configuration is required. If PREFIX is a macro then please configure it.

(unknownMacro)

🔇 Additional comments (2)

arch/generic/crc32_braid_c.c (2)

773-797: LGTM: Main entry point function.

The main entry point function is well-structured:

Handles alignment requirements
Selects appropriate implementation based on input size
Properly handles pre and post-conditioning of CRC value

🧰 Tools

🪛 cppcheck (2.10-2)

[error] 773-773: There is an unknown macro here somewhere. Configuration is required. If PREFIX is a macro then please configure it.

(unknownMacro)

448-448: ⚠️ Potential issue

Add bounds checking for large input processing.

The main processing loops assume input buffer has sufficient length without explicit validation. This could lead to buffer overflows.

Add validation before the loops:

+if (i + (14870 + 64) * sizeof(z_word_t) > len) {
+    return crc;  // or appropriate error handling
+}
 for(; (i + (14870 + 64) * sizeof(z_word_t)) < len; i += (32 * sizeof(z_word_t))) {

+if (i + 72 > len) {
+    return crc;  // or appropriate error handling
+}
 for(; (i + 72 < len); i += 32) {

Also applies to: 581-581

⛔ Skipped due to learnings

Learnt from: samrussell
PR: zlib-ng/zlib-ng#1837
File: arch/generic/crc32_braid_c.c:596-597
Timestamp: 2024-12-25T19:45:06.009Z
Learning: We should verify out-of-bounds pointer arithmetic in chorba_118960_nondestructive() and related loops where "input + i" is used, especially when len < expected block sizes.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

arch/generic/crc32_braid_c.c (2)
606-609: Consider optimizing 64-bit shifts for 32-bit systems.

As noted in the PR comments by ccawley2011, shifts of 64-bit integers can be complex on 32-bit systems. The current implementation heavily relies on 64-bit shifts.

Consider:

Implementing a 32-bit version using smaller shifts

Using compiler intrinsics where available

Adding runtime detection to choose between 32-bit and 64-bit implementations
Would you like me to help implement these optimizations?

1240-1246: Add input validation.

The function should validate input parameters before processing.

Add validation:
 Z_INTERNAL uint32_t PREFIX(crc32_braid)(uint32_t crc, const uint8_t *buf, size_t len) {
     uint32_t c;
     uint64_t* aligned_buf;
     size_t aligned_len;
 
+    if (!buf && len > 0) {
+        return crc;  // or appropriate error handling
+    }
+
     c = (~crc) & 0xffffffff;
🧰 Tools

🪛 cppcheck (2.10-2)

[error] 1240-1240: There is an unknown macro here somewhere. Configuration is required. If PREFIX is a macro then please configure it.

(unknownMacro)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d0da817 and 09dc08d.

📒 Files selected for processing (1)

arch/generic/crc32_braid_c.c (2 hunks)

🧰 Additional context used

📓 Learnings (1)

arch/generic/crc32_braid_c.c (1)

Learnt from: samrussell
PR: zlib-ng/zlib-ng#1837
File: arch/generic/crc32_braid_c.c:596-597
Timestamp: 2024-12-25T19:45:06.009Z
Learning: We should verify out-of-bounds pointer arithmetic in chorba_118960_nondestructive() and related loops where "input + i" is used, especially when len < expected block sizes.

🪛 cppcheck (2.10-2)

arch/generic/crc32_braid_c.c

[error] 1240-1240: There is an unknown macro here somewhere. Configuration is required. If PREFIX is a macro then please configure it.

(unknownMacro)

🔇 Additional comments (2)

arch/generic/crc32_braid_c.c (2)

Line range hint 62-212: LGTM: Base CRC implementation looks solid.

The base CRC implementation is well-structured and follows good practices.

🧰 Tools

🪛 cppcheck (2.10-2)

[error] 1240-1240: There is an unknown macro here somewhere. Configuration is required. If PREFIX is a macro then please configure it.

(unknownMacro)

448-448: ⚠️ Potential issue

Add bounds checking for large input processing.

Based on the learnings from past reviews, the loops processing large blocks need proper bounds checking to prevent buffer overflows.

Add validation before the loops:

+    if (i + (14870 + 64) * sizeof(z_word_t) > len) {
+        return crc32_braid_base(crc, buf, len);
+    }
     for(; (i + (14870 + 64) * sizeof(z_word_t)) < len; i += (32 * sizeof(z_word_t))) {

+    if (i + 72 > len) {
+        return crc;
+    }
     for(; (i + 72 < len); i += 32) {

Also applies to: 581-581

⛔ Skipped due to learnings

Learnt from: samrussell
PR: zlib-ng/zlib-ng#1837
File: arch/generic/crc32_braid_c.c:596-597
Timestamp: 2024-12-25T19:45:06.009Z
Learning: We should verify out-of-bounds pointer arithmetic in chorba_118960_nondestructive() and related loops where "input + i" is used, especially when len < expected block sizes.

coderabbitai · 2024-12-27T10:03:39Z

arch/generic/crc32_braid_c.c

+    return c;
+}
+
+/* Implement Chorba algorithm from https://arxiv.org/abs/2412.16398 */


🛠️ Refactor suggestion

Add comprehensive algorithm documentation.

The implementation lacks detailed documentation explaining:

The Chorba algorithm's principles and performance characteristics

Purpose of magic numbers (14848, 14880, etc.)

Data flow between passes

Performance characteristics for different input sizes

Would you like me to help draft detailed documentation for these functions?

Also applies to: 673-673

I would like to see a draft documentation. The paper is at https://arxiv.org/pdf/2412.16398

@samrussell

Below is a draft documentation for the new Chorba-based functions and their usage in arch/generic/crc32_braid_c.c. Feel free to adapt or refine as needed:

/** * Chorba Algorithm Overview * * The Chorba algorithm (refer: https://arxiv.org/abs/2412.16398) is a novel approach * for efficiently computing CRC (cyclic redundancy check) values across large data * buffers. Unlike the traditional braided or folding CRC methods that separately * process segments (or "braids") of the input data and then combine partial CRCs, * the Chorba algorithm intermixes multiple "passes" of XOR-based transformations * to increase parallelism and reduce data dependencies. * * ============================================================================= * 1. Combining CPU-Friendly Arithmetic with Braid-Like Techniques * * The Chorba algorithm attempts to exploit wide CPU registers (e.g., 64-bit or * equivalent SIMD registers) to process multiple parts of the data simultaneously. * By interleaving partial XOR and left/right bit shifts in several stages, * Chorba effectively merges the computational pipeline of a braided CRC approach * with mathematically verifiable transformations. This often reduces the * overhead of branch mispredictions and memory stalls encountered when * splitting the data among separate braids. * * ============================================================================= * 2. High-Level Flow in This Implementation * * - The data is consumed in chunks (often 32 or 64 bytes, depending on the * function and architecture). * - Intermediate XOR states are accumulated across multiple "passes" (e.g., * the variable sets next1, next2, next3, etc.). * - These passes branch into distinct loops: * a) Initial pass or "warm-up": Pre-filling bit-buffer arrays and partial * CRC states using repeated XOR transformations. * b) Intermediate pass or "overlap": After some portion of data is * processed, partial states are used to feed subsequent XOR steps for * further input segments. * c) Final pass or "cleanup": Any remaining bytes are processed, combined * with leftover nextX states, and passed through the base CRC * calculation for completeness. * * - In addition, there is logic for smaller input sizes (see * chorba_small_nondestructive) as well as for larger sizes (chorba_118960_nondestructive). * For extremely large data, a multi-iteration approach ensures that leftover * partial states get combined correctly. * * ============================================================================= * 3. Magic Numbers (14848, 14880, etc.) * * Various "magic" constants appear in the code (14848, 14880, 14870, etc.). These * values arise from aligning critical intermediate transformations to specific * bit-buffer offsets. They are based on thorough experimentation with alignment * and chunk sizes to: * - Provide safe offsets ensuring data from earlier passes does not overlap * incorrectly with later passes. * - Ensure that the bit-buffer remains large enough to handle partial XOR * states multiple times without frequent flushing. * - Map closely to the half or quarter multiples of the allocated buffer * sizes in memory, aiming for fewer cache misses (though the exact numbers * can appear somewhat “arbitrary” when read in isolation). * * It can be helpful to define them as constants or macros with self-documenting * names. For example: * #define CHORBA_OFFSET_1 14848 * #define CHORBA_OFFSET_2 14880 * // etc. * * ============================================================================= * 4. Data Flow Between Passes * * - In the larger function chorba_118960_nondestructive: * 1. The bitbuffer is initialized (often set to zero in large chunks). * 2. The code loops over the input to XOR in partial states plus leftover * states from earlier iterations (like next1, next2, etc.). Each iteration * stores intermediate results back into the bitbuffer for use in * subsequent passes. * 3. After the main body of data has been processed, leftover nextX states * and trailing data are forwarded to a final pass that merges them with * standard CRC lookups. * * This interplay ensures that short writes in an early pass can influence * the next pass with minimal overhead, all while preserving the final * correctness of the CRC. * * - In the smaller function chorba_small_nondestructive: * 1. The approach is a simplified version for buffer sizes not large enough * to justify the full overhead of the bigger multi-pass approach. * 2. It still takes advantage of partial states (nextX), but uses fewer * passes to reduce overhead for smaller data. * 3. The final partial states are fed back into a standard braided CRC * base function to ensure consistency at the end. * * ============================================================================= * 5. Performance Characteristics * * - For **large buffers** (e.g., > 64 KB or 256 KB, depending on the system): * Chorba aims to generate a 2x speedup relative to conventional braided * CRC computations, especially on hardware that efficiently handles 64-bit * shifts and XOR ops. The offset-based storage in bitbuffer reduces cache * line thrashing, and the broad “multiphase” approach maximizes pipeline * utilization. * * - For **sub-256 KB but moderately large** buffers: * A partial speedup (in the 5–20% range) may be observed, particularly with * AVX1 or SSE-like instruction sets. Gains come from minimized loop overhead * and early partial XOR accumulation. * * - For **very small buffers** (under ~72 bytes in this code): * The overhead of setting up multiple passes may outweigh any parallel gains. * Hence the fallback to crc32_braid_base to avoid unnecessary complexity. * * ============================================================================= * Usage Notes * * - The function PREFIX(crc32_braid) automatically picks between * crc32_braid_base and the Chorba variants chorba_118960_nondestructive or * chorba_small_nondestructive based on buffer size and alignment constraints. * - If you modify or extend the magic offsets (14848, 14880, etc.) or the pass * count, be sure to keep the bitbuffer size and indexing logic consistent * throughout the code. * - This code is a direct application of the techniques in the referenced * paper but heavily tuned to suit the zlib-ng contexts (e.g., factoring in * braided CRC table usage). */

This draft includes:

High-level explanation of the Chorba algorithm and its core principle of interleaving XOR transformations.

Rationale behind magic numbers (14848, 14880, and related offsets).

Outline of the data flow and how intermediate states propagate across multiple passes.

Discussion of performance expectations for different buffer sizes.

Please let me know if you would like any specific areas expanded further, or if you have additional questions about the constants or multi-pass flow.

arch/generic/crc32_braid_c.c

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

arch/generic/crc32_chorba_c.c (1)

27-479: ⚠️ Potential issue

Add bounds checking for pointer arithmetic.

The function performs extensive pointer arithmetic without validating buffer sizes, which could lead to buffer overflows.

Based on the learnings, we need to verify out-of-bounds pointer arithmetic in loops where "input + i" is used.

🧹 Nitpick comments (2)

arch/x86/crc32_pclmulqdq_tpl.h (1)
172-172: Document the polynomial constants.

The magic numbers 0x596C8D81 and 0xF5E48C85 should be documented to explain their significance in the CRC computation.
-    const __m128i xmm_fold12 = _mm_set_epi64x(0x596C8D81, 0xF5E48C85);
+    /* Constants for 12-fold CRC32 reduction polynomial */
+    const __m128i xmm_fold12 = _mm_set_epi64x(0x596C8D81, 0xF5E48C85);
arch/generic/crc32_chorba_c.c (1)
1-26: LGTM! Well-documented implementation.

Good documentation explaining the algorithm phases and requirements. Consider making buffer sizes configurable via macros for different architectures/platforms.
-#define bitbuffersizebytes (16 * 1024 * sizeof(z_word_t))
+#ifndef CHORBA_BUFFER_SIZE_KB
+#define CHORBA_BUFFER_SIZE_KB 16
+#endif
+#define bitbuffersizebytes (CHORBA_BUFFER_SIZE_KB * 1024 * sizeof(z_word_t))

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 999ec5e and 3672389.

📒 Files selected for processing (17)

CMakeLists.txt (1 hunks)
Makefile.in (2 hunks)
arch/generic/Makefile.in (2 hunks)
arch/generic/crc32_braid_c.c (2 hunks)
arch/generic/crc32_c.c (1 hunks)
arch/generic/crc32_chorba_c.c (1 hunks)
arch/generic/generic_functions.h (2 hunks)
arch/s390/crc32-vx.c (2 hunks)
arch/x86/crc32_fold_pclmulqdq_tpl.h (1 hunks)
arch/x86/crc32_pclmulqdq_tpl.h (1 hunks)
crc32_c.h (1 hunks)
functable.c (1 hunks)
test/benchmarks/benchmark_crc32.cc (1 hunks)
test/test_crc32.cc (1 hunks)
win32/Makefile.a64 (2 hunks)
win32/Makefile.arm (2 hunks)
win32/Makefile.msc (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (13)

test/benchmarks/benchmark_crc32.cc
Makefile.in
test/test_crc32.cc
CMakeLists.txt
arch/x86/crc32_fold_pclmulqdq_tpl.h
arch/s390/crc32-vx.c
functable.c
win32/Makefile.msc
arch/generic/Makefile.in
arch/generic/generic_functions.h
win32/Makefile.arm
crc32_c.h
win32/Makefile.a64

🧰 Additional context used

📓 Learnings (3)

arch/generic/crc32_c.c (1)

Learnt from: Dead2
PR: zlib-ng/zlib-ng#1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-01-23T22:01:53.422Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.

arch/generic/crc32_chorba_c.c (2)

Learnt from: samrussell
PR: zlib-ng/zlib-ng#1837
File: arch/generic/crc32_braid_c.c:596-597
Timestamp: 2024-12-25T19:45:06.009Z
Learning: We should verify out-of-bounds pointer arithmetic in chorba_118960_nondestructive() and related loops where "input + i" is used, especially when len < expected block sizes.

Learnt from: Dead2
PR: zlib-ng/zlib-ng#1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-01-23T22:01:53.422Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.

arch/generic/crc32_braid_c.c (3)

Learnt from: samrussell
PR: zlib-ng/zlib-ng#1837
File: arch/generic/crc32_braid_c.c:0-0
Timestamp: 2025-01-06T07:20:02.028Z
Learning: Business requirement: Memory allocations are not allowed, so large stack allocations must be used instead.

Learnt from: samrussell
PR: zlib-ng/zlib-ng#1837
File: arch/generic/crc32_braid_c.c:215-217
Timestamp: 2024-12-27T10:06:16.184Z
Learning: There's a business requirement to use stack allocation instead of heap for the Chorba buffer.

Learnt from: samrussell
PR: zlib-ng/zlib-ng#1837
File: arch/generic/crc32_braid_c.c:596-597
Timestamp: 2024-12-25T19:45:06.009Z
Learning: We should verify out-of-bounds pointer arithmetic in chorba_118960_nondestructive() and related loops where "input + i" is used, especially when len < expected block sizes.

🪛 Cppcheck (2.10-2)

arch/generic/crc32_c.c

[error] 5-5: There is an unknown macro here somewhere. Configuration is required. If PREFIX is a macro then please configure it.

(unknownMacro)

arch/generic/crc32_chorba_c.c

[error] 638-638: Uninitialized variable

(uninitvar)

arch/generic/crc32_braid_c.c

[error] 214-214: There is an unknown macro here somewhere. Configuration is required. If PREFIX is a macro then please configure it.

(unknownMacro)

🔇 Additional comments (8)

arch/generic/crc32_c.c (3)

1-4: LGTM! Required headers are included.

The includes are appropriate for the CRC32 implementation, with crc32_c.h providing the Chorba function declarations.

5-41: LGTM! Well-structured implementation with size-based optimizations.

The function effectively:

Pre-conditions the CRC value

Handles alignment requirements

Selects appropriate implementation based on data size

🧰 Tools

🪛 Cppcheck (2.10-2)

[error] 5-5: There is an unknown macro here somewhere. Configuration is required. If PREFIX is a macro then please configure it.

(unknownMacro)

11-35: Add input size validation for Chorba functions.

The Chorba functions have minimum size requirements (e.g., 118960 + 512 bytes for chorba_118960_nondestructive). Consider adding size validation before calling these functions to prevent undefined behavior.

arch/x86/crc32_pclmulqdq_tpl.h (1)

171-211: LGTM! Well-implemented folding operation.

The implementation follows the established pattern of other fold functions and correctly uses PCLMULQDQ instructions.
arch/generic/crc32_chorba_c.c (2)

1210-1421: LGTM! Well-optimized for 32-bit systems.

The implementation properly handles 32-bit constraints by:

Using appropriate shift values

Avoiding 64-bit operations

Maintaining register pressure within 32-bit system limits

474-476: ⚠️ Potential issue

Fix uninitialized variable access.

The crc variable is used in the CRC table lookup before being properly initialized.
+    uint32_t current_crc = 0;  // Initialize CRC
     for(size_t j = 0; j < (len-i); j++) {
-        crc = crc_table[(crc ^ final_bytes[j] ^ bitbufferbytes[(j+i) % bitbuffersizebytes]) & 0xff] ^ (crc >> 8);
+        current_crc = crc_table[(current_crc ^ final_bytes[j] ^ bitbufferbytes[(j+i) % bitbuffersizebytes]) & 0xff] ^ (current_crc >> 8);
     }
+    crc = current_crc;
Likely invalid or redundant comment.
arch/generic/crc32_braid_c.c (2)

Line range hint 62-212: LGTM! Clean internal implementation.

The function correctly handles the core CRC calculation without pre/post-conditioning, making it suitable for internal use by other functions.

🧰 Tools

🪛 Cppcheck (2.10-2)

[error] 214-214: There is an unknown macro here somewhere. Configuration is required. If PREFIX is a macro then please configure it.

(unknownMacro)

214-219: LGTM! Proper pre/post-conditioning.

The function correctly:

Pre-conditions the input CRC with NOT operation

Calls the internal implementation

Post-conditions the result

🧰 Tools

🪛 Cppcheck (2.10-2)

[error] 214-214: There is an unknown macro here somewhere. Configuration is required. If PREFIX is a macro then please configure it.

(unknownMacro)

Dead2 · 2025-02-11T14:14:45Z

@samrussell Would you please rebase this? It should pass all CI tests then.

Also, the ai suggestion about changing magic numbers with defines would be very nice if you could work in as well.

+/* Size thresholds for Chorba algorithm variants */
+#define CHORBA_LARGE_THRESHOLD (sizeof(z_word_t) * 64 * 1024)
+#define CHORBA_MEDIUM_THRESHOLD 32768
+#define CHORBA_SMALL_THRESHOLD 72

 if(aligned_len > (sizeof(z_word_t) * 64) * 1024)
-    c = crc32_chorba_118960_nondestructive(c, (z_word_t*) aligned_buf, aligned_len);
+    c = crc32_chorba_118960_nondestructive(c, (z_word_t*) aligned_buf, aligned_len);
 #if W == 8
-else if (aligned_len > 8192 && aligned_len <= 32768)
+else if (aligned_len > 8192 && aligned_len <= CHORBA_MEDIUM_THRESHOLD)
     c = crc32_chorba_32768_nondestructive(c, (uint64_t*) aligned_buf, aligned_len);
-else if (aligned_len > 72)
+else if (aligned_len > CHORBA_SMALL_THRESHOLD)
     c = crc32_chorba_small_nondestructive(c, (uint64_t*) aligned_buf, aligned_len);

Other than that, I am ready to merge this next.

samrussell · 2025-02-14T11:21:54Z

Rebased and magic numbers given names as per AI feedback

@samrussell Would you please rebase this? It should pass all CI tests then.

Also, the ai suggestion about changing magic numbers with defines would be very nice if you could work in as well.

+/* Size thresholds for Chorba algorithm variants */
+#define CHORBA_LARGE_THRESHOLD (sizeof(z_word_t) * 64 * 1024)
+#define CHORBA_MEDIUM_THRESHOLD 32768
+#define CHORBA_SMALL_THRESHOLD 72

 if(aligned_len > (sizeof(z_word_t) * 64) * 1024)
-    c = crc32_chorba_118960_nondestructive(c, (z_word_t*) aligned_buf, aligned_len);
+    c = crc32_chorba_118960_nondestructive(c, (z_word_t*) aligned_buf, aligned_len);
 #if W == 8
-else if (aligned_len > 8192 && aligned_len <= 32768)
+else if (aligned_len > 8192 && aligned_len <= CHORBA_MEDIUM_THRESHOLD)
     c = crc32_chorba_32768_nondestructive(c, (uint64_t*) aligned_buf, aligned_len);
-else if (aligned_len > 72)
+else if (aligned_len > CHORBA_SMALL_THRESHOLD)
     c = crc32_chorba_small_nondestructive(c, (uint64_t*) aligned_buf, aligned_len);

Other than that, I am ready to merge this next.

coderabbitai

Caution

Inline review comments failed to post. This is likely due to GitHub's limits when posting large numbers of comments.

Actionable comments posted: 2

🔭 Outside diff range comments (1)

arch/x86/crc32_pclmulqdq_tpl.h (1)
108-326: 🛠️ Refactor suggestion

Document Chorba algorithm implementation and consider optimizations.

The implementation would benefit from:

Detailed documentation of the algorithm's steps

Explanation of magic numbers and XOR patterns

Potential performance optimizations

Add comprehensive documentation and consider optimizations:
+/*
+ * Chorba Algorithm Implementation
+ * Based on: https://arxiv.org/abs/2412.16398
+ *
+ * The algorithm processes data in 512-byte chunks using:
+ * 1. Initial 8 QWORD reads with specific XOR patterns
+ * 2. Interleaved PCLMUL-based folds
+ * 3. Scaled generator polynomial copies
+ *
+ * XOR Pattern:
+ * - Offsets: 6, 9, 10, 16, 20, 22, 24, 25, 27, 28, 30, 31, 32
+ * - Based on generator_64_bits_unrolled_8 from the paper
+ */
Consider these optimizations:

Use AVX512 for wider SIMD operations

Reduce register pressure by reordering operations

Minimize memory accesses in the XOR patterns

♻️ Duplicate comments (2)

arch/generic/crc32_chorba_c.c (2)
27-27: ⚠️ Potential issue

Add input size validation.

The function requires a minimum input size of 118960 + 512 bytes but doesn't validate it. This could lead to buffer overflows.

Apply this diff to add input validation:
 Z_INTERNAL uint32_t crc32_chorba_118960_nondestructive (uint32_t crc, const z_word_t* input, size_t len) {
+    if (len < (118960 + 512) * sizeof(z_word_t)) {
+        return crc32_braid_internal(crc, (const uint8_t*)input, len);
+    }
474-476: ⚠️ Potential issue

Fix uninitialized variable access.

The crc variable is used in the CRC table lookup before being properly initialized.

Apply this diff to fix the uninitialized variable:
+    uint32_t current_crc = 0;  // Initialize CRC
     for(size_t j = 0; j < (len-i); j++) {
-        crc = crc_table[(crc ^ final_bytes[j] ^ bitbufferbytes[(j+i) % bitbuffersizebytes]) & 0xff] ^ (crc >> 8);
+        current_crc = crc_table[(current_crc ^ final_bytes[j] ^ bitbufferbytes[(j+i) % bitbuffersizebytes]) & 0xff] ^ (current_crc >> 8);
     }
+    crc = current_crc;

🧹 Nitpick comments (6)

arch/generic/crc32_braid_c.c (1)
214-221: Add function documentation.

The function handles CRC pre/post-conditioning correctly, but lacks documentation about:

Parameter requirements

Pre-conditioning process

Post-conditioning process

Add this documentation:
+/**
+ * Compute CRC32 checksum for a buffer.
+ *
+ * @param c Initial CRC value (usually 0)
+ * @param buf Input buffer
+ * @param len Length of buffer in bytes
+ * @return Computed CRC32 value
+ *
+ * The function performs:
+ * 1. Pre-conditioning: Inverts initial CRC
+ * 2. Core computation: Using braided algorithm
+ * 3. Post-conditioning: Final inversion
+ */
 uint32_t PREFIX(crc32_braid)(uint32_t c, const uint8_t *buf, size_t len) {
🧰 Tools

🪛 Cppcheck (2.10-2)

[error] 214-214: There is an unknown macro here somewhere. Configuration is required. If PREFIX is a macro then please configure it.

(unknownMacro)
arch/x86/crc32_pclmulqdq_tpl.h (1)
171-211: Consider optimizing fold_12 implementation.

The function follows the same pattern as fold_4 but with a different polynomial. Consider:

Using AVX512 instructions when available

Extracting common folding patterns into helper functions

Consider adding AVX512 support:
+#ifdef X86_AVX512
+static void fold_12_avx512(__m512i *zmm_crc0, __m512i *zmm_crc1,
+                          __m512i *zmm_crc2, __m512i *zmm_crc3) {
+    // AVX512 implementation
+}
+#endif
arch/x86/crc32_fold_pclmulqdq_tpl.h (1)
20-419: Consider restructuring for better maintainability.

The implementation would benefit from:

Breaking down the large function into smaller, focused functions

Better documentation of the control flow

Consistent error handling

Consider extracting common patterns:
+static inline void process_block(__m128i *crc0, __m128i *crc1,
+                                __m128i *crc2, __m128i *crc3,
+                                const __m128i *src, __m128i *dst) {
+    __m128i t0 = _mm_load_si128(src);
+    __m128i t1 = _mm_load_si128(src + 1);
+    __m128i t2 = _mm_load_si128(src + 2);
+    __m128i t3 = _mm_load_si128(src + 3);
+
+#ifdef COPY
+    _mm_storeu_si128(dst, t0);
+    _mm_storeu_si128(dst + 1, t1);
+    _mm_storeu_si128(dst + 2, t2);
+    _mm_storeu_si128(dst + 3, t3);
+#endif
+
+    fold_4(crc0, crc1, crc2, crc3);
+    *crc0 = _mm_xor_si128(*crc0, t0);
+    *crc1 = _mm_xor_si128(*crc1, t1);
+    *crc2 = _mm_xor_si128(*crc2, t2);
+    *crc3 = _mm_xor_si128(*crc3, t3);
+}
arch/generic/crc32_chorba_c.c (3)
11-26: Enhance function documentation.

While the documentation explains the algorithm phases, consider adding:

Performance characteristics and expected speedup

Memory requirements and alignment constraints

Explanation of the magic numbers (118960, 512)

Error handling behavior for invalid inputs

656-657: Expand performance-related comment.

The comment about for vs while loop performance should be expanded to explain why this occurs and whether this is platform-specific.

Replace the comment with:
-    /* This is weird, doing for vs while drops 10% off the exec time */
+    /* Using for loop instead of while loop improves performance by ~10%
+     * This is likely due to better compiler optimizations with for loops,
+     * specifically on x86 architectures where the loop condition check
+     * can be combined with the increment operation.
+     */
1210-1210: Consider SIMD optimizations for 32-bit version.

The 32-bit version could benefit from SIMD optimizations (SSE2/NEON) to improve performance, especially since register pressure is a concern on 32-bit architectures.

Would you like me to generate a SIMD-optimized version of this function?

🛑 Comments failed to post (2)

arch/generic/crc32_c.c (1)
5-42: 💡 Verification agent
🧩 Analysis chain

Verify alignment and buffer size requirements.

The function handles alignment and buffer size checks, but there are a few considerations:

The alignment check assumes 8-byte alignment but uses 0xF mask (16-byte).

The buffer size check for Chorba algorithm (len >= 512 + 64 + 16*8) should be documented.

Run this script to verify the alignment requirements:

🏁 Script executed:
#!/bin/bash
# Search for alignment-related code in the codebase
rg -B 2 -A 2 'algn_diff|alignment|aligned' arch/generic/
Length of output: 1907
Action Required: Add Inline Documentation for Alignment & Buffer-Size Logic

The alignment adjustment computed as
unsigned long algn_diff = ((uintptr_t)8 - ((uintptr_t)buf & 0xF)) & 0xF;
forces the buffer pointer to a state where it is 8 mod 16. This non-obvious calculation would benefit from an inline comment explaining why the pointer needs to be adjusted in this manner rather than using a simple 8‑byte alignment.

The thresholds determining which CRC variant to invoke (for example, the condition
if(aligned_len > (sizeof(z_word_t) * 64) * 1024)
and the subsequent length checks for the Chorba routines) are based on performance or algorithmic considerations that are not immediately apparent. Please document the rationale behind these buffer size criteria to clarify their intent and derivation.

🧰 Tools

🪛 Cppcheck (2.10-2)

[error] 5-5: There is an unknown macro here somewhere. Configuration is required. If PREFIX is a macro then please configure it.

(unknownMacro)
arch/generic/crc32_chorba_c.c (1)
482-482: 🛠️ Refactor suggestion

Add input size validation and document magic numbers.

The function should validate input size and document the significance of magic numbers (145, 183, 211).

Apply this diff to add input validation and documentation:
+/**
+ * Implements the Chorba algorithm for CRC32 computation with a 32KB buffer.
+ *
+ * @param crc Initial CRC value
+ * @param buf Input data buffer
+ * @param len Length of input data in bytes
+ * @return Computed CRC32 value
+ *
+ * @note Requires minimum input size of (300*8+64) bytes
+ * @note Uses 32KB temporary buffer
+ */
 Z_INTERNAL uint32_t crc32_chorba_32768_nondestructive (uint32_t crc, const uint64_t* buf, size_t len) {
+    if (len < (300*8+64)) {
+        return crc32_braid_internal(crc, (const uint8_t*)buf, len);
+    }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
/**
 * Implements the Chorba algorithm for CRC32 computation with a 32KB buffer.
 *
 * @param crc Initial CRC value
 * @param buf Input data buffer
 * @param len Length of input data in bytes
 * @return Computed CRC32 value
 *
 * @note Requires minimum input size of (300*8+64) bytes
 * @note Uses 32KB temporary buffer
 * // TODO: Document the significance of magic numbers 145, 183, and 211.
 */
Z_INTERNAL uint32_t crc32_chorba_32768_nondestructive (uint32_t crc, const uint64_t* buf, size_t len) {
    if (len < (300*8+64)) {
        return crc32_braid_internal(crc, (const uint8_t*)buf, len);
    }
    // ... rest of the CRC computation implementation ...
}

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

arch/generic/crc32_chorba_c.c (2)
474-476: ⚠️ Potential issue

Fix uninitialized variable access in CRC computation.

The crc variable is used in the CRC table lookup before being properly initialized.

Initialize crc before use:
+    uint32_t current_crc = 0;  // Initialize CRC
     for(size_t j = 0; j < (len-i); j++) {
-        crc = crc_table[(crc ^ final_bytes[j] ^ bitbufferbytes[(j+i) % bitbuffersizebytes]) & 0xff] ^ (crc >> 8);
+        current_crc = crc_table[(current_crc ^ final_bytes[j] ^ bitbufferbytes[(j+i) % bitbuffersizebytes]) & 0xff] ^ (current_crc >> 8);
     }
+    crc = current_crc;
27-27: ⚠️ Potential issue

Add input validation for minimum size requirement.

The function requires a minimum input size of 118960 + 512 bytes but doesn't validate it. Consider adding a size check:
 Z_INTERNAL uint32_t crc32_chorba_118960_nondestructive (uint32_t crc, const z_word_t* input, size_t len) {
+    if (len < (118960 + 512) * sizeof(z_word_t)) {
+        return crc32_braid_internal(crc, (const uint8_t*)input, len);
+    }

🧹 Nitpick comments (3)

arch/generic/crc32_braid_c.c (2)

62-110: Add brief comments explaining partial boundary alignment handling.

While the code for processing partial data until reaching a z_word_t boundary is correct, adding concise in-code comments would improve maintainability and help new contributors understand why this approach is necessary before braided processing.

214-220: Check for naming consistency.

crc32_braid now serves as a thin wrapper that applies pre- and post-conditioning, yet the file name and function name might imply a higher-level braided logic. Confirm you have aligned naming with the new code structure (especially given other recommendations about naming or refactoring from previous discussions).

🧰 Tools

🪛 Cppcheck (2.10-2)

[error] 214-214: There is an unknown macro here somewhere. Configuration is required. If PREFIX is a macro then please configure it.

(unknownMacro)

arch/x86/crc32_pclmulqdq_tpl.h (1)

171-211: Document the polynomial used in fold_12.

The constant 0x596C8D81, 0xF5E48C85 is set via _mm_set_epi64x without extra context, unlike the well-documented _mm_set_epi32(0x00000001, 0x54442bd4, 0x00000001, 0xc6e41596) in fold_1..fold_4. Provide a short explanation or reference to clarify how this polynomial was chosen and how it complements the existing fold methods.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5570933 and 071df72.

📒 Files selected for processing (17)

CMakeLists.txt (1 hunks)
Makefile.in (2 hunks)
arch/generic/Makefile.in (2 hunks)
arch/generic/crc32_braid_c.c (2 hunks)
arch/generic/crc32_c.c (1 hunks)
arch/generic/crc32_chorba_c.c (1 hunks)
arch/generic/generic_functions.h (2 hunks)
arch/s390/crc32-vx.c (2 hunks)
arch/x86/crc32_fold_pclmulqdq_tpl.h (1 hunks)
arch/x86/crc32_pclmulqdq_tpl.h (1 hunks)
crc32_c.h (1 hunks)
functable.c (1 hunks)
test/benchmarks/benchmark_crc32.cc (1 hunks)
test/test_crc32.cc (1 hunks)
win32/Makefile.a64 (2 hunks)
win32/Makefile.arm (2 hunks)
win32/Makefile.msc (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (12)

test/benchmarks/benchmark_crc32.cc
Makefile.in
arch/s390/crc32-vx.c
CMakeLists.txt
functable.c
win32/Makefile.a64
win32/Makefile.arm
arch/generic/Makefile.in
test/test_crc32.cc
win32/Makefile.msc
arch/generic/generic_functions.h
crc32_c.h

🧰 Additional context used

🧠 Learnings (3)

arch/generic/crc32_c.c (1)

Learnt from: Dead2
PR: zlib-ng/zlib-ng#1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-01-23T22:01:53.422Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.

arch/generic/crc32_chorba_c.c (2)

Learnt from: samrussell
PR: zlib-ng/zlib-ng#1837
File: arch/generic/crc32_braid_c.c:596-597
Timestamp: 2024-12-25T19:45:06.009Z
Learning: We should verify out-of-bounds pointer arithmetic in chorba_118960_nondestructive() and related loops where "input + i" is used, especially when len < expected block sizes.

Learnt from: Dead2
PR: zlib-ng/zlib-ng#1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-01-23T22:01:53.422Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.

arch/generic/crc32_braid_c.c (3)

Learnt from: samrussell
PR: zlib-ng/zlib-ng#1837
File: arch/generic/crc32_braid_c.c:0-0
Timestamp: 2025-01-06T07:20:02.028Z
Learning: Business requirement: Memory allocations are not allowed, so large stack allocations must be used instead.

Learnt from: samrussell
PR: zlib-ng/zlib-ng#1837
File: arch/generic/crc32_braid_c.c:215-217
Timestamp: 2024-12-27T10:06:16.184Z
Learning: There's a business requirement to use stack allocation instead of heap for the Chorba buffer.

Learnt from: samrussell
PR: zlib-ng/zlib-ng#1837
File: arch/generic/crc32_braid_c.c:596-597
Timestamp: 2024-12-25T19:45:06.009Z
Learning: We should verify out-of-bounds pointer arithmetic in chorba_118960_nondestructive() and related loops where "input + i" is used, especially when len < expected block sizes.

🪛 Cppcheck (2.10-2)

arch/generic/crc32_c.c

[error] 12-12: There is an unknown macro here somewhere. Configuration is required. If PREFIX is a macro then please configure it.

(unknownMacro)

arch/generic/crc32_chorba_c.c

[error] 638-638: Uninitialized variable

(uninitvar)

arch/generic/crc32_braid_c.c

[error] 214-214: There is an unknown macro here somewhere. Configuration is required. If PREFIX is a macro then please configure it.

(unknownMacro)

🔇 Additional comments (10)

arch/generic/crc32_c.c (4)

19-19: Double-check alignment calculation.

Using (uintptr_t)8 in combination with a 16-byte mask (& 0xF) is suspicious. If the goal is to align the buffer to a 16-byte boundary, consider replacing (uintptr_t)8 with (uintptr_t)16 or using a well-known formula that aligns based on (16 - (buf & 0xF)) & 0xF. This helps ensure that aligned_buf is genuinely 16-byte aligned.

26-38: Verify Chorba boundary conditions.

When selecting between crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, and crc32_chorba_small_nondestructive (or its 32-bit variant), ensure that the cutoff thresholds in the code (e.g. CHORBA_LARGE_THRESHOLD, CHORBA_MEDIUM_LOWER_THRESHOLD, etc.) truly optimize performance for typical datasets. Excessively narrow or broad thresholds might degrade performance in edge cases.

43-45: Consider test coverage for the #else branch.

When WITHOUT_CHORBA is defined, the code falls back to crc32_braid_internal. Confirm that this fallback path is tested to maintain coverage and ensure consistent behavior across build configurations.

47-48: LGTM: Return statement is straightforward.

The final XOR of c ^ 0xffffffff is standard CRC post-conditioning. The logic here appears correct.

arch/generic/crc32_braid_c.c (2)

173-193: Verify that 64-bit word swaps are safe on 32-bit builds.

ZSWAPWORD macros and these braided loops assume W=4 or 8 and rely on 64-bit types. For 32-bit environments with limited registers, ensure this logic remains correct and isn't inadvertently causing performance issues or incorrect results when building for 32-bit targets.

200-208: Good approach to finalize remaining bytes.

This fallback loop for leftover bytes helps avoid partial word complications. The logic is sound, and the macros DO8 and DO1 are used consistently.
arch/x86/crc32_fold_pclmulqdq_tpl.h (2)

108-114: LGTM! Well-documented algorithm implementation.

The implementation correctly follows the Chorba algorithm from the research paper, with clear documentation explaining the interleaving of PCLMUL-based folds with scaled generator polynomial copies.

115-115: Verify minimum input size requirement.

The loop requires at least 512 + 64 + 16*8 bytes of input data, but there's no explicit check for this requirement.

Consider adding a size check to prevent potential buffer overruns:
+    if (len < 512 + 64 + 16*8) {
+        goto process_remaining;
+    }
     while (len >= 512 + 64 + 16*8) {
arch/generic/crc32_chorba_c.c (2)

482-488: LGTM! Proper buffer initialization and error handling.

The function correctly initializes the buffer and handles the initial CRC value.

1210-1224: LGTM! Well-optimized for 32-bit architectures.

The function is properly optimized for 32-bit systems with appropriate buffer sizes and variable types.

coderabbitai · 2025-02-14T11:25:41Z

arch/x86/crc32_fold_pclmulqdq_tpl.h

+            xmm_t0 = _mm_xor_si128(xmm_t0, chorba6);
+            xmm_t1 = _mm_xor_si128(_mm_xor_si128(xmm_t1, chorba5), chorba8);
+            xmm_t2 = _mm_xor_si128(_mm_xor_si128(_mm_xor_si128(xmm_t2, chorba4), chorba8), chorba7);
+            xmm_t3 = _mm_xor_si128(_mm_xor_si128(_mm_xor_si128(xmm_t3, chorba3), chorba7), chorba6);


🛠️ Refactor suggestion

Ensure MSVC compatibility for SIMD XOR operations.

Based on past review comments, MSVC might not support binary operations on SIMD operands. Consider using intrinsics for better compiler compatibility:

- xmm_t0 = _mm_xor_si128(_mm_xor_si128(_mm_xor_si128(xmm_t0, chorba4), chorba8), chorba6); + xmm_t0 = _mm_xor_si128(xmm_t0, chorba4); + xmm_t0 = _mm_xor_si128(xmm_t0, chorba8); + xmm_t0 = _mm_xor_si128(xmm_t0, chorba6);

Also applies to: 177-182, 201-204, 224-227, 247-250, 270-273, 292-295, 314-317

Dead2

LGTM

KungFuJesus · 2025-02-16T15:49:25Z

Awesome work on this, can't wait to get around to vectorizing it again.

coderabbitai bot reviewed Dec 24, 2024

View reviewed changes

arch/x86/crc32_fold_pclmulqdq_tpl.h Outdated Show resolved Hide resolved

arch/generic/crc32_braid_c.c Outdated Show resolved Hide resolved

github-advanced-security bot found potential problems Dec 24, 2024

View reviewed changes

arch/generic/crc32_braid_c.c Fixed Show fixed Hide fixed

arch/generic/crc32_braid_c.c Fixed Show fixed Hide fixed

arch/x86/crc32_fold_pclmulqdq_tpl.h Fixed Show fixed Hide fixed

KungFuJesus reviewed Dec 24, 2024

View reviewed changes

arch/x86/crc32_fold_pclmulqdq_tpl.h Outdated Show resolved Hide resolved

KungFuJesus reviewed Dec 24, 2024

View reviewed changes

arch/generic/crc32_braid_c.c Outdated Show resolved Hide resolved

KungFuJesus reviewed Dec 24, 2024

View reviewed changes

coderabbitai bot reviewed Dec 25, 2024

View reviewed changes

arch/x86/crc32_fold_pclmulqdq_tpl.h Outdated Show resolved Hide resolved

arch/x86/crc32_fold_pclmulqdq_tpl.h Outdated Show resolved Hide resolved

coderabbitai bot reviewed Dec 25, 2024

View reviewed changes

arch/generic/crc32_braid_c.c Outdated Show resolved Hide resolved

arch/generic/crc32_braid_c.c Outdated Show resolved Hide resolved

coderabbitai bot reviewed Dec 25, 2024

View reviewed changes

coderabbitai bot reviewed Dec 26, 2024

View reviewed changes

github-advanced-security bot found potential problems Dec 26, 2024

View reviewed changes

samrussell force-pushed the chorba branch from c508610 to 2c4ca03 Compare December 26, 2024 13:02

coderabbitai bot reviewed Dec 26, 2024

View reviewed changes

github-advanced-security bot found potential problems Dec 26, 2024

View reviewed changes

arch/generic/crc32_braid_c.c Fixed Show fixed Hide fixed

arch/generic/crc32_braid_c.c Fixed Show fixed Hide fixed

coderabbitai bot reviewed Dec 26, 2024

View reviewed changes

ccawley2011 reviewed Dec 26, 2024

View reviewed changes

github-advanced-security bot found potential problems Dec 26, 2024

View reviewed changes

arch/generic/crc32_braid_c.c Fixed Show fixed Hide fixed

arch/generic/crc32_braid_c.c Fixed Show fixed Hide fixed

coderabbitai bot reviewed Dec 27, 2024

View reviewed changes

coderabbitai bot reviewed Jan 27, 2025

View reviewed changes

samrussell force-pushed the chorba branch from 3672389 to 5570933 Compare February 14, 2025 11:17

implement chorba algorithm

071df72

samrussell force-pushed the chorba branch from 5570933 to 071df72 Compare February 14, 2025 11:21

coderabbitai bot reviewed Feb 14, 2025

View reviewed changes

Dead2 approved these changes Feb 14, 2025

View reviewed changes

Dead2 merged commit b33ba96 into zlib-ng:develop Feb 15, 2025

coderabbitai bot mentioned this pull request Feb 16, 2025

Explicit SSE2 vectorization of Chorba CRC method #1872

Merged

This was referenced Mar 25, 2025

Fix a big endian bug on the 32k and larger specializations of chorba #1891

Merged

SSE4.1 optimized chorba #1893

Merged

coderabbitai bot mentioned this pull request Apr 26, 2025

Match function declaration for chorba_small_nondestructive_sse2 #1907

Merged

coderabbitai bot mentioned this pull request May 13, 2025

DONT MERGE Chorba large 32bit fix unit test #1914

Closed

This was referenced Sep 27, 2025

[DRAFT] Add support for trimming down the amount of unused arch functions #1972

Closed

Split CRC32 Braid and Chorba word types #1975

Merged

Dead2 mentioned this pull request Nov 5, 2025

2.3.0 RC1 #1994

Merged

This was referenced Nov 12, 2025

Reorganize Chorba activation and refactor Chorba dispatch #2003

Closed

Reorganize Chorba activation #2004

Merged

Rewrite chorba dispatch functions #2006

Merged

coderabbitai bot mentioned this pull request Dec 9, 2025

Integrate ARMv8 PMULL+EOR3 crc32 algorithm from fast-crc32 #2023

Merged

coderabbitai bot mentioned this pull request Dec 19, 2025

Refactor crc32_fold functions into single crc32_copy #2048

Merged

This was referenced Jan 4, 2026

Add ALIGN_DIFF to perform alignment needed to next boundary #2072

Merged

Use GCC's may_alias attribute for access to buffers in crc32_chorba #2078

Merged

This was referenced Jan 16, 2026

Replace conditional byte swapping with portable host/LE conversion #2105

Merged

Combine crc32 short length logic #2112

Merged

coderabbitai bot mentioned this pull request Jan 29, 2026

Clean up crc32_braid/chorba calls. #2133

Merged

coderabbitai bot mentioned this pull request Feb 11, 2026

zlib-ng-2.3.3/functable.c:305:21: error: ‘crc32_riscv64_zbc’ undeclared (first use in this function) #2148

Closed

		for(; i + (118960 + 4096) * sizeof(z_word_t) < len; i += (32 * sizeof(z_word_t))) {
		z_word_t in1, in2, in3, in4, in5, in6, in7, in8;

		z_word_t* bitbuffer = aligned_alloc(sizeof(z_word_t), bitbuffersizebytes);
		const uint8_t* bitbufferbytes = (const uint8_t*) bitbuffer;

Uh oh!

Conversation

samrussell commented Dec 24, 2024 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Suggested labels

Suggested reviewers

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

nmoinvaz commented Dec 25, 2024

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 26, 2024

Choose a reason for hiding this comment

samrussell commented Dec 24, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 24, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

codecov bot commented Dec 26, 2024 •

edited

Loading

coderabbitai bot Dec 27, 2024 •

edited

Loading