Continued cleanup of old UNALIGNED_OK checks#1830
Conversation
WalkthroughThis pull request modifies memory comparison functions in the zlib library, focusing on consolidating and renaming functions related to aligned and unaligned comparisons. The Changes
Possibly related PRs
Suggested labels
Suggested reviewers
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 0
🔭 Outside diff range comments (1)
deflate_rle.c (1)
Line range hint
13-21: Fix syntax error in preprocessor directiveThere's a syntax error in the
#elifcondition on line 16 - missing closing parenthesis.Apply this fix:
-# elif defined(HAVE_BUILTIN_CTZ +# elif defined(HAVE_BUILTIN_CTZ)🧰 Tools
🪛 cppcheck (2.10-2)
[error] 16-16: failed to evaluate #elif condition
(preprocessorErrorDirective)
🧹 Nitpick comments (5)
test/test_compare256_rle.cc (1)
14-16: Consider documenting the ZTEST_ALL workaroundWhile the workaround is necessary for testing, it would be helpful to document why this approach was chosen and what potential risks it might introduce.
compare256_rle.h (1)
Line range hint
11-45: Consider optimizing the byte comparison loopThe byte-by-byte comparison implementation is safe but could be more maintainable and potentially more efficient.
Consider refactoring to reduce code duplication:
- do { - if (*src0 != *src1) - return len; - src1 += 1, len += 1; - if (*src0 != *src1) - return len; - src1 += 1, len += 1; - // ... repeated 8 times ... - } while (len < 256); + do { + for (int i = 0; i < 8; i++) { + if (*src0 != *src1) + return len; + src1 += 1; + len += 1; + } + } while (len < 256);arch/generic/compare256_c.c (1)
Line range hint
11-60: Consider applying the same optimization as compare256_rle_cThe implementation could benefit from the same loop optimization suggested earlier to reduce code duplication.
Consider applying the same loop optimization pattern suggested for compare256_rle_c.
chunkset_tpl.h (1)
Line range hint
263-285: Consider additional safety improvements for CHUNKCOPY_SAFE.The implementation looks good with proper bounds checking and special case handling. However, consider these improvements:
- The cast to
unsignedin the CHUNKMEMSET call could potentially truncate len on 32-bit systems.- The function could benefit from an Assert for null pointer validation.
Consider applying these changes:
static inline uint8_t *CHUNKCOPY_SAFE(uint8_t *out, uint8_t *from, uint64_t len, uint8_t *safe) { + Assert(out != NULL && from != NULL && safe != NULL, "null pointer check"); if (out == from) return out + len; uint64_t safelen = (safe - out); len = MIN(len, safelen); #ifndef HAVE_MASKED_READWRITE uint64_t from_dist = (uint64_t)llabs(safe - from); if (UNLIKELY(from_dist < sizeof(chunk_t) || safelen < sizeof(chunk_t))) { while (len--) { *out++ = *from++; } return out; } #endif - return CHUNKMEMSET(out, from, (unsigned)len); + /* Handle potential truncation on 32-bit systems */ + if (UNLIKELY(len > UINT_MAX)) { + while (len > UINT_MAX) { + out = CHUNKMEMSET(out, from, UINT_MAX); + from += UINT_MAX; + len -= UINT_MAX; + } + } + return CHUNKMEMSET(out, from, (unsigned)len); }test/benchmarks/benchmark_compare256.cc (1)
15-18: Consider a more maintainable alternative to direct .c file inclusionThe current workaround of directly including a .c file and using a macro to access all functions suggests potential architectural issues. This approach could lead to maintenance challenges and potential symbol conflicts.
Consider these alternatives:
- Move the common functions to a header file
- Create a proper testing API in the compare256 module
- Use proper extern declarations for the test functions
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (17)
.github/workflows/cmake.yml(1 hunks)CMakeLists.txt(0 hunks)README.md(1 hunks)arch/generic/compare256_c.c(3 hunks)arch/generic/generic_functions.h(1 hunks)chunkset_tpl.h(1 hunks)cmake/detect-sanitizer.cmake(1 hunks)compare256_rle.h(3 hunks)configure(0 hunks)deflate_rle.c(1 hunks)insert_string_tpl.h(1 hunks)match_tpl.h(0 hunks)test/benchmarks/benchmark_compare256.cc(2 hunks)test/benchmarks/benchmark_compare256_rle.cc(2 hunks)test/test_compare256.cc(2 hunks)test/test_compare256_rle.cc(2 hunks)zbuild.h(1 hunks)
💤 Files with no reviewable changes (3)
- match_tpl.h
- configure
- CMakeLists.txt
✅ Files skipped from review due to trivial changes (1)
- README.md
🧰 Additional context used
📓 Learnings (1)
chunkset_tpl.h (2)
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: chunkset_tpl.h:135-135
Timestamp: 2024-11-12T09:04:30.407Z
Learning: In the `CHUNKMEMSET` function within `chunkset_tpl.h`, extra bounds checks are avoided to maintain performance in critical code sections. Branching is minimized to prevent negative impacts on speculative execution. The variable `len` is enforced with `safelen` early on.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: chunkset_tpl.h:263-271
Timestamp: 2024-11-12T09:04:30.407Z
Learning: In the `CHUNKCOPY_SAFE` function within `chunkset_tpl.h`, `from` cannot be ahead of `safe` because `safe` marks the end of the window, and `from` is always within the allocated window bounds. Thus, the subtraction `safe - from` is valid without additional checks.
🪛 cppcheck (2.10-2)
deflate_rle.c
[error] 16-16: failed to evaluate #elif condition
(preprocessorErrorDirective)
🔇 Additional comments (20)
cmake/detect-sanitizer.cmake (1)
114-114: LGTM! Unconditional alignment sanitizer check improves safety.
The unconditional inclusion of the 'alignment' check in the sanitizer list is a positive change that aligns with the PR's objective of cleaning up unaligned memory access checks. This will help catch potential memory alignment issues across all builds.
zbuild.h (3)
246-254: LGTM! Well-structured architecture-specific optimizations.
The implementation correctly defines optimal comparison sizes for x86_64 (64 bytes), i386 (32 bytes), and ARM64 (64 bytes with unaligned access support). The architecture detection and conditions are accurate.
255-262: LGTM! Proper handling of ARM and PowerPC architectures.
The code correctly:
- Checks for ARM FEATURE_UNALIGNED support
- Handles PowerPC64 with appropriate endianness check
- Sets appropriate OPTIMAL_CMP values (32/64 bytes) based on architecture capabilities
264-269: LGTM! Safe fallback mechanism.
The implementation provides a robust fallback mechanism:
- Respects NO_UNALIGNED override by unsetting OPTIMAL_CMP
- Defaults to a safe 8-byte comparison size when no architecture-specific optimization is available
.github/workflows/cmake.yml (1)
84-88: LGTM! Simplified UBSAN workflow configuration.
The changes correctly:
- Update the job name to reflect the removal of unaligned checks
- Remove the redundant WITH_UNALIGNED flag from cmake arguments
This simplification aligns with the PR's cleanup objectives.
test/test_compare256_rle.cc (1)
58-65: LGTM: Proper replacement of UNALIGNED_OK checks
The change from UNALIGNED_OK to BYTE_ORDER == LITTLE_ENDIAN is appropriate since unaligned access is primarily a concern for little-endian architectures. The nested checks for HAVE_BUILTIN_CTZ/CTZLL are also properly structured.
test/benchmarks/benchmark_compare256_rle.cc (2)
14-16: LGTM: Consistent with test file changes
The ZTEST_ALL workaround is consistently applied across test files.
67-74: LGTM: Proper conditional compilation structure
The conditional compilation structure correctly mirrors the test file, maintaining consistency across the test suite.
test/test_compare256.cc (2)
72-80: LGTM: Proper architecture-specific conditional compilation
The changes correctly:
- Replace UNALIGNED_OK with BYTE_ORDER checks
- Maintain proper nesting of CPU feature checks
- Keep consistent structure with other test files
16-18: Review the direct inclusion of C implementation file
The comment acknowledges this as an "unorthodox workaround", but including a .c file directly is generally discouraged. Consider:
- Moving the test-specific functions to a header file
- Creating a test-specific header that exposes these functions
insert_string_tpl.h (1)
32-39: Verify performance impact of consistent memcpy usage
The changes improve safety and consistency by using memcpy for all cases, but we should verify there's no significant performance regression.
Consider benchmarking the new implementation against the old one, particularly on platforms where unaligned access was previously optimized.
✅ Verification successful
Let's examine the benchmark files to understand if they specifically test the hash calculation performance related to this change.
Let's check the previous implementation to compare with the new changes.
Let's examine PR #1828 since it's directly related to our changes.
Based on the PR comments and benchmark results, I can now generate the final response:
Performance impact of memcpy usage has been verified
The benchmark results from the PR show no significant performance regression from using memcpy consistently. In fact, the changes show:
- Comparable performance across all compression levels
- Slight improvements in some cases (level 9)
- No measurable decompression speed impact
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Look for existing benchmarks that measure hash calculation performance
rg -l "benchmark.*hash" test/benchmarks/
Length of output: 226
Script:
#!/bin/bash
# Check content of slidehash benchmark which is most relevant to our case
rg -A 10 "benchmark.*hash" test/benchmarks/benchmark_slidehash.cc
# Also check the implementation details around HASH_CALC_READ
rg -B 5 -A 10 "HASH_CALC_READ" --type cpp
Length of output: 3280
Script:
#!/bin/bash
# Check the git history for the previous implementation of HASH_CALC_READ
git log -p --no-merges -S "HASH_CALC_READ" insert_string_tpl.h
# Also check if there are any existing performance-related issues or discussions
gh issue list --search "performance hash"
gh pr list --search "performance hash"
Length of output: 408
Script:
#!/bin/bash
# Get the PR description and comments for PR #1828
gh pr view 1828 --json body,comments
# Also check for any benchmark results or performance discussions in the PR
gh pr view 1828 --json comments | jq -r '.comments[].body' | grep -i "performance\|benchmark"
Length of output: 35801
compare256_rle.h (3)
Line range hint 47-74: LGTM: Safe implementation of unaligned 16-bit comparison
The implementation correctly uses memcpy for safe unaligned memory access and properly handles byte order requirements.
Line range hint 76-103: LGTM: Efficient implementation using compiler intrinsics
The implementation makes good use of __builtin_ctz for efficient comparison while maintaining safety with proper memory access patterns.
Line range hint 104-132: LGTM: Verify 64-bit alignment requirements
The implementation is efficient and safe, using appropriate compiler intrinsics for 64-bit operations.
✅ Verification successful
Let me verify the usage of the 64-bit operations and their platform dependencies further.
LGTM: 64-bit operations are properly guarded and have fallbacks
The implementation is safe with proper platform-specific guards:
- The 64-bit operations are only used when
HAVE_BUILTIN_CTZLLis available - There are appropriate fallbacks to 32-bit (
HAVE_BUILTIN_CTZ), 16-bit, and byte-by-byte comparison implementations - The feature detection is handled during build configuration via CMake/configure
- The code is well-tested with both unit tests and benchmarks for all variants
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Verify that 64-bit operations are only used on supported platforms
rg -l "HAVE_BUILTIN_CTZLL" | xargs rg "compare256.*64"
Length of output: 1335
Script:
#!/bin/bash
# Check how HAVE_BUILTIN_CTZLL is defined and its platform dependencies
rg "HAVE_BUILTIN_CTZLL" -B 3 -A 3
# Check if there are any platform-specific checks or fallbacks
rg "compare256_rle" -B 2 -A 2
Length of output: 19454
arch/generic/generic_functions.h (2)
34-53: LGTM: Well-structured conditional compilation
The implementation properly handles different architectures and capabilities with a clear fallback hierarchy.
Line range hint 57-71: LGTM: Clear and maintainable implementation selection
The generic implementation selection is well-structured and maintainable.
arch/generic/compare256_c.c (2)
Line range hint 62-142: LGTM: Safe and efficient unaligned comparison implementations
The implementations properly handle unaligned access and follow best practices for memory comparison.
Line range hint 143-183: LGTM: Consistent implementation for 64-bit operations
The implementation maintains consistency with other bit-width implementations while properly handling 64-bit specific requirements.
chunkset_tpl.h (1)
230-232: LGTM! Verify OPTIMAL_CMP usage across the codebase.
The switch from UNALIGNED checks to OPTIMAL_CMP for alignment control is a good improvement. The alignment masks (7 for 8-byte alignment, 3 for 4-byte alignment) are correctly defined based on the optimal comparison size.
Let's verify the OPTIMAL_CMP usage:
✅ Verification successful
OPTIMAL_CMP usage is consistent and properly defined
The verification confirms that:
- OPTIMAL_CMP is properly defined in zbuild.h with architecture-specific values (64 for x86_64/ARM64, 32 for x86/ARM, fallback to 8)
- The alignment masks in chunkset_tpl.h (7 for OPTIMAL_CMP < 32, 3 for OPTIMAL_CMP == 32) correctly correspond to the defined OPTIMAL_CMP values
- No remaining UNALIGNED_OK checks were found, confirming successful migration
- The alignment handling is consistent throughout the codebase with proper boundary checks
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check OPTIMAL_CMP usage and definitions
# Verify that OPTIMAL_CMP is consistently used and properly defined
# Find OPTIMAL_CMP definition and usage
rg -n "OPTIMAL_CMP" --type c --type h
# Check if there are any remaining UNALIGNED checks that should be migrated
rg -n "UNALIGNED.*OK" --type c --type h
Length of output: 508
Script:
#!/bin/bash
# Let's examine the context around OPTIMAL_CMP definitions and usage
# Check zbuild.h definitions with context
rg -A 5 -B 5 "OPTIMAL_CMP" zbuild.h
# Check chunkset_tpl.h usage with context
rg -A 5 -B 5 "OPTIMAL_CMP" chunkset_tpl.h
# Look for any other alignment-related macros that might interact
rg -n "(ALIGN|align)" chunkset_tpl.h
Length of output: 1935
test/benchmarks/benchmark_compare256.cc (1)
73-81: Verify architectural assumptions about endianness
The unaligned benchmarks are now conditionally compiled only for little-endian systems. While this aligns with removing UNALIGNED_OK checks, we should verify:
- That unaligned access is indeed only supported/optimized for little-endian
- Whether we're missing test coverage for big-endian systems
✅ Verification successful
Based on the search results, I can now generate the final response:
Endianness check for unaligned access is correctly implemented
The endianness-specific code patterns reveal that:
- Unaligned access optimizations are consistently guarded by LITTLE_ENDIAN checks across the codebase (compare256_c.c, compare256_rle.h)
- Big-endian systems are properly handled with fallback implementations (compare256_c.c has
#if BYTE_ORDER != LITTLE_ENDIAN || defined(ZTEST_ALL)) - Test coverage for both endianness exists through the ZTEST_ALL macro
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for any other endianness-specific code patterns
rg "BYTE_ORDER|LITTLE_ENDIAN|BIG_ENDIAN" --type c --type cpp
# Look for other unaligned access implementations
rg "unaligned|UNALIGNED" --type c --type cpp
Length of output: 10989
9ca6a80 to
f0bdd49
Compare
f0bdd49 to
97964e1
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (4)
compare256_rle.h (2)
Line range hint
11-45: LGTM! Consider adding documentation for the big-endian implementation.The new byte-by-byte comparison implementation for big-endian systems is safe and correct. The conditional compilation based on byte order is an improvement over the previous UNALIGNED_OK checks.
Consider adding a brief comment explaining why a byte-by-byte comparison is used for big-endian systems versus the optimized versions for little-endian.
Line range hint
1-132: Good architectural improvements in alignment handling.The changes successfully:
- Replace UNALIGNED_OK with more precise byte-order based conditions
- Maintain memory safety through consistent use of memcpy
- Provide appropriate implementations for different architectures
This aligns well with the PR's objective of cleaning up alignment checks while potentially improving maintainability.
Consider documenting these architectural decisions in a comment block at the top of the file to help future maintainers understand the design choices.
arch/generic/compare256_c.c (2)
Line range hint
11-60: Consider performance optimization for byte comparison.While the byte-by-byte comparison is safe, it could be optimized for better performance on big-endian systems.
Consider using word-sized comparisons when alignment permits:
static inline uint32_t compare256_c_static(const uint8_t *src0, const uint8_t *src1) { uint32_t len = 0; + // Fast path: compare aligned words first + if (((uintptr_t)src0 & (sizeof(size_t)-1)) == 0 && + ((uintptr_t)src1 & (sizeof(size_t)-1)) == 0) { + while (len + sizeof(size_t) <= 256) { + if (*(size_t*)src0 != *(size_t*)src1) + break; + src0 += sizeof(size_t); + src1 += sizeof(size_t); + len += sizeof(size_t); + } + } + // Fallback to byte comparison do { if (*src0 != *src1) return len;
Line range hint
11-144: Architecture and implementation hierarchy looks solid.The implementation provides a good balance of safety and performance with:
- Byte-by-byte fallback for big-endian
- 16-bit operations using safe memcmp
- 32/64-bit operations using memcpy + builtin_ctz
Consider adding architecture-specific optimized implementations in separate files for frequently used platforms.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
arch/generic/compare256_c.c(3 hunks)arch/generic/generic_functions.h(1 hunks)compare256_rle.h(3 hunks)deflate_rle.c(1 hunks)match_tpl.h(0 hunks)test/benchmarks/benchmark_compare256.cc(2 hunks)test/benchmarks/benchmark_compare256_rle.cc(2 hunks)test/test_compare256.cc(2 hunks)test/test_compare256_rle.cc(2 hunks)
💤 Files with no reviewable changes (1)
- match_tpl.h
🚧 Files skipped from review as they are similar to previous changes (5)
- test/benchmarks/benchmark_compare256.cc
- test/benchmarks/benchmark_compare256_rle.cc
- deflate_rle.c
- test/test_compare256_rle.cc
- test/test_compare256.cc
🔇 Additional comments (6)
compare256_rle.h (2)
Line range hint 47-104: Verify BYTE_ORDER macro availability across all supported platforms.
The change from UNALIGNED_OK to BYTE_ORDER-based conditional compilation is good, but we should ensure BYTE_ORDER is consistently defined.
✅ Verification successful
The search results show that BYTE_ORDER is well-defined and consistently handled across the codebase through the zendian.h header file. This header implements a robust fallback mechanism:
- First checks compiler-provided
__BYTE_ORDER__ - Then tries various system endian headers (
endian.h,machine/endian.h,sys/endian.h) - Finally provides default definitions if needed
Let me verify the actual implementation in zendian.h to ensure it's robust:
BYTE_ORDER macro is properly defined and safe to use across all supported platforms
The implementation in zendian.h is robust and comprehensive, covering all major platforms and architectures:
- Compiler-provided
__BYTE_ORDER__detection - Windows platforms via architecture detection
- Unix-like systems through standard endian headers
- Special handling for MinGW, Apple, BSD, and Solaris
- Proper fallback mechanisms and error detection
The change from UNALIGNED_OK to BYTE_ORDER-based conditional compilation is safe and well-supported.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check if BYTE_ORDER is consistently defined across the codebase
# and if there are any platform-specific definitions we need to handle
# Search for BYTE_ORDER definitions or includes
rg -n "BYTE_ORDER|endian\.h"
# Check for platform-specific handling
rg -n "#if.*BYTE_ORDER"
Length of output: 5853
Script:
#!/bin/bash
# Check the complete implementation of zendian.h
cat zendian.h
Length of output: 1720
Line range hint 104-132: Verify fallback behavior when builtin functions are unavailable.
The optimized implementations correctly use memcpy for safe unaligned access. However, we should verify the fallback behavior when HAVE_BUILTIN_CTZ/CTZLL are not defined.
✅ Verification successful
Fallback behavior is properly implemented for platforms without CTZ/CTZLL builtins
The codebase has a well-designed fallback mechanism:
- In
deflate_rle.c, there's a clear fallback chain:
#if BYTE_ORDER == LITTLE_ENDIAN
# if defined(HAVE_BUILTIN_CTZLL)
# define compare256_rle compare256_rle_unaligned_64
# elif defined(HAVE_BUILTIN_CTZ)
# define compare256_rle compare256_rle_unaligned_32
# else
# define compare256_rle compare256_rle_unaligned_16
# endif
#else
# define compare256_rle compare256_rle_c
#endif- The
compare256_rle_unaligned_16implementation is always available and doesn't rely on builtin functions, serving as the base fallback. - For big-endian systems, it falls back to the basic byte-by-byte comparison (
compare256_rle_c). - The compiler detection for CTZ/CTZLL is properly handled in both CMake and configure scripts.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check how the code handles platforms without CTZ/CTZLL builtins
# Search for fallback implementations
rg -n "fallback.*ctz"
# Check if there are platform-specific alternatives
ast-grep --pattern 'define HAVE_BUILTIN_CTZ'
ast-grep --pattern 'define HAVE_BUILTIN_CTZLL'
Length of output: 119
Script:
#!/bin/bash
# Let's check for any non-builtin implementations of compare256_rle
rg -n "compare256_rle" -A 10
# Also search for any other CTZ implementations
rg -n "ctz" -B 2 -A 2
# Check for compiler detection macros
rg -n "HAVE_BUILTIN" -B 2 -A 2
Length of output: 41449
arch/generic/generic_functions.h (2)
57-58: LGTM! Generic implementation selection is consistent.
The selection of generic implementations aligns well with the architecture-specific function declarations above.
34-38: LGTM! Verify function availability across architectures.
The reorganization of functions under BYTE_ORDER != LITTLE_ENDIAN improves clarity and architecture-specific handling.
Let's verify the function declarations match their implementations:
✅ Verification successful
Let's gather more information to verify the implementations.
Let's check the actual implementations in compare256_c.c to verify the function definitions.
Function declarations are correctly implemented and conditionally defined
The verification confirms that:
- The functions are correctly implemented in compare256_c.c under the appropriate endianness conditions
longest_match_candlongest_match_slow_care defined via template inclusion underBYTE_ORDER == BIG_ENDIANcompare256_unaligned_cis not directly visible but is part of the unaligned comparison implementations for LITTLE_ENDIAN systems- The architecture-specific organization in generic_functions.h aligns with the actual implementations
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify function declarations match implementations
# Search for function implementations in source files
rg -l "^(uint32_t |void |static.*)(longest_match_c|longest_match_slow_c|compare256_unaligned_c)\b" arch/
Length of output: 139
Script:
#!/bin/bash
# Search for the function implementations with context
rg -A 5 "^(uint32_t |void |static.*)(longest_match_c|longest_match_slow_c|compare256_unaligned_c)\b" --type c
# Also check for any potential references to these functions
rg "(longest_match_c|longest_match_slow_c|compare256_unaligned_c)" --type c
Length of output: 1533
Script:
#!/bin/bash
# Get the content of compare256_c.c to verify implementations
cat arch/generic/compare256_c.c
Length of output: 4961
arch/generic/compare256_c.c (2)
Line range hint 62-142: LGTM! Safe unaligned access implementation.
The use of zng_memcmp_2 ensures safe 16-bit comparisons without alignment requirements.
143-144: Verify memory safety for 64-bit comparisons.
The removal of UNALIGNED64_OK check relies on memcpy for safe access, but we should verify this is sufficient across all platforms.
Let's check for potential alignment issues:
✅ Verification successful
The previous search revealed alignment-related code across the codebase, but we need to specifically examine the implementation of compare256 functions to verify the memory safety aspects after the UNALIGNED64_OK removal. Let's check the actual implementation and memcpy usage.
Memory safety is properly handled through memcpy
The implementation correctly uses memcpy for both 32-bit and 64-bit comparisons, which ensures safe unaligned memory access across all platforms. The removal of UNALIGNED64_OK check is safe because:
- All memory accesses are handled through memcpy, which safely handles unaligned access
- The implementation uses proper sized buffers (sizeof) for both source and destination
- This approach is platform-agnostic and safer than direct unaligned access
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Search for alignment-related issues or discussions
# Look for alignment-related comments or issues
rg -i "align|unaligned" --type md
rg -i "align|unaligned" --type cpp arch/ include/ -C2
Length of output: 9364
Script:
#!/bin/bash
# Let's check the implementation details of compare256 functions
ast-grep --pattern 'compare256_unaligned_64($$$)'
ast-grep --pattern 'compare256_unaligned_32($$$)'
ast-grep --pattern 'compare256_unaligned_16($$$)'
# Also check for memcpy usage in these functions
rg "memcpy" arch/generic/compare256_c.c -C3
Length of output: 530
97964e1 to
9a7d70b
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
compare256_rle.h (1)
Line range hint
78-102: LGTM: Optimized 32-bit and 64-bit implementationsThe 32-bit and 64-bit implementations:
- Efficiently use CPU features (CTZ/CTZLL)
- Handle alignment safely through memcpy
- Properly calculate match lengths
Consider adding a comment explaining the bit manipulation logic for better maintainability.
Also applies to: 104-132
arch/generic/generic_functions.h (1)
44-46: Fix indentation inconsistency.The indentation in the fallback case doesn't match the style used in other conditional blocks.
- uint32_t longest_match_16(deflate_state *const s, Pos cur_match); - uint32_t longest_match_slow_16(deflate_state *const s, Pos cur_match); - uint32_t compare256_16(const uint8_t *src0, const uint8_t *src1); + uint32_t longest_match_16(deflate_state *const s, Pos cur_match); + uint32_t longest_match_slow_16(deflate_state *const s, Pos cur_match); + uint32_t compare256_16(const uint8_t *src0, const uint8_t *src1);
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
arch/generic/compare256_c.c(5 hunks)arch/generic/generic_functions.h(1 hunks)compare256_rle.h(4 hunks)deflate_rle.c(1 hunks)match_tpl.h(0 hunks)test/benchmarks/benchmark_compare256.cc(2 hunks)test/benchmarks/benchmark_compare256_rle.cc(2 hunks)test/test_compare256.cc(2 hunks)test/test_compare256_rle.cc(2 hunks)
💤 Files with no reviewable changes (1)
- match_tpl.h
🚧 Files skipped from review as they are similar to previous changes (4)
- test/test_compare256_rle.cc
- test/benchmarks/benchmark_compare256.cc
- test/test_compare256.cc
- arch/generic/compare256_c.c
🧰 Additional context used
📓 Learnings (1)
arch/generic/generic_functions.h (1)
Learnt from: Dead2
PR: zlib-ng/zlib-ng#1830
File: arch/generic/generic_functions.h:39-53
Timestamp: 2024-12-20T23:35:59.830Z
Learning: The `longest_match_unaligned_*` functions are templated using the LONGEST_MATCH macro in match_tpl.h, so their implementations are generated rather than explicitly defined.
🔇 Additional comments (7)
deflate_rle.c (1)
13-21: LGTM: Well-structured function selection hierarchy
The changes implement a clear and logical selection of comparison functions:
- 64-bit variant when CTZLL is available
- 32-bit variant when CTZ is available
- 16-bit variant as a fallback for little-endian
- Basic C implementation for big-endian
This aligns with the PR objectives by removing UNALIGNED_OK checks while maintaining optimal function selection based on available CPU features.
compare256_rle.h (2)
Line range hint 11-45: LGTM: Well-implemented basic comparison function
The C implementation is well-structured with:
- Proper bounds checking (len < 256)
- Efficient unrolled loop for byte comparisons
- Clear fallback for big-endian systems
Line range hint 47-75: LGTM: Safe 16-bit comparison implementation
The 16-bit comparison function:
- Uses safe memory access through memcpy
- Properly handles partial matches
- Maintains correct alignment requirements
arch/generic/generic_functions.h (2)
34-51: LGTM! Well-structured function declarations based on CPU capabilities.
The hierarchical organization (CTZLL > CTZ > fallback) provides optimal function selection for each architecture, with a clean separation between LITTLE_ENDIAN and BIG_ENDIAN paths.
56-69: Verify implementation and usage of generic functions.
The macro definitions look good, maintaining a clean mapping to the declared functions. Let's verify:
- Implementation of all referenced functions
- Usage of these macros in dependent code
test/benchmarks/benchmark_compare256_rle.cc (2)
67-74: LGTM! Clean conditional compilation structure
The changes improve the code by:
- Replacing UNALIGNED_OK checks with direct endianness checks
- Using consistent defined() checks for builtin functions
- Simplifying function names
Let's verify the consistency of these changes across the codebase:
✅ Verification successful
LGTM! Changes are consistent with the codebase
The verification confirms:
- No remaining UNALIGNED_OK checks found, indicating complete cleanup
- The compare256_rle functions (16/32/64) are used consistently across implementation, tests, and benchmarks
- BYTE_ORDER == LITTLE_ENDIAN checks are used uniformly throughout the codebase
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check for any remaining UNALIGNED_OK checks that might need cleanup
rg "UNALIGNED_OK"
# Verify the new function names are used consistently
rg "compare256_rle_(16|32|64)"
# Check for consistent endianness checks
rg "BYTE_ORDER.*LITTLE_ENDIAN"
Length of output: 3072
13-15: Consider documenting the ZTEST_ALL workaround more thoroughly
While workarounds in test code are sometimes necessary, it would be helpful to document:
- Why this workaround is needed
- What functions it provides access to
- Whether this could be solved through a better design
Let's check if this workaround is consistently used across other test files:
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #1830 +/- ##
===========================================
+ Coverage 82.20% 82.44% +0.24%
===========================================
Files 137 138 +1
Lines 11001 10994 -7
Branches 2695 2778 +83
===========================================
+ Hits 9043 9064 +21
+ Misses 1261 1235 -26
+ Partials 697 695 -2 ☔ View full report in Codecov by Sentry. |
9a7d70b to
a1076de
Compare
|
Rebased, but there are still a few small things to fix, I have to get some Christmas shopping done first though. |
ccawley2011
left a comment
There was a problem hiding this comment.
How would these changes affect performance on platforms without unaligned memory access?
arch/generic/compare256_c.c
Outdated
| /* 16-bit unaligned integer comparison */ | ||
| static inline uint32_t compare256_unaligned_16_static(const uint8_t *src0, const uint8_t *src1) { | ||
| /* 16-bit integer comparison */ | ||
| static inline uint32_t compare256_16_static(const uint8_t *src0, const uint8_t *src1) { |
There was a problem hiding this comment.
It might be better to keep the byte version rather than the 16-bit version - the byte version would potentially have better performance than the 16-bit version on platforms without unaligned reads, and platforms without dedicated clz/ctz instructions would likely end up using the 32-bit or 64-bit versions anyway because GCC typically provides generic fallbacks for __builtin_clz/__builtin_ctz instructions for all architectures.
There was a problem hiding this comment.
I think I would have to agree, that keeping the c versions is probably a good idea for those platforms.
There was a problem hiding this comment.
Added the 8-bit version back, but not currently in use by any platform.
There was a problem hiding this comment.
The code you have now is a good compromise.
That is something I want some help testing. In my godbolt tests, it looks like the emulated |
a1076de to
1ec6c36
Compare
|
Oh, if we find any platforms that actually needs the byte-by-byte functions, they should be specifically added with |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
arch/generic/compare256_p.h (1)
11-35: Check for potential loop unrolling or consolidation.
The do-while loop repeats the same compare-block logic four times before checking its boundary condition. This approach might be manually unrolling the loop for performance, but it can reduce readability. Consider using a for loop and unroll it via compiler settings or macros, if needed. That can keep the logic simpler to maintain.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
CMakeLists.txt(1 hunks)arch/generic/compare256_c.c(1 hunks)arch/generic/compare256_p.h(1 hunks)arch/generic/generic_functions.h(1 hunks)compare256_rle.h(3 hunks)deflate_rle.c(1 hunks)functable.c(1 hunks)match_tpl.h(0 hunks)test/benchmarks/benchmark_compare256.cc(2 hunks)test/benchmarks/benchmark_compare256_rle.cc(1 hunks)test/test_compare256.cc(2 hunks)test/test_compare256_rle.cc(1 hunks)
💤 Files with no reviewable changes (1)
- match_tpl.h
🚧 Files skipped from review as they are similar to previous changes (4)
- test/test_compare256_rle.cc
- deflate_rle.c
- arch/generic/generic_functions.h
- compare256_rle.h
🧰 Additional context used
🪛 cppcheck (2.10-2)
test/benchmarks/benchmark_compare256.cc
[error] 68-68: There is an unknown macro here somewhere. Configuration is required. If BENCHMARK_DEFINE_F is a macro then please configure it.
(unknownMacro)
test/benchmarks/benchmark_compare256_rle.cc
[error] 62-62: There is an unknown macro here somewhere. Configuration is required. If BENCHMARK_DEFINE_F is a macro then please configure it.
(unknownMacro)
🔇 Additional comments (14)
arch/generic/compare256_p.h (4)
1-4: License header looks fine and consistent with project standards.
No further action needed.
6-10: Includes are appropriate and minimal.
These header inclusions provide the necessary dependencies without apparent redundancies.
64-90: 64-bit mismatch detection logic is consistent with 32-bit version.
The approach extends naturally to 64-bit words. The cast of match_byte to uint32_t is correct, ensuring the correct return type. Good job retaining a consistent pattern.
36-62: Verify usage of built-in functions for mismatch detection.
The code uses __builtin_ctz for determining the first differing byte. This is a good optimization, but confirm that all target compilers and architectures support it correctly (especially cross-compiling scenarios).
✅ Verification successful
Built-in function usage is properly guarded and verified
The codebase has robust checks for __builtin_ctz support:
- Build-time detection in both CMake and configure scripts that test compiler support
- Proper macro guards (
HAVE_BUILTIN_CTZ) around all usage - Fallback implementations in
fallback_builtins.hfor platforms without native support - Architecture-specific optimizations (SSE2, AVX2, NEON) with appropriate feature detection
- Correct handling of endianness with
BYTE_ORDERchecks
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Searching for usage of __builtin_ctz in the codebase to verify compatibility across files.
ast-grep --pattern $'__builtin_ctz($_)'
Length of output: 40
Script:
#!/bin/bash
# Let's check for HAVE_BUILTIN_CTZ macro definition and its usage
rg "HAVE_BUILTIN_CTZ" -B 2 -A 2
# Also search for any configure/build system checks for this builtin
fd -e configure -e m4 -e cmake -X rg "builtin_ctz|__builtin_ctz"
# Check if there are any other similar built-in function checks
rg "__builtin_(clz|ctz)" -B 2 -A 2
Length of output: 18163
arch/generic/compare256_c.c (3)
10-10: Include for compare256_p.h ensures the inline functions are accessible.
This is an appropriate place to include the new header.
13-18: Dynamic selection of comparison variant is clear.
The logic uses #if checks around HAVE_BUILTIN_CTZLL, HAVE_BUILTIN_CTZ, and a fallback for 16-bit variant. This is concise and helps maintain a smaller code footprint.
22-44: Macro definitions for generating longest_match variants.
The approach is consistent and maintains similar selection logic for different built-ins. No immediate issues here. Just ensure match_tpl.h references the correct macros.
test/benchmarks/benchmark_compare256_rle.cc (1)
62-67: Benchmark registration is concise; watch for unknown macro warnings.
Cppcheck warns about potential unknown macros (e.g., BENCHMARK_DEFINE_F). If that macro is from a recognized library (Google Benchmark), this is typically harmless, but consider configuring the static analyzer to avoid false positives.
✅ Verification successful
Let me verify the macro usage by checking for the benchmark header inclusion and the actual macro being used in the file under review.
Macro warnings can be safely ignored - proper benchmark framework integration confirmed
The code properly includes the Google Benchmark framework (<benchmark/benchmark.h>), and the BENCHMARK_DEFINE_F macro is correctly used within a custom macro BENCHMARK_COMPARE256_RLE. The implementation follows the standard Google Benchmark fixture pattern with proper registration. Any Cppcheck warnings about unknown macros in this context can be safely ignored.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Attempt to locate definition of BENCHMARK_DEFINE_F in the codebase
rg -A 5 'BENCHMARK_DEFINE_F'
Length of output: 4987
Script:
#!/bin/bash
# Check the includes and macro definition in the file
cat test/benchmarks/benchmark_compare256_rle.cc
Length of output: 1925
🧰 Tools
🪛 cppcheck (2.10-2)
[error] 62-62: There is an unknown macro here somewhere. Configuration is required. If BENCHMARK_DEFINE_F is a macro then please configure it.
(unknownMacro)
test/test_compare256.cc (2)
15-15: Including "arch/generic/compare256_p.h" directly.
This can sometimes raise concerns about mixing test code with private/internal headers. However, the comment in past reviews clarifies this is intentional for direct testing.
67-72: Tests for 16, 32, and 64-bit variants.
All three coverage targets look correct. This ensures thorough testing of each specialized function.
test/benchmarks/benchmark_compare256.cc (2)
15-15: LGTM: Header inclusion aligns with architectural changes
The addition of compare256_p.h is consistent with the move towards aligned comparisons and the removal of unaligned variants.
68-74: Verify architecture-specific benchmark registration
The replacement of unaligned benchmarks with aligned versions is good, but we should verify the architecture-specific conditions:
- Base compare256_16 is unconditional
- compare256_32 requires HAVE_BUILTIN_CTZ
- compare256_64 requires HAVE_BUILTIN_CTZLL
✅ Verification successful
Architecture-specific benchmark registration is correctly implemented
The verification confirms that:
- compare256_16 is correctly unconditional in both implementation (arch/generic/compare256_p.h) and benchmarks
- compare256_32 is properly guarded by HAVE_BUILTIN_CTZ in both implementation and benchmarks
- compare256_64 is properly guarded by HAVE_BUILTIN_CTZLL in both implementation and benchmarks
The conditional compilation directives are consistently applied across the codebase, including tests, benchmarks, and architecture-specific implementations.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify that the compare256 functions are properly defined in the header
# Check if the compare256 functions are properly defined
ast-grep --pattern 'uint32_t compare256_$_(const uint8_t* src0, const uint8_t* src1)' arch/generic/compare256_p.h
Length of output: 115
Script:
#!/bin/bash
# Let's try a different approach to verify the architecture-specific conditions
# 1. Search for compare256 function declarations
rg -n "compare256_(16|32|64)" --type cpp -A 2
# 2. Check the conditional compilation directives around these functions
rg -n "HAVE_BUILTIN_CTZ|HAVE_BUILTIN_CTZLL" --type cpp -B 2 -A 2
Length of output: 8878
🧰 Tools
🪛 cppcheck (2.10-2)
[error] 68-68: There is an unknown macro here somewhere. Configuration is required. If BENCHMARK_DEFINE_F is a macro then please configure it.
(unknownMacro)
functable.c (1)
64-66: Verify function implementations exist
The replacement of generic implementations with architecture-specific ones looks good, but we should verify the existence of these functions.
CMakeLists.txt (1)
1077-1079: LGTM: Header additions support architectural changes
The addition of these headers to ZLIB_PRIVATE_HDRS is appropriate:
- chunk_permute_table.h likely contains permutation tables for chunk operations
- compare256_p.h contains the new aligned comparison functions
- generic_functions.h provides architecture-independent implementations
1ec6c36 to
376c1ff
Compare
I'm somewhat speculating but I wonder if it'll still be faster for the compiler to generate a piecewise load by finding partial alignment and shifting and or'ing it in. It'd be a long instruction sequence to be generic enough for any alignment but it still might be better. Then again on ppc, at least ppc64, a good portion of that time was external to that comparison loop. I guess SPARC will be the ultimate test. |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
compare256_rle.h (1)
Line range hint
12-37: Consider optimizing repeated memcpy operationsThe 16-bit comparison function performs multiple memcpy operations in a tight loop. While safe, this could impact performance.
Consider caching the src1_cmp value and updating it only when needed:
static inline uint32_t compare256_rle_16(const uint8_t *src0, const uint8_t *src1) { uint32_t len = 0; uint16_t src0_cmp, src1_cmp; memcpy(&src0_cmp, src0, sizeof(src0_cmp)); + uint8_t *next_src1 = src1; do { - memcpy(&src1_cmp, src1, sizeof(src1_cmp)); + memcpy(&src1_cmp, next_src1, sizeof(src1_cmp)); if (src0_cmp != src1_cmp) return len + (*src0 == *src1); - src1 += 2, len += 2; + next_src1 += 2, len += 2;arch/generic/compare256_p.h (2)
47-55: Optimize branch prediction for diff checkThe diff check is in a hot loop. Consider helping the compiler with branch prediction.
diff = sv ^ mv; - if (diff) { + if (UNLIKELY(diff)) { # if BYTE_ORDER == LITTLE_ENDIAN uint32_t match_byte = __builtin_ctz(diff) / 8;
1-90: Document performance characteristicsThe file would benefit from documentation about the performance characteristics of each comparison method.
Add a comment block explaining when to use each comparison width and their relative performance characteristics on different architectures.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
CMakeLists.txt(1 hunks)arch/generic/compare256_c.c(1 hunks)arch/generic/compare256_p.h(1 hunks)arch/generic/generic_functions.h(1 hunks)compare256_rle.h(3 hunks)deflate_rle.c(1 hunks)functable.c(1 hunks)match_tpl.h(0 hunks)test/benchmarks/benchmark_compare256.cc(2 hunks)test/benchmarks/benchmark_compare256_rle.cc(1 hunks)test/test_compare256.cc(2 hunks)test/test_compare256_rle.cc(1 hunks)
💤 Files with no reviewable changes (1)
- match_tpl.h
🚧 Files skipped from review as they are similar to previous changes (3)
- test/test_compare256_rle.cc
- deflate_rle.c
- functable.c
🧰 Additional context used
📓 Learnings (1)
arch/generic/generic_functions.h (1)
Learnt from: Dead2
PR: zlib-ng/zlib-ng#1830
File: arch/generic/generic_functions.h:39-53
Timestamp: 2024-12-20T23:35:59.830Z
Learning: The `longest_match_unaligned_*` functions are templated using the LONGEST_MATCH macro in match_tpl.h, so their implementations are generated rather than explicitly defined.
🪛 cppcheck (2.10-2)
test/benchmarks/benchmark_compare256_rle.cc
[error] 62-62: There is an unknown macro here somewhere. Configuration is required. If BENCHMARK_DEFINE_F is a macro then please configure it.
(unknownMacro)
test/benchmarks/benchmark_compare256.cc
[error] 68-68: There is an unknown macro here somewhere. Configuration is required. If BENCHMARK_DEFINE_F is a macro then please configure it.
(unknownMacro)
🔇 Additional comments (10)
arch/generic/compare256_c.c (2)
13-19: LGTM! Well-structured function selection based on CPU capabilities.
The hierarchical fallback from 64-bit to 32-bit to 16-bit operations is well-designed, ensuring optimal performance across different architectures.
22-40: Verify performance impact of function selection.
Given the PR objectives mentioning potential performance concerns on platforms without unaligned access, we should verify the performance impact of these changes.
Run the following script to identify platforms that might be affected:
test/benchmarks/benchmark_compare256_rle.cc (1)
62-67: LGTM! Comprehensive benchmark coverage for all variants.
The benchmarks are properly structured to test all comparison variants based on available CPU capabilities:
- Base 16-bit version is always available
- 32-bit version when CTZ is available
- 64-bit version when CTZLL is available
🧰 Tools
🪛 cppcheck (2.10-2)
[error] 62-62: There is an unknown macro here somewhere. Configuration is required. If BENCHMARK_DEFINE_F is a macro then please configure it.
(unknownMacro)
arch/generic/generic_functions.h (1)
36-36: LGTM! Function declaration aligns with implementation.
The addition of compare256_c declaration is consistent with the PR's objective of cleaning up and consolidating comparison functions.
test/test_compare256.cc (2)
15-15: LGTM: Header inclusion aligns with new architecture
The inclusion of compare256_p.h is appropriate as it provides the new comparison function implementations.
67-72: Verify test coverage for all comparison functions
The conditional compilation ensures tests are only built when the required CPU features are available. However, we should verify that all code paths are tested.
✅ Verification successful
Let me verify if the RLE (Run Length Encoding) comparison functions are also tested by searching for their test coverage.
All comparison functions are properly tested
The verification shows comprehensive test coverage:
- Generic comparison functions (
compare256_16/32/64) are tested intest/test_compare256.cc - RLE comparison functions (
compare256_rle_16/32/64) are tested intest/test_compare256_rle.cc - Platform-specific implementations (SSE2, AVX2, NEON, POWER9, RVV) are tested with appropriate CPU feature checks
- All tests use proper conditional compilation guards matching their requirements
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check test coverage for comparison functions
# Look for untested comparison functions
# Find all compare256 function declarations
rg -p "static\s+inline\s+uint32_t\s+compare256_\w+" --type cpp
# Find corresponding test cases
rg -p "TEST_COMPARE256\(" --type cpp
Length of output: 1296
Script:
#!/bin/bash
# Search for RLE comparison function tests
rg -p "TEST.*compare256_rle" --type cpp
# Get more context around TEST_COMPARE256 macro to understand test coverage
rg -p "TEST_COMPARE256" -B 5 -A 10 test/test_compare256.cc
Length of output: 1608
compare256_rle.h (1)
Line range hint 69-94: Verify 64-bit comparison edge cases
The 64-bit comparison function handles endianness correctly but should be tested with various alignment scenarios.
test/benchmarks/benchmark_compare256.cc (2)
15-15: LGTM: Header inclusion aligns with compare256 reorganization
The addition of compare256_p.h from the generic architecture directory is appropriate for the benchmark reorganization.
68-73: LGTM: Well-structured benchmark registration with proper guards
The new benchmark registrations are properly guarded with appropriate compiler feature checks:
- compare256_16 is always available
- compare256_32 requires HAVE_BUILTIN_CTZ
- compare256_64 requires HAVE_BUILTIN_CTZLL
🧰 Tools
🪛 cppcheck (2.10-2)
[error] 68-68: There is an unknown macro here somewhere. Configuration is required. If BENCHMARK_DEFINE_F is a macro then please configure it.
(unknownMacro)
CMakeLists.txt (1)
1077-1079: LGTM: Properly added new generic architecture headers
The new headers are correctly added to the private headers list:
- chunk_permute_table.h
- compare256_p.h
- generic_functions.h
|
In present develop, we compile generic functions that we never use, 8 variants of longest_match and 5 of compare256, without counting the sse and such optimized versions. |
376c1ff to
4bdc524
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
arch/generic/compare256_p.h (2)
38-61: Consider optimizing memory access patternThe current implementation uses separate
memcpyoperations forsvandmv. While this ensures safe unaligned access, it might be worth profiling the performance impact of:
- Using direct dereferencing when the input is known to be aligned
- Combining the memory operations
Example optimization when alignment is guaranteed:
- memcpy(&sv, src0, sizeof(sv)); - memcpy(&mv, src1, sizeof(mv)); + sv = *(const uint32_t *)src0; + mv = *(const uint32_t *)src1;
66-89: Verify 64-bit to 32-bit truncation safetyThe conversion from
uint64_t match_bytetouint32_tat line 82 is safe because:
- The maximum value of
match_byteis 7 (as it's divided by 8)- The maximum value of
lenis 256However, it would be good to add a static assertion to document this assumption.
+/* Static assertion to verify that match_byte will never exceed uint32_t range */ +static_assert(256 < UINT32_MAX, "len cannot overflow uint32_t"); static inline uint32_t compare256_64(const uint8_t *src0, const uint8_t *src1) {
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
CMakeLists.txt(1 hunks)arch/generic/compare256_c.c(1 hunks)arch/generic/compare256_p.h(1 hunks)arch/generic/generic_functions.h(1 hunks)compare256_rle.h(3 hunks)deflate_rle.c(1 hunks)functable.c(1 hunks)match_tpl.h(0 hunks)test/benchmarks/benchmark_compare256.cc(2 hunks)test/benchmarks/benchmark_compare256_rle.cc(1 hunks)test/test_compare256.cc(2 hunks)test/test_compare256_rle.cc(1 hunks)
💤 Files with no reviewable changes (1)
- match_tpl.h
🚧 Files skipped from review as they are similar to previous changes (7)
- test/test_compare256_rle.cc
- functable.c
- deflate_rle.c
- test/test_compare256.cc
- arch/generic/compare256_c.c
- compare256_rle.h
- arch/generic/generic_functions.h
🧰 Additional context used
🪛 cppcheck (2.10-2)
test/benchmarks/benchmark_compare256_rle.cc
[error] 62-62: There is an unknown macro here somewhere. Configuration is required. If BENCHMARK_DEFINE_F is a macro then please configure it.
(unknownMacro)
test/benchmarks/benchmark_compare256.cc
[error] 68-68: There is an unknown macro here somewhere. Configuration is required. If BENCHMARK_DEFINE_F is a macro then please configure it.
(unknownMacro)
🔇 Additional comments (4)
arch/generic/compare256_p.h (1)
12-34: Consider bounds checking for memory access
While the function implementation is efficient, it lacks parameter validation.
The unrolled loop implementation looks good
The manual loop unrolling with 4 iterations per cycle is a good optimization technique. The implementation correctly handles both the 2-byte comparisons and single-byte fallback on mismatch.
test/benchmarks/benchmark_compare256.cc (1)
68-73: LGTM! Benchmark coverage aligns with PR objectives
The changes correctly:
- Remove unaligned comparison benchmarks
- Add benchmarks for new aligned variants (16/32/64-bit)
- Use appropriate conditional compilation based on CPU features
🧰 Tools
🪛 cppcheck (2.10-2)
[error] 68-68: There is an unknown macro here somewhere. Configuration is required. If BENCHMARK_DEFINE_F is a macro then please configure it.
(unknownMacro)
test/benchmarks/benchmark_compare256_rle.cc (1)
62-67: LGTM! Consistent with non-RLE benchmark changes
The benchmark registration follows the same pattern as benchmark_compare256.cc, maintaining consistency across the codebase.
🧰 Tools
🪛 cppcheck (2.10-2)
[error] 62-62: There is an unknown macro here somewhere. Configuration is required. If BENCHMARK_DEFINE_F is a macro then please configure it.
(unknownMacro)
CMakeLists.txt (1)
1077-1079: New generic architecture headers align with cleanup objectives.
The addition of these header files in the generic architecture directory supports the PR's goal of cleaning up alignment-related code:
chunk_permute_table.h: Likely contains permutation tables for chunk operationscompare256_p.h: New implementation for 256-byte comparisonsgeneric_functions.h: Common function declarations for generic architecture
Let's verify the integration of these headers:
✅ Verification successful
New generic headers properly integrated and support architecture-specific optimizations
The verification confirms that these new header files are well-integrated into the codebase and serve specific purposes:
generic_functions.h: Provides baseline C implementations of architecture-specific functionscompare256_p.h: Implements optimized 256-byte memory comparison with different integer widths (16/32/64-bit)chunk_permute_table.h: Contains shared permutation tables used by both SSSE3 and NEON implementations for chunkmemset operations
The headers are properly referenced and utilized:
- Used by both x86 (SSSE3) and ARM (NEON) architectures
- Integrated through the main
arch_functions.hdispatcher - Support the PR's cleanup objectives by providing architecture-agnostic implementations
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check if the new headers are properly referenced in the codebase
echo "Checking usage of new headers..."
rg -l "chunk_permute_table\.h|compare256_p\.h|generic_functions\.h" --type c
# Verify if these headers replace any old alignment-related code
echo "Checking for old alignment-related code..."
rg -l "UNALIGNED_OK" --type c
Length of output: 493
Script:
#!/bin/bash
# Let's check the content and relationships of these headers
# Check content of generic_functions.h
echo "=== Content of generic_functions.h ==="
cat arch/generic/generic_functions.h
# Check content of compare256_p.h
echo -e "\n=== Content of compare256_p.h ==="
cat arch/generic/compare256_p.h
# Check content of chunk_permute_table.h
echo -e "\n=== Content of chunk_permute_table.h ==="
cat arch/generic/chunk_permute_table.h
# Check how these are included in architecture-specific implementations
echo -e "\n=== Usage in arch_functions.h ==="
cat arch_functions.h
# Check for any related changes in the implementation files
echo -e "\n=== Usage in SSSE3 implementation ==="
cat arch/x86/chunkset_ssse3.c
echo -e "\n=== Usage in NEON implementation ==="
cat arch/arm/chunkset_neon.c
Length of output: 14820
b1437c4 to
71a9e5b
Compare
|
Rebased, reordered and redid the commits. Also added back the 8-bit compare256 variants. Those are now available in the benchmarks, and can be accessed by setting the define |
|
Workflow https://github.com/zlib-ng/zlib-ng/actions/runs/12470358964/job/34805368017?pr=1830 |
Yes, it also fails before this PR, we haven't had that test working for months due to github not wanting to support s390x themselves, so we have to depend on a 3rd party runner port. Not sure what broke there yet. |
Confirmed. Works now. |
0f23df0 to
cfa5e9d
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (3)
arch/generic/compare256_c.c (1)
24-31: Consider adding documentation for template parameters.While the template usage is correct, adding comments to explain the parameters and differences between normal and slow variants would improve maintainability.
test/benchmarks/benchmark_compare256_rle.cc (1)
62-68: LGTM! Consider improving error messages.The benchmark registration for different bit-width variants is well-structured. However, the error message could be more specific about why the CPU doesn't support a particular variant.
Update the error message to be more descriptive:
- state.SkipWithError("CPU does not support " #name); + state.SkipWithError("CPU does not support " #name "-bit comparisons (missing required CPU feature)");🧰 Tools
🪛 cppcheck (2.10-2)
[error] 62-62: There is an unknown macro here somewhere. Configuration is required. If BENCHMARK_DEFINE_F is a macro then please configure it.
(unknownMacro)
zbuild.h (1)
271-271: Consider documenting the OPTIMAL_CMP values.Adding a comment explaining the performance implications of different OPTIMAL_CMP values would help future maintainers.
+/* OPTIMAL_CMP values determine the comparison width: + * 64: Best for 64-bit architectures with unaligned access + * 32: Best for 32-bit architectures with unaligned access + * 16: Safe default for unknown architectures + * 8: Safe fallback for architectures without unaligned access + */ #define OPTIMAL_CMP 16
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (14)
CMakeLists.txt(1 hunks)arch/generic/Makefile.in(1 hunks)arch/generic/compare256_c.c(1 hunks)arch/generic/compare256_p.h(1 hunks)arch/generic/generic_functions.h(1 hunks)compare256_rle.h(4 hunks)deflate_rle.c(1 hunks)functable.c(1 hunks)match_tpl.h(0 hunks)test/benchmarks/benchmark_compare256.cc(2 hunks)test/benchmarks/benchmark_compare256_rle.cc(1 hunks)test/test_compare256.cc(2 hunks)test/test_compare256_rle.cc(1 hunks)zbuild.h(1 hunks)
💤 Files with no reviewable changes (1)
- match_tpl.h
🚧 Files skipped from review as they are similar to previous changes (8)
- arch/generic/Makefile.in
- CMakeLists.txt
- test/test_compare256_rle.cc
- functable.c
- test/test_compare256.cc
- arch/generic/generic_functions.h
- deflate_rle.c
- compare256_rle.h
🧰 Additional context used
🪛 cppcheck (2.10-2)
test/benchmarks/benchmark_compare256_rle.cc
[error] 62-62: There is an unknown macro here somewhere. Configuration is required. If BENCHMARK_DEFINE_F is a macro then please configure it.
(unknownMacro)
test/benchmarks/benchmark_compare256.cc
[error] 68-68: There is an unknown macro here somewhere. Configuration is required. If BENCHMARK_DEFINE_F is a macro then please configure it.
(unknownMacro)
🔇 Additional comments (8)
arch/generic/compare256_c.c (2)
20-22: Clean implementation using macro-based dispatch.
The simplified implementation improves maintainability while allowing platform-specific optimizations through the COMPARE256 macro.
7-18: LGTM! Verify platform-specific performance impact.
The macro-based selection of comparison functions with priority for byte version when OPTIMAL_CMP=8 aligns well with the previous discussions about performance on platforms without unaligned reads.
Run this script to analyze platform-specific configurations:
✅ Verification successful
Verified: Platform-specific optimizations are well-handled
The macro-based selection in compare256_c.c is well-designed with appropriate fallbacks:
- For platforms with unaligned access and SIMD:
- x86 has optimized AVX2/SSE2 implementations
- ARM has NEON implementation with CTZLL support
- For platforms without SIMD but with unaligned access:
- 64-bit comparison with CTZLL
- 32-bit comparison with CTZ
- 16-bit comparison fallback
- For platforms with alignment restrictions:
- 8-bit comparison (OPTIMAL_CMP=8) ensures safe byte-by-byte access
The implementation correctly prioritizes the byte version (OPTIMAL_CMP=8) for platforms with alignment restrictions while allowing faster implementations on platforms with better capabilities.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check which platforms might benefit from OPTIMAL_CMP=8
# Search for platform-specific configurations and CPU feature checks
rg -l "HAVE_BUILTIN_CTZ|HAVE_BUILTIN_CTZLL|UNALIGNED_OK" arch/
Length of output: 251
Script:
#!/bin/bash
# Check the content of platform-specific files to understand the configuration logic
# Check generic implementation header
echo "=== Generic implementation header (compare256_p.h) ==="
cat arch/generic/compare256_p.h
# Check ARM-specific configurations
echo -e "\n=== ARM configurations ==="
rg "UNALIGNED_OK|HAVE_BUILTIN_CTZ|HAVE_BUILTIN_CTZLL|OPTIMAL_CMP" arch/arm/arm_functions.h
# Check x86-specific configurations
echo -e "\n=== x86 configurations ==="
rg "UNALIGNED_OK|HAVE_BUILTIN_CTZ|HAVE_BUILTIN_CTZLL|OPTIMAL_CMP" arch/x86/x86_functions.h
Length of output: 3790
Script:
#!/bin/bash
# Check for unaligned memory access handling and platform-specific optimizations
# Check memory access functions
echo "=== Memory access functions ==="
rg "zng_memread|zng_memcmp" arch/generic/zmemory.h || rg "zng_memread|zng_memcmp" zmemory.h
# Check for NEON implementation details
echo -e "\n=== ARM NEON implementation ==="
head -n 30 arch/arm/compare256_neon.c
# Check for SSE2/AVX2 implementation details
echo -e "\n=== x86 SSE2/AVX2 implementations ==="
head -n 30 arch/x86/compare256_sse2.c
head -n 30 arch/x86/compare256_avx2.c
Length of output: 4056
test/benchmarks/benchmark_compare256.cc (1)
68-74: Maintain consistency with RLE benchmark error messages.
The benchmark structure is good, but error messages should be consistent with any improvements made to the RLE benchmarks.
Update the error message to be more descriptive, maintaining consistency with RLE benchmarks:
- state.SkipWithError("CPU does not support " #name);
+ state.SkipWithError("CPU does not support " #name "-bit comparisons (missing required CPU feature)");🧰 Tools
🪛 cppcheck (2.10-2)
[error] 68-68: There is an unknown macro here somewhere. Configuration is required. If BENCHMARK_DEFINE_F is a macro then please configure it.
(unknownMacro)
arch/generic/compare256_p.h (4)
11-42: LGTM! Efficient byte-by-byte comparison implementation.
The unrolled loop design optimizes performance by reducing branch predictions.
45-67: LGTM! Well-implemented 16-bit comparison with proper partial match handling.
The implementation correctly handles both full 16-bit matches and partial byte matches.
97-123: LGTM! 64-bit comparison implementation with proper type handling.
The implementation correctly handles 64-bit comparisons with appropriate type casting for the return value.
69-95: Verify compiler support for CTZ/CLZ built-ins.
The implementation efficiently uses built-in functions for trailing/leading zero counts. Let's verify compiler support across different platforms.
✅ Verification successful
Compiler support for CTZ/CLZ built-ins is properly handled
The codebase has comprehensive support for compiler built-ins with proper fallbacks:
- CMake and configure scripts check for
__builtin_ctzand__builtin_ctzllsupport - Fallback implementations are provided in
fallback_builtins.hfor platforms without native support - The code conditionally enables optimized implementations based on architecture and built-in support:
- Generic 32-bit (
compare256_32) whenHAVE_BUILTIN_CTZis available - Generic 64-bit (
compare256_64) whenHAVE_BUILTIN_CTZLLis available - SSE2/AVX2 optimizations for x86 with
HAVE_BUILTIN_CTZ - NEON optimizations for ARM with
HAVE_BUILTIN_CTZLL
- Generic 32-bit (
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check compiler support for CTZ/CLZ built-ins
# Test: Search for compiler checks and definitions
rg -A 5 "HAVE_BUILTIN_CTZ|HAVE_BUILTIN_CLZ"
Length of output: 11476
zbuild.h (1)
246-271: Well-structured architecture-specific optimizations.
The OPTIMAL_CMP definitions are properly organized with:
- 64-bit optimization for x86_64 and ARM64 with unaligned access
- 32-bit optimization for i386 and ARM with unaligned access
- Safe fallback to 8-bit for architectures without unaligned access
- Default to 16-bit when architecture is unknown
cfa5e9d to
81274e4
Compare
|
Rebased and added a couple comments improving documentation. |
81274e4 to
ef3aea5
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (4)
zmemory.h (1)
84-84: Maintain consistency inzng_memcmp_4.Similar to
zng_memcmp_2, confirm callers expect 0/1 rather than a typical three-way comparison result from standardmemcmp.test/benchmarks/benchmark_compare256_rle.cc (1)
67-68: Benchmark for 64-byte comparisons.Same as above: the code is conditionally compiled by
HAVE_BUILTIN_CTZLL. Ensure environment checks are consistent, and confirm that relevant test coverage is in place for 64-byte comparisons.test/benchmarks/benchmark_compare256.cc (1)
76-76: Optional: Confirm clarity of separation.Having multiple
#endifblocks at close can be less readable if conditions expand in future. Consider grouping or adding clarifying comments if this file grows.zbuild.h (1)
273-276: Minor: Align indentation in powerpc sections.The powerpc sections have inconsistent indentation compared to other architecture-specific sections.
-# define OPTIMAL_CMP 64 +# define OPTIMAL_CMP 64 # elif defined(__powerpc__) || defined(__ppc__) || defined(__PPC__) -# define OPTIMAL_CMP 32 +# define OPTIMAL_CMP 32
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (15)
CMakeLists.txt(1 hunks)arch/generic/Makefile.in(1 hunks)arch/generic/compare256_c.c(1 hunks)arch/generic/compare256_p.h(1 hunks)arch/generic/generic_functions.h(1 hunks)compare256_rle.h(4 hunks)deflate_rle.c(1 hunks)functable.c(1 hunks)match_tpl.h(1 hunks)test/benchmarks/benchmark_compare256.cc(2 hunks)test/benchmarks/benchmark_compare256_rle.cc(1 hunks)test/test_compare256.cc(2 hunks)test/test_compare256_rle.cc(1 hunks)zbuild.h(1 hunks)zmemory.h(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (10)
- CMakeLists.txt
- test/test_compare256_rle.cc
- arch/generic/Makefile.in
- test/test_compare256.cc
- compare256_rle.h
- match_tpl.h
- arch/generic/compare256_p.h
- functable.c
- arch/generic/compare256_c.c
- arch/generic/generic_functions.h
🧰 Additional context used
🪛 cppcheck (2.10-2)
test/benchmarks/benchmark_compare256_rle.cc
[error] 62-62: There is an unknown macro here somewhere. Configuration is required. If BENCHMARK_DEFINE_F is a macro then please configure it.
(unknownMacro)
test/benchmarks/benchmark_compare256.cc
[error] 68-68: There is an unknown macro here somewhere. Configuration is required. If BENCHMARK_DEFINE_F is a macro then please configure it.
(unknownMacro)
🔇 Additional comments (11)
zmemory.h (2)
76-76: Clarify the boolean return logic in zng_memcmp_2.
This function returns 1 if there is a difference, 0 otherwise, which is an unconventional usage of “memcmp” semantics. Ensure all call sites correctly interpret this boolean-like return value.
92-92: Confirm usage alignment in zng_memcmp_8.
Returning 1 or 0 instead of a signed integer might be intentional, but verify that it aligns with downstream usage and avoids confusion.
test/benchmarks/benchmark_compare256_rle.cc (2)
62-63: Confirm coverage and naming consistency.
Adding benchmarks for compare256_rle_8 and compare256_rle_16 looks fine. Ensure that all relevant compare sizes are tested if needed, and confirm the naming scheme remains consistent across the codebase.
🧰 Tools
🪛 cppcheck (2.10-2)
[error] 62-62: There is an unknown macro here somewhere. Configuration is required. If BENCHMARK_DEFINE_F is a macro then please configure it.
(unknownMacro)
64-66: Conditionally compiled benchmark for 32-byte comparisons.
Having this block guarded by HAVE_BUILTIN_CTZ suggests 32-byte RLE comparisons might be unavailable on some platforms. Verify that fallback or alternative benchmarks exist if HAVE_BUILTIN_CTZ is not defined.
test/benchmarks/benchmark_compare256.cc (2)
15-15: Include path verification.
#include "arch/generic/compare256_p.h" is newly added. Confirm that the path aligns with the repository’s structure and that no duplication occurs with other headers.
68-74: Expansion of aligned comparison benchmarks.
Benchmarks for 8, 16, 32, and 64 bytes are conditionally compiled. This is consistent with the removal of unaligned comparisons. Just ensure runtimes on platforms where some macros (e.g. HAVE_BUILTIN_CTZ, HAVE_BUILTIN_CTZLL) aren’t defined still receive the intended fallback coverage.
🧰 Tools
🪛 cppcheck (2.10-2)
[error] 68-68: There is an unknown macro here somewhere. Configuration is required. If BENCHMARK_DEFINE_F is a macro then please configure it.
(unknownMacro)
deflate_rle.c (2)
11-11: New include for compare256_rle.h.
Ensure that only what is needed from this header is included and that the dependency is minimal. If usage expands, a forward declaration might be insufficient. Current approach looks fine.
13-20: Macro-based selection approach for compare256_rle.
Selecting a compare function based on OPTIMAL_CMP or builtins is a neat optimization. However, confirm that this logic does not break on unusual compiler or platform configurations that might define multiple or no macros. Additional fallback logic or explicit checks can help maintain portability.
If you need a script to verify coverage for each macro condition, let me know.
zbuild.h (3)
246-254: LGTM! Well-documented optimization strategy.
The comment block provides clear and comprehensive documentation about the OPTIMAL_CMP values and their intended use cases, with an important clarification about unaligned access safety.
255-278: LGTM! Comprehensive architecture-specific optimizations.
The implementation provides well-structured architecture detection and appropriate OPTIMAL_CMP values based on platform capabilities and word sizes.
279-281: Verify performance impact of default OPTIMAL_CMP change.
The default OPTIMAL_CMP value has been changed from 8 to 16. While this aligns with the documentation, we should verify that this change doesn't negatively impact performance on platforms without unaligned access support.
✅ Verification successful
Let me gather more information about the performance implications and platform-specific implementations.
Let me check one more thing to understand the platform-specific implementations and their relationship with OPTIMAL_CMP.
Default OPTIMAL_CMP value of 16 is safe and well-considered
The change from 8 to 16 as the default OPTIMAL_CMP value is actually safe and well-implemented. Here's why:
- The codebase has a sophisticated selection of OPTIMAL_CMP values based on platform capabilities:
- 64-bit comparison for platforms with CTZLL builtin
- 32-bit comparison for platforms with CTZ builtin
- 16-bit comparison as the new default fallback
- 8-bit comparison explicitly set for platforms that need it
The default value of 16 is only used when no platform-specific value is defined, and the implementation in compare256_p.h shows that the 16-bit comparison is implemented safely using zng_memcmp_2 which handles unaligned access properly. This is part of a larger cleanup effort (PR #1830) that improves the handling of unaligned access across the codebase.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for any existing performance tests or benchmarks related to compare functions
rg -l "compare.*benchmark|benchmark.*compare" --type c --type cpp
# Look for any platform-specific implementations that might be affected
rg -l "OPTIMAL_CMP.*8|compare.*8" --type c --type h
# Search for any existing issues or discussions about OPTIMAL_CMP
gh issue list --search "OPTIMAL_CMP"
gh pr list --search "OPTIMAL_CMP"
Length of output: 853
Script:
#!/bin/bash
# Check the content of benchmark files to understand performance implications
rg -A 5 "OPTIMAL_CMP" test/benchmarks/benchmark_compare256*.cc
# Look at platform-specific implementations to understand unaligned access handling
cat arch/generic/generic_functions.h
cat arch/generic/compare256_p.h
# Check the PR that's related to UNALIGNED_OK checks
gh pr view 1830 --json body
Length of output: 7550
Script:
#!/bin/bash
# Check the generic compare256 implementation and its usage of OPTIMAL_CMP
cat arch/generic/compare256_c.c
# Look for any platform-specific OPTIMAL_CMP definitions
rg "define.*OPTIMAL_CMP" --type c --type h
Length of output: 1379
|
After talking it over with @KungFuJesus, we agree that checking for OPTIMAL_CMP is probably not necessary or useful in zmemory.h. If it turns out that it is, it can be added back in a fixed form. I was considering changing them to all be => 16, but we'll try it like this first and see what tests and benchmarks come out as. @ccawley2011 Please have a look at the updated PR now that your PR #1548 has been merged first. It seems to work fine with |
- Remove obsolete checks - Fix checks that are inconsistent - Stop compiling compare256/longest_match variants that never gets called - Improve how the generic compare256 functions are handled. - Allow overriding OPTIMAL_CMP This simplifies the code and avoids having a lot of code in the compiled library than can never get executed.
ef3aea5 to
577c41e
Compare
|
Fixed indentation. |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
test/benchmarks/benchmark_compare256_rle.cc (1)
62-68: Consider platform-specific benchmark configurationsBased on the PR discussion:
- Consider adding benchmark configurations for platforms without unaligned access to measure potential performance impact
- For platforms where 8-bit comparisons might be optimal, document how to run these benchmarks with
OPTIMAL_CMP=8- Consider collecting and documenting benchmark results across different architectures (x86_64, Aarch64, PPC64) to guide platform-specific optimizations
Would you like me to help create a benchmark configuration matrix or documentation for platform-specific testing?
🧰 Tools
🪛 cppcheck (2.10-2)
[error] 62-62: There is an unknown macro here somewhere. Configuration is required. If BENCHMARK_DEFINE_F is a macro then please configure it.
(unknownMacro)
arch/generic/compare256_p.h (1)
11-42: Consider loop unrolling optimization.The byte-by-byte comparison is implemented with repeated code blocks. While this works, consider using a more maintainable loop unrolling macro or compiler directives.
- do { - if (*src0 != *src1) - return len; - src0 += 1, src1 += 1, len += 1; - if (*src0 != *src1) - return len; - src0 += 1, src1 += 1, len += 1; - // ... repeated 8 times - } while (len < 256); + #define COMPARE_BYTE_BLOCK(n) \ + if (*src0 != *src1) return len; \ + src0 += 1, src1 += 1, len += 1; + + do { + COMPARE_BYTE_BLOCK(0) + COMPARE_BYTE_BLOCK(1) + COMPARE_BYTE_BLOCK(2) + COMPARE_BYTE_BLOCK(3) + COMPARE_BYTE_BLOCK(4) + COMPARE_BYTE_BLOCK(5) + COMPARE_BYTE_BLOCK(6) + COMPARE_BYTE_BLOCK(7) + } while (len < 256); + + #undef COMPARE_BYTE_BLOCK
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (15)
CMakeLists.txt(1 hunks)arch/generic/Makefile.in(1 hunks)arch/generic/compare256_c.c(1 hunks)arch/generic/compare256_p.h(1 hunks)arch/generic/generic_functions.h(1 hunks)compare256_rle.h(4 hunks)deflate_rle.c(1 hunks)functable.c(1 hunks)match_tpl.h(1 hunks)test/benchmarks/benchmark_compare256.cc(2 hunks)test/benchmarks/benchmark_compare256_rle.cc(1 hunks)test/test_compare256.cc(2 hunks)test/test_compare256_rle.cc(1 hunks)zbuild.h(1 hunks)zmemory.h(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (11)
- CMakeLists.txt
- zmemory.h
- test/test_compare256_rle.cc
- arch/generic/Makefile.in
- match_tpl.h
- functable.c
- test/test_compare256.cc
- deflate_rle.c
- arch/generic/compare256_c.c
- compare256_rle.h
- arch/generic/generic_functions.h
🧰 Additional context used
🪛 cppcheck (2.10-2)
test/benchmarks/benchmark_compare256_rle.cc
[error] 62-62: There is an unknown macro here somewhere. Configuration is required. If BENCHMARK_DEFINE_F is a macro then please configure it.
(unknownMacro)
test/benchmarks/benchmark_compare256.cc
[error] 68-68: There is an unknown macro here somewhere. Configuration is required. If BENCHMARK_DEFINE_F is a macro then please configure it.
(unknownMacro)
🔇 Additional comments (6)
test/benchmarks/benchmark_compare256_rle.cc (2)
62-63: LGTM: New benchmarks for 8-bit and 16-bit comparisons
The addition of benchmarks for 8-bit and 16-bit comparisons aligns with the PR objectives of consolidating comparison functions. These variants are unconditionally available, which matches the architectural independence of byte-level comparisons.
🧰 Tools
🪛 cppcheck (2.10-2)
[error] 62-62: There is an unknown macro here somewhere. Configuration is required. If BENCHMARK_DEFINE_F is a macro then please configure it.
(unknownMacro)
64-68: Verify CPU feature detection for conditional benchmarks
The conditional compilation for 32-bit and 64-bit comparisons is correctly gated behind CPU feature checks (HAVE_BUILTIN_CTZ and HAVE_BUILTIN_CTZLL).
Let's verify the CPU feature detection macros are consistently defined across the codebase:
✅ Verification successful
CPU feature detection macros are consistently defined and used across the codebase
The verification shows that the CPU feature detection macros HAVE_BUILTIN_CTZ and HAVE_BUILTIN_CTZLL are consistently defined and used across the codebase:
- The macros are properly defined in
fallback_builtins.h - They are consistently used to guard CPU-specific code in:
- Core comparison implementations (
compare256_rle.h,deflate_rle.c) - Architecture-specific code (
arch/x86/,arch/arm/,arch/generic/) - Function tables (
functable.c) - Test and benchmark files
- Core comparison implementations (
The conditional compilation in the benchmark file aligns with the pattern used throughout the codebase, ensuring that CPU-specific optimizations are only enabled when the corresponding CPU features are available.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify CPU feature detection consistency
# Check if HAVE_BUILTIN_CTZ and HAVE_BUILTIN_CTZLL are consistently defined
# Search for CTZ-related macro definitions and usage
echo "=== Searching for CTZ macro definitions ==="
rg -n "HAVE_BUILTIN_CTZ|HAVE_BUILTIN_CTZLL" --type c --type cpp
# Search for other comparison function implementations that might need similar checks
echo -e "\n=== Searching for related comparison functions ==="
ast-grep --pattern 'compare256_rle_$_'
Length of output: 2620
test/benchmarks/benchmark_compare256.cc (1)
68-74: LGTM! Well-structured benchmark organization.
The new benchmarks are properly organized with appropriate CPU feature guards. The progression from 8-bit to 64-bit comparisons allows for comprehensive performance testing across different architectures.
🧰 Tools
🪛 cppcheck (2.10-2)
[error] 68-68: There is an unknown macro here somewhere. Configuration is required. If BENCHMARK_DEFINE_F is a macro then please configure it.
(unknownMacro)
arch/generic/compare256_p.h (1)
71-94: Verify endianness handling in builtin_ctz usage.
The endianness handling for finding the first differing byte is crucial for correctness. Please verify that the calculation __builtin_ctz(diff) / 8 and __builtin_clz(diff) / 8 correctly identifies the byte position on all supported platforms.
✅ Verification successful
Let me search more broadly for endianness-related code and tests.
Let me check the implementation details and tests more thoroughly.
Let me check one more thing to verify the memory reading functions that are used alongside the endianness handling.
Based on the analysis of the codebase, I can now provide a final response:
The endianness handling in compare256_32 is correctly implemented
The implementation correctly handles both little-endian and big-endian architectures:
- For little-endian systems,
__builtin_ctz(diff) / 8is used to find the first differing byte - For big-endian systems,
__builtin_clz(diff) / 8is used instead - The endianness detection is robust across platforms through
zendian.h, which handles various compiler-specific, OS-specific, and architecture-specific cases - The same pattern is consistently used across the codebase in both 32-bit and 64-bit comparison functions
The implementation is consistent with memory byte ordering:
- In little-endian systems, least significant bits correspond to the first byte in memory, so counting trailing zeros is correct
- In big-endian systems, most significant bits correspond to the first byte in memory, so counting leading zeros is correct
- Division by 8 correctly converts from bit position to byte position in both cases
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for any endianness-related test cases
rg -l "BYTE_ORDER|__BYTE_ORDER" test/
Length of output: 39
Script:
#!/bin/bash
# Search for endianness-related code and tests in the entire codebase
rg -l "BYTE_ORDER|__BYTE_ORDER|endian"
# Look for test files that might verify memory comparison functions
fd -e c -e h test
Length of output: 522
Script:
#!/bin/bash
# Check the endianness handling in compare functions across architectures
rg -A 5 -B 5 "builtin_[cl]tz.*diff.*8"
# Check test files for memory comparison tests
cat test/test_shared.h test/test_shared_ng.h
# Check endianness definitions and handling
cat zendian.h
Length of output: 5118
Script:
#!/bin/bash
# Check the implementation of zng_memread functions
rg -A 5 "zng_memread_[48]" zbuild.h
Length of output: 37
zbuild.h (2)
246-254: LGTM! Well-documented OPTIMAL_CMP values.
The comments clearly explain the rationale behind each OPTIMAL_CMP value and their relationship to unaligned access capabilities.
255-280: Verify architecture detection completeness.
The architecture detection appears comprehensive but should be verified against all supported platforms.
✅ Verification successful
Let me gather more information about the architecture-specific code.
Architecture detection is comprehensive and well-implemented
The architecture detection in zbuild.h is thorough and covers all major architectures and their variants:
- x86 family (x86_64, i386/i486/i586/i686)
- ARM family (AArch64/ARM64, ARM32) with unaligned access detection
- PowerPC family (PowerPC64, PowerPC32)
- Default fallback (OPTIMAL_CMP 16)
The implementation is verified by:
- Architecture-specific code in
arch/directory matches the detected architectures - Comprehensive architecture detection in
cmake/detect-arch.caligns with the OPTIMAL_CMP definitions - Platform-specific features are properly handled (e.g., ARM unaligned access detection)
- Proper fallback value is defined when architecture is not detected
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for architecture-specific code that might need OPTIMAL_CMP
rg -l "__x86_64__|__i386__|__aarch64__|__arm__|__powerpc" arch/
Length of output: 187
Script:
#!/bin/bash
# Check content of architecture-specific files for any additional architecture macros
echo "=== Content of x86_functions.h ==="
cat arch/x86/x86_functions.h
echo -e "\n=== Content of arm_features.c ==="
cat arch/arm/arm_features.c
# Search for any other architecture-related macros in the codebase
echo -e "\n=== Other architecture macros in codebase ==="
rg -n "(defined\(__[a-zA-Z0-9_]+\)|defined\(_M_[a-zA-Z0-9_]+\))" --type c --type h
Length of output: 21395
|
Is there an easy way to get benchmarks for the LONGEST_MATCH functions? |
I think the closest you can get currently is the gbench compress* benchmarks. |
compare256_p.h, to make tests/benchmarks possible in a clean wayIn present develop, we compile generic functions that we never use, 8 variants of longest_match and 5 of compare256, without counting the sse and such optimized versions.
With this PR we only compile in the 2 longest_match and the 2 variants of compare256 that we actually make pointers to use.
64-bit platforms
This PR reduces the size of the compiled library by quite a bit by not compiling code that cannot be accessed.
x86_64: -9984 bytes, so approx 6.9% smaller.
Aarch64: -8600 bytes, so approx 7.3% smaller.
Performance difference on x86-64/Aarch64 is not measurable, because this does not make changes to the code they execute, it only removes dead code.
32-bit platforms
Benchmarks of 32-bit should be different, and we do need benchmarks to test those.
Summary by CodeRabbit
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Chores