Skip to content

Continued cleanup of old UNALIGNED_OK checks#1830

Merged
Dead2 merged 2 commits intodevelopfrom
remove_unaligned_pt2
Dec 26, 2024
Merged

Continued cleanup of old UNALIGNED_OK checks#1830
Dead2 merged 2 commits intodevelopfrom
remove_unaligned_pt2

Conversation

@Dead2
Copy link
Copy Markdown
Member

@Dead2 Dead2 commented Dec 20, 2024

  • Remove obsolete checks
  • Fix checks that are inconsistent
  • Stop compiling a bunch of compare256/longest_match variants that never gets called
  • Moved the generic compare256 functions into a separate file compare256_p.h, to make tests/benchmarks possible in a clean way
  • Rename functions to get rid of misleading _unaligned names
  • Allow overriding OPTIMAL_CMP define

In present develop, we compile generic functions that we never use, 8 variants of longest_match and 5 of compare256, without counting the sse and such optimized versions.
With this PR we only compile in the 2 longest_match and the 2 variants of compare256 that we actually make pointers to use.

64-bit platforms

This PR reduces the size of the compiled library by quite a bit by not compiling code that cannot be accessed.
x86_64: -9984 bytes, so approx 6.9% smaller.
Aarch64: -8600 bytes, so approx 7.3% smaller.

Performance difference on x86-64/Aarch64 is not measurable, because this does not make changes to the code they execute, it only removes dead code.

32-bit platforms

Benchmarks of 32-bit should be different, and we do need benchmarks to test those.

Summary by CodeRabbit

Summary by CodeRabbit

  • New Features

    • Introduced new memory comparison functions for 8, 16, 32, and 64-bit comparisons.
    • Added benchmarks for aligned comparisons.
    • Added new header files to support enhanced functionality.
  • Bug Fixes

    • Streamlined handling of memory comparisons and removed unnecessary checks.
  • Documentation

    • Updated comments and documentation to reflect new function logic and benchmarks.
  • Chores

    • Enhanced build configuration for improved architecture support and clarity.

@Dead2 Dead2 added the cleanup Improving maintainability or removing code. label Dec 20, 2024
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Dec 20, 2024

Walkthrough

This pull request modifies memory comparison functions in the zlib library, focusing on consolidating and renaming functions related to aligned and unaligned comparisons. The compare256_c_static function has been removed, with its functionality integrated into compare256_c. Several unaligned comparison functions have been renamed, and their declarations updated in header files. Compilation conditions have been simplified by eliminating checks related to unaligned access, relying instead on architecture-specific features. Additionally, benchmark and test files have been updated to reflect these changes, streamlining the handling of aligned comparisons.

Changes

File Change Summary
arch/generic/compare256_c.c - Removed compare256_c_static, merged into compare256_c
- Added functions: compare256_8, compare256_16, compare256_32, compare256_64
- Updated function signatures for unaligned comparisons
- Simplified compilation conditions based on HAVE_BUILTIN_CTZLL and HAVE_BUILTIN_CTZ
arch/generic/generic_functions.h - Removed declarations for unaligned functions
- Added declarations for compare256_c and its aligned counterparts
compare256_rle.h - Removed compare256_rle_c
- Renamed unaligned RLE comparison functions to compare256_rle_16, compare256_rle_32, compare256_rle_64 with updated logic
deflate_rle.c - Updated macro definitions for compare256_rle based on built-in functions
- Simplified conditional compilation checks
match_tpl.h - Removed unaligned memory access handling
- Simplified logic for finding matches
test/benchmarks/benchmark_compare256.cc - Simplified conditional compilation for benchmarks
- Updated references to aligned comparison functions
test/benchmarks/benchmark_compare256_rle.cc - Modified benchmark registration conditions
- Updated function references to aligned comparisons
test/test_compare256.cc - Streamlined test compilation conditions
- Updated test function references
test/test_compare256_rle.cc - Updated test function references
CMakeLists.txt - Updated build configuration, added new private headers related to generic functions
arch/generic/compare256_p.h - Added inline functions for 8, 16, 32, and 64-bit comparisons
functable.c - Updated function pointers to point to architecture-specific implementations
zbuild.h - Updated OPTIMAL_CMP macro definitions and handling of sanitizers
arch/generic/Makefile.in - Added dependency on compare256_p.h for compare256_c.o and compare256_c.lo

Possibly related PRs

Suggested labels

optimization, Build Env

Suggested reviewers

  • KungFuJesus
  • nmoinvaz

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🔭 Outside diff range comments (1)
deflate_rle.c (1)

Line range hint 13-21: Fix syntax error in preprocessor directive

There's a syntax error in the #elif condition on line 16 - missing closing parenthesis.

Apply this fix:

-#  elif defined(HAVE_BUILTIN_CTZ
+#  elif defined(HAVE_BUILTIN_CTZ)
🧰 Tools
🪛 cppcheck (2.10-2)

[error] 16-16: failed to evaluate #elif condition

(preprocessorErrorDirective)

🧹 Nitpick comments (5)
test/test_compare256_rle.cc (1)

14-16: Consider documenting the ZTEST_ALL workaround

While the workaround is necessary for testing, it would be helpful to document why this approach was chosen and what potential risks it might introduce.

compare256_rle.h (1)

Line range hint 11-45: Consider optimizing the byte comparison loop

The byte-by-byte comparison implementation is safe but could be more maintainable and potentially more efficient.

Consider refactoring to reduce code duplication:

-    do {
-        if (*src0 != *src1)
-            return len;
-        src1 += 1, len += 1;
-        if (*src0 != *src1)
-            return len;
-        src1 += 1, len += 1;
-        // ... repeated 8 times ...
-    } while (len < 256);
+    do {
+        for (int i = 0; i < 8; i++) {
+            if (*src0 != *src1)
+                return len;
+            src1 += 1;
+            len += 1;
+        }
+    } while (len < 256);
arch/generic/compare256_c.c (1)

Line range hint 11-60: Consider applying the same optimization as compare256_rle_c

The implementation could benefit from the same loop optimization suggested earlier to reduce code duplication.

Consider applying the same loop optimization pattern suggested for compare256_rle_c.

chunkset_tpl.h (1)

Line range hint 263-285: Consider additional safety improvements for CHUNKCOPY_SAFE.

The implementation looks good with proper bounds checking and special case handling. However, consider these improvements:

  1. The cast to unsigned in the CHUNKMEMSET call could potentially truncate len on 32-bit systems.
  2. The function could benefit from an Assert for null pointer validation.

Consider applying these changes:

 static inline uint8_t *CHUNKCOPY_SAFE(uint8_t *out, uint8_t *from, uint64_t len, uint8_t *safe)
 {
+    Assert(out != NULL && from != NULL && safe != NULL, "null pointer check");
     if (out == from)
         return out + len;
 
     uint64_t safelen = (safe - out);
     len = MIN(len, safelen);
 
 #ifndef HAVE_MASKED_READWRITE
     uint64_t from_dist = (uint64_t)llabs(safe - from);
     if (UNLIKELY(from_dist < sizeof(chunk_t) || safelen < sizeof(chunk_t))) {
         while (len--) {
             *out++ = *from++;
         }
 
         return out;
     }
 #endif
 
-    return CHUNKMEMSET(out, from, (unsigned)len);
+    /* Handle potential truncation on 32-bit systems */
+    if (UNLIKELY(len > UINT_MAX)) {
+        while (len > UINT_MAX) {
+            out = CHUNKMEMSET(out, from, UINT_MAX);
+            from += UINT_MAX;
+            len -= UINT_MAX;
+        }
+    }
+    return CHUNKMEMSET(out, from, (unsigned)len);
 }
test/benchmarks/benchmark_compare256.cc (1)

15-18: Consider a more maintainable alternative to direct .c file inclusion

The current workaround of directly including a .c file and using a macro to access all functions suggests potential architectural issues. This approach could lead to maintenance challenges and potential symbol conflicts.

Consider these alternatives:

  1. Move the common functions to a header file
  2. Create a proper testing API in the compare256 module
  3. Use proper extern declarations for the test functions
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4fa76be and 9ca6a80.

📒 Files selected for processing (17)
  • .github/workflows/cmake.yml (1 hunks)
  • CMakeLists.txt (0 hunks)
  • README.md (1 hunks)
  • arch/generic/compare256_c.c (3 hunks)
  • arch/generic/generic_functions.h (1 hunks)
  • chunkset_tpl.h (1 hunks)
  • cmake/detect-sanitizer.cmake (1 hunks)
  • compare256_rle.h (3 hunks)
  • configure (0 hunks)
  • deflate_rle.c (1 hunks)
  • insert_string_tpl.h (1 hunks)
  • match_tpl.h (0 hunks)
  • test/benchmarks/benchmark_compare256.cc (2 hunks)
  • test/benchmarks/benchmark_compare256_rle.cc (2 hunks)
  • test/test_compare256.cc (2 hunks)
  • test/test_compare256_rle.cc (2 hunks)
  • zbuild.h (1 hunks)
💤 Files with no reviewable changes (3)
  • match_tpl.h
  • configure
  • CMakeLists.txt
✅ Files skipped from review due to trivial changes (1)
  • README.md
🧰 Additional context used
📓 Learnings (1)
chunkset_tpl.h (2)
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: chunkset_tpl.h:135-135
Timestamp: 2024-11-12T09:04:30.407Z
Learning: In the `CHUNKMEMSET` function within `chunkset_tpl.h`, extra bounds checks are avoided to maintain performance in critical code sections. Branching is minimized to prevent negative impacts on speculative execution. The variable `len` is enforced with `safelen` early on.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: chunkset_tpl.h:263-271
Timestamp: 2024-11-12T09:04:30.407Z
Learning: In the `CHUNKCOPY_SAFE` function within `chunkset_tpl.h`, `from` cannot be ahead of `safe` because `safe` marks the end of the window, and `from` is always within the allocated window bounds. Thus, the subtraction `safe - from` is valid without additional checks.
🪛 cppcheck (2.10-2)
deflate_rle.c

[error] 16-16: failed to evaluate #elif condition

(preprocessorErrorDirective)

🔇 Additional comments (20)
cmake/detect-sanitizer.cmake (1)

114-114: LGTM! Unconditional alignment sanitizer check improves safety.

The unconditional inclusion of the 'alignment' check in the sanitizer list is a positive change that aligns with the PR's objective of cleaning up unaligned memory access checks. This will help catch potential memory alignment issues across all builds.

zbuild.h (3)

246-254: LGTM! Well-structured architecture-specific optimizations.

The implementation correctly defines optimal comparison sizes for x86_64 (64 bytes), i386 (32 bytes), and ARM64 (64 bytes with unaligned access support). The architecture detection and conditions are accurate.


255-262: LGTM! Proper handling of ARM and PowerPC architectures.

The code correctly:

  • Checks for ARM FEATURE_UNALIGNED support
  • Handles PowerPC64 with appropriate endianness check
  • Sets appropriate OPTIMAL_CMP values (32/64 bytes) based on architecture capabilities

264-269: LGTM! Safe fallback mechanism.

The implementation provides a robust fallback mechanism:

  1. Respects NO_UNALIGNED override by unsetting OPTIMAL_CMP
  2. Defaults to a safe 8-byte comparison size when no architecture-specific optimization is available
.github/workflows/cmake.yml (1)

84-88: LGTM! Simplified UBSAN workflow configuration.

The changes correctly:

  • Update the job name to reflect the removal of unaligned checks
  • Remove the redundant WITH_UNALIGNED flag from cmake arguments
    This simplification aligns with the PR's cleanup objectives.
test/test_compare256_rle.cc (1)

58-65: LGTM: Proper replacement of UNALIGNED_OK checks

The change from UNALIGNED_OK to BYTE_ORDER == LITTLE_ENDIAN is appropriate since unaligned access is primarily a concern for little-endian architectures. The nested checks for HAVE_BUILTIN_CTZ/CTZLL are also properly structured.

test/benchmarks/benchmark_compare256_rle.cc (2)

14-16: LGTM: Consistent with test file changes

The ZTEST_ALL workaround is consistently applied across test files.


67-74: LGTM: Proper conditional compilation structure

The conditional compilation structure correctly mirrors the test file, maintaining consistency across the test suite.

test/test_compare256.cc (2)

72-80: LGTM: Proper architecture-specific conditional compilation

The changes correctly:

  1. Replace UNALIGNED_OK with BYTE_ORDER checks
  2. Maintain proper nesting of CPU feature checks
  3. Keep consistent structure with other test files

16-18: Review the direct inclusion of C implementation file

The comment acknowledges this as an "unorthodox workaround", but including a .c file directly is generally discouraged. Consider:

  1. Moving the test-specific functions to a header file
  2. Creating a test-specific header that exposes these functions
insert_string_tpl.h (1)

32-39: Verify performance impact of consistent memcpy usage

The changes improve safety and consistency by using memcpy for all cases, but we should verify there's no significant performance regression.

Consider benchmarking the new implementation against the old one, particularly on platforms where unaligned access was previously optimized.

✅ Verification successful

Let's examine the benchmark files to understand if they specifically test the hash calculation performance related to this change.


Let's check the previous implementation to compare with the new changes.


Let's examine PR #1828 since it's directly related to our changes.


Based on the PR comments and benchmark results, I can now generate the final response:

Performance impact of memcpy usage has been verified

The benchmark results from the PR show no significant performance regression from using memcpy consistently. In fact, the changes show:

  • Comparable performance across all compression levels
  • Slight improvements in some cases (level 9)
  • No measurable decompression speed impact
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Look for existing benchmarks that measure hash calculation performance
rg -l "benchmark.*hash" test/benchmarks/

Length of output: 226


Script:

#!/bin/bash
# Check content of slidehash benchmark which is most relevant to our case
rg -A 10 "benchmark.*hash" test/benchmarks/benchmark_slidehash.cc

# Also check the implementation details around HASH_CALC_READ
rg -B 5 -A 10 "HASH_CALC_READ" --type cpp

Length of output: 3280


Script:

#!/bin/bash
# Check the git history for the previous implementation of HASH_CALC_READ
git log -p --no-merges -S "HASH_CALC_READ" insert_string_tpl.h

# Also check if there are any existing performance-related issues or discussions
gh issue list --search "performance hash"
gh pr list --search "performance hash"

Length of output: 408


Script:

#!/bin/bash
# Get the PR description and comments for PR #1828
gh pr view 1828 --json body,comments

# Also check for any benchmark results or performance discussions in the PR
gh pr view 1828 --json comments | jq -r '.comments[].body' | grep -i "performance\|benchmark"

Length of output: 35801

compare256_rle.h (3)

Line range hint 47-74: LGTM: Safe implementation of unaligned 16-bit comparison

The implementation correctly uses memcpy for safe unaligned memory access and properly handles byte order requirements.


Line range hint 76-103: LGTM: Efficient implementation using compiler intrinsics

The implementation makes good use of __builtin_ctz for efficient comparison while maintaining safety with proper memory access patterns.


Line range hint 104-132: LGTM: Verify 64-bit alignment requirements

The implementation is efficient and safe, using appropriate compiler intrinsics for 64-bit operations.

✅ Verification successful

Let me verify the usage of the 64-bit operations and their platform dependencies further.


LGTM: 64-bit operations are properly guarded and have fallbacks

The implementation is safe with proper platform-specific guards:

  • The 64-bit operations are only used when HAVE_BUILTIN_CTZLL is available
  • There are appropriate fallbacks to 32-bit (HAVE_BUILTIN_CTZ), 16-bit, and byte-by-byte comparison implementations
  • The feature detection is handled during build configuration via CMake/configure
  • The code is well-tested with both unit tests and benchmarks for all variants
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Verify that 64-bit operations are only used on supported platforms
rg -l "HAVE_BUILTIN_CTZLL" | xargs rg "compare256.*64"

Length of output: 1335


Script:

#!/bin/bash
# Check how HAVE_BUILTIN_CTZLL is defined and its platform dependencies
rg "HAVE_BUILTIN_CTZLL" -B 3 -A 3

# Check if there are any platform-specific checks or fallbacks
rg "compare256_rle" -B 2 -A 2

Length of output: 19454

arch/generic/generic_functions.h (2)

34-53: LGTM: Well-structured conditional compilation

The implementation properly handles different architectures and capabilities with a clear fallback hierarchy.


Line range hint 57-71: LGTM: Clear and maintainable implementation selection

The generic implementation selection is well-structured and maintainable.

arch/generic/compare256_c.c (2)

Line range hint 62-142: LGTM: Safe and efficient unaligned comparison implementations

The implementations properly handle unaligned access and follow best practices for memory comparison.


Line range hint 143-183: LGTM: Consistent implementation for 64-bit operations

The implementation maintains consistency with other bit-width implementations while properly handling 64-bit specific requirements.

chunkset_tpl.h (1)

230-232: LGTM! Verify OPTIMAL_CMP usage across the codebase.

The switch from UNALIGNED checks to OPTIMAL_CMP for alignment control is a good improvement. The alignment masks (7 for 8-byte alignment, 3 for 4-byte alignment) are correctly defined based on the optimal comparison size.

Let's verify the OPTIMAL_CMP usage:

✅ Verification successful

OPTIMAL_CMP usage is consistent and properly defined

The verification confirms that:

  • OPTIMAL_CMP is properly defined in zbuild.h with architecture-specific values (64 for x86_64/ARM64, 32 for x86/ARM, fallback to 8)
  • The alignment masks in chunkset_tpl.h (7 for OPTIMAL_CMP < 32, 3 for OPTIMAL_CMP == 32) correctly correspond to the defined OPTIMAL_CMP values
  • No remaining UNALIGNED_OK checks were found, confirming successful migration
  • The alignment handling is consistent throughout the codebase with proper boundary checks
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check OPTIMAL_CMP usage and definitions
# Verify that OPTIMAL_CMP is consistently used and properly defined

# Find OPTIMAL_CMP definition and usage
rg -n "OPTIMAL_CMP" --type c --type h

# Check if there are any remaining UNALIGNED checks that should be migrated
rg -n "UNALIGNED.*OK" --type c --type h

Length of output: 508


Script:

#!/bin/bash
# Let's examine the context around OPTIMAL_CMP definitions and usage

# Check zbuild.h definitions with context
rg -A 5 -B 5 "OPTIMAL_CMP" zbuild.h

# Check chunkset_tpl.h usage with context
rg -A 5 -B 5 "OPTIMAL_CMP" chunkset_tpl.h

# Look for any other alignment-related macros that might interact
rg -n "(ALIGN|align)" chunkset_tpl.h

Length of output: 1935

test/benchmarks/benchmark_compare256.cc (1)

73-81: Verify architectural assumptions about endianness

The unaligned benchmarks are now conditionally compiled only for little-endian systems. While this aligns with removing UNALIGNED_OK checks, we should verify:

  1. That unaligned access is indeed only supported/optimized for little-endian
  2. Whether we're missing test coverage for big-endian systems
✅ Verification successful

Based on the search results, I can now generate the final response:

Endianness check for unaligned access is correctly implemented

The endianness-specific code patterns reveal that:

  • Unaligned access optimizations are consistently guarded by LITTLE_ENDIAN checks across the codebase (compare256_c.c, compare256_rle.h)
  • Big-endian systems are properly handled with fallback implementations (compare256_c.c has #if BYTE_ORDER != LITTLE_ENDIAN || defined(ZTEST_ALL))
  • Test coverage for both endianness exists through the ZTEST_ALL macro
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for any other endianness-specific code patterns
rg "BYTE_ORDER|LITTLE_ENDIAN|BIG_ENDIAN" --type c --type cpp

# Look for other unaligned access implementations
rg "unaligned|UNALIGNED" --type c --type cpp

Length of output: 10989

@Dead2 Dead2 force-pushed the remove_unaligned_pt2 branch from 9ca6a80 to f0bdd49 Compare December 20, 2024 22:55
@Dead2 Dead2 force-pushed the remove_unaligned_pt2 branch from f0bdd49 to 97964e1 Compare December 20, 2024 23:03
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
compare256_rle.h (2)

Line range hint 11-45: LGTM! Consider adding documentation for the big-endian implementation.

The new byte-by-byte comparison implementation for big-endian systems is safe and correct. The conditional compilation based on byte order is an improvement over the previous UNALIGNED_OK checks.

Consider adding a brief comment explaining why a byte-by-byte comparison is used for big-endian systems versus the optimized versions for little-endian.


Line range hint 1-132: Good architectural improvements in alignment handling.

The changes successfully:

  1. Replace UNALIGNED_OK with more precise byte-order based conditions
  2. Maintain memory safety through consistent use of memcpy
  3. Provide appropriate implementations for different architectures

This aligns well with the PR's objective of cleaning up alignment checks while potentially improving maintainability.

Consider documenting these architectural decisions in a comment block at the top of the file to help future maintainers understand the design choices.

arch/generic/compare256_c.c (2)

Line range hint 11-60: Consider performance optimization for byte comparison.

While the byte-by-byte comparison is safe, it could be optimized for better performance on big-endian systems.

Consider using word-sized comparisons when alignment permits:

 static inline uint32_t compare256_c_static(const uint8_t *src0, const uint8_t *src1) {
     uint32_t len = 0;
+    // Fast path: compare aligned words first
+    if (((uintptr_t)src0 & (sizeof(size_t)-1)) == 0 &&
+        ((uintptr_t)src1 & (sizeof(size_t)-1)) == 0) {
+        while (len + sizeof(size_t) <= 256) {
+            if (*(size_t*)src0 != *(size_t*)src1)
+                break;
+            src0 += sizeof(size_t);
+            src1 += sizeof(size_t);
+            len += sizeof(size_t);
+        }
+    }
+    // Fallback to byte comparison
     do {
         if (*src0 != *src1)
             return len;

Line range hint 11-144: Architecture and implementation hierarchy looks solid.

The implementation provides a good balance of safety and performance with:

  1. Byte-by-byte fallback for big-endian
  2. 16-bit operations using safe memcmp
  3. 32/64-bit operations using memcpy + builtin_ctz

Consider adding architecture-specific optimized implementations in separate files for frequently used platforms.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9ca6a80 and 97964e1.

📒 Files selected for processing (9)
  • arch/generic/compare256_c.c (3 hunks)
  • arch/generic/generic_functions.h (1 hunks)
  • compare256_rle.h (3 hunks)
  • deflate_rle.c (1 hunks)
  • match_tpl.h (0 hunks)
  • test/benchmarks/benchmark_compare256.cc (2 hunks)
  • test/benchmarks/benchmark_compare256_rle.cc (2 hunks)
  • test/test_compare256.cc (2 hunks)
  • test/test_compare256_rle.cc (2 hunks)
💤 Files with no reviewable changes (1)
  • match_tpl.h
🚧 Files skipped from review as they are similar to previous changes (5)
  • test/benchmarks/benchmark_compare256.cc
  • test/benchmarks/benchmark_compare256_rle.cc
  • deflate_rle.c
  • test/test_compare256_rle.cc
  • test/test_compare256.cc
🔇 Additional comments (6)
compare256_rle.h (2)

Line range hint 47-104: Verify BYTE_ORDER macro availability across all supported platforms.

The change from UNALIGNED_OK to BYTE_ORDER-based conditional compilation is good, but we should ensure BYTE_ORDER is consistently defined.

✅ Verification successful

The search results show that BYTE_ORDER is well-defined and consistently handled across the codebase through the zendian.h header file. This header implements a robust fallback mechanism:

  1. First checks compiler-provided __BYTE_ORDER__
  2. Then tries various system endian headers (endian.h, machine/endian.h, sys/endian.h)
  3. Finally provides default definitions if needed

Let me verify the actual implementation in zendian.h to ensure it's robust:


BYTE_ORDER macro is properly defined and safe to use across all supported platforms

The implementation in zendian.h is robust and comprehensive, covering all major platforms and architectures:

  • Compiler-provided __BYTE_ORDER__ detection
  • Windows platforms via architecture detection
  • Unix-like systems through standard endian headers
  • Special handling for MinGW, Apple, BSD, and Solaris
  • Proper fallback mechanisms and error detection

The change from UNALIGNED_OK to BYTE_ORDER-based conditional compilation is safe and well-supported.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if BYTE_ORDER is consistently defined across the codebase
# and if there are any platform-specific definitions we need to handle

# Search for BYTE_ORDER definitions or includes
rg -n "BYTE_ORDER|endian\.h" 

# Check for platform-specific handling
rg -n "#if.*BYTE_ORDER" 

Length of output: 5853


Script:

#!/bin/bash
# Check the complete implementation of zendian.h
cat zendian.h

Length of output: 1720


Line range hint 104-132: Verify fallback behavior when builtin functions are unavailable.

The optimized implementations correctly use memcpy for safe unaligned access. However, we should verify the fallback behavior when HAVE_BUILTIN_CTZ/CTZLL are not defined.

✅ Verification successful

Fallback behavior is properly implemented for platforms without CTZ/CTZLL builtins

The codebase has a well-designed fallback mechanism:

  1. In deflate_rle.c, there's a clear fallback chain:
#if BYTE_ORDER == LITTLE_ENDIAN
#  if defined(HAVE_BUILTIN_CTZLL)
#    define compare256_rle compare256_rle_unaligned_64
#  elif defined(HAVE_BUILTIN_CTZ)
#    define compare256_rle compare256_rle_unaligned_32
#  else
#    define compare256_rle compare256_rle_unaligned_16
#  endif
#else
#  define compare256_rle compare256_rle_c
#endif
  1. The compare256_rle_unaligned_16 implementation is always available and doesn't rely on builtin functions, serving as the base fallback.
  2. For big-endian systems, it falls back to the basic byte-by-byte comparison (compare256_rle_c).
  3. The compiler detection for CTZ/CTZLL is properly handled in both CMake and configure scripts.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check how the code handles platforms without CTZ/CTZLL builtins

# Search for fallback implementations
rg -n "fallback.*ctz" 

# Check if there are platform-specific alternatives
ast-grep --pattern 'define HAVE_BUILTIN_CTZ'
ast-grep --pattern 'define HAVE_BUILTIN_CTZLL'

Length of output: 119


Script:

#!/bin/bash
# Let's check for any non-builtin implementations of compare256_rle
rg -n "compare256_rle" -A 10

# Also search for any other CTZ implementations
rg -n "ctz" -B 2 -A 2

# Check for compiler detection macros
rg -n "HAVE_BUILTIN" -B 2 -A 2

Length of output: 41449

arch/generic/generic_functions.h (2)

57-58: LGTM! Generic implementation selection is consistent.

The selection of generic implementations aligns well with the architecture-specific function declarations above.


34-38: LGTM! Verify function availability across architectures.

The reorganization of functions under BYTE_ORDER != LITTLE_ENDIAN improves clarity and architecture-specific handling.

Let's verify the function declarations match their implementations:

✅ Verification successful

Let's gather more information to verify the implementations.


Let's check the actual implementations in compare256_c.c to verify the function definitions.


Function declarations are correctly implemented and conditionally defined

The verification confirms that:

  • The functions are correctly implemented in compare256_c.c under the appropriate endianness conditions
  • longest_match_c and longest_match_slow_c are defined via template inclusion under BYTE_ORDER == BIG_ENDIAN
  • compare256_unaligned_c is not directly visible but is part of the unaligned comparison implementations for LITTLE_ENDIAN systems
  • The architecture-specific organization in generic_functions.h aligns with the actual implementations
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify function declarations match implementations
# Search for function implementations in source files
rg -l "^(uint32_t |void |static.*)(longest_match_c|longest_match_slow_c|compare256_unaligned_c)\b" arch/

Length of output: 139


Script:

#!/bin/bash
# Search for the function implementations with context
rg -A 5 "^(uint32_t |void |static.*)(longest_match_c|longest_match_slow_c|compare256_unaligned_c)\b" --type c

# Also check for any potential references to these functions
rg "(longest_match_c|longest_match_slow_c|compare256_unaligned_c)" --type c

Length of output: 1533


Script:

#!/bin/bash
# Get the content of compare256_c.c to verify implementations
cat arch/generic/compare256_c.c

Length of output: 4961

arch/generic/compare256_c.c (2)

Line range hint 62-142: LGTM! Safe unaligned access implementation.

The use of zng_memcmp_2 ensures safe 16-bit comparisons without alignment requirements.


143-144: Verify memory safety for 64-bit comparisons.

The removal of UNALIGNED64_OK check relies on memcpy for safe access, but we should verify this is sufficient across all platforms.

Let's check for potential alignment issues:

✅ Verification successful

The previous search revealed alignment-related code across the codebase, but we need to specifically examine the implementation of compare256 functions to verify the memory safety aspects after the UNALIGNED64_OK removal. Let's check the actual implementation and memcpy usage.


Memory safety is properly handled through memcpy

The implementation correctly uses memcpy for both 32-bit and 64-bit comparisons, which ensures safe unaligned memory access across all platforms. The removal of UNALIGNED64_OK check is safe because:

  • All memory accesses are handled through memcpy, which safely handles unaligned access
  • The implementation uses proper sized buffers (sizeof) for both source and destination
  • This approach is platform-agnostic and safer than direct unaligned access
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Search for alignment-related issues or discussions
# Look for alignment-related comments or issues
rg -i "align|unaligned" --type md
rg -i "align|unaligned" --type cpp arch/ include/ -C2

Length of output: 9364


Script:

#!/bin/bash
# Let's check the implementation details of compare256 functions
ast-grep --pattern 'compare256_unaligned_64($$$)'
ast-grep --pattern 'compare256_unaligned_32($$$)'
ast-grep --pattern 'compare256_unaligned_16($$$)'

# Also check for memcpy usage in these functions
rg "memcpy" arch/generic/compare256_c.c -C3

Length of output: 530

@Dead2 Dead2 force-pushed the remove_unaligned_pt2 branch from 97964e1 to 9a7d70b Compare December 20, 2024 23:51
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
compare256_rle.h (1)

Line range hint 78-102: LGTM: Optimized 32-bit and 64-bit implementations

The 32-bit and 64-bit implementations:

  • Efficiently use CPU features (CTZ/CTZLL)
  • Handle alignment safely through memcpy
  • Properly calculate match lengths

Consider adding a comment explaining the bit manipulation logic for better maintainability.

Also applies to: 104-132

arch/generic/generic_functions.h (1)

44-46: Fix indentation inconsistency.

The indentation in the fallback case doesn't match the style used in other conditional blocks.

-    uint32_t longest_match_16(deflate_state *const s, Pos cur_match);
-    uint32_t longest_match_slow_16(deflate_state *const s, Pos cur_match);
-    uint32_t compare256_16(const uint8_t *src0, const uint8_t *src1);
+        uint32_t longest_match_16(deflate_state *const s, Pos cur_match);
+        uint32_t longest_match_slow_16(deflate_state *const s, Pos cur_match);
+        uint32_t compare256_16(const uint8_t *src0, const uint8_t *src1);
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 97964e1 and 9a7d70b.

📒 Files selected for processing (9)
  • arch/generic/compare256_c.c (5 hunks)
  • arch/generic/generic_functions.h (1 hunks)
  • compare256_rle.h (4 hunks)
  • deflate_rle.c (1 hunks)
  • match_tpl.h (0 hunks)
  • test/benchmarks/benchmark_compare256.cc (2 hunks)
  • test/benchmarks/benchmark_compare256_rle.cc (2 hunks)
  • test/test_compare256.cc (2 hunks)
  • test/test_compare256_rle.cc (2 hunks)
💤 Files with no reviewable changes (1)
  • match_tpl.h
🚧 Files skipped from review as they are similar to previous changes (4)
  • test/test_compare256_rle.cc
  • test/benchmarks/benchmark_compare256.cc
  • test/test_compare256.cc
  • arch/generic/compare256_c.c
🧰 Additional context used
📓 Learnings (1)
arch/generic/generic_functions.h (1)
Learnt from: Dead2
PR: zlib-ng/zlib-ng#1830
File: arch/generic/generic_functions.h:39-53
Timestamp: 2024-12-20T23:35:59.830Z
Learning: The `longest_match_unaligned_*` functions are templated using the LONGEST_MATCH macro in match_tpl.h, so their implementations are generated rather than explicitly defined.
🔇 Additional comments (7)
deflate_rle.c (1)

13-21: LGTM: Well-structured function selection hierarchy

The changes implement a clear and logical selection of comparison functions:

  1. 64-bit variant when CTZLL is available
  2. 32-bit variant when CTZ is available
  3. 16-bit variant as a fallback for little-endian
  4. Basic C implementation for big-endian

This aligns with the PR objectives by removing UNALIGNED_OK checks while maintaining optimal function selection based on available CPU features.

compare256_rle.h (2)

Line range hint 11-45: LGTM: Well-implemented basic comparison function

The C implementation is well-structured with:

  • Proper bounds checking (len < 256)
  • Efficient unrolled loop for byte comparisons
  • Clear fallback for big-endian systems

Line range hint 47-75: LGTM: Safe 16-bit comparison implementation

The 16-bit comparison function:

  • Uses safe memory access through memcpy
  • Properly handles partial matches
  • Maintains correct alignment requirements
arch/generic/generic_functions.h (2)

34-51: LGTM! Well-structured function declarations based on CPU capabilities.

The hierarchical organization (CTZLL > CTZ > fallback) provides optimal function selection for each architecture, with a clean separation between LITTLE_ENDIAN and BIG_ENDIAN paths.


56-69: Verify implementation and usage of generic functions.

The macro definitions look good, maintaining a clean mapping to the declared functions. Let's verify:

  1. Implementation of all referenced functions
  2. Usage of these macros in dependent code
test/benchmarks/benchmark_compare256_rle.cc (2)

67-74: LGTM! Clean conditional compilation structure

The changes improve the code by:

  1. Replacing UNALIGNED_OK checks with direct endianness checks
  2. Using consistent defined() checks for builtin functions
  3. Simplifying function names

Let's verify the consistency of these changes across the codebase:

✅ Verification successful

LGTM! Changes are consistent with the codebase

The verification confirms:

  1. No remaining UNALIGNED_OK checks found, indicating complete cleanup
  2. The compare256_rle functions (16/32/64) are used consistently across implementation, tests, and benchmarks
  3. BYTE_ORDER == LITTLE_ENDIAN checks are used uniformly throughout the codebase
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for any remaining UNALIGNED_OK checks that might need cleanup
rg "UNALIGNED_OK" 

# Verify the new function names are used consistently
rg "compare256_rle_(16|32|64)"

# Check for consistent endianness checks
rg "BYTE_ORDER.*LITTLE_ENDIAN"

Length of output: 3072


13-15: Consider documenting the ZTEST_ALL workaround more thoroughly

While workarounds in test code are sometimes necessary, it would be helpful to document:

  1. Why this workaround is needed
  2. What functions it provides access to
  3. Whether this could be solved through a better design

Let's check if this workaround is consistently used across other test files:

@codecov
Copy link
Copy Markdown

codecov bot commented Dec 21, 2024

Codecov Report

Attention: Patch coverage is 86.53846% with 14 lines in your changes missing coverage. Please review.

Project coverage is 82.44%. Comparing base (d7e121e) to head (577c41e).
Report is 2 commits behind head on develop.

Files with missing lines Patch % Lines
test/benchmarks/benchmark_compare256.cc 0.00% 0 Missing and 4 partials ⚠️
test/benchmarks/benchmark_compare256_rle.cc 0.00% 0 Missing and 4 partials ⚠️
test/test_compare256.cc 25.00% 0 Missing and 3 partials ⚠️
test/test_compare256_rle.cc 25.00% 0 Missing and 3 partials ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #1830      +/-   ##
===========================================
+ Coverage    82.20%   82.44%   +0.24%     
===========================================
  Files          137      138       +1     
  Lines        11001    10994       -7     
  Branches      2695     2778      +83     
===========================================
+ Hits          9043     9064      +21     
+ Misses        1261     1235      -26     
+ Partials       697      695       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Dead2 Dead2 force-pushed the remove_unaligned_pt2 branch from 9a7d70b to a1076de Compare December 21, 2024 12:30
@Dead2
Copy link
Copy Markdown
Member Author

Dead2 commented Dec 21, 2024

Rebased, but there are still a few small things to fix, I have to get some Christmas shopping done first though.

Copy link
Copy Markdown
Contributor

@ccawley2011 ccawley2011 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would these changes affect performance on platforms without unaligned memory access?

/* 16-bit unaligned integer comparison */
static inline uint32_t compare256_unaligned_16_static(const uint8_t *src0, const uint8_t *src1) {
/* 16-bit integer comparison */
static inline uint32_t compare256_16_static(const uint8_t *src0, const uint8_t *src1) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to keep the byte version rather than the 16-bit version - the byte version would potentially have better performance than the 16-bit version on platforms without unaligned reads, and platforms without dedicated clz/ctz instructions would likely end up using the 32-bit or 64-bit versions anyway because GCC typically provides generic fallbacks for __builtin_clz/__builtin_ctz instructions for all architectures.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would have to agree, that keeping the c versions is probably a good idea for those platforms.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the 8-bit version back, but not currently in use by any platform.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code you have now is a good compromise.

@Dead2
Copy link
Copy Markdown
Member Author

Dead2 commented Dec 22, 2024

How would these changes affect performance on platforms without unaligned memory access?

That is something I want some help testing.
Hopefully it is not very a big loss, but we can add back restrictions to that if it is worth it. Most of the cleanup is not contingent on that, but it did look like something that should be investigated.

In my godbolt tests, it looks like the emulated __builtin_clz should be faster than the old _c. It adds only a few more instructions, but loses a lot more in longest_match.
But as of yet, this is just a hypothesis that we hope is reflected in actual benchmarks as well.

@Dead2 Dead2 force-pushed the remove_unaligned_pt2 branch from a1076de to 1ec6c36 Compare December 22, 2024 00:13
@Dead2
Copy link
Copy Markdown
Member Author

Dead2 commented Dec 22, 2024

Oh, if we find any platforms that actually needs the byte-by-byte functions, they should be specifically added with #define OPTIMAL_CMP 8, so that we can optimize for it and also document it for posterity.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
arch/generic/compare256_p.h (1)

11-35: Check for potential loop unrolling or consolidation.
The do-while loop repeats the same compare-block logic four times before checking its boundary condition. This approach might be manually unrolling the loop for performance, but it can reduce readability. Consider using a for loop and unroll it via compiler settings or macros, if needed. That can keep the logic simpler to maintain.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a1076de and 1ec6c36.

📒 Files selected for processing (12)
  • CMakeLists.txt (1 hunks)
  • arch/generic/compare256_c.c (1 hunks)
  • arch/generic/compare256_p.h (1 hunks)
  • arch/generic/generic_functions.h (1 hunks)
  • compare256_rle.h (3 hunks)
  • deflate_rle.c (1 hunks)
  • functable.c (1 hunks)
  • match_tpl.h (0 hunks)
  • test/benchmarks/benchmark_compare256.cc (2 hunks)
  • test/benchmarks/benchmark_compare256_rle.cc (1 hunks)
  • test/test_compare256.cc (2 hunks)
  • test/test_compare256_rle.cc (1 hunks)
💤 Files with no reviewable changes (1)
  • match_tpl.h
🚧 Files skipped from review as they are similar to previous changes (4)
  • test/test_compare256_rle.cc
  • deflate_rle.c
  • arch/generic/generic_functions.h
  • compare256_rle.h
🧰 Additional context used
🪛 cppcheck (2.10-2)
test/benchmarks/benchmark_compare256.cc

[error] 68-68: There is an unknown macro here somewhere. Configuration is required. If BENCHMARK_DEFINE_F is a macro then please configure it.

(unknownMacro)

test/benchmarks/benchmark_compare256_rle.cc

[error] 62-62: There is an unknown macro here somewhere. Configuration is required. If BENCHMARK_DEFINE_F is a macro then please configure it.

(unknownMacro)

🔇 Additional comments (14)
arch/generic/compare256_p.h (4)

1-4: License header looks fine and consistent with project standards.
No further action needed.


6-10: Includes are appropriate and minimal.
These header inclusions provide the necessary dependencies without apparent redundancies.


64-90: 64-bit mismatch detection logic is consistent with 32-bit version.
The approach extends naturally to 64-bit words. The cast of match_byte to uint32_t is correct, ensuring the correct return type. Good job retaining a consistent pattern.


36-62: Verify usage of built-in functions for mismatch detection.
The code uses __builtin_ctz for determining the first differing byte. This is a good optimization, but confirm that all target compilers and architectures support it correctly (especially cross-compiling scenarios).

✅ Verification successful

Built-in function usage is properly guarded and verified

The codebase has robust checks for __builtin_ctz support:

  • Build-time detection in both CMake and configure scripts that test compiler support
  • Proper macro guards (HAVE_BUILTIN_CTZ) around all usage
  • Fallback implementations in fallback_builtins.h for platforms without native support
  • Architecture-specific optimizations (SSE2, AVX2, NEON) with appropriate feature detection
  • Correct handling of endianness with BYTE_ORDER checks
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Searching for usage of __builtin_ctz in the codebase to verify compatibility across files.
ast-grep --pattern $'__builtin_ctz($_)' 

Length of output: 40


Script:

#!/bin/bash
# Let's check for HAVE_BUILTIN_CTZ macro definition and its usage
rg "HAVE_BUILTIN_CTZ" -B 2 -A 2

# Also search for any configure/build system checks for this builtin
fd -e configure -e m4 -e cmake -X rg "builtin_ctz|__builtin_ctz"

# Check if there are any other similar built-in function checks
rg "__builtin_(clz|ctz)" -B 2 -A 2

Length of output: 18163

arch/generic/compare256_c.c (3)

10-10: Include for compare256_p.h ensures the inline functions are accessible.
This is an appropriate place to include the new header.


13-18: Dynamic selection of comparison variant is clear.
The logic uses #if checks around HAVE_BUILTIN_CTZLL, HAVE_BUILTIN_CTZ, and a fallback for 16-bit variant. This is concise and helps maintain a smaller code footprint.


22-44: Macro definitions for generating longest_match variants.
The approach is consistent and maintains similar selection logic for different built-ins. No immediate issues here. Just ensure match_tpl.h references the correct macros.

test/benchmarks/benchmark_compare256_rle.cc (1)

62-67: Benchmark registration is concise; watch for unknown macro warnings.
Cppcheck warns about potential unknown macros (e.g., BENCHMARK_DEFINE_F). If that macro is from a recognized library (Google Benchmark), this is typically harmless, but consider configuring the static analyzer to avoid false positives.

✅ Verification successful

Let me verify the macro usage by checking for the benchmark header inclusion and the actual macro being used in the file under review.


Macro warnings can be safely ignored - proper benchmark framework integration confirmed

The code properly includes the Google Benchmark framework (<benchmark/benchmark.h>), and the BENCHMARK_DEFINE_F macro is correctly used within a custom macro BENCHMARK_COMPARE256_RLE. The implementation follows the standard Google Benchmark fixture pattern with proper registration. Any Cppcheck warnings about unknown macros in this context can be safely ignored.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Attempt to locate definition of BENCHMARK_DEFINE_F in the codebase
rg -A 5 'BENCHMARK_DEFINE_F'

Length of output: 4987


Script:

#!/bin/bash
# Check the includes and macro definition in the file
cat test/benchmarks/benchmark_compare256_rle.cc

Length of output: 1925

🧰 Tools
🪛 cppcheck (2.10-2)

[error] 62-62: There is an unknown macro here somewhere. Configuration is required. If BENCHMARK_DEFINE_F is a macro then please configure it.

(unknownMacro)

test/test_compare256.cc (2)

15-15: Including "arch/generic/compare256_p.h" directly.
This can sometimes raise concerns about mixing test code with private/internal headers. However, the comment in past reviews clarifies this is intentional for direct testing.


67-72: Tests for 16, 32, and 64-bit variants.
All three coverage targets look correct. This ensures thorough testing of each specialized function.

test/benchmarks/benchmark_compare256.cc (2)

15-15: LGTM: Header inclusion aligns with architectural changes

The addition of compare256_p.h is consistent with the move towards aligned comparisons and the removal of unaligned variants.


68-74: Verify architecture-specific benchmark registration

The replacement of unaligned benchmarks with aligned versions is good, but we should verify the architecture-specific conditions:

  1. Base compare256_16 is unconditional
  2. compare256_32 requires HAVE_BUILTIN_CTZ
  3. compare256_64 requires HAVE_BUILTIN_CTZLL
✅ Verification successful

Architecture-specific benchmark registration is correctly implemented

The verification confirms that:

  1. compare256_16 is correctly unconditional in both implementation (arch/generic/compare256_p.h) and benchmarks
  2. compare256_32 is properly guarded by HAVE_BUILTIN_CTZ in both implementation and benchmarks
  3. compare256_64 is properly guarded by HAVE_BUILTIN_CTZLL in both implementation and benchmarks

The conditional compilation directives are consistently applied across the codebase, including tests, benchmarks, and architecture-specific implementations.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that the compare256 functions are properly defined in the header

# Check if the compare256 functions are properly defined
ast-grep --pattern 'uint32_t compare256_$_(const uint8_t* src0, const uint8_t* src1)' arch/generic/compare256_p.h

Length of output: 115


Script:

#!/bin/bash
# Let's try a different approach to verify the architecture-specific conditions

# 1. Search for compare256 function declarations
rg -n "compare256_(16|32|64)" --type cpp -A 2

# 2. Check the conditional compilation directives around these functions
rg -n "HAVE_BUILTIN_CTZ|HAVE_BUILTIN_CTZLL" --type cpp -B 2 -A 2

Length of output: 8878

🧰 Tools
🪛 cppcheck (2.10-2)

[error] 68-68: There is an unknown macro here somewhere. Configuration is required. If BENCHMARK_DEFINE_F is a macro then please configure it.

(unknownMacro)

functable.c (1)

64-66: Verify function implementations exist

The replacement of generic implementations with architecture-specific ones looks good, but we should verify the existence of these functions.

CMakeLists.txt (1)

1077-1079: LGTM: Header additions support architectural changes

The addition of these headers to ZLIB_PRIVATE_HDRS is appropriate:

  1. chunk_permute_table.h likely contains permutation tables for chunk operations
  2. compare256_p.h contains the new aligned comparison functions
  3. generic_functions.h provides architecture-independent implementations

@Dead2 Dead2 force-pushed the remove_unaligned_pt2 branch from 1ec6c36 to 376c1ff Compare December 22, 2024 00:25
@KungFuJesus
Copy link
Copy Markdown
Collaborator

Oh, if we find any platforms that actually needs the byte-by-byte functions, they should be specifically added with #define OPTIMAL_CMP 8, so that we can optimize for it and also document it for posterity.

I'm somewhat speculating but I wonder if it'll still be faster for the compiler to generate a piecewise load by finding partial alignment and shifting and or'ing it in. It'd be a long instruction sequence to be generic enough for any alignment but it still might be better.

Then again on ppc, at least ppc64, a good portion of that time was external to that comparison loop. I guess SPARC will be the ultimate test.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
compare256_rle.h (1)

Line range hint 12-37: Consider optimizing repeated memcpy operations

The 16-bit comparison function performs multiple memcpy operations in a tight loop. While safe, this could impact performance.

Consider caching the src1_cmp value and updating it only when needed:

 static inline uint32_t compare256_rle_16(const uint8_t *src0, const uint8_t *src1) {
     uint32_t len = 0;
     uint16_t src0_cmp, src1_cmp;
 
     memcpy(&src0_cmp, src0, sizeof(src0_cmp));
+    uint8_t *next_src1 = src1;
 
     do {
-        memcpy(&src1_cmp, src1, sizeof(src1_cmp));
+        memcpy(&src1_cmp, next_src1, sizeof(src1_cmp));
         if (src0_cmp != src1_cmp)
             return len + (*src0 == *src1);
-        src1 += 2, len += 2;
+        next_src1 += 2, len += 2;
arch/generic/compare256_p.h (2)

47-55: Optimize branch prediction for diff check

The diff check is in a hot loop. Consider helping the compiler with branch prediction.

         diff = sv ^ mv;
-        if (diff) {
+        if (UNLIKELY(diff)) {
#  if BYTE_ORDER == LITTLE_ENDIAN
             uint32_t match_byte = __builtin_ctz(diff) / 8;

1-90: Document performance characteristics

The file would benefit from documentation about the performance characteristics of each comparison method.

Add a comment block explaining when to use each comparison width and their relative performance characteristics on different architectures.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1ec6c36 and 376c1ff.

📒 Files selected for processing (12)
  • CMakeLists.txt (1 hunks)
  • arch/generic/compare256_c.c (1 hunks)
  • arch/generic/compare256_p.h (1 hunks)
  • arch/generic/generic_functions.h (1 hunks)
  • compare256_rle.h (3 hunks)
  • deflate_rle.c (1 hunks)
  • functable.c (1 hunks)
  • match_tpl.h (0 hunks)
  • test/benchmarks/benchmark_compare256.cc (2 hunks)
  • test/benchmarks/benchmark_compare256_rle.cc (1 hunks)
  • test/test_compare256.cc (2 hunks)
  • test/test_compare256_rle.cc (1 hunks)
💤 Files with no reviewable changes (1)
  • match_tpl.h
🚧 Files skipped from review as they are similar to previous changes (3)
  • test/test_compare256_rle.cc
  • deflate_rle.c
  • functable.c
🧰 Additional context used
📓 Learnings (1)
arch/generic/generic_functions.h (1)
Learnt from: Dead2
PR: zlib-ng/zlib-ng#1830
File: arch/generic/generic_functions.h:39-53
Timestamp: 2024-12-20T23:35:59.830Z
Learning: The `longest_match_unaligned_*` functions are templated using the LONGEST_MATCH macro in match_tpl.h, so their implementations are generated rather than explicitly defined.
🪛 cppcheck (2.10-2)
test/benchmarks/benchmark_compare256_rle.cc

[error] 62-62: There is an unknown macro here somewhere. Configuration is required. If BENCHMARK_DEFINE_F is a macro then please configure it.

(unknownMacro)

test/benchmarks/benchmark_compare256.cc

[error] 68-68: There is an unknown macro here somewhere. Configuration is required. If BENCHMARK_DEFINE_F is a macro then please configure it.

(unknownMacro)

🔇 Additional comments (10)
arch/generic/compare256_c.c (2)

13-19: LGTM! Well-structured function selection based on CPU capabilities.

The hierarchical fallback from 64-bit to 32-bit to 16-bit operations is well-designed, ensuring optimal performance across different architectures.


22-40: Verify performance impact of function selection.

Given the PR objectives mentioning potential performance concerns on platforms without unaligned access, we should verify the performance impact of these changes.

Run the following script to identify platforms that might be affected:

test/benchmarks/benchmark_compare256_rle.cc (1)

62-67: LGTM! Comprehensive benchmark coverage for all variants.

The benchmarks are properly structured to test all comparison variants based on available CPU capabilities:

  • Base 16-bit version is always available
  • 32-bit version when CTZ is available
  • 64-bit version when CTZLL is available
🧰 Tools
🪛 cppcheck (2.10-2)

[error] 62-62: There is an unknown macro here somewhere. Configuration is required. If BENCHMARK_DEFINE_F is a macro then please configure it.

(unknownMacro)

arch/generic/generic_functions.h (1)

36-36: LGTM! Function declaration aligns with implementation.

The addition of compare256_c declaration is consistent with the PR's objective of cleaning up and consolidating comparison functions.

test/test_compare256.cc (2)

15-15: LGTM: Header inclusion aligns with new architecture

The inclusion of compare256_p.h is appropriate as it provides the new comparison function implementations.


67-72: Verify test coverage for all comparison functions

The conditional compilation ensures tests are only built when the required CPU features are available. However, we should verify that all code paths are tested.

✅ Verification successful

Let me verify if the RLE (Run Length Encoding) comparison functions are also tested by searching for their test coverage.


All comparison functions are properly tested

The verification shows comprehensive test coverage:

  • Generic comparison functions (compare256_16/32/64) are tested in test/test_compare256.cc
  • RLE comparison functions (compare256_rle_16/32/64) are tested in test/test_compare256_rle.cc
  • Platform-specific implementations (SSE2, AVX2, NEON, POWER9, RVV) are tested with appropriate CPU feature checks
  • All tests use proper conditional compilation guards matching their requirements
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check test coverage for comparison functions
# Look for untested comparison functions

# Find all compare256 function declarations
rg -p "static\s+inline\s+uint32_t\s+compare256_\w+" --type cpp

# Find corresponding test cases
rg -p "TEST_COMPARE256\(" --type cpp

Length of output: 1296


Script:

#!/bin/bash
# Search for RLE comparison function tests
rg -p "TEST.*compare256_rle" --type cpp

# Get more context around TEST_COMPARE256 macro to understand test coverage
rg -p "TEST_COMPARE256" -B 5 -A 10 test/test_compare256.cc

Length of output: 1608

compare256_rle.h (1)

Line range hint 69-94: Verify 64-bit comparison edge cases

The 64-bit comparison function handles endianness correctly but should be tested with various alignment scenarios.

test/benchmarks/benchmark_compare256.cc (2)

15-15: LGTM: Header inclusion aligns with compare256 reorganization

The addition of compare256_p.h from the generic architecture directory is appropriate for the benchmark reorganization.


68-73: LGTM: Well-structured benchmark registration with proper guards

The new benchmark registrations are properly guarded with appropriate compiler feature checks:

  • compare256_16 is always available
  • compare256_32 requires HAVE_BUILTIN_CTZ
  • compare256_64 requires HAVE_BUILTIN_CTZLL
🧰 Tools
🪛 cppcheck (2.10-2)

[error] 68-68: There is an unknown macro here somewhere. Configuration is required. If BENCHMARK_DEFINE_F is a macro then please configure it.

(unknownMacro)

CMakeLists.txt (1)

1077-1079: LGTM: Properly added new generic architecture headers

The new headers are correctly added to the private headers list:

  • chunk_permute_table.h
  • compare256_p.h
  • generic_functions.h

@Dead2
Copy link
Copy Markdown
Member Author

Dead2 commented Dec 22, 2024

In present develop, we compile generic functions that we never use, 8 variants of longest_match and 5 of compare256, without counting the sse and such optimized versions.
With this PR we only compile in the 2 longest_match and the 2 variants of compare256 that we actually make pointers to use.

@Dead2 Dead2 force-pushed the remove_unaligned_pt2 branch from 376c1ff to 4bdc524 Compare December 22, 2024 00:45
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
arch/generic/compare256_p.h (2)

38-61: Consider optimizing memory access pattern

The current implementation uses separate memcpy operations for sv and mv. While this ensures safe unaligned access, it might be worth profiling the performance impact of:

  1. Using direct dereferencing when the input is known to be aligned
  2. Combining the memory operations

Example optimization when alignment is guaranteed:

-        memcpy(&sv, src0, sizeof(sv));
-        memcpy(&mv, src1, sizeof(mv));
+        sv = *(const uint32_t *)src0;
+        mv = *(const uint32_t *)src1;

66-89: Verify 64-bit to 32-bit truncation safety

The conversion from uint64_t match_byte to uint32_t at line 82 is safe because:

  1. The maximum value of match_byte is 7 (as it's divided by 8)
  2. The maximum value of len is 256

However, it would be good to add a static assertion to document this assumption.

+/* Static assertion to verify that match_byte will never exceed uint32_t range */
+static_assert(256 < UINT32_MAX, "len cannot overflow uint32_t");
 static inline uint32_t compare256_64(const uint8_t *src0, const uint8_t *src1) {
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 376c1ff and 4bdc524.

📒 Files selected for processing (12)
  • CMakeLists.txt (1 hunks)
  • arch/generic/compare256_c.c (1 hunks)
  • arch/generic/compare256_p.h (1 hunks)
  • arch/generic/generic_functions.h (1 hunks)
  • compare256_rle.h (3 hunks)
  • deflate_rle.c (1 hunks)
  • functable.c (1 hunks)
  • match_tpl.h (0 hunks)
  • test/benchmarks/benchmark_compare256.cc (2 hunks)
  • test/benchmarks/benchmark_compare256_rle.cc (1 hunks)
  • test/test_compare256.cc (2 hunks)
  • test/test_compare256_rle.cc (1 hunks)
💤 Files with no reviewable changes (1)
  • match_tpl.h
🚧 Files skipped from review as they are similar to previous changes (7)
  • test/test_compare256_rle.cc
  • functable.c
  • deflate_rle.c
  • test/test_compare256.cc
  • arch/generic/compare256_c.c
  • compare256_rle.h
  • arch/generic/generic_functions.h
🧰 Additional context used
🪛 cppcheck (2.10-2)
test/benchmarks/benchmark_compare256_rle.cc

[error] 62-62: There is an unknown macro here somewhere. Configuration is required. If BENCHMARK_DEFINE_F is a macro then please configure it.

(unknownMacro)

test/benchmarks/benchmark_compare256.cc

[error] 68-68: There is an unknown macro here somewhere. Configuration is required. If BENCHMARK_DEFINE_F is a macro then please configure it.

(unknownMacro)

🔇 Additional comments (4)
arch/generic/compare256_p.h (1)

12-34: Consider bounds checking for memory access

While the function implementation is efficient, it lacks parameter validation.

The unrolled loop implementation looks good

The manual loop unrolling with 4 iterations per cycle is a good optimization technique. The implementation correctly handles both the 2-byte comparisons and single-byte fallback on mismatch.

test/benchmarks/benchmark_compare256.cc (1)

68-73: LGTM! Benchmark coverage aligns with PR objectives

The changes correctly:

  1. Remove unaligned comparison benchmarks
  2. Add benchmarks for new aligned variants (16/32/64-bit)
  3. Use appropriate conditional compilation based on CPU features
🧰 Tools
🪛 cppcheck (2.10-2)

[error] 68-68: There is an unknown macro here somewhere. Configuration is required. If BENCHMARK_DEFINE_F is a macro then please configure it.

(unknownMacro)

test/benchmarks/benchmark_compare256_rle.cc (1)

62-67: LGTM! Consistent with non-RLE benchmark changes

The benchmark registration follows the same pattern as benchmark_compare256.cc, maintaining consistency across the codebase.

🧰 Tools
🪛 cppcheck (2.10-2)

[error] 62-62: There is an unknown macro here somewhere. Configuration is required. If BENCHMARK_DEFINE_F is a macro then please configure it.

(unknownMacro)

CMakeLists.txt (1)

1077-1079: New generic architecture headers align with cleanup objectives.

The addition of these header files in the generic architecture directory supports the PR's goal of cleaning up alignment-related code:

  • chunk_permute_table.h: Likely contains permutation tables for chunk operations
  • compare256_p.h: New implementation for 256-byte comparisons
  • generic_functions.h: Common function declarations for generic architecture

Let's verify the integration of these headers:

✅ Verification successful

New generic headers properly integrated and support architecture-specific optimizations

The verification confirms that these new header files are well-integrated into the codebase and serve specific purposes:

  • generic_functions.h: Provides baseline C implementations of architecture-specific functions
  • compare256_p.h: Implements optimized 256-byte memory comparison with different integer widths (16/32/64-bit)
  • chunk_permute_table.h: Contains shared permutation tables used by both SSSE3 and NEON implementations for chunkmemset operations

The headers are properly referenced and utilized:

  • Used by both x86 (SSSE3) and ARM (NEON) architectures
  • Integrated through the main arch_functions.h dispatcher
  • Support the PR's cleanup objectives by providing architecture-agnostic implementations
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if the new headers are properly referenced in the codebase
echo "Checking usage of new headers..."
rg -l "chunk_permute_table\.h|compare256_p\.h|generic_functions\.h" --type c

# Verify if these headers replace any old alignment-related code
echo "Checking for old alignment-related code..."
rg -l "UNALIGNED_OK" --type c

Length of output: 493


Script:

#!/bin/bash
# Let's check the content and relationships of these headers

# Check content of generic_functions.h
echo "=== Content of generic_functions.h ==="
cat arch/generic/generic_functions.h

# Check content of compare256_p.h
echo -e "\n=== Content of compare256_p.h ==="
cat arch/generic/compare256_p.h

# Check content of chunk_permute_table.h
echo -e "\n=== Content of chunk_permute_table.h ==="
cat arch/generic/chunk_permute_table.h

# Check how these are included in architecture-specific implementations
echo -e "\n=== Usage in arch_functions.h ==="
cat arch_functions.h

# Check for any related changes in the implementation files
echo -e "\n=== Usage in SSSE3 implementation ==="
cat arch/x86/chunkset_ssse3.c

echo -e "\n=== Usage in NEON implementation ==="
cat arch/arm/chunkset_neon.c

Length of output: 14820

@Dead2 Dead2 force-pushed the remove_unaligned_pt2 branch 4 times, most recently from b1437c4 to 71a9e5b Compare December 22, 2024 12:58
@Dead2
Copy link
Copy Markdown
Member Author

Dead2 commented Dec 22, 2024

Rebased, reordered and redid the commits.

Also added back the 8-bit compare256 variants. Those are now available in the benchmarks, and can be accessed by setting the define OPTIMAL_CMP=8 from cflags.
Platforms don't currently use that by default now. But specific platforms can easily be set to use it after showing a performance benefit.

@Dead2 Dead2 added the Benchmarks wanted Please benchmark with and without this PR label Dec 22, 2024
@phprus
Copy link
Copy Markdown
Contributor

phprus commented Dec 23, 2024

Workflow CMake / EL9 Clang S390X DFLTCC MSAN is failing.

https://github.com/zlib-ng/zlib-ng/actions/runs/12470358964/job/34805368017?pr=1830

@Dead2
Copy link
Copy Markdown
Member Author

Dead2 commented Dec 23, 2024

Workflow CMake / EL9 Clang S390X DFLTCC MSAN is failing.

Yes, it also fails before this PR, we haven't had that test working for months due to github not wanting to support s390x themselves, so we have to depend on a 3rd party runner port. Not sure what broke there yet.
I have pinged @iii-i, but he is likely off on a Christmas vacation or something I would guess.

@KungFuJesus
Copy link
Copy Markdown
Collaborator

WITH_NATIVE fails on PPC64

I think this is fixed now.

Confirmed. Works now.

@Dead2 Dead2 force-pushed the remove_unaligned_pt2 branch from 0f23df0 to cfa5e9d Compare December 24, 2024 12:07
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
arch/generic/compare256_c.c (1)

24-31: Consider adding documentation for template parameters.

While the template usage is correct, adding comments to explain the parameters and differences between normal and slow variants would improve maintainability.

test/benchmarks/benchmark_compare256_rle.cc (1)

62-68: LGTM! Consider improving error messages.

The benchmark registration for different bit-width variants is well-structured. However, the error message could be more specific about why the CPU doesn't support a particular variant.

Update the error message to be more descriptive:

-            state.SkipWithError("CPU does not support " #name);
+            state.SkipWithError("CPU does not support " #name "-bit comparisons (missing required CPU feature)");
🧰 Tools
🪛 cppcheck (2.10-2)

[error] 62-62: There is an unknown macro here somewhere. Configuration is required. If BENCHMARK_DEFINE_F is a macro then please configure it.

(unknownMacro)

zbuild.h (1)

271-271: Consider documenting the OPTIMAL_CMP values.

Adding a comment explaining the performance implications of different OPTIMAL_CMP values would help future maintainers.

+/* OPTIMAL_CMP values determine the comparison width:
+ * 64: Best for 64-bit architectures with unaligned access
+ * 32: Best for 32-bit architectures with unaligned access
+ * 16: Safe default for unknown architectures
+ * 8:  Safe fallback for architectures without unaligned access
+ */
 #define OPTIMAL_CMP 16
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0f23df0 and cfa5e9d.

📒 Files selected for processing (14)
  • CMakeLists.txt (1 hunks)
  • arch/generic/Makefile.in (1 hunks)
  • arch/generic/compare256_c.c (1 hunks)
  • arch/generic/compare256_p.h (1 hunks)
  • arch/generic/generic_functions.h (1 hunks)
  • compare256_rle.h (4 hunks)
  • deflate_rle.c (1 hunks)
  • functable.c (1 hunks)
  • match_tpl.h (0 hunks)
  • test/benchmarks/benchmark_compare256.cc (2 hunks)
  • test/benchmarks/benchmark_compare256_rle.cc (1 hunks)
  • test/test_compare256.cc (2 hunks)
  • test/test_compare256_rle.cc (1 hunks)
  • zbuild.h (1 hunks)
💤 Files with no reviewable changes (1)
  • match_tpl.h
🚧 Files skipped from review as they are similar to previous changes (8)
  • arch/generic/Makefile.in
  • CMakeLists.txt
  • test/test_compare256_rle.cc
  • functable.c
  • test/test_compare256.cc
  • arch/generic/generic_functions.h
  • deflate_rle.c
  • compare256_rle.h
🧰 Additional context used
🪛 cppcheck (2.10-2)
test/benchmarks/benchmark_compare256_rle.cc

[error] 62-62: There is an unknown macro here somewhere. Configuration is required. If BENCHMARK_DEFINE_F is a macro then please configure it.

(unknownMacro)

test/benchmarks/benchmark_compare256.cc

[error] 68-68: There is an unknown macro here somewhere. Configuration is required. If BENCHMARK_DEFINE_F is a macro then please configure it.

(unknownMacro)

🔇 Additional comments (8)
arch/generic/compare256_c.c (2)

20-22: Clean implementation using macro-based dispatch.

The simplified implementation improves maintainability while allowing platform-specific optimizations through the COMPARE256 macro.


7-18: LGTM! Verify platform-specific performance impact.

The macro-based selection of comparison functions with priority for byte version when OPTIMAL_CMP=8 aligns well with the previous discussions about performance on platforms without unaligned reads.

Run this script to analyze platform-specific configurations:

✅ Verification successful

Verified: Platform-specific optimizations are well-handled

The macro-based selection in compare256_c.c is well-designed with appropriate fallbacks:

  • For platforms with unaligned access and SIMD:
    • x86 has optimized AVX2/SSE2 implementations
    • ARM has NEON implementation with CTZLL support
  • For platforms without SIMD but with unaligned access:
    • 64-bit comparison with CTZLL
    • 32-bit comparison with CTZ
    • 16-bit comparison fallback
  • For platforms with alignment restrictions:
    • 8-bit comparison (OPTIMAL_CMP=8) ensures safe byte-by-byte access

The implementation correctly prioritizes the byte version (OPTIMAL_CMP=8) for platforms with alignment restrictions while allowing faster implementations on platforms with better capabilities.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check which platforms might benefit from OPTIMAL_CMP=8

# Search for platform-specific configurations and CPU feature checks
rg -l "HAVE_BUILTIN_CTZ|HAVE_BUILTIN_CTZLL|UNALIGNED_OK" arch/

Length of output: 251


Script:

#!/bin/bash
# Check the content of platform-specific files to understand the configuration logic

# Check generic implementation header
echo "=== Generic implementation header (compare256_p.h) ==="
cat arch/generic/compare256_p.h

# Check ARM-specific configurations
echo -e "\n=== ARM configurations ==="
rg "UNALIGNED_OK|HAVE_BUILTIN_CTZ|HAVE_BUILTIN_CTZLL|OPTIMAL_CMP" arch/arm/arm_functions.h

# Check x86-specific configurations
echo -e "\n=== x86 configurations ==="
rg "UNALIGNED_OK|HAVE_BUILTIN_CTZ|HAVE_BUILTIN_CTZLL|OPTIMAL_CMP" arch/x86/x86_functions.h

Length of output: 3790


Script:

#!/bin/bash
# Check for unaligned memory access handling and platform-specific optimizations

# Check memory access functions
echo "=== Memory access functions ==="
rg "zng_memread|zng_memcmp" arch/generic/zmemory.h || rg "zng_memread|zng_memcmp" zmemory.h

# Check for NEON implementation details
echo -e "\n=== ARM NEON implementation ==="
head -n 30 arch/arm/compare256_neon.c

# Check for SSE2/AVX2 implementation details
echo -e "\n=== x86 SSE2/AVX2 implementations ==="
head -n 30 arch/x86/compare256_sse2.c
head -n 30 arch/x86/compare256_avx2.c

Length of output: 4056

test/benchmarks/benchmark_compare256.cc (1)

68-74: Maintain consistency with RLE benchmark error messages.

The benchmark structure is good, but error messages should be consistent with any improvements made to the RLE benchmarks.

Update the error message to be more descriptive, maintaining consistency with RLE benchmarks:

-            state.SkipWithError("CPU does not support " #name);
+            state.SkipWithError("CPU does not support " #name "-bit comparisons (missing required CPU feature)");
🧰 Tools
🪛 cppcheck (2.10-2)

[error] 68-68: There is an unknown macro here somewhere. Configuration is required. If BENCHMARK_DEFINE_F is a macro then please configure it.

(unknownMacro)

arch/generic/compare256_p.h (4)

11-42: LGTM! Efficient byte-by-byte comparison implementation.

The unrolled loop design optimizes performance by reducing branch predictions.


45-67: LGTM! Well-implemented 16-bit comparison with proper partial match handling.

The implementation correctly handles both full 16-bit matches and partial byte matches.


97-123: LGTM! 64-bit comparison implementation with proper type handling.

The implementation correctly handles 64-bit comparisons with appropriate type casting for the return value.


69-95: Verify compiler support for CTZ/CLZ built-ins.

The implementation efficiently uses built-in functions for trailing/leading zero counts. Let's verify compiler support across different platforms.

✅ Verification successful

Compiler support for CTZ/CLZ built-ins is properly handled

The codebase has comprehensive support for compiler built-ins with proper fallbacks:

  • CMake and configure scripts check for __builtin_ctz and __builtin_ctzll support
  • Fallback implementations are provided in fallback_builtins.h for platforms without native support
  • The code conditionally enables optimized implementations based on architecture and built-in support:
    • Generic 32-bit (compare256_32) when HAVE_BUILTIN_CTZ is available
    • Generic 64-bit (compare256_64) when HAVE_BUILTIN_CTZLL is available
    • SSE2/AVX2 optimizations for x86 with HAVE_BUILTIN_CTZ
    • NEON optimizations for ARM with HAVE_BUILTIN_CTZLL
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check compiler support for CTZ/CLZ built-ins
# Test: Search for compiler checks and definitions
rg -A 5 "HAVE_BUILTIN_CTZ|HAVE_BUILTIN_CLZ"

Length of output: 11476

zbuild.h (1)

246-271: Well-structured architecture-specific optimizations.

The OPTIMAL_CMP definitions are properly organized with:

  • 64-bit optimization for x86_64 and ARM64 with unaligned access
  • 32-bit optimization for i386 and ARM with unaligned access
  • Safe fallback to 8-bit for architectures without unaligned access
  • Default to 16-bit when architecture is unknown

@Dead2 Dead2 force-pushed the remove_unaligned_pt2 branch from cfa5e9d to 81274e4 Compare December 24, 2024 12:29
@Dead2
Copy link
Copy Markdown
Member Author

Dead2 commented Dec 24, 2024

Rebased and added a couple comments improving documentation.

@Dead2 Dead2 force-pushed the remove_unaligned_pt2 branch from 81274e4 to ef3aea5 Compare December 24, 2024 18:40
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
zmemory.h (1)

84-84: Maintain consistency in zng_memcmp_4.

Similar to zng_memcmp_2, confirm callers expect 0/1 rather than a typical three-way comparison result from standard memcmp.

test/benchmarks/benchmark_compare256_rle.cc (1)

67-68: Benchmark for 64-byte comparisons.

Same as above: the code is conditionally compiled by HAVE_BUILTIN_CTZLL. Ensure environment checks are consistent, and confirm that relevant test coverage is in place for 64-byte comparisons.

test/benchmarks/benchmark_compare256.cc (1)

76-76: Optional: Confirm clarity of separation.

Having multiple #endif blocks at close can be less readable if conditions expand in future. Consider grouping or adding clarifying comments if this file grows.

zbuild.h (1)

273-276: Minor: Align indentation in powerpc sections.

The powerpc sections have inconsistent indentation compared to other architecture-specific sections.

-#      define OPTIMAL_CMP 64
+#    define OPTIMAL_CMP 64
 #  elif defined(__powerpc__) || defined(__ppc__) || defined(__PPC__)
-#      define OPTIMAL_CMP 32
+#    define OPTIMAL_CMP 32
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 81274e4 and ef3aea5.

📒 Files selected for processing (15)
  • CMakeLists.txt (1 hunks)
  • arch/generic/Makefile.in (1 hunks)
  • arch/generic/compare256_c.c (1 hunks)
  • arch/generic/compare256_p.h (1 hunks)
  • arch/generic/generic_functions.h (1 hunks)
  • compare256_rle.h (4 hunks)
  • deflate_rle.c (1 hunks)
  • functable.c (1 hunks)
  • match_tpl.h (1 hunks)
  • test/benchmarks/benchmark_compare256.cc (2 hunks)
  • test/benchmarks/benchmark_compare256_rle.cc (1 hunks)
  • test/test_compare256.cc (2 hunks)
  • test/test_compare256_rle.cc (1 hunks)
  • zbuild.h (1 hunks)
  • zmemory.h (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (10)
  • CMakeLists.txt
  • test/test_compare256_rle.cc
  • arch/generic/Makefile.in
  • test/test_compare256.cc
  • compare256_rle.h
  • match_tpl.h
  • arch/generic/compare256_p.h
  • functable.c
  • arch/generic/compare256_c.c
  • arch/generic/generic_functions.h
🧰 Additional context used
🪛 cppcheck (2.10-2)
test/benchmarks/benchmark_compare256_rle.cc

[error] 62-62: There is an unknown macro here somewhere. Configuration is required. If BENCHMARK_DEFINE_F is a macro then please configure it.

(unknownMacro)

test/benchmarks/benchmark_compare256.cc

[error] 68-68: There is an unknown macro here somewhere. Configuration is required. If BENCHMARK_DEFINE_F is a macro then please configure it.

(unknownMacro)

🔇 Additional comments (11)
zmemory.h (2)

76-76: Clarify the boolean return logic in zng_memcmp_2.

This function returns 1 if there is a difference, 0 otherwise, which is an unconventional usage of “memcmp” semantics. Ensure all call sites correctly interpret this boolean-like return value.


92-92: Confirm usage alignment in zng_memcmp_8.

Returning 1 or 0 instead of a signed integer might be intentional, but verify that it aligns with downstream usage and avoids confusion.

test/benchmarks/benchmark_compare256_rle.cc (2)

62-63: Confirm coverage and naming consistency.

Adding benchmarks for compare256_rle_8 and compare256_rle_16 looks fine. Ensure that all relevant compare sizes are tested if needed, and confirm the naming scheme remains consistent across the codebase.

🧰 Tools
🪛 cppcheck (2.10-2)

[error] 62-62: There is an unknown macro here somewhere. Configuration is required. If BENCHMARK_DEFINE_F is a macro then please configure it.

(unknownMacro)


64-66: Conditionally compiled benchmark for 32-byte comparisons.

Having this block guarded by HAVE_BUILTIN_CTZ suggests 32-byte RLE comparisons might be unavailable on some platforms. Verify that fallback or alternative benchmarks exist if HAVE_BUILTIN_CTZ is not defined.

test/benchmarks/benchmark_compare256.cc (2)

15-15: Include path verification.

#include "arch/generic/compare256_p.h" is newly added. Confirm that the path aligns with the repository’s structure and that no duplication occurs with other headers.


68-74: Expansion of aligned comparison benchmarks.

Benchmarks for 8, 16, 32, and 64 bytes are conditionally compiled. This is consistent with the removal of unaligned comparisons. Just ensure runtimes on platforms where some macros (e.g. HAVE_BUILTIN_CTZ, HAVE_BUILTIN_CTZLL) aren’t defined still receive the intended fallback coverage.

🧰 Tools
🪛 cppcheck (2.10-2)

[error] 68-68: There is an unknown macro here somewhere. Configuration is required. If BENCHMARK_DEFINE_F is a macro then please configure it.

(unknownMacro)

deflate_rle.c (2)

11-11: New include for compare256_rle.h.

Ensure that only what is needed from this header is included and that the dependency is minimal. If usage expands, a forward declaration might be insufficient. Current approach looks fine.


13-20: Macro-based selection approach for compare256_rle.

Selecting a compare function based on OPTIMAL_CMP or builtins is a neat optimization. However, confirm that this logic does not break on unusual compiler or platform configurations that might define multiple or no macros. Additional fallback logic or explicit checks can help maintain portability.

If you need a script to verify coverage for each macro condition, let me know.

zbuild.h (3)

246-254: LGTM! Well-documented optimization strategy.

The comment block provides clear and comprehensive documentation about the OPTIMAL_CMP values and their intended use cases, with an important clarification about unaligned access safety.


255-278: LGTM! Comprehensive architecture-specific optimizations.

The implementation provides well-structured architecture detection and appropriate OPTIMAL_CMP values based on platform capabilities and word sizes.


279-281: Verify performance impact of default OPTIMAL_CMP change.

The default OPTIMAL_CMP value has been changed from 8 to 16. While this aligns with the documentation, we should verify that this change doesn't negatively impact performance on platforms without unaligned access support.

✅ Verification successful

Let me gather more information about the performance implications and platform-specific implementations.


Let me check one more thing to understand the platform-specific implementations and their relationship with OPTIMAL_CMP.


Default OPTIMAL_CMP value of 16 is safe and well-considered

The change from 8 to 16 as the default OPTIMAL_CMP value is actually safe and well-implemented. Here's why:

  • The codebase has a sophisticated selection of OPTIMAL_CMP values based on platform capabilities:
    • 64-bit comparison for platforms with CTZLL builtin
    • 32-bit comparison for platforms with CTZ builtin
    • 16-bit comparison as the new default fallback
    • 8-bit comparison explicitly set for platforms that need it

The default value of 16 is only used when no platform-specific value is defined, and the implementation in compare256_p.h shows that the 16-bit comparison is implemented safely using zng_memcmp_2 which handles unaligned access properly. This is part of a larger cleanup effort (PR #1830) that improves the handling of unaligned access across the codebase.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for any existing performance tests or benchmarks related to compare functions
rg -l "compare.*benchmark|benchmark.*compare" --type c --type cpp

# Look for any platform-specific implementations that might be affected
rg -l "OPTIMAL_CMP.*8|compare.*8" --type c --type h

# Search for any existing issues or discussions about OPTIMAL_CMP
gh issue list --search "OPTIMAL_CMP"
gh pr list --search "OPTIMAL_CMP"

Length of output: 853


Script:

#!/bin/bash
# Check the content of benchmark files to understand performance implications
rg -A 5 "OPTIMAL_CMP" test/benchmarks/benchmark_compare256*.cc

# Look at platform-specific implementations to understand unaligned access handling
cat arch/generic/generic_functions.h
cat arch/generic/compare256_p.h

# Check the PR that's related to UNALIGNED_OK checks
gh pr view 1830 --json body

Length of output: 7550


Script:

#!/bin/bash
# Check the generic compare256 implementation and its usage of OPTIMAL_CMP
cat arch/generic/compare256_c.c

# Look for any platform-specific OPTIMAL_CMP definitions
rg "define.*OPTIMAL_CMP" --type c --type h

Length of output: 1379

@Dead2
Copy link
Copy Markdown
Member Author

Dead2 commented Dec 24, 2024

After talking it over with @KungFuJesus, we agree that checking for OPTIMAL_CMP is probably not necessary or useful in zmemory.h. If it turns out that it is, it can be added back in a fixed form. I was considering changing them to all be => 16, but we'll try it like this first and see what tests and benchmarks come out as.

@ccawley2011 Please have a look at the updated PR now that your PR #1548 has been merged first. It seems to work fine with -mstrict-align on my 64-bit rpi5 at least, but that supports HAVE_MAY_ALIAS so it isn't actully affected by the change.

- Remove obsolete checks
- Fix checks that are inconsistent
- Stop compiling compare256/longest_match variants that never gets called
- Improve how the generic compare256 functions are handled.
- Allow overriding OPTIMAL_CMP

This simplifies the code and avoids having a lot of code in the compiled library than can never get executed.
@Dead2 Dead2 force-pushed the remove_unaligned_pt2 branch from ef3aea5 to 577c41e Compare December 26, 2024 18:54
@Dead2
Copy link
Copy Markdown
Member Author

Dead2 commented Dec 26, 2024

Fixed indentation.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
test/benchmarks/benchmark_compare256_rle.cc (1)

62-68: Consider platform-specific benchmark configurations

Based on the PR discussion:

  1. Consider adding benchmark configurations for platforms without unaligned access to measure potential performance impact
  2. For platforms where 8-bit comparisons might be optimal, document how to run these benchmarks with OPTIMAL_CMP=8
  3. Consider collecting and documenting benchmark results across different architectures (x86_64, Aarch64, PPC64) to guide platform-specific optimizations

Would you like me to help create a benchmark configuration matrix or documentation for platform-specific testing?

🧰 Tools
🪛 cppcheck (2.10-2)

[error] 62-62: There is an unknown macro here somewhere. Configuration is required. If BENCHMARK_DEFINE_F is a macro then please configure it.

(unknownMacro)

arch/generic/compare256_p.h (1)

11-42: Consider loop unrolling optimization.

The byte-by-byte comparison is implemented with repeated code blocks. While this works, consider using a more maintainable loop unrolling macro or compiler directives.

-    do {
-        if (*src0 != *src1)
-            return len;
-        src0 += 1, src1 += 1, len += 1;
-        if (*src0 != *src1)
-            return len;
-        src0 += 1, src1 += 1, len += 1;
-        // ... repeated 8 times
-    } while (len < 256);
+    #define COMPARE_BYTE_BLOCK(n) \
+        if (*src0 != *src1) return len; \
+        src0 += 1, src1 += 1, len += 1;
+
+    do {
+        COMPARE_BYTE_BLOCK(0)
+        COMPARE_BYTE_BLOCK(1)
+        COMPARE_BYTE_BLOCK(2)
+        COMPARE_BYTE_BLOCK(3)
+        COMPARE_BYTE_BLOCK(4)
+        COMPARE_BYTE_BLOCK(5)
+        COMPARE_BYTE_BLOCK(6)
+        COMPARE_BYTE_BLOCK(7)
+    } while (len < 256);
+
+    #undef COMPARE_BYTE_BLOCK
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ef3aea5 and 577c41e.

📒 Files selected for processing (15)
  • CMakeLists.txt (1 hunks)
  • arch/generic/Makefile.in (1 hunks)
  • arch/generic/compare256_c.c (1 hunks)
  • arch/generic/compare256_p.h (1 hunks)
  • arch/generic/generic_functions.h (1 hunks)
  • compare256_rle.h (4 hunks)
  • deflate_rle.c (1 hunks)
  • functable.c (1 hunks)
  • match_tpl.h (1 hunks)
  • test/benchmarks/benchmark_compare256.cc (2 hunks)
  • test/benchmarks/benchmark_compare256_rle.cc (1 hunks)
  • test/test_compare256.cc (2 hunks)
  • test/test_compare256_rle.cc (1 hunks)
  • zbuild.h (1 hunks)
  • zmemory.h (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (11)
  • CMakeLists.txt
  • zmemory.h
  • test/test_compare256_rle.cc
  • arch/generic/Makefile.in
  • match_tpl.h
  • functable.c
  • test/test_compare256.cc
  • deflate_rle.c
  • arch/generic/compare256_c.c
  • compare256_rle.h
  • arch/generic/generic_functions.h
🧰 Additional context used
🪛 cppcheck (2.10-2)
test/benchmarks/benchmark_compare256_rle.cc

[error] 62-62: There is an unknown macro here somewhere. Configuration is required. If BENCHMARK_DEFINE_F is a macro then please configure it.

(unknownMacro)

test/benchmarks/benchmark_compare256.cc

[error] 68-68: There is an unknown macro here somewhere. Configuration is required. If BENCHMARK_DEFINE_F is a macro then please configure it.

(unknownMacro)

🔇 Additional comments (6)
test/benchmarks/benchmark_compare256_rle.cc (2)

62-63: LGTM: New benchmarks for 8-bit and 16-bit comparisons

The addition of benchmarks for 8-bit and 16-bit comparisons aligns with the PR objectives of consolidating comparison functions. These variants are unconditionally available, which matches the architectural independence of byte-level comparisons.

🧰 Tools
🪛 cppcheck (2.10-2)

[error] 62-62: There is an unknown macro here somewhere. Configuration is required. If BENCHMARK_DEFINE_F is a macro then please configure it.

(unknownMacro)


64-68: Verify CPU feature detection for conditional benchmarks

The conditional compilation for 32-bit and 64-bit comparisons is correctly gated behind CPU feature checks (HAVE_BUILTIN_CTZ and HAVE_BUILTIN_CTZLL).

Let's verify the CPU feature detection macros are consistently defined across the codebase:

✅ Verification successful

CPU feature detection macros are consistently defined and used across the codebase

The verification shows that the CPU feature detection macros HAVE_BUILTIN_CTZ and HAVE_BUILTIN_CTZLL are consistently defined and used across the codebase:

  • The macros are properly defined in fallback_builtins.h
  • They are consistently used to guard CPU-specific code in:
    • Core comparison implementations (compare256_rle.h, deflate_rle.c)
    • Architecture-specific code (arch/x86/, arch/arm/, arch/generic/)
    • Function tables (functable.c)
    • Test and benchmark files

The conditional compilation in the benchmark file aligns with the pattern used throughout the codebase, ensuring that CPU-specific optimizations are only enabled when the corresponding CPU features are available.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify CPU feature detection consistency
# Check if HAVE_BUILTIN_CTZ and HAVE_BUILTIN_CTZLL are consistently defined

# Search for CTZ-related macro definitions and usage
echo "=== Searching for CTZ macro definitions ==="
rg -n "HAVE_BUILTIN_CTZ|HAVE_BUILTIN_CTZLL" --type c --type cpp

# Search for other comparison function implementations that might need similar checks
echo -e "\n=== Searching for related comparison functions ==="
ast-grep --pattern 'compare256_rle_$_'

Length of output: 2620

test/benchmarks/benchmark_compare256.cc (1)

68-74: LGTM! Well-structured benchmark organization.

The new benchmarks are properly organized with appropriate CPU feature guards. The progression from 8-bit to 64-bit comparisons allows for comprehensive performance testing across different architectures.

🧰 Tools
🪛 cppcheck (2.10-2)

[error] 68-68: There is an unknown macro here somewhere. Configuration is required. If BENCHMARK_DEFINE_F is a macro then please configure it.

(unknownMacro)

arch/generic/compare256_p.h (1)

71-94: Verify endianness handling in builtin_ctz usage.

The endianness handling for finding the first differing byte is crucial for correctness. Please verify that the calculation __builtin_ctz(diff) / 8 and __builtin_clz(diff) / 8 correctly identifies the byte position on all supported platforms.

✅ Verification successful

Let me search more broadly for endianness-related code and tests.


Let me check the implementation details and tests more thoroughly.


Let me check one more thing to verify the memory reading functions that are used alongside the endianness handling.


Based on the analysis of the codebase, I can now provide a final response:

The endianness handling in compare256_32 is correctly implemented

The implementation correctly handles both little-endian and big-endian architectures:

  • For little-endian systems, __builtin_ctz(diff) / 8 is used to find the first differing byte
  • For big-endian systems, __builtin_clz(diff) / 8 is used instead
  • The endianness detection is robust across platforms through zendian.h, which handles various compiler-specific, OS-specific, and architecture-specific cases
  • The same pattern is consistently used across the codebase in both 32-bit and 64-bit comparison functions

The implementation is consistent with memory byte ordering:

  • In little-endian systems, least significant bits correspond to the first byte in memory, so counting trailing zeros is correct
  • In big-endian systems, most significant bits correspond to the first byte in memory, so counting leading zeros is correct
  • Division by 8 correctly converts from bit position to byte position in both cases
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for any endianness-related test cases
rg -l "BYTE_ORDER|__BYTE_ORDER" test/

Length of output: 39


Script:

#!/bin/bash
# Search for endianness-related code and tests in the entire codebase
rg -l "BYTE_ORDER|__BYTE_ORDER|endian"

# Look for test files that might verify memory comparison functions
fd -e c -e h test

Length of output: 522


Script:

#!/bin/bash
# Check the endianness handling in compare functions across architectures
rg -A 5 -B 5 "builtin_[cl]tz.*diff.*8" 

# Check test files for memory comparison tests
cat test/test_shared.h test/test_shared_ng.h

# Check endianness definitions and handling
cat zendian.h

Length of output: 5118


Script:

#!/bin/bash
# Check the implementation of zng_memread functions
rg -A 5 "zng_memread_[48]" zbuild.h

Length of output: 37

zbuild.h (2)

246-254: LGTM! Well-documented OPTIMAL_CMP values.

The comments clearly explain the rationale behind each OPTIMAL_CMP value and their relationship to unaligned access capabilities.


255-280: Verify architecture detection completeness.

The architecture detection appears comprehensive but should be verified against all supported platforms.

✅ Verification successful

Let me gather more information about the architecture-specific code.


Architecture detection is comprehensive and well-implemented

The architecture detection in zbuild.h is thorough and covers all major architectures and their variants:

  1. x86 family (x86_64, i386/i486/i586/i686)
  2. ARM family (AArch64/ARM64, ARM32) with unaligned access detection
  3. PowerPC family (PowerPC64, PowerPC32)
  4. Default fallback (OPTIMAL_CMP 16)

The implementation is verified by:

  • Architecture-specific code in arch/ directory matches the detected architectures
  • Comprehensive architecture detection in cmake/detect-arch.c aligns with the OPTIMAL_CMP definitions
  • Platform-specific features are properly handled (e.g., ARM unaligned access detection)
  • Proper fallback value is defined when architecture is not detected
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for architecture-specific code that might need OPTIMAL_CMP
rg -l "__x86_64__|__i386__|__aarch64__|__arm__|__powerpc" arch/

Length of output: 187


Script:

#!/bin/bash
# Check content of architecture-specific files for any additional architecture macros
echo "=== Content of x86_functions.h ==="
cat arch/x86/x86_functions.h

echo -e "\n=== Content of arm_features.c ==="
cat arch/arm/arm_features.c

# Search for any other architecture-related macros in the codebase
echo -e "\n=== Other architecture macros in codebase ==="
rg -n "(defined\(__[a-zA-Z0-9_]+\)|defined\(_M_[a-zA-Z0-9_]+\))" --type c --type h

Length of output: 21395

@Dead2 Dead2 merged commit bf05e88 into develop Dec 26, 2024
@ccawley2011
Copy link
Copy Markdown
Contributor

Is there an easy way to get benchmarks for the LONGEST_MATCH functions?

@Dead2
Copy link
Copy Markdown
Member Author

Dead2 commented Dec 26, 2024

Is there an easy way to get benchmarks for the LONGEST_MATCH functions?

I think the closest you can get currently is the gbench compress* benchmarks.
test/benchmarks/benchmark_zlib --benchmark_filter=compress*

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Benchmarks wanted Please benchmark with and without this PR cleanup Improving maintainability or removing code.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants