Skip to content

Add AVX512 version of compare256#1901

Merged
Dead2 merged 1 commit intodevelopfrom
compare256-avx512
Apr 14, 2025
Merged

Add AVX512 version of compare256#1901
Dead2 merged 1 commit intodevelopfrom
compare256-avx512

Conversation

@Dead2
Copy link
Copy Markdown
Member

@Dead2 Dead2 commented Apr 10, 2025

Implements AVX512 variant of compare256.

Unlike AVX2, AVX512 does not need to make the mask separately, and instead can generate the mask directly from the comparison, saving us a tiny bit of time.

Using 64-byte compares from the start turned out to be a little slower on normal files because most matches are short or even not matches at all. Therefore I implemented this using two rounds of 32-byte compares at the start, before going up to 64byte compares after the first 64bytes have matched.

Before:

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     54.185%      0.078/0.078/0.079/0.000        0.030/0.031/0.031/0.000        8,526,745
 2     43.871%      0.128/0.129/0.129/0.000        0.030/0.030/0.030/0.000        6,903,702
 3     42.390%      0.158/0.158/0.158/0.000        0.029/0.029/0.029/0.000        6,670,664
 4     41.644%      0.181/0.181/0.182/0.000        0.028/0.028/0.028/0.000        6,553,205
 5     41.215%      0.196/0.197/0.197/0.000        0.028/0.028/0.028/0.000        6,485,659
 6     41.032%      0.243/0.244/0.244/0.000        0.028/0.028/0.028/0.000        6,456,912
 7     40.778%      0.337/0.338/0.338/0.000        0.028/0.028/0.028/0.000        6,416,941
 8     40.704%      0.445/0.446/0.446/0.000        0.028/0.028/0.028/0.000        6,405,249
 9     40.409%      0.528/0.528/0.529/0.000        0.027/0.027/0.027/0.000        6,358,951

 avg1  42.914%                        0.255                          0.029
 tot                                 68.956                          7.705       60,778,028

   text    data     bss     dec     hex filename
 170024    1344       8  171376   29d70 libz-ng.so.2

After:

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     54.185%      0.079/0.079/0.079/0.000        0.031/0.031/0.031/0.000        8,526,745
 2     43.871%      0.127/0.128/0.128/0.000        0.030/0.030/0.030/0.000        6,903,702
 3     42.390%      0.156/0.156/0.157/0.000        0.029/0.029/0.029/0.000        6,670,664
 4     41.644%      0.178/0.179/0.180/0.000        0.028/0.028/0.028/0.000        6,553,205
 5     41.215%      0.193/0.194/0.194/0.000        0.028/0.028/0.028/0.000        6,485,659
 6     41.032%      0.240/0.240/0.240/0.000        0.028/0.028/0.028/0.000        6,456,912
 7     40.778%      0.335/0.335/0.335/0.000        0.028/0.028/0.028/0.000        6,416,941
 8     40.704%      0.442/0.443/0.443/0.000        0.028/0.028/0.028/0.000        6,405,249
 9     40.409%      0.534/0.535/0.535/0.000        0.027/0.027/0.027/0.000        6,358,951

 avg1  42.914%                        0.254                          0.029
 tot                                 68.650                          7.705       60,778,028

   text    data     bss     dec     hex filename
 172720    1344       8  174072   2a7f8 libz-ng.so.2

Levels 1-8 are a little faster, with level 6 being 1.24% faster.
Level 9 takes a small penalty due to trying really hard to find even small matches, increasing the likelihood of the matches not being matches at all.

Had to add a few more benchmark steps for this to be useful.

compare256/avx2/1           1.11 ns         1.11 ns   2514328460
compare256/avx2/8           1.18 ns         1.18 ns   2514327444
compare256/avx2/32          1.11 ns         1.11 ns   2514328166
compare256/avx2/96          1.95 ns         1.95 ns   1436756998
compare256/avx2/160         2.51 ns         2.51 ns   1117478410
compare256/avx2/256         3.62 ns         3.62 ns    773638652

compare256/avx512/1         1.11 ns         1.11 ns   2514328713
compare256/avx512/8         1.11 ns         1.11 ns   2514328853
compare256/avx512/32        1.11 ns         1.11 ns   2514328853
compare256/avx512/96        1.95 ns         1.95 ns   1436755135
compare256/avx512/160       2.04 ns         2.04 ns   1257161900
compare256/avx512/256       2.23 ns         2.23 ns   1257162425

Up to 96 byte matches they are perform practically the same.
On 160-byte matches, these results indicate it is ~19% faster.
On 256-byte matches, these results indicate it is ~39% faster.

Summary by CodeRabbit

  • New Features

    • Integrated AVX512-optimized routines to improve data comparison and matching performance on supported systems.
  • Tests

    • Enhanced benchmark configurations with precise input sizes for better performance evaluation.
    • Added dedicated test cases to validate the new optimized routines for the AVX512 architecture.

@Dead2 Dead2 added optimization Architecture Architecture specific labels Apr 10, 2025
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 10, 2025

Walkthrough

This pull request adds an AVX512-optimized implementation for the compare256 function along with associated matching functions. It updates the build configuration by modifying CMakeLists.txt, Makefile.in, and the configure script to include the new source file and targets. Additionally, new function declarations, function pointer assignments, tests, and benchmarks have been added to integrate and validate the AVX512 functionality.

Changes

File(s) Change Summary
CMakeLists.txt, arch/x86/Makefile.in, configure Updated build scripts to support AVX512 by adding the compare256_avx512.c source file, new build targets (compare256_avx512.o, compare256_avx512.lo), and appending compare256_avx512.lo to both static and shared object lists. Also, a new feature flag (AVX512_COMPARE256) is introduced.
arch/x86/compare256_avx512.c, arch/x86/x86_functions.h, functable.c Implemented AVX512 versions of compare256 and longest match functions. Added corresponding declarations, macros, and updated function pointer assignments to call the new AVX512 implementations.
test/benchmarks/benchmark_compare256.cc, test/test_compare256.cc Modified benchmarking to use specific argument values and added a new benchmark for compare256_avx512. Also, introduced a conditional test case to validate the AVX512 implementation when the CPU feature is available.

Sequence Diagram(s)

sequenceDiagram
    participant T as Test/Benchmark
    participant F as Functable
    participant C as compare256_avx512
    participant I as AVX512 Intrinsics

    T->>F: Call compare256 via function pointer
    F->>C: Forward call to compare256_avx512
    C->>I: Load and compare data using AVX512 intrinsics
    I-->>C: Return comparison result
    C-->>F: Pass result back
    F-->>T: Return final compare result
Loading

Tip

⚡💬 Agentic Chat (Pro Plan, General Availability)
  • We're introducing multi-step agentic chat in review comments and issue comments, within and outside of PR's. This feature enhances review and issue discussions with the CodeRabbit agentic chat by enabling advanced interactions, including the ability to create pull requests directly from comments and add commits to existing pull requests.
✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai plan to trigger planning for file edits and PR creation.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
test/benchmarks/benchmark_compare256.cc (1)

62-62: Improved benchmark precision with specific test sizes

Changed from a range-based benchmark to specific argument sizes (1, 8, 32, 96, 160, 256), providing more targeted performance data at key boundary conditions.

Consider adding a comment explaining why these specific values were chosen, especially 96 and 160, to help future maintainers understand the rationale.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e4a31e0 and d943ac8.

📒 Files selected for processing (8)
  • CMakeLists.txt (1 hunks)
  • arch/x86/Makefile.in (2 hunks)
  • arch/x86/compare256_avx512.c (1 hunks)
  • arch/x86/x86_functions.h (2 hunks)
  • configure (1 hunks)
  • functable.c (1 hunks)
  • test/benchmarks/benchmark_compare256.cc (2 hunks)
  • test/test_compare256.cc (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (4)
test/test_compare256.cc (1)
arch/x86/x86_functions.h (1)
  • compare256_avx512 (64-64)
functable.c (1)
arch/x86/x86_functions.h (3)
  • compare256_avx512 (64-64)
  • longest_match_avx512 (65-65)
  • longest_match_slow_avx512 (66-66)
arch/x86/compare256_avx512.c (1)
arch/x86/x86_functions.h (1)
  • compare256_avx512 (64-64)
test/benchmarks/benchmark_compare256.cc (1)
arch/x86/x86_functions.h (1)
  • compare256_avx512 (64-64)
🔇 Additional comments (12)
CMakeLists.txt (1)

1044-1045: Looks good: Proper integration of AVX512 compare256 implementation.

The addition of compare256_avx512.c to the AVX512 source list and registering it as a feature with appropriate flags follows the established pattern in the codebase for other AVX512 optimized functions.

test/test_compare256.cc (1)

82-84: Correct test case added for AVX512 implementation.

The test follows the same pattern used for other architectures and properly checks for both AVX512 support and the required HAVE_BUILTIN_CTZLL intrinsic. The CPU feature check is also consistent with other AVX512 implementations.

arch/x86/Makefile.in (2)

39-39: Correctly added object file to the build target list.

The object files for the AVX512 compare256 implementation have been properly added to the all target.


88-92: Appropriate build rules for AVX512 comparison function.

The build rules correctly use the AVX512FLAG to enable AVX512 instructions and follow the same pattern used for other AVX512-optimized functions.

functable.c (1)

142-146: Correctly implemented AVX512 function registration.

The code properly registers the AVX512 implementations of compare256, longest_match, and longest_match_slow when both AVX512 and the required HAVE_BUILTIN_CTZLL intrinsic are available. This matches the PR description stating that the implementation "generates the mask directly from the comparison" for performance improvement.

Note that all three functions are updated together as they're likely interdependent, with compare256 being the core function used by the matching functions.

configure (1)

1697-1698: Build system updated to include new AVX512 compare256 implementation

The configure script now properly includes the new compare256_avx512.lo file in both static and shared object lists when AVX512 support is detected.

arch/x86/x86_functions.h (2)

63-67: New AVX512-optimized compare256 function declarations added

The declarations are properly guarded by the HAVE_BUILTIN_CTZLL check since the implementation relies on the __builtin_ctzll intrinsic.


177-184: Functional wiring for AVX512 compare256 features

Appropriate native function pointer assignments for AVX512 implementations, ensuring they're used when the CPU supports AVX512 features and the compiler has the necessary intrinsics.

test/benchmarks/benchmark_compare256.cc (1)

83-85: Added AVX512 compare256 benchmark

Properly conditionally compiled and checks for required CPU features before running.

arch/x86/compare256_avx512.c (3)

19-77: Efficient implementation of AVX512 comparison function

The implementation cleverly starts with 32-byte AVX2 comparisons before moving to 64-byte AVX512 comparisons. This hybrid approach minimizes overhead for common short matches while leveraging AVX512 for longer matches.

The mask generation directly from the comparison (using _mm256_cmpeq_epu8_mask and _mm512_cmpeq_epu8_mask) is a key optimization that differentiates this implementation from the AVX2 version.


79-81: Simple wrapper function with appropriate internal linkage

The Z_INTERNAL macro ensures proper visibility control for this function.


83-93: Template-based implementation of longest_match functions

Using match_tpl.h with appropriate macro definitions is a clean approach to generate the longest_match implementations based on the compare256 function.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 10, 2025

Codecov Report

Attention: Patch coverage is 0% with 52 lines in your changes missing coverage. Please review.

Project coverage is 83.11%. Comparing base (eb76eca) to head (e098ae8).
Report is 5 commits behind head on develop.

Files with missing lines Patch % Lines
arch/x86/compare256_avx512.c 0.00% 47 Missing ⚠️
functable.c 0.00% 3 Missing ⚠️
test/benchmarks/benchmark_compare256.cc 0.00% 0 Missing and 1 partial ⚠️
test/test_compare256.cc 0.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           develop    #1901       +/-   ##
============================================
+ Coverage    40.68%   83.11%   +42.43%     
============================================
  Files           72      145       +73     
  Lines         7635    13000     +5365     
  Branches      1275     2875     +1600     
============================================
+ Hits          3106    10805     +7699     
+ Misses        4295     1247     -3048     
- Partials       234      948      +714     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@nmoinvaz
Copy link
Copy Markdown
Member

Great idea to not use a loop like the other implementations, because the early checks always hurt performance.

@nmoinvaz
Copy link
Copy Markdown
Member

nmoinvaz commented Apr 11, 2025

Grok3 says qemu AVX512 should be supported with -cpu Skylake-Server-v1, but not sure what package on Ubuntu we would use.

Copy link
Copy Markdown
Collaborator

@mtl1979 mtl1979 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix typo in configure script

@Dead2 Dead2 force-pushed the compare256-avx512 branch from d943ac8 to 1a45a14 Compare April 11, 2025 09:44
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
arch/x86/compare256_avx512.c (1)

2-2: Consider updating the copyright year consistently.
Existing reviews also mention adjusting the year, so this may remain a to-do item.

🧹 Nitpick comments (2)
test/benchmarks/benchmark_compare256.cc (1)

62-62: Switching from Range to discrete Arg values is a valid approach.
This ensures more precise measurements at specific sizes. If you still want broader coverage of sizes without manually enumerating each possible workload size, consider using additional Arg values or a Range alongside these specific Args.

arch/x86/compare256_avx512.c (1)

49-75: 64-byte comparisons appear correct, but consider using prefetching if repeated queries are common.
The logic and usage of _mm512_cmpeq_epu8_mask followed by __builtin_ctzll(~mask_x) is correct for detecting the first mismatch byte. If performance is critical for large data sets repeatedly invoked, adding optional prefetch hints could be beneficial, though this is often workload-dependent.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d943ac8 and 1a45a14.

📒 Files selected for processing (8)
  • CMakeLists.txt (1 hunks)
  • arch/x86/Makefile.in (2 hunks)
  • arch/x86/compare256_avx512.c (1 hunks)
  • arch/x86/x86_functions.h (2 hunks)
  • configure (1 hunks)
  • functable.c (1 hunks)
  • test/benchmarks/benchmark_compare256.cc (2 hunks)
  • test/test_compare256.cc (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (6)
  • arch/x86/Makefile.in
  • CMakeLists.txt
  • test/test_compare256.cc
  • functable.c
  • configure
  • arch/x86/x86_functions.h
🧰 Additional context used
🧬 Code Graph Analysis (2)
arch/x86/compare256_avx512.c (1)
arch/x86/x86_functions.h (1)
  • compare256_avx512 (64-64)
test/benchmarks/benchmark_compare256.cc (1)
arch/x86/x86_functions.h (1)
  • compare256_avx512 (64-64)
🔇 Additional comments (4)
test/benchmarks/benchmark_compare256.cc (1)

83-85: AVX512 benchmark registration looks good.
The code correctly checks test_cpu_features.x86.has_avx512_common before enabling the benchmark. This will gracefully skip the AVX512 test on CPUs lacking AVX512 support.

arch/x86/compare256_avx512.c (3)

19-47: Two-stage 32-byte checks are well-designed.
Performing two 32-byte comparisons before moving on to 64-byte comparisons helps minimize overhead for brief matches. The fallback to __builtin_ctz on mismatch is also correct. Ensure that calling code provides at least 32 bytes in src0 and src1.


79-81: Public wrapper function is straightforward.
Exposing compare256_avx512_static via a single wrapper is consistent with other architecture-specific compare256 variants. Good job keeping its interface parallel to existing implementations.


83-91: Template expansions for LONGEST_MATCH are in line with the zlib design.
The reuse of match_tpl.h for both normal and slow matching is standard in this codebase, providing consistent expansions without duplication.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
arch/x86/compare256_avx512.c (2)

28-30: Correct the comment to reflect 16-byte initial round.
The comment says “First do a couple 32byte rounds” but the code does a single 16-byte round followed by 64-byte rounds. Clarifying the comment helps avoid confusion for future maintainers.

-    // First do a couple 32byte rounds before increasing to 64bytes, this reduces
+    // First do one 16byte round before increasing to 64bytes, this reduces

41-75: Consider refactoring repeated comparison blocks.
Four nearly identical comparison blocks for offsets 16, 80, 144, and 192 could be encapsulated in a loop or helper function to reduce duplication. This might slightly lighten maintenance while still preserving performance if carefully inlined.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1a45a14 and 295de4d.

📒 Files selected for processing (2)
  • arch/x86/compare256_avx512.c (1 hunks)
  • test/benchmarks/benchmark_compare256.cc (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/benchmarks/benchmark_compare256.cc
🧰 Additional context used
🧬 Code Graph Analysis (1)
arch/x86/compare256_avx512.c (1)
arch/x86/x86_functions.h (1)
  • compare256_avx512 (64-64)
🔇 Additional comments (2)
arch/x86/compare256_avx512.c (2)

1-11: No issues with the license block and initial includes.
Everything looks consistent with project conventions and licensing terms.


80-82:

Details

✅ Verification successful

Verify usage of the newly added function.
Ensure that compare256_avx512 is properly referenced by the function table or calling code so the AVX512 path is actually invoked at runtime.

Run the following script to confirm calls to compare256_avx512:


🏁 Script executed:

#!/bin/bash
# Searching for calls or references to compare256_avx512 in the codebase
rg -A 5 "compare256_avx512"

Length of output: 5649


AVX512 Function Usage Confirmed
Your verification shows that compare256_avx512 is correctly referenced across the codebase:

  • Function Table: In functable.c, the function pointer is set as ft.compare256 = &compare256_avx512;.
  • Header Declaration: The function is declared in arch/x86/x86_functions.h.
  • Build Integration: The build files (Makefile.in and CMakeLists.txt) reference the implementation correctly.
  • Testing & Benchmarks: Test files and benchmark cases (test/test_compare256.cc and test/benchmarks/benchmark_compare256.cc) invoke this function, ensuring the AVX512 path will be used at runtime.

No further changes are necessary as all integration points have been verified.

@Dead2
Copy link
Copy Markdown
Member Author

Dead2 commented Apr 11, 2025

Rewrote function again to improve the speed of sub-16 byte matches by first using a 128-bit intrinsic, and after that use only 512-bit intrinsics.
This requires us to overlap on the last run, but this is cheaper than processing the tail using a 256-bit and then a 128-bit run.

Develop:

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     54.185%      0.078/0.078/0.079/0.000        0.030/0.031/0.031/0.000        8,526,745
 2     43.871%      0.128/0.129/0.129/0.000        0.030/0.030/0.030/0.000        6,903,702
 3     42.390%      0.158/0.158/0.158/0.000        0.029/0.029/0.029/0.000        6,670,664
 4     41.644%      0.181/0.181/0.182/0.000        0.028/0.028/0.028/0.000        6,553,205
 5     41.215%      0.196/0.197/0.197/0.000        0.028/0.028/0.028/0.000        6,485,659
 6     41.032%      0.243/0.244/0.244/0.000        0.028/0.028/0.028/0.000        6,456,912
 7     40.778%      0.337/0.338/0.338/0.000        0.028/0.028/0.028/0.000        6,416,941
 8     40.704%      0.445/0.446/0.446/0.000        0.028/0.028/0.028/0.000        6,405,249
 9     40.409%      0.528/0.528/0.529/0.000        0.027/0.027/0.027/0.000        6,358,951

 avg1  42.914%                        0.255                          0.029
 tot                                 68.956                          7.705       60,778,028

   text    data     bss     dec     hex filename
 170024    1344       8  171376   29d70 libz-ng.so.2

PR w/first version:

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     54.185%      0.079/0.079/0.079/0.000        0.031/0.031/0.031/0.000        8,526,745
 2     43.871%      0.127/0.128/0.128/0.000        0.030/0.030/0.030/0.000        6,903,702
 3     42.390%      0.156/0.156/0.157/0.000        0.029/0.029/0.029/0.000        6,670,664
 4     41.644%      0.178/0.179/0.180/0.000        0.028/0.028/0.028/0.000        6,553,205
 5     41.215%      0.193/0.194/0.194/0.000        0.028/0.028/0.028/0.000        6,485,659
 6     41.032%      0.240/0.240/0.240/0.000        0.028/0.028/0.028/0.000        6,456,912
 7     40.778%      0.335/0.335/0.335/0.000        0.028/0.028/0.028/0.000        6,416,941
 8     40.704%      0.442/0.443/0.443/0.000        0.028/0.028/0.028/0.000        6,405,249
 9     40.409%      0.534/0.535/0.535/0.000        0.027/0.027/0.027/0.000        6,358,951

 avg1  42.914%                        0.254                          0.029
 tot                                 68.650                          7.705       60,778,028

   text    data     bss     dec     hex filename
 172720    1344       8  174072   2a7f8 libz-ng.so.2

PR w/new version:

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     54.185%      0.078/0.078/0.078/0.000        0.031/0.031/0.031/0.000        8,526,745
 2     43.871%      0.127/0.127/0.128/0.000        0.030/0.030/0.030/0.000        6,903,702
 3     42.390%      0.155/0.156/0.156/0.000        0.029/0.029/0.029/0.000        6,670,664
 4     41.644%      0.179/0.179/0.180/0.000        0.028/0.028/0.028/0.000        6,553,205
 5     41.215%      0.193/0.194/0.194/0.000        0.028/0.028/0.028/0.000        6,485,659
 6     41.032%      0.239/0.240/0.240/0.000        0.028/0.028/0.028/0.000        6,456,912
 7     40.778%      0.334/0.335/0.335/0.000        0.028/0.028/0.028/0.000        6,416,941
 8     40.704%      0.442/0.442/0.442/0.000        0.028/0.028/0.028/0.000        6,405,249
 9     40.409%      0.527/0.527/0.527/0.000        0.027/0.027/0.027/0.000        6,358,951

 avg1  42.914%                        0.253                          0.029
 tot                                 68.334                          7.703       60,778,028

   text    data     bss     dec     hex filename
 172784    1344       8  174136   2a838 libz-ng.so.2

It is now faster on all levels except level 0 (probably too few digits in measurement).
Overall it is now 0.92% faster for all levels.
Level 2 is 1.55% faster.
Level 6 is 1.64% faster.
Even level 9 is now 0.19% faster, instead of being slower.

New comparison of AVX2 vs AVX512:

compare256/avx2/1           1.11 ns         1.11 ns   2514313868
compare256/avx2/10          1.11 ns         1.11 ns   2514328555
compare256/avx2/40          1.41 ns         1.41 ns   2011462645
compare256/avx2/80          1.95 ns         1.95 ns   1436753207
compare256/avx2/100         3.06 ns         3.06 ns    914300827
compare256/avx2/175         3.15 ns         3.15 ns    773638911
compare256/avx2/256         3.62 ns         3.62 ns    773638348

compare256/avx512/1        0.835 ns        0.835 ns   2514328706
compare256/avx512/10       0.835 ns        0.835 ns   3352437409
compare256/avx512/40        1.48 ns         1.48 ns   2011461596
compare256/avx512/80        1.39 ns         1.39 ns   2011225435
compare256/avx512/100       1.96 ns         1.96 ns   1430012195
compare256/avx512/175       2.66 ns         2.66 ns   1059428694
compare256/avx512/256       2.91 ns         2.91 ns    967652717

This is now faster on both small matches and big matches compared to the AVX2 implementation (will look into changing that too later).
1-16 byte matches is now ~25% faster than AVX2.
80-byte matches is ~29% faster
256-byte matches is ~20% faster

256-byte matches are a little slower than the 2.51ns we had with the first commit, but what we gain on the shorter matches more than makes up for it in the end.

@Dead2 Dead2 requested a review from KungFuJesus April 11, 2025 20:07
// 16 bytes
xmm_src0_0 = _mm_loadu_si128((__m128i*)src0);
xmm_src1_0 = _mm_loadu_si128((__m128i*)src1);
mask_0 = (uint32_t)_mm_cmpeq_epu8_mask(xmm_src0_0, xmm_src1_0); // zero-extended to use __builtin_ctz
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if AVX2 can also benefit from this.

Copy link
Copy Markdown
Member Author

@Dead2 Dead2 Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compare with mask integrated?
That is a new thing in EVEX encoding (introduced with AVX512) unfortunately, EVEX adds 8 new mask registers in addition to extending vector registers to 512bit and doubling the amount of vector registers.

Copy link
Copy Markdown
Member

@nmoinvaz nmoinvaz Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, I think the first two bytes are always guaranteed to be a match. Nevermind I was thinking of 258 not 256.

@Dead2 Dead2 requested a review from Copilot April 12, 2025 09:22
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 5 out of 8 changed files in this pull request and generated no comments.

Files not reviewed (3)
  • CMakeLists.txt: Language not supported
  • arch/x86/Makefile.in: Language not supported
  • configure: Language not supported

Copy link
Copy Markdown
Collaborator

@KungFuJesus KungFuJesus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested on Cascade Lake:

2025-04-12T11:43:40-04:00
Running ./benchmark_zlib
Run on (28 X 4800 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x14)
  L1 Instruction 32 KiB (x14)
  L2 Unified 1024 KiB (x14)
  L3 Unified 19712 KiB (x1)
Load Average: 0.07, 0.14, 0.12
***WARNING*** Library was built as DEBUG. Timings may be affected.
----------------------------------------------------------------
Benchmark                      Time             CPU   Iterations
----------------------------------------------------------------
compare256/avx512/1         1.03 ns         1.03 ns    675617909
compare256/avx512/10        1.03 ns         1.03 ns    677453527
compare256/avx512/40        1.95 ns         1.95 ns    358004126
compare256/avx512/80        1.95 ns         1.95 ns    358826063
compare256/avx512/100       2.51 ns         2.51 ns    279082482
compare256/avx512/175       2.77 ns         2.77 ns    252652802
compare256/avx512/256       3.19 ns         3.19 ns    219524119

Experimented a bit with kortest to precede a k-mask register move but it really ended up being a wash or more expensive. Modifying the comparison to use not equal and save the inversion also seems to be slower, possibly due to breaking down into more uops. We're frontend bound on this but I don't think much can be done about it.

@Dead2 Dead2 force-pushed the compare256-avx512 branch from 295de4d to d21cdbc Compare April 14, 2025 10:20
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
arch/x86/compare256_avx512.c (1)

28-29: Clarify the comment to match the implemented approach.

Currently, the comment states “a couple 32byte rounds” while the code starts with a 16-byte comparison. Consider aligning it with the actual implementation:

-    // First do a couple 32byte rounds before increasing to 64bytes, this reduces
+    // First do a 16-byte round before increasing to 64 bytes, this reduces
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 295de4d and d21cdbc.

📒 Files selected for processing (8)
  • CMakeLists.txt (1 hunks)
  • arch/x86/Makefile.in (2 hunks)
  • arch/x86/compare256_avx512.c (1 hunks)
  • arch/x86/x86_functions.h (2 hunks)
  • configure (1 hunks)
  • functable.c (1 hunks)
  • test/benchmarks/benchmark_compare256.cc (2 hunks)
  • test/test_compare256.cc (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
  • test/test_compare256.cc
  • CMakeLists.txt
  • arch/x86/x86_functions.h
  • configure
  • test/benchmarks/benchmark_compare256.cc
  • arch/x86/Makefile.in
  • functable.c
🧰 Additional context used
🧬 Code Graph Analysis (1)
arch/x86/compare256_avx512.c (1)
arch/x86/x86_functions.h (1)
  • compare256_avx512 (64-64)
🔇 Additional comments (7)
arch/x86/compare256_avx512.c (7)

1-6: Header & License verified.

These lines correctly include the necessary license and attributions. No concerns here.


7-11: Includes are appropriate.

All required zlib project headers are present. No issues found.


12-18: Conditional compilation guards appear correct.

The code checks both AVX512 support and the availability of __builtin_ctzll, ensuring safe compilation paths.


19-27: Local variable declarations look good.

Allocating separate ZMM and XMM registers up front is a sensible approach for chunk-wise comparisons.


80-82: Wrapper function is clean.

Delegating to the static inline function helps maintain consistent signature usage. No issues here.


84-95: Macro definitions for match processing are consistent.

Defining LONGEST_MATCH/COMPARE256 before and after including match_tpl.h is a familiar zlib pattern.


31-78:

Details

✅ Verification successful

Efficient block-wise compare logic.

This chunk-wise approach (16 bytes first, then multiple 64-byte loads) with early mismatch detection is well-structured. The use of _mm*_cmpeq_epu8_mask intrinsics plus __builtin_ctz/__builtin_ctzll is correct for finding the first mismatch.

However, please confirm all invocations guarantee at least 256 bytes are available in each buffer (no partial/shorter data scenarios that could cause out-of-bounds loads).


🏁 Script executed:

#!/bin/bash
# Verify that compare256_avx512 is only invoked with buffers >= 256 bytes.
# Searching the repository for all usage sites and checking surrounding code/comments.
rg -A 5 compare256_avx512

Length of output: 5649


Efficient AVX512-accelerated compare with precondition documentation

The block-wise comparison logic using 16-byte and 64-byte loads is implemented optimally with early exit on mismatch using the appropriate AVX512 intrinsics and bit-scan intrinsics. Based on a search of usage sites (in files like functable.c, the benchmarks, and the test suite), all invocations of compare256_avx512 occur in contexts where it is guaranteed that both input buffers hold at least 256 bytes.

To further improve code clarity and robustness, please consider adding a comment in the function’s header noting that the caller must ensure a minimum buffer size of 256 bytes.

Improve the speed of sub-16 byte matches by first using a
128-bit intrinsic, after that use only 512-bit intrinsics.
This requires us to overlap on the last run, but this is cheaper than
processing the tail using a 256-bit and then a 128-bit run.

Change benchmark steps to avoid it hitting chunk boundaries
of one or the other function as much, this gives more fair benchmarks.
@Dead2 Dead2 force-pushed the compare256-avx512 branch from d21cdbc to e098ae8 Compare April 14, 2025 10:32
@Dead2
Copy link
Copy Markdown
Member Author

Dead2 commented Apr 14, 2025

Rebased, squashed and updated comment.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
arch/x86/compare256_avx512.c (1)

19-80: Thorough implementation with overlapping tail check.
Your approach efficiently handles short matches and leverages a final 64-byte overlap for large matches. The shift from 16 to 64-byte comparisons is logical and well documented. Additionally:

  1. The combination of _mm_cmpeq_epu8_mask and __builtin_ctz is consistent for pinpointing the first mismatch.
  2. Offsets are assigned correctly to ensure short-circuiting upon mismatch.

To complete coverage, consider adding explicit tests for:

  • Mismatch precisely at index 15.
  • Mismatch at index 16 or 192 (boundary offset checks).
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d21cdbc and e098ae8.

📒 Files selected for processing (8)
  • CMakeLists.txt (1 hunks)
  • arch/x86/Makefile.in (2 hunks)
  • arch/x86/compare256_avx512.c (1 hunks)
  • arch/x86/x86_functions.h (2 hunks)
  • configure (1 hunks)
  • functable.c (1 hunks)
  • test/benchmarks/benchmark_compare256.cc (2 hunks)
  • test/test_compare256.cc (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
  • test/test_compare256.cc
  • CMakeLists.txt
  • functable.c
  • arch/x86/Makefile.in
  • test/benchmarks/benchmark_compare256.cc
  • configure
  • arch/x86/x86_functions.h
🧰 Additional context used
🧬 Code Graph Analysis (1)
arch/x86/compare256_avx512.c (1)
arch/x86/x86_functions.h (1)
  • compare256_avx512 (64-64)
🔇 Additional comments (4)
arch/x86/compare256_avx512.c (4)

1-5: Use of current year in copyright header looks good.
No issues here, and it matches the clarifications from prior review comments.


12-18: Confirm minimum buffer constraints.
This code provides no safeguards against reading beyond buffer boundaries. Please verify that all callers guarantee at least 256 bytes in src0 and src1 to avoid potential out-of-bounds reads.


82-84: Public wrapper is straightforward.
The inline static function is properly wrapped here for external linkage. No issues to report.


86-95: Macros inclusion for match_tpl.
Using the same template multiple times with different definitions is a known pattern. Looks fine for maintaining two variants (fast vs. slow). No immediate concerns here.

@Dead2 Dead2 merged commit 00a3168 into develop Apr 14, 2025
278 of 290 checks passed
@coderabbitai coderabbitai bot mentioned this pull request Apr 14, 2025
@Dead2 Dead2 deleted the compare256-avx512 branch August 23, 2025 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Architecture Architecture specific optimization

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants