Conversation
WalkthroughThis pull request adds an AVX512-optimized implementation for the Changes
Sequence Diagram(s)sequenceDiagram
participant T as Test/Benchmark
participant F as Functable
participant C as compare256_avx512
participant I as AVX512 Intrinsics
T->>F: Call compare256 via function pointer
F->>C: Forward call to compare256_avx512
C->>I: Load and compare data using AVX512 intrinsics
I-->>C: Return comparison result
C-->>F: Pass result back
F-->>T: Return final compare result
Tip ⚡💬 Agentic Chat (Pro Plan, General Availability)
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
test/benchmarks/benchmark_compare256.cc (1)
62-62: Improved benchmark precision with specific test sizesChanged from a range-based benchmark to specific argument sizes (1, 8, 32, 96, 160, 256), providing more targeted performance data at key boundary conditions.
Consider adding a comment explaining why these specific values were chosen, especially 96 and 160, to help future maintainers understand the rationale.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
CMakeLists.txt(1 hunks)arch/x86/Makefile.in(2 hunks)arch/x86/compare256_avx512.c(1 hunks)arch/x86/x86_functions.h(2 hunks)configure(1 hunks)functable.c(1 hunks)test/benchmarks/benchmark_compare256.cc(2 hunks)test/test_compare256.cc(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (4)
test/test_compare256.cc (1)
arch/x86/x86_functions.h (1)
compare256_avx512(64-64)
functable.c (1)
arch/x86/x86_functions.h (3)
compare256_avx512(64-64)longest_match_avx512(65-65)longest_match_slow_avx512(66-66)
arch/x86/compare256_avx512.c (1)
arch/x86/x86_functions.h (1)
compare256_avx512(64-64)
test/benchmarks/benchmark_compare256.cc (1)
arch/x86/x86_functions.h (1)
compare256_avx512(64-64)
🔇 Additional comments (12)
CMakeLists.txt (1)
1044-1045: Looks good: Proper integration of AVX512 compare256 implementation.The addition of
compare256_avx512.cto the AVX512 source list and registering it as a feature with appropriate flags follows the established pattern in the codebase for other AVX512 optimized functions.test/test_compare256.cc (1)
82-84: Correct test case added for AVX512 implementation.The test follows the same pattern used for other architectures and properly checks for both AVX512 support and the required
HAVE_BUILTIN_CTZLLintrinsic. The CPU feature check is also consistent with other AVX512 implementations.arch/x86/Makefile.in (2)
39-39: Correctly added object file to the build target list.The object files for the AVX512 compare256 implementation have been properly added to the
alltarget.
88-92: Appropriate build rules for AVX512 comparison function.The build rules correctly use the
AVX512FLAGto enable AVX512 instructions and follow the same pattern used for other AVX512-optimized functions.functable.c (1)
142-146: Correctly implemented AVX512 function registration.The code properly registers the AVX512 implementations of
compare256,longest_match, andlongest_match_slowwhen both AVX512 and the requiredHAVE_BUILTIN_CTZLLintrinsic are available. This matches the PR description stating that the implementation "generates the mask directly from the comparison" for performance improvement.Note that all three functions are updated together as they're likely interdependent, with
compare256being the core function used by the matching functions.configure (1)
1697-1698: Build system updated to include new AVX512 compare256 implementationThe configure script now properly includes the new
compare256_avx512.lofile in both static and shared object lists when AVX512 support is detected.arch/x86/x86_functions.h (2)
63-67: New AVX512-optimized compare256 function declarations addedThe declarations are properly guarded by the HAVE_BUILTIN_CTZLL check since the implementation relies on the __builtin_ctzll intrinsic.
177-184: Functional wiring for AVX512 compare256 featuresAppropriate native function pointer assignments for AVX512 implementations, ensuring they're used when the CPU supports AVX512 features and the compiler has the necessary intrinsics.
test/benchmarks/benchmark_compare256.cc (1)
83-85: Added AVX512 compare256 benchmarkProperly conditionally compiled and checks for required CPU features before running.
arch/x86/compare256_avx512.c (3)
19-77: Efficient implementation of AVX512 comparison functionThe implementation cleverly starts with 32-byte AVX2 comparisons before moving to 64-byte AVX512 comparisons. This hybrid approach minimizes overhead for common short matches while leveraging AVX512 for longer matches.
The mask generation directly from the comparison (using
_mm256_cmpeq_epu8_maskand_mm512_cmpeq_epu8_mask) is a key optimization that differentiates this implementation from the AVX2 version.
79-81: Simple wrapper function with appropriate internal linkageThe Z_INTERNAL macro ensures proper visibility control for this function.
83-93: Template-based implementation of longest_match functionsUsing match_tpl.h with appropriate macro definitions is a clean approach to generate the longest_match implementations based on the compare256 function.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #1901 +/- ##
============================================
+ Coverage 40.68% 83.11% +42.43%
============================================
Files 72 145 +73
Lines 7635 13000 +5365
Branches 1275 2875 +1600
============================================
+ Hits 3106 10805 +7699
+ Misses 4295 1247 -3048
- Partials 234 948 +714 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Great idea to not use a loop like the other implementations, because the early checks always hurt performance. |
|
Grok3 says qemu AVX512 should be supported with |
mtl1979
left a comment
There was a problem hiding this comment.
Fix typo in configure script
d943ac8 to
1a45a14
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
arch/x86/compare256_avx512.c (1)
2-2: Consider updating the copyright year consistently.
Existing reviews also mention adjusting the year, so this may remain a to-do item.
🧹 Nitpick comments (2)
test/benchmarks/benchmark_compare256.cc (1)
62-62: Switching from Range to discrete Arg values is a valid approach.
This ensures more precise measurements at specific sizes. If you still want broader coverage of sizes without manually enumerating each possible workload size, consider using additional Arg values or a Range alongside these specific Args.arch/x86/compare256_avx512.c (1)
49-75: 64-byte comparisons appear correct, but consider using prefetching if repeated queries are common.
The logic and usage of_mm512_cmpeq_epu8_maskfollowed by__builtin_ctzll(~mask_x)is correct for detecting the first mismatch byte. If performance is critical for large data sets repeatedly invoked, adding optional prefetch hints could be beneficial, though this is often workload-dependent.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
CMakeLists.txt(1 hunks)arch/x86/Makefile.in(2 hunks)arch/x86/compare256_avx512.c(1 hunks)arch/x86/x86_functions.h(2 hunks)configure(1 hunks)functable.c(1 hunks)test/benchmarks/benchmark_compare256.cc(2 hunks)test/test_compare256.cc(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (6)
- arch/x86/Makefile.in
- CMakeLists.txt
- test/test_compare256.cc
- functable.c
- configure
- arch/x86/x86_functions.h
🧰 Additional context used
🧬 Code Graph Analysis (2)
arch/x86/compare256_avx512.c (1)
arch/x86/x86_functions.h (1)
compare256_avx512(64-64)
test/benchmarks/benchmark_compare256.cc (1)
arch/x86/x86_functions.h (1)
compare256_avx512(64-64)
🔇 Additional comments (4)
test/benchmarks/benchmark_compare256.cc (1)
83-85: AVX512 benchmark registration looks good.
The code correctly checkstest_cpu_features.x86.has_avx512_commonbefore enabling the benchmark. This will gracefully skip the AVX512 test on CPUs lacking AVX512 support.arch/x86/compare256_avx512.c (3)
19-47: Two-stage 32-byte checks are well-designed.
Performing two 32-byte comparisons before moving on to 64-byte comparisons helps minimize overhead for brief matches. The fallback to__builtin_ctzon mismatch is also correct. Ensure that calling code provides at least 32 bytes insrc0andsrc1.
79-81: Public wrapper function is straightforward.
Exposingcompare256_avx512_staticvia a single wrapper is consistent with other architecture-specific compare256 variants. Good job keeping its interface parallel to existing implementations.
83-91: Template expansions for LONGEST_MATCH are in line with the zlib design.
The reuse ofmatch_tpl.hfor both normal and slow matching is standard in this codebase, providing consistent expansions without duplication.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
arch/x86/compare256_avx512.c (2)
28-30: Correct the comment to reflect 16-byte initial round.
The comment says “First do a couple 32byte rounds” but the code does a single 16-byte round followed by 64-byte rounds. Clarifying the comment helps avoid confusion for future maintainers.- // First do a couple 32byte rounds before increasing to 64bytes, this reduces + // First do one 16byte round before increasing to 64bytes, this reduces
41-75: Consider refactoring repeated comparison blocks.
Four nearly identical comparison blocks for offsets 16, 80, 144, and 192 could be encapsulated in a loop or helper function to reduce duplication. This might slightly lighten maintenance while still preserving performance if carefully inlined.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
arch/x86/compare256_avx512.c(1 hunks)test/benchmarks/benchmark_compare256.cc(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- test/benchmarks/benchmark_compare256.cc
🧰 Additional context used
🧬 Code Graph Analysis (1)
arch/x86/compare256_avx512.c (1)
arch/x86/x86_functions.h (1)
compare256_avx512(64-64)
🔇 Additional comments (2)
arch/x86/compare256_avx512.c (2)
1-11: No issues with the license block and initial includes.
Everything looks consistent with project conventions and licensing terms.
80-82:Details
✅ Verification successful
Verify usage of the newly added function.
Ensure thatcompare256_avx512is properly referenced by the function table or calling code so the AVX512 path is actually invoked at runtime.Run the following script to confirm calls to
compare256_avx512:
🏁 Script executed:
#!/bin/bash # Searching for calls or references to compare256_avx512 in the codebase rg -A 5 "compare256_avx512"Length of output: 5649
AVX512 Function Usage Confirmed
Your verification shows thatcompare256_avx512is correctly referenced across the codebase:
- Function Table: In
functable.c, the function pointer is set asft.compare256 = &compare256_avx512;.- Header Declaration: The function is declared in
arch/x86/x86_functions.h.- Build Integration: The build files (
Makefile.inandCMakeLists.txt) reference the implementation correctly.- Testing & Benchmarks: Test files and benchmark cases (
test/test_compare256.ccandtest/benchmarks/benchmark_compare256.cc) invoke this function, ensuring the AVX512 path will be used at runtime.No further changes are necessary as all integration points have been verified.
|
Rewrote function again to improve the speed of sub-16 byte matches by first using a 128-bit intrinsic, and after that use only 512-bit intrinsics. Develop: PR w/first version: PR w/new version: It is now faster on all levels except level 0 (probably too few digits in measurement). New comparison of AVX2 vs AVX512: This is now faster on both small matches and big matches compared to the AVX2 implementation (will look into changing that too later). 256-byte matches are a little slower than the 2.51ns we had with the first commit, but what we gain on the shorter matches more than makes up for it in the end. |
| // 16 bytes | ||
| xmm_src0_0 = _mm_loadu_si128((__m128i*)src0); | ||
| xmm_src1_0 = _mm_loadu_si128((__m128i*)src1); | ||
| mask_0 = (uint32_t)_mm_cmpeq_epu8_mask(xmm_src0_0, xmm_src1_0); // zero-extended to use __builtin_ctz |
There was a problem hiding this comment.
I wonder if AVX2 can also benefit from this.
There was a problem hiding this comment.
The compare with mask integrated?
That is a new thing in EVEX encoding (introduced with AVX512) unfortunately, EVEX adds 8 new mask registers in addition to extending vector registers to 512bit and doubling the amount of vector registers.
There was a problem hiding this comment.
Btw, I think the first two bytes are always guaranteed to be a match. Nevermind I was thinking of 258 not 256.
KungFuJesus
left a comment
There was a problem hiding this comment.
Tested on Cascade Lake:
2025-04-12T11:43:40-04:00
Running ./benchmark_zlib
Run on (28 X 4800 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x14)
L1 Instruction 32 KiB (x14)
L2 Unified 1024 KiB (x14)
L3 Unified 19712 KiB (x1)
Load Average: 0.07, 0.14, 0.12
***WARNING*** Library was built as DEBUG. Timings may be affected.
----------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------
compare256/avx512/1 1.03 ns 1.03 ns 675617909
compare256/avx512/10 1.03 ns 1.03 ns 677453527
compare256/avx512/40 1.95 ns 1.95 ns 358004126
compare256/avx512/80 1.95 ns 1.95 ns 358826063
compare256/avx512/100 2.51 ns 2.51 ns 279082482
compare256/avx512/175 2.77 ns 2.77 ns 252652802
compare256/avx512/256 3.19 ns 3.19 ns 219524119
Experimented a bit with kortest to precede a k-mask register move but it really ended up being a wash or more expensive. Modifying the comparison to use not equal and save the inversion also seems to be slower, possibly due to breaking down into more uops. We're frontend bound on this but I don't think much can be done about it.
295de4d to
d21cdbc
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
arch/x86/compare256_avx512.c (1)
28-29: Clarify the comment to match the implemented approach.Currently, the comment states “a couple 32byte rounds” while the code starts with a 16-byte comparison. Consider aligning it with the actual implementation:
- // First do a couple 32byte rounds before increasing to 64bytes, this reduces + // First do a 16-byte round before increasing to 64 bytes, this reduces
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
CMakeLists.txt(1 hunks)arch/x86/Makefile.in(2 hunks)arch/x86/compare256_avx512.c(1 hunks)arch/x86/x86_functions.h(2 hunks)configure(1 hunks)functable.c(1 hunks)test/benchmarks/benchmark_compare256.cc(2 hunks)test/test_compare256.cc(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
- test/test_compare256.cc
- CMakeLists.txt
- arch/x86/x86_functions.h
- configure
- test/benchmarks/benchmark_compare256.cc
- arch/x86/Makefile.in
- functable.c
🧰 Additional context used
🧬 Code Graph Analysis (1)
arch/x86/compare256_avx512.c (1)
arch/x86/x86_functions.h (1)
compare256_avx512(64-64)
🔇 Additional comments (7)
arch/x86/compare256_avx512.c (7)
1-6: Header & License verified.These lines correctly include the necessary license and attributions. No concerns here.
7-11: Includes are appropriate.All required zlib project headers are present. No issues found.
12-18: Conditional compilation guards appear correct.The code checks both AVX512 support and the availability of
__builtin_ctzll, ensuring safe compilation paths.
19-27: Local variable declarations look good.Allocating separate ZMM and XMM registers up front is a sensible approach for chunk-wise comparisons.
80-82: Wrapper function is clean.Delegating to the static inline function helps maintain consistent signature usage. No issues here.
84-95: Macro definitions for match processing are consistent.Defining
LONGEST_MATCH/COMPARE256before and after includingmatch_tpl.his a familiar zlib pattern.
31-78:Details
✅ Verification successful
Efficient block-wise compare logic.
This chunk-wise approach (16 bytes first, then multiple 64-byte loads) with early mismatch detection is well-structured. The use of
_mm*_cmpeq_epu8_maskintrinsics plus__builtin_ctz/__builtin_ctzllis correct for finding the first mismatch.However, please confirm all invocations guarantee at least 256 bytes are available in each buffer (no partial/shorter data scenarios that could cause out-of-bounds loads).
🏁 Script executed:
#!/bin/bash # Verify that compare256_avx512 is only invoked with buffers >= 256 bytes. # Searching the repository for all usage sites and checking surrounding code/comments. rg -A 5 compare256_avx512Length of output: 5649
Efficient AVX512-accelerated compare with precondition documentation
The block-wise comparison logic using 16-byte and 64-byte loads is implemented optimally with early exit on mismatch using the appropriate AVX512 intrinsics and bit-scan intrinsics. Based on a search of usage sites (in files like
functable.c, the benchmarks, and the test suite), all invocations ofcompare256_avx512occur in contexts where it is guaranteed that both input buffers hold at least 256 bytes.To further improve code clarity and robustness, please consider adding a comment in the function’s header noting that the caller must ensure a minimum buffer size of 256 bytes.
Improve the speed of sub-16 byte matches by first using a 128-bit intrinsic, after that use only 512-bit intrinsics. This requires us to overlap on the last run, but this is cheaper than processing the tail using a 256-bit and then a 128-bit run. Change benchmark steps to avoid it hitting chunk boundaries of one or the other function as much, this gives more fair benchmarks.
d21cdbc to
e098ae8
Compare
|
Rebased, squashed and updated comment. |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
arch/x86/compare256_avx512.c (1)
19-80: Thorough implementation with overlapping tail check.
Your approach efficiently handles short matches and leverages a final 64-byte overlap for large matches. The shift from 16 to 64-byte comparisons is logical and well documented. Additionally:
- The combination of
_mm_cmpeq_epu8_maskand__builtin_ctzis consistent for pinpointing the first mismatch.- Offsets are assigned correctly to ensure short-circuiting upon mismatch.
To complete coverage, consider adding explicit tests for:
- Mismatch precisely at index 15.
- Mismatch at index 16 or 192 (boundary offset checks).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
CMakeLists.txt(1 hunks)arch/x86/Makefile.in(2 hunks)arch/x86/compare256_avx512.c(1 hunks)arch/x86/x86_functions.h(2 hunks)configure(1 hunks)functable.c(1 hunks)test/benchmarks/benchmark_compare256.cc(2 hunks)test/test_compare256.cc(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
- test/test_compare256.cc
- CMakeLists.txt
- functable.c
- arch/x86/Makefile.in
- test/benchmarks/benchmark_compare256.cc
- configure
- arch/x86/x86_functions.h
🧰 Additional context used
🧬 Code Graph Analysis (1)
arch/x86/compare256_avx512.c (1)
arch/x86/x86_functions.h (1)
compare256_avx512(64-64)
🔇 Additional comments (4)
arch/x86/compare256_avx512.c (4)
1-5: Use of current year in copyright header looks good.
No issues here, and it matches the clarifications from prior review comments.
12-18: Confirm minimum buffer constraints.
This code provides no safeguards against reading beyond buffer boundaries. Please verify that all callers guarantee at least 256 bytes insrc0andsrc1to avoid potential out-of-bounds reads.
82-84: Public wrapper is straightforward.
The inline static function is properly wrapped here for external linkage. No issues to report.
86-95: Macros inclusion for match_tpl.
Using the same template multiple times with different definitions is a known pattern. Looks fine for maintaining two variants (fast vs. slow). No immediate concerns here.
Implements AVX512 variant of compare256.
Unlike AVX2, AVX512 does not need to make the mask separately, and instead can generate the mask directly from the comparison, saving us a tiny bit of time.
Using 64-byte compares from the start turned out to be a little slower on normal files because most matches are short or even not matches at all. Therefore I implemented this using two rounds of 32-byte compares at the start, before going up to 64byte compares after the first 64bytes have matched.
Before:
After:
Levels 1-8 are a little faster, with level 6 being 1.24% faster.
Level 9 takes a small penalty due to trying really hard to find even small matches, increasing the likelihood of the matches not being matches at all.
Had to add a few more benchmark steps for this to be useful.
Up to 96 byte matches they are perform practically the same.
On 160-byte matches, these results indicate it is ~19% faster.
On 256-byte matches, these results indicate it is ~39% faster.
Summary by CodeRabbit
New Features
Tests