Port SSE/AVX optimization to Loongarch64 LSX/LASX Vector Intrinsics#1925
Port SSE/AVX optimization to Loongarch64 LSX/LASX Vector Intrinsics#1925Dead2 merged 13 commits intozlib-ng:developfrom
Conversation
WalkthroughThis change introduces comprehensive support for the LoongArch architecture, including new SIMD-optimized implementations for CRC32, Adler32, chunk memory operations, compare256, and slide hash. It adds LoongArch-specific build options, intrinsic detection, runtime feature detection, function pointer assignments, toolchain files, and extends benchmarks, tests, and documentation to recognize and exercise these new features. Changes
Sequence Diagram(s)sequenceDiagram
participant BuildSystem as Build System (CMake/configure)
participant CPU as LoongArch CPU
participant App as Application/Library
participant Functable as Function Table
participant Features as loongarch_features.c
participant SIMD as SIMD-optimized Functions
BuildSystem->>CPU: Detect LoongArch architecture
BuildSystem->>BuildSystem: Check for CRC/LSX/LASX intrinsics
BuildSystem->>BuildSystem: Enable LoongArch-specific build flags and sources
App->>Features: loongarch_check_features()
Features->>CPU: Read CPU config register
Features-->>App: Return has_crc, has_lsx, has_lasx
App->>Functable: init_functable()
Functable->>App: Query CPU features
Functable->>SIMD: Assign function pointers for CRC32, Adler32, etc. based on features
App->>SIMD: Call optimized functions at runtime
Possibly related PRs
Suggested labels
Suggested reviewers
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (4)
arch/loongarch/loongarch_features.c (1)
22-27: LGTM! Consider using bit shift operations for clarity.The CPU feature detection implementation is correct and follows the LoongArch documentation. The bit masks accurately correspond to the documented positions.
For improved readability and maintainability, consider using bit shift operations instead of hex literals:
void Z_INTERNAL loongarch_check_features(struct loongarch_cpu_features *features) { unsigned int w1 = __cpucfg(0x1); - features->has_crc = w1 & 0x2000000; - features->has_lsx = w1 & 0x40; - features->has_lasx = w1 & 0x80; + features->has_crc = w1 & (1U << 25); + features->has_lsx = w1 & (1U << 6); + features->has_lasx = w1 & (1U << 7); }This makes the bit positions explicit and matches the documentation comments.
arch/loongarch/slide_hash_lsx.c (1)
19-54: Review the alignment assumptions and loop structure.The function assumes 16-byte alignment (
assertstatements on lines 61-62) but processes 32 bytes per iteration (lines 37-42). Consider these potential issues:
- Alignment assumption: The comment mentions "64 byte boundaries" but only asserts 15-byte alignment
- Loop structure: Using
gotofor control flow (line 52) reduces readabilityConsider refactoring to eliminate
goto:- int on_chain = 0; - -next_chain: - table = (on_chain) ? table1 : table0; - entries = (on_chain) ? entries1 : entries0; - - table += entries; - table -= 16; - - /* ZALLOC allocates this pointer unless the user chose a custom allocator. - * Our alloc function is aligned to 64 byte boundaries */ - do { - value0 = __lsx_vld(table, 0); - value1 = __lsx_vld(table, 16); - result0 = __lsx_vssub_hu(value0, wsize); - result1 = __lsx_vssub_hu(value1, wsize); - __lsx_vst(result0, table, 0); - __lsx_vst(result1, table, 16); - - table -= 16; - entries -= 16; - } while (entries > 0); - - ++on_chain; - if (on_chain > 1) { - return; - } else { - goto next_chain; - } + for (int chain = 0; chain < 2; chain++) { + table = (chain == 1) ? table1 : table0; + entries = (chain == 1) ? entries1 : entries0; + + table += entries; + table -= 16; + + do { + value0 = __lsx_vld(table, 0); + value1 = __lsx_vld(table, 16); + result0 = __lsx_vssub_hu(value0, wsize); + result1 = __lsx_vssub_hu(value1, wsize); + __lsx_vst(result0, table, 0); + __lsx_vst(result1, table, 16); + + table -= 16; + entries -= 16; + } while (entries > 0); + }cmake/toolchain-loongarch64-gcc-14.cmake (1)
22-25: Consider consistent error handling for C++ compiler detection.The C compiler detection fails fatally if not found (lines 17-19), but C++ compiler detection is optional. This asymmetry might cause build issues if C++ compilation is required but the compiler isn't found.
Consider adding a warning or making the behavior consistent:
find_program(CXX_COMPILER_FULL_PATH NAMES g++-14-${CMAKE_CXX_COMPILER_TARGET} ${CMAKE_CXX_COMPILER_TARGET}-g++-14) if(CXX_COMPILER_FULL_PATH) set(CMAKE_CXX_COMPILER ${CXX_COMPILER_FULL_PATH}) +else() + message(WARNING "C++ cross-compiler for ${CMAKE_CXX_COMPILER_TARGET} not found") endif()arch/loongarch/adler32_lasx.c (1)
62-64: Consider using designated initializers for better readability.The vector initialization uses a compound literal which is correct but could be more readable.
- const __m256i dot2v = (__m256i)((v32i8){ 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, - 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1 }); + const __m256i dot2v = (__m256i)((v32i8){ + 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, + 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1 + });
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (39)
.github/workflows/cmake.yml(1 hunks).github/workflows/configure.yml(1 hunks)CMakeLists.txt(5 hunks)arch/arm/chunkset_neon.c(1 hunks)arch/generic/chunk_128bit_perm_idx_lut.h(1 hunks)arch/generic/chunk_256bit_perm_idx_lut.h(1 hunks)arch/loongarch/Makefile.in(1 hunks)arch/loongarch/adler32_lasx.c(1 hunks)arch/loongarch/adler32_lsx.c(1 hunks)arch/loongarch/chunkset_lasx.c(1 hunks)arch/loongarch/chunkset_lsx.c(1 hunks)arch/loongarch/compare256_lasx.c(1 hunks)arch/loongarch/compare256_lsx.c(1 hunks)arch/loongarch/crc32_la.c(1 hunks)arch/loongarch/lasxintrin_ext.h(1 hunks)arch/loongarch/loongarch_features.c(1 hunks)arch/loongarch/loongarch_features.h(1 hunks)arch/loongarch/loongarch_functions.h(1 hunks)arch/loongarch/lsxintrin_ext.h(1 hunks)arch/loongarch/slide_hash_lasx.c(1 hunks)arch/loongarch/slide_hash_lsx.c(1 hunks)arch/x86/chunkset_avx2.c(1 hunks)arch/x86/chunkset_avx512.c(1 hunks)arch/x86/chunkset_ssse3.c(1 hunks)arch_functions.h(1 hunks)cmake/detect-intrinsics.cmake(1 hunks)cmake/toolchain-loongarch64-gcc-14.cmake(1 hunks)configure(8 hunks)cpu_features.c(1 hunks)cpu_features.h(2 hunks)functable.c(1 hunks)test/benchmarks/benchmark_adler32.cc(1 hunks)test/benchmarks/benchmark_adler32_copy.cc(1 hunks)test/benchmarks/benchmark_compare256.cc(1 hunks)test/benchmarks/benchmark_crc32.cc(1 hunks)test/benchmarks/benchmark_slidehash.cc(1 hunks)test/test_adler32.cc(1 hunks)test/test_compare256.cc(1 hunks)test/test_crc32.cc(1 hunks)
🔇 Additional comments (69)
arch/generic/chunk_256bit_perm_idx_lut.h (3)
1-5: LGTM! Well-structured header refactoring.The conversion from architecture-specific to generic header is well-executed. The new header guard
CHUNK_256BIT_PERM_IDX_LUT_H_follows naming conventions and the copyright header is properly added.
11-46: Lookup tables look correct.The permutation index lookup table and half remainder values are properly structured for 256-bit operations and maintain the same data as the original AVX2-specific implementation.
7-7: Verify the include path change works correctly.The include path changed from relative
"../generic/chunk_permute_table.h"to direct"chunk_permute_table.h". Ensure the build system's include paths are configured to locate this header correctly across all architectures that use this generic lookup table.#!/bin/bash # Description: Verify that chunk_permute_table.h can be found by the build system # Expected: The header should exist and be accessible # Find the chunk_permute_table.h file echo "Looking for chunk_permute_table.h:" fd "chunk_permute_table.h" # Check if it exists in expected locations echo -e "\nChecking common include paths:" ls -la arch/generic/chunk_permute_table.h 2>/dev/null || echo "Not found in arch/generic/" ls -la chunk_permute_table.h 2>/dev/null || echo "Not found in root"arch/x86/chunkset_avx512.c (1)
9-9: LGTM! Proper migration to generic header.The include change from
"avx2_tables.h"to"arch/generic/chunk_256bit_perm_idx_lut.h"correctly uses the new consolidated generic header. This maintains the same lookup table data while enabling reuse across architectures.arch/x86/chunkset_avx2.c (1)
8-8: LGTM! Consistent with AVX512 refactoring.The include change mirrors the same refactoring applied to AVX512, properly consolidating the permutation lookup tables into the generic header.
arch_functions.h (1)
27-28: LGTM! Proper LoongArch architecture integration.The addition of LoongArch support follows the established pattern exactly, using
elif defined(LOONGARCH_FEATURES)and including the architecture-specific functions header. This maintains consistency with the existing architecture selection logic.cpu_features.c (1)
22-23: LGTM! Consistent CPU feature detection pattern.The LoongArch CPU feature detection follows the exact same pattern as other architectures, calling
loongarch_check_features(&features->loongarch)within the conditional compilation chain.#!/bin/bash # Description: Verify that loongarch_check_features function is declared and defined # Expected: Function should be declared in header and defined in source echo "Searching for loongarch_check_features function:" rg -A 3 "loongarch_check_features" echo -e "\nLooking for LoongArch-specific files:" fd "loongarch" --type farch/arm/chunkset_neon.c (1)
10-10: LGTM: Good refactoring to use shared lookup table.This change eliminates code duplication by centralizing the permutation index lookup table across multiple architectures (SSSE3, NEON, LSX). The shared header approach improves maintainability and consistency.
arch/x86/chunkset_ssse3.c (1)
10-11: LGTM: Consistent refactoring to shared headers.These includes align with the broader effort to unify permutation index lookup tables across architectures. The changes maintain existing functionality while improving code organization and reducing duplication.
test/benchmarks/benchmark_adler32_copy.cc (1)
130-133: LGTM: Proper benchmark coverage for LoongArch LASX.The benchmark additions follow the established pattern used by other architectures:
- Proper conditional compilation with
LOONGARCH_LASX- Runtime feature detection using
test_cpu_features.loongarch.has_lasx- Both baseline and optimized variants for comprehensive performance comparison
This ensures the new LoongArch LASX implementation gets proper benchmark coverage.
arch/generic/chunk_128bit_perm_idx_lut.h (1)
1-27: LGTM: Well-structured shared lookup table header.The header is properly implemented with:
- Correct header guards to prevent multiple inclusion
- Appropriate
static constdeclaration for the lookup table- Proper dependency on
chunk_permute_table.hfor thelut_rem_pairtype- Clear documentation of purpose and usage
This centralizes the permutation index lookup table that was previously duplicated across SSSE3, NEON, and now LSX implementations, improving maintainability.
test/test_crc32.cc (1)
324-326: LGTM: Proper test coverage for LoongArch CRC32.The test addition correctly follows the established pattern:
- Uses the standard
TEST_CRC32macro for consistency- Proper conditional compilation with
LOONGARCH_CRC- Runtime feature detection using
test_cpu_features.loongarch.has_crc- Function naming convention matches other architectures
This ensures the new LoongArch CRC32 implementation gets the same comprehensive test coverage as other architecture-specific implementations.
cpu_features.h (1)
21-22: LGTM! LoongArch support follows established architecture pattern.The addition of LoongArch CPU features detection is well-implemented and consistent with the existing architecture support pattern used for X86, ARM, PPC, S390, and RISCV.
Also applies to: 36-37
test/benchmarks/benchmark_crc32.cc (1)
100-102: LGTM! LoongArch CRC32 benchmark follows established pattern.The benchmark registration is correctly implemented using the existing
BENCHMARK_CRC32macro with appropriate conditional compilation and CPU feature detection.test/test_adler32.cc (1)
395-400: LGTM! LoongArch Adler32 tests follow established SIMD testing pattern.The test instantiations for both LSX and LASX variants are correctly implemented with proper conditional compilation and runtime CPU feature detection, consistent with other SIMD architecture tests.
test/test_compare256.cc (1)
94-99: LGTM! LoongArch compare256 tests follow established SIMD pattern.The test cases for LSX and LASX variants correctly require both the architecture-specific macros and
HAVE_BUILTIN_CTZ, consistent with the pattern used by X86 SSE2/AVX2 implementations. The runtime feature detection is properly implemented.test/benchmarks/benchmark_compare256.cc (1)
95-100: LGTM! LoongArch compare256 benchmarks follow established SIMD pattern.The benchmark registrations for LSX and LASX variants correctly mirror the test implementation pattern, requiring both architecture-specific macros and
HAVE_BUILTIN_CTZ, consistent with X86 SSE2/AVX2 benchmarks.test/benchmarks/benchmark_adler32.cc (1)
100-106: LGTM! Consistent with existing SIMD benchmark patterns.The LoongArch LSX and LASX benchmark registrations correctly follow the established pattern used by other SIMD implementations. The conditional compilation guards and runtime feature checks are appropriate.
test/benchmarks/benchmark_slidehash.cc (1)
98-104: LGTM! Consistent implementation following established patterns.The LoongArch slide_hash benchmark registrations correctly implement the same pattern used throughout the file for other SIMD extensions. The conditional compilation and runtime feature detection are properly implemented.
arch/loongarch/Makefile.in (2)
13-14: LGTM! Proper compiler flags for LoongArch SIMD extensions.The LSX and LASX compiler flags are correctly defined and will enable the appropriate SIMD instruction sets during compilation.
44-91: Verify the NOLTOFLAG variable definition.The Makefile structure is well-organized and includes appropriate build rules for all LoongArch-specific source files. However, the
NOLTOFLAGvariable is used in multiple targets but not defined in this Makefile.Please verify that
NOLTOFLAGis properly defined in the parent Makefile or build system:#!/bin/bash # Description: Check if NOLTOFLAG is defined in parent Makefiles # Expected: Find NOLTOFLAG definition in build system # Search for NOLTOFLAG definition in Makefile hierarchy rg -A 2 "NOLTOFLAG.*=" ..github/workflows/cmake.yml (1)
313-328: LGTM! Comprehensive CI configuration for LoongArch64.The GitHub Actions workflow additions properly implement LoongArch64 cross-compilation testing with GCC 14. The dual configuration approach (with and without LASX) ensures both instruction set variants are tested. The package selection, toolchain configuration, and coverage reporting setup follow the established patterns used for other cross-compilation targets.
.github/workflows/configure.yml (1)
252-269: LGTM! LoongArch64 CI configuration follows established patterns.The addition of two LoongArch64 build matrix entries properly follows the same pattern used for other cross-compiled architectures, with appropriate compiler selection (GCC 14), static linking, and cross-compilation packages.
arch/loongarch/loongarch_features.h (2)
11-15: Well-designed feature detection interface.The
loongarch_cpu_featuresstruct appropriately captures the three key LoongArch SIMD capabilities (CRC32, LSX 128-bit, LASX 256-bit) following the same pattern as other architecture feature detection headers.
17-17: Proper internal function declaration.The use of
Z_INTERNALforloongarch_check_featuresis correct for internal runtime detection functions.arch/loongarch/slide_hash_lasx.c (2)
19-33: Efficient vectorized hash sliding implementation.The
slide_hash_chainfunction efficiently processes 16 hash values per iteration using LASX 256-bit intrinsics. The use of saturating subtract (__lasx_xvssub_hu) correctly prevents hash value underflow.
35-42: Proper bounds checking and implementation.The assertion
Assert(s->w_size <= UINT16_MAX, ...)is crucial for preventing overflow when casting touint16_t. The function correctly handles both theheadhash table (fixedHASH_SIZE) and theprevtable (variablewsize).functable.c (1)
272-309: Comprehensive LoongArch runtime dispatch integration.The addition of LoongArch function pointer assignments follows the established pattern perfectly. The progressive feature checking (CRC → LSX → LASX) ensures optimal performance selection, and the conditional use of
HAVE_BUILTIN_CTZis appropriate for functions requiring bit manipulation support.CMakeLists.txt (3)
132-136: Well-structured LoongArch build options.The build options properly implement feature dependencies with LASX requiring LSX, following the hardware capabilities of LoongArch processors.
740-742: Consistent architecture directory setup.The ARCHDIR assignment follows the same pattern as other supported architectures.
1616-1620: Complete feature summary integration.The feature summary additions properly inform users about LoongArch-specific optimizations that are enabled or disabled during configuration.
cmake/toolchain-loongarch64-gcc-14.cmake (1)
9-9: Verify QEMU CPU model and sysroot path.The emulator configuration uses
la464-loongarch-cpuand assumes sysroot at/usr/${CMAKE_C_COMPILER_TARGET}/. Ensure these paths and CPU models are available in the target environment.#!/bin/bash # Description: Verify QEMU LoongArch64 support and sysroot availability # Check if QEMU supports la464-loongarch-cpu qemu-loongarch64 -cpu help 2>/dev/null | grep -q "la464-loongarch-cpu" || echo "QEMU CPU model la464-loongarch-cpu not found" # Check typical sysroot locations for path in /usr/loongarch64-linux-gnu /usr/loongarch64-unknown-linux-gnu; do if [ -d "$path" ]; then echo "Found potential sysroot: $path" fi donecmake/detect-intrinsics.cmake (3)
680-692: LGTM on CRC intrinsics detection.The macro correctly tests LoongArch CRC intrinsics using
<larchintrin.h>and__crc_w_b_w. The test program is minimal and appropriate.
694-709: LGTM on LSX intrinsics detection.The macro properly sets the
-mlsxflag and tests LSX intrinsics using<lsxintrin.h>and__lsx_vabsd_b. The structure follows established patterns.
728-743: LGTM on LASX intrinsics detection.The macro correctly configures
-mlasxflag and tests LASX intrinsics using<lasxintrin.h>and__lasx_xvabsd_b. The implementation is consistent with other intrinsic detection macros.arch/loongarch/lsxintrin_ext.h (3)
11-16: LGTM on sum of absolute differences implementation.The
lsx_sad_bufunction correctly implements sum of absolute differences using LSX intrinsics. The sequence of horizontal add operations progressively reduces from byte to doubleword level.
18-20: LGTM on movemask implementation.The
lsx_movemask_bfunction properly extracts sign bits using__lsx_vmskltz_band retrieves the result with__lsx_vpickve2gr_w. This mirrors x86 SSE_mm_movemask_epi8functionality.
22-31: LGTM on byte shuffle implementation.The
lsx_shuffle_bfunction correctly implements byte shuffling with MSB handling:
- Detects negative indices (MSB set) using
__lsx_vslti_b- Masks indices to 4 bits for shuffle operation
- Clears destination bytes where source indices had MSB set
This properly emulates x86 SSSE3
_mm_shuffle_epi8behavior.arch/loongarch/compare256_lasx.c (3)
17-46: LGTM on LASX compare256 implementation.The function correctly implements vectorized byte comparison using LASX intrinsics:
- Loads 32-byte chunks using
__lasx_xvld- Performs byte-wise equality check with
__lasx_xvseq_b- Extracts bitmask and uses
__builtin_ctzto find first difference- Properly handles loop unrolling and length tracking
The logic mirrors the AVX2 implementation and should provide good performance.
52-61: LGTM on template macro definitions.The macro definitions correctly set up the template parameters for both regular and slow longest match variants:
LONGEST_MATCHandCOMPARE256macros are properly defined- Both regular and slow variants are included
- Template inclusion pattern follows established conventions
15-15: Verify availability of lasx_movemask_b function.The code references
lasx_movemask_b(line 25, 36) but this function isn't defined in this file. Ensure it's properly declared inlasxintrin_ext.h.#!/bin/bash # Description: Check if lasx_movemask_b is defined in lasxintrin_ext.h # Search for lasx_movemask_b definition rg -A 5 "lasx_movemask_b" arch/loongarch/arch/loongarch/lasxintrin_ext.h (5)
12-17: LGTM! Correct SAD implementation.The sum of absolute differences implementation follows the correct pattern of computing absolute differences followed by progressive horizontal additions to accumulate the result.
19-22: LGTM! Correct movemask implementation.The function correctly extracts sign bits from all 32 bytes in the 256-bit vector by using mask generation and combining the results from two words.
24-27: LGTM! Correct casting implementation.The function correctly zero-extends a 128-bit vector to 256-bit by placing the input in the lower half and zeroing the upper half.
29-39: LGTM! Correct vector insertion and extension.Both functions are implemented correctly:
lasx_inserti128_si256properly handles insertion into lower (imm8==0) or upper (imm8!=0) halflasx_zextsi128_si256correctly zero-extends the input vectorThe commented alternative implementation for
lasx_zextsi128_si256could be more explicit about the zero-extension intent, but the current implementation is more efficient.
42-46: LGTM! Correct shuffle implementation with negative index handling.The function correctly implements byte shuffling with proper handling of negative indices (setting corresponding output bytes to zero), matching the behavior of x86 shuffle intrinsics.
arch/loongarch/crc32_la.c (2)
15-37: LGTM! Correct CRC32 implementation using LoongArch hardware instructions.The function correctly:
- Inverts the input CRC at the beginning and output at the end (standard CRC32 practice)
- Processes data in optimal chunk sizes using appropriate LoongArch CRC instructions
- Uses proper memory reading functions that should handle endianness correctly
- Follows the established pattern for hardware-accelerated CRC implementations
41-49: LGTM! Correct wrapper implementations.Both wrapper functions properly integrate with the fold infrastructure:
crc32_fold_copy_loongarch64correctly updates CRC and copies datacrc32_fold_loongarch64correctly updates CRC with proper parameter handling- The Z_UNUSED annotation for init_crc is appropriate as noted in the comment
arch/loongarch/compare256_lsx.c (2)
17-82: LGTM! Well-implemented SIMD comparison with proper alignment handling.The function correctly implements 256-byte buffer comparison:
- Proper handling of unaligned access with initial unaligned load followed by aligned processing
- Correct use of LSX intrinsics for 16-byte vector comparisons
- Appropriate use of
__builtin_ctzto find the first differing bit- Sound algorithm for handling alignment boundaries and remaining bytes
- Logic follows established patterns from other architecture implementations
84-98: LGTM! Correct template usage and wrapper implementation.The code properly:
- Provides a clean public interface via
compare256_lsxwrapper- Uses the template system correctly to instantiate both fast and slow longest match variants
- Follows established patterns from other architecture implementations
arch/loongarch/adler32_lsx.c (3)
18-30: LGTM! Correct horizontal sum implementations.Both helper functions properly implement horizontal summation:
partial_hsumcorrectly sums adjacent 32-bit word pairshsumcorrectly sums all four 32-bit elements using standard interleaving and addition patterns
32-146: LGTM! Sophisticated SIMD Adler32 implementation.This is a well-structured vectorized Adler32 implementation that:
- Correctly handles overflow prevention by processing chunks of maximum size NMAX
- Uses appropriate vector accumulators for both s1 and s2 components
- Implements efficient 32-byte and 16-byte processing loops with proper weight vectors
- Properly handles both copy and non-copy variants through the COPY template parameter
- Follows established patterns from other architecture's optimized Adler32 implementations
- Correctly applies modular reduction and combines results at the end
The complexity is justified for the performance benefits of SIMD optimization.
148-154: LGTM! Clean wrapper functions.Both wrapper functions properly instantiate the template implementation with appropriate parameters for copy vs non-copy variants.
configure (4)
107-107: LGTM! Properly integrated LoongArch configuration variables.The new configuration variables and options are well-integrated:
buildcrc32lafollows the established pattern for optional featureslsxflagandlasxflagcompiler flags follow naming conventions- Command-line option
--without-crc32-lamatches the established pattern- Architecture detection for
loongarch64is properly added to the existing case structureAlso applies to: 122-123, 230-230, 365-366
1717-1772: LGTM! Well-structured intrinsic detection functions.All three intrinsic detection functions follow the established pattern:
- Proper test programs with appropriate headers and representative intrinsics
- Correct compiler flag usage (
-mlsx,-mlasx)- Consistent variable naming and logging
- Test intrinsics are appropriate for validating each instruction set
2293-2332: LGTM! Comprehensive LoongArch architecture support.The LoongArch case properly implements:
- Correct architecture directory setup
- Conditional feature detection based on optimization settings
- Appropriate intrinsic checking for CRC32, LSX, and LASX
- Proper compiler flag and object file management
- Consistent naming and structure matching other architecture implementations
- Complete coverage of all LoongArch-specific optimizations introduced in this PR
2400-2401: LGTM! Proper Makefile integration.The Makefile updates correctly:
- Log the new compiler flags following the established pattern
- Substitute the flags into the architecture-specific Makefile using the standard sed pattern
- Maintain consistency with other architecture flag handling
Also applies to: 2551-2552
arch/loongarch/chunkset_lsx.c (3)
1-14: LGTM: Well-structured includes and conditional compilation.The file header, copyright, and includes are properly organized. The conditional compilation with
LOONGARCH_LSXensures the code is only compiled when LSX support is available.
23-33: LGTM: Correct LSX intrinsic usage for chunk replication.The
chunkmemset_*functions correctly use LSX intrinsics to replicate small data patterns across 128-bit vectors:
__lsx_vreplgr2vr_hfor 2-byte patterns__lsx_vreplgr2vr_wfor 4-byte patterns__lsx_vreplgr2vr_dfor 8-byte patternsThis follows the expected pattern for SIMD chunk operations.
43-59: Verify memory sanitizer integration and permutation logic.The
GET_CHUNK_MAGfunction has complex logic for byte permutation. While the__msan_unpoisonusage is documented as intentional, verify that:
- The permutation table indexing is bounds-safe
- The
lut_rem.idxvalue is properly validated- The
lsx_shuffle_boperation handles all edge cases correctly#!/bin/bash # Description: Verify the permutation table and lookup usage is consistent across architectures # Expected: Similar patterns in other SIMD implementations # Check if permutation table is properly defined fd -e h -e c | xargs rg -l "permute_table" | head -5 # Check bounds of perm_idx_lut usage rg -A 5 -B 5 "perm_idx_lut\[.*\]" --type carch/loongarch/adler32_lasx.c (3)
20-32: LGTM: Correct LASX horizontal sum implementation.The
hsum256andpartial_hsum256functions correctly implement horizontal sum operations using LASX intrinsics:
- Proper use of
__lasx_xvadd_wfor 32-bit additions- Correct bit shifting and permutation for reduction
- Appropriate vector element extraction
This follows the standard pattern for SIMD horizontal reductions.
87-87: ```shell
#!/bin/bashDisplay context around the usage of lasx_sad_bu in arch/loongarch/adler32_lasx.c
sed -n '1,200p' arch/loongarch/adler32_lasx.c
--- `142-143`: ```shell #!/bin/bash # Locate and inspect partial_hsum256 and hsum256 definitions rg -n -C5 'partial_hsum256' --type c rg -n -C5 'hsum256' --type c # Show surrounding code in arch/loongarch/adler32_lasx.c for context rg -n -C5 'adler0 = partial_hsum256' arch/loongarch/adler32_lasx.carch/loongarch/chunkset_lasx.c (3)
38-41: LGTM: Correct implementation of 16-byte chunk replication.The
chunkmemset_16function correctly implements 16-byte pattern replication using:
- Load 128-bit pattern with
__lsx_vld- Broadcast to both 128-bit lanes of 256-bit vector using
lasx_inserti128_si256This is the standard approach for creating 256-bit vectors from 128-bit patterns.
62-83: Review complex cross-lane permutation logic.The
GET_CHUNK_MAGfunction has complex branching logic for handling different distance ranges. Thedist < 16case uses a different approach than the>= 16case.Key concerns:
- The
permute_xformoffset calculation withlasx_inserti128_si256- Cross-lane permutation with
__lsx_vslti_band__lsx_vbitsel_v- Memory access patterns for
buf + 16#!/bin/bash # Description: Verify similar cross-lane permutation logic in other SIMD implementations # Expected: Similar patterns in AVX2 implementations # Check for similar cross-lane logic in other architectures rg -A 10 -B 5 "dist.*16.*permute\|cross.*lane" --type c # Check if lasx_inserti128_si256 is properly defined rg "lasx_inserti128_si256" --type c --type h
102-113: ```shell
#!/bin/bashCheck includes in the LoongArch chunkset implementation
rg '#include' -n arch/loongarch/chunkset_lasx.c
echo
echo "First 30 lines of arch/loongarch/chunkset_lasx.c for context:"
sed -n '1,30p' arch/loongarch/chunkset_lasx.c</details> <details> <summary>arch/loongarch/loongarch_functions.h (3)</summary> `8-10`: **LGTM: Proper header guard implementation.** The header guard `LOONGARCH_FUNCTIONS_H_` follows the naming convention and prevents multiple inclusions correctly. --- `45-99`: **Verify macro redefinition safety.** The `DISABLE_RUNTIME_CPU_DETECTION` section redefines many `native_*` macros. While this approach is common, ensure that: 1. The original macros are properly undefined before redefinition 2. The redefinition order respects LASX > LSX > CRC precedence 3. No macro redefinition warnings occur during compilation ```shell #!/bin/bash # Description: Check for potential macro redefinition issues # Expected: Proper undef/define pairs, no redefinition warnings # Check if similar patterns exist in other architecture headers rg -A 10 -B 5 "DISABLE_RUNTIME_CPU_DETECTION" --type h # Look for macro redefinition patterns rg -A 3 -B 1 "#undef.*native_.*#define.*native_" --type h
21-25: ```shell
#!/bin/bashInspect LOONGARCH_LSX guard in loongarch_functions.h
rg -n 'LOONGARCH_LSX' -A3 -B3 arch/loongarch/loongarch_functions.h
Inspect LOONGARCH_LASX guard in loongarch_functions.h
rg -n 'LOONGARCH_LASX' -A3 -B3 arch/loongarch/loongarch_functions.h
</details> </blockquote></details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #1925 +/- ##
============================================
+ Coverage 39.39% 81.43% +42.04%
============================================
Files 74 162 +88
Lines 7885 13916 +6031
Branches 1303 3118 +1815
============================================
+ Hits 3106 11333 +8227
+ Misses 4536 1582 -2954
- Partials 243 1001 +758 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
This looks pretty complete. I think the README.md features might need to be updated to add LSX/LASX. |
fb31879 to
3984d07
Compare
The unit tests all pass, and benchmark shows a significant performance increase |
|
Thanks for the help with testing. |
|
I'm not familiar with Loongson assembly syntax or intrinsics, so I can't verify if there is or isn't any hidden bugs that CI isn't catching, but generally the patch is starting to look pretty clean. As with any new "port", there might be further improvements that can be made, but not all needs to be in initial version. The general rule is to use inline assembly only if there is no working intrinsic function to do the same "function" efficiently. Having at least one person that is familiar with Loongson testing the code is a good thing, but some could argue there should be more than one. |
Unfortunately, there are no intrinsics for this function in compilers. |
Then it's OK to use inline assembly... Using inline assembly will sometimes disable certain optimizations but still it's generally safer to not let compiler optimize code that it might not fully understand. |
|
Rebased |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (3)
arch/loongarch/lasxintrin_ext.h (3)
21-36: Consider documenting the xvpermi.q immediate value.The implementation is correct, but the magic number
0x20in thexvpermi.qinstruction could benefit from a comment explaining that it selects the lower quadword for zero extension.static inline __m256i lasx_zextsi128_si256(__m128i in) { __m256i out = __lasx_xvldi(0); __asm__ volatile ( ".irp i," LA_ALL_REGS "\n\t" " .ifc %[out], " LA_XREGS_PREFIX"\\i \n\t" " .irp j," LA_ALL_REGS "\n\t" " .ifc %[in], " LA_VREGS_PREFIX "\\j \n\t" + " # 0x20: copy lower 128 bits to output\n\t" " xvpermi.q $xr\\i, $xr\\j, 0x20 \n\t" " .endif \n\t" " .endr \n\t" " .endif \n\t" ".endr \n\t" : [out] "+f" (out) : [in] "f" (in) ); return out; }
38-65: Complex but correct implementation - consider adding comments.The inline assembly correctly handles register allocation and vector concatenation. Consider adding comments to explain the
xvpermi.qoperation and the purpose of the.ifncblock.static inline __m256i lasx_set_si128(__m128i inhi, __m128i inlo) { __m256i out; __asm__ volatile ( ".irp i," LA_ALL_REGS "\n\t" " .ifc %[hi], " LA_VREGS_PREFIX "\\i \n\t" " .irp j," LA_ALL_REGS "\n\t" " .ifc %[lo], " LA_VREGS_PREFIX "\\j \n\t" + " # 0x20: place lo in lower 128 bits, hi in upper 128 bits\n\t" " xvpermi.q $xr\\i, $xr\\j, 0x20 \n\t" " .endif \n\t" " .endr \n\t" " .endif \n\t" ".endr \n\t" + "# If output register differs from hi, copy the result\n\t" ".ifnc %[out], %[hi] \n\t" ".irp i," LA_ALL_REGS "\n\t" " .ifc %[out], " LA_XREGS_PREFIX "\\i \n\t" " .irp j," LA_ALL_REGS "\n\t" " .ifc %[hi], " LA_VREGS_PREFIX "\\j \n\t" " xvori.b $xr\\i, $xr\\j, 0 \n\t" " .endif \n\t" " .endr \n\t" " .endif \n\t" ".endr \n\t" ".endif \n\t" : [out] "=f" (out), [hi] "+f" (inhi) : [lo] "f" (inlo) ); return out; }
90-91: Add newline at end of file.#endif // include guard LASXINTRIN_EXT_H +
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (41)
.github/workflows/cmake.yml(1 hunks).github/workflows/configure.yml(1 hunks)CMakeLists.txt(5 hunks)README.md(1 hunks)arch/arm/chunkset_neon.c(1 hunks)arch/generic/chunk_128bit_perm_idx_lut.h(1 hunks)arch/generic/chunk_256bit_perm_idx_lut.h(1 hunks)arch/loongarch/Makefile.in(1 hunks)arch/loongarch/adler32_lasx.c(1 hunks)arch/loongarch/adler32_lsx.c(1 hunks)arch/loongarch/chunkset_lasx.c(1 hunks)arch/loongarch/chunkset_lsx.c(1 hunks)arch/loongarch/compare256_lasx.c(1 hunks)arch/loongarch/compare256_lsx.c(1 hunks)arch/loongarch/crc32_la.c(1 hunks)arch/loongarch/lasxintrin_ext.h(1 hunks)arch/loongarch/loongarch_features.c(1 hunks)arch/loongarch/loongarch_features.h(1 hunks)arch/loongarch/loongarch_functions.h(1 hunks)arch/loongarch/lsxintrin_ext.h(1 hunks)arch/loongarch/slide_hash_lasx.c(1 hunks)arch/loongarch/slide_hash_lsx.c(1 hunks)arch/x86/chunkset_avx2.c(1 hunks)arch/x86/chunkset_avx512.c(1 hunks)arch/x86/chunkset_ssse3.c(1 hunks)arch_functions.h(1 hunks)cmake/detect-intrinsics.cmake(1 hunks)cmake/toolchain-loongarch64-gcc-14.cmake(1 hunks)configure(8 hunks)cpu_features.c(1 hunks)cpu_features.h(2 hunks)functable.c(1 hunks)test/benchmarks/benchmark_adler32.cc(1 hunks)test/benchmarks/benchmark_adler32_copy.cc(1 hunks)test/benchmarks/benchmark_compare256.cc(1 hunks)test/benchmarks/benchmark_crc32.cc(1 hunks)test/benchmarks/benchmark_slidehash.cc(1 hunks)test/data/.gitignore(1 hunks)test/test_adler32.cc(1 hunks)test/test_compare256.cc(1 hunks)test/test_crc32.cc(1 hunks)
✅ Files skipped from review due to trivial changes (2)
- arch/x86/chunkset_avx512.c
- arch_functions.h
🚧 Files skipped from review as they are similar to previous changes (37)
- test/data/.gitignore
- arch/x86/chunkset_avx2.c
- cpu_features.c
- arch/arm/chunkset_neon.c
- test/benchmarks/benchmark_slidehash.cc
- arch/x86/chunkset_ssse3.c
- test/benchmarks/benchmark_adler32_copy.cc
- test/benchmarks/benchmark_crc32.cc
- arch/loongarch/Makefile.in
- README.md
- test/benchmarks/benchmark_adler32.cc
- arch/loongarch/loongarch_features.c
- test/benchmarks/benchmark_compare256.cc
- arch/generic/chunk_128bit_perm_idx_lut.h
- arch/generic/chunk_256bit_perm_idx_lut.h
- .github/workflows/configure.yml
- arch/loongarch/slide_hash_lasx.c
- arch/loongarch/slide_hash_lsx.c
- cpu_features.h
- arch/loongarch/loongarch_features.h
- test/test_adler32.cc
- CMakeLists.txt
- arch/loongarch/compare256_lasx.c
- functable.c
- arch/loongarch/crc32_la.c
- test/test_crc32.cc
- arch/loongarch/compare256_lsx.c
- test/test_compare256.cc
- arch/loongarch/adler32_lasx.c
- arch/loongarch/chunkset_lsx.c
- .github/workflows/cmake.yml
- arch/loongarch/chunkset_lasx.c
- cmake/toolchain-loongarch64-gcc-14.cmake
- configure
- arch/loongarch/loongarch_functions.h
- arch/loongarch/lsxintrin_ext.h
- cmake/detect-intrinsics.cmake
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:14-24
Timestamp: 2025-02-21T01:41:50.358Z
Learning: In zlib-ng's SSE2 vectorized Chorba CRC implementation, the code that calls READ_NEXT macro ensures 16-byte alignment, making explicit alignment checks unnecessary within the macro.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-18T02:53:23.621Z
Learning: In zlib-ng, when macro names and parameters are self-explanatory and scalar versions of the code provide context, additional documentation for the macros may not be necessary.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: arch/x86/chunkset_avx2.c:82-85
Timestamp: 2024-10-07T21:18:37.806Z
Learning: In `arch/x86/chunkset_avx2.c`, when working with AVX2-capable x86 CPUs, unaligned memory access using `_mm_loadu_si128` is acceptable since there is no performance penalty on architectures after Nehalem. Ensuring alignment may introduce unnecessary overhead due to arbitrary offsets into the window.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: arch/x86/chunkset_avx2.c:82-85
Timestamp: 2024-10-08T19:37:14.998Z
Learning: In `arch/x86/chunkset_avx2.c`, when working with AVX2-capable x86 CPUs, unaligned memory access using `_mm_loadu_si128` is acceptable since there is no performance penalty on architectures after Nehalem. Ensuring alignment may introduce unnecessary overhead due to arbitrary offsets into the window.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:42:40.488Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1778
File: arch/x86/chunkset_avx2.c:160-171
Timestamp: 2024-10-08T21:51:45.330Z
Learning: In `arch/x86/chunkset_avx2.c`, within the `GET_HALFCHUNK_MAG` function, using a conditional branch to select between `_mm_loadl_epi64` and `_mm_loadu_si128` is not recommended because the branching cost outweighs the savings from the load.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/x86_intrins.h:114-117
Timestamp: 2025-02-23T16:51:54.545Z
Learning: In x86/x86_intrins.h, the Clang macros for _mm_cvtsi64x_si128 and _mm_cvtsi128_si64x don't need additional MSVC guards since MSVC's implementation is already protected by `defined(_MSC_VER) && !defined(__clang__)`, making them mutually exclusive.
arch/loongarch/adler32_lsx.c (6)
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:42:40.488Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:14-24
Timestamp: 2025-02-21T01:41:50.358Z
Learning: In zlib-ng's SSE2 vectorized Chorba CRC implementation, the code that calls READ_NEXT macro ensures 16-byte alignment, making explicit alignment checks unnecessary within the macro.
Learnt from: Dead2
PR: zlib-ng/zlib-ng#1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-01-23T22:01:53.422Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: chunkset_tpl.h:92-108
Timestamp: 2024-10-08T19:37:14.998Z
Learning: In the `HALFCHUNKCOPY` function in `chunkset_tpl.h`, the `len` parameter is always bound to be small by the callers and will not exceed the limits of an `int32_t`.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: chunkset_tpl.h:92-108
Timestamp: 2024-10-07T22:00:02.180Z
Learning: In the `HALFCHUNKCOPY` function in `chunkset_tpl.h`, the `len` parameter is always bound to be small by the callers and will not exceed the limits of an `int32_t`.
arch/loongarch/lasxintrin_ext.h (12)
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/x86_intrins.h:114-117
Timestamp: 2025-02-23T16:51:54.545Z
Learning: In x86/x86_intrins.h, the Clang macros for _mm_cvtsi64x_si128 and _mm_cvtsi128_si64x don't need additional MSVC guards since MSVC's implementation is already protected by `defined(_MSC_VER) && !defined(__clang__)`, making them mutually exclusive.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:41:10.063Z
Learning: For SSE2 optimizations, `_mm_cvtsi128_si64` should be used instead of `_mm_extract_epi64` (SSE4.1) for extracting 64-bit values from 128-bit vectors, as it generates more efficient movq instructions.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1778
File: arch/x86/chunkset_avx2.c:160-171
Timestamp: 2024-10-08T21:51:45.330Z
Learning: In `arch/x86/chunkset_avx2.c`, within the `GET_HALFCHUNK_MAG` function, using a conditional branch to select between `_mm_loadl_epi64` and `_mm_loadu_si128` is not recommended because the branching cost outweighs the savings from the load.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: arch/x86/chunkset_avx2.c:82-85
Timestamp: 2024-10-07T21:18:37.806Z
Learning: In `arch/x86/chunkset_avx2.c`, when working with AVX2-capable x86 CPUs, unaligned memory access using `_mm_loadu_si128` is acceptable since there is no performance penalty on architectures after Nehalem. Ensuring alignment may introduce unnecessary overhead due to arbitrary offsets into the window.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: arch/x86/chunkset_avx2.c:82-85
Timestamp: 2024-10-08T19:37:14.998Z
Learning: In `arch/x86/chunkset_avx2.c`, when working with AVX2-capable x86 CPUs, unaligned memory access using `_mm_loadu_si128` is acceptable since there is no performance penalty on architectures after Nehalem. Ensuring alignment may introduce unnecessary overhead due to arbitrary offsets into the window.
Learnt from: Dead2
PR: zlib-ng/zlib-ng#1830
File: arch/generic/generic_functions.h:39-53
Timestamp: 2024-12-20T23:35:59.830Z
Learning: The `longest_match_unaligned_*` functions are templated using the LONGEST_MATCH macro in match_tpl.h, so their implementations are generated rather than explicitly defined.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1805
File: inffast_tpl.h:257-262
Timestamp: 2024-10-29T02:22:52.846Z
Learning: In `inffast_tpl.h`, when AVX512 is enabled, the branch involving `chunkcopy_safe` is intentionally eliminated to optimize performance.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1805
File: arch/x86/chunkset_avx512.c:32-34
Timestamp: 2024-10-29T02:22:55.489Z
Learning: In `arch/x86/chunkset_avx512.c`, the `gen_mask` function's `len` parameter cannot exceed 32 because it is only called on the remaining bytes from a 32-byte vector.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: arch/x86/chunkset_avx2.c:56-59
Timestamp: 2024-10-08T03:09:54.717Z
Learning: The array `half_rem_vals` in `arch/x86/chunkset_avx2.c` is used in the function `GET_HALFCHUNK_MAG` on line 148.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: arch/x86/chunkset_avx2.c:56-59
Timestamp: 2024-10-12T13:02:26.066Z
Learning: The array `half_rem_vals` in `arch/x86/chunkset_avx2.c` is used in the function `GET_HALFCHUNK_MAG` on line 148.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1805
File: arch/x86/chunkset_avx512.c:28-30
Timestamp: 2024-10-29T02:18:25.966Z
Learning: In `chunkset_avx512.c`, the `gen_half_mask` function does not require validation for `len` since it will never exceed 16 due to computing the remainder for a 16-byte load.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:26-28
Timestamp: 2025-02-21T01:44:03.996Z
Learning: The alignment requirements for chorba_small_nondestructive_sse2 (16-byte alignment and multiple of 8 length) are enforced by its calling function, making additional checks redundant.
🧬 Code Graph Analysis (1)
arch/loongarch/adler32_lsx.c (3)
test/benchmarks/benchmark_adler32_copy.cc (1)
adler(22-22)arch/loongarch/lsxintrin_ext.h (2)
lsx_sad_bu(11-16)lsx_sad_bu(11-11)arch/loongarch/loongarch_functions.h (2)
adler32_lsx(18-18)adler32_fold_copy_lsx(19-19)
⏰ Context from checks skipped due to timeout of 90000ms (114)
- GitHub Check: Windows GCC
- GitHub Check: Windows MSVC 2022 v142 Win32
- GitHub Check: Windows MSVC 2022 v141 Win32
- GitHub Check: Windows MSVC 2022 v140 Win32
- GitHub Check: Windows MSVC 2022 v143 Win64
- GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
- GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
- GitHub Check: Ubuntu GCC SSSE3 UBSAN
- GitHub Check: Ubuntu GCC No Chorba
- GitHub Check: Ubuntu GCC ASAN
- GitHub Check: macOS GCC (ARM64)
- GitHub Check: Windows GCC
- GitHub Check: Windows MSVC 2022 v142 Win32
- GitHub Check: Windows MSVC 2022 v141 Win32
- GitHub Check: Windows MSVC 2022 v140 Win32
- GitHub Check: Windows MSVC 2022 v143 Win64
- GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
- GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
- GitHub Check: Ubuntu GCC SSSE3 UBSAN
- GitHub Check: Ubuntu GCC No Chorba
- GitHub Check: Ubuntu GCC ASAN
- GitHub Check: macOS GCC (ARM64)
- GitHub Check: Windows GCC
- GitHub Check: Windows MSVC 2022 v142 Win32
- GitHub Check: Windows MSVC 2022 v141 Win32
- GitHub Check: Windows MSVC 2022 v140 Win32
- GitHub Check: Windows MSVC 2022 v143 Win64
- GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
- GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
- GitHub Check: Ubuntu GCC SSSE3 UBSAN
- GitHub Check: Ubuntu GCC No Chorba
- GitHub Check: Ubuntu GCC ASAN
- GitHub Check: macOS GCC (ARM64)
- GitHub Check: Windows GCC
- GitHub Check: Windows MSVC 2022 v142 Win32
- GitHub Check: Windows MSVC 2022 v141 Win32
- GitHub Check: Windows MSVC 2022 v140 Win32
- GitHub Check: Windows MSVC 2022 v143 Win64
- GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
- GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
- GitHub Check: Ubuntu GCC SSSE3 UBSAN
- GitHub Check: Ubuntu GCC No Chorba
- GitHub Check: Ubuntu GCC ASAN
- GitHub Check: macOS GCC (ARM64)
- GitHub Check: Windows GCC
- GitHub Check: Windows MSVC 2022 v142 Win32
- GitHub Check: Windows MSVC 2022 v141 Win32
- GitHub Check: Windows MSVC 2022 v140 Win32
- GitHub Check: Windows MSVC 2022 v143 Win64
- GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
- GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
- GitHub Check: Ubuntu GCC SSSE3 UBSAN
- GitHub Check: Ubuntu GCC No Chorba
- GitHub Check: Ubuntu GCC ASAN
- GitHub Check: macOS GCC (ARM64)
- GitHub Check: Windows GCC
- GitHub Check: Windows MSVC 2022 v142 Win32
- GitHub Check: Windows MSVC 2022 v141 Win32
- GitHub Check: Windows MSVC 2022 v140 Win32
- GitHub Check: Windows MSVC 2022 v143 Win64
- GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
- GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
- GitHub Check: Ubuntu GCC SSSE3 UBSAN
- GitHub Check: Ubuntu GCC No Chorba
- GitHub Check: Ubuntu GCC ASAN
- GitHub Check: macOS GCC (ARM64)
- GitHub Check: Windows GCC
- GitHub Check: Windows MSVC 2022 v142 Win32
- GitHub Check: Windows MSVC 2022 v141 Win32
- GitHub Check: Windows MSVC 2022 v140 Win32
- GitHub Check: Windows MSVC 2022 v143 Win64
- GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
- GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
- GitHub Check: Ubuntu GCC SSSE3 UBSAN
- GitHub Check: Ubuntu GCC No Chorba
- GitHub Check: Ubuntu GCC ASAN
- GitHub Check: macOS GCC (ARM64)
- GitHub Check: Windows GCC
- GitHub Check: Windows MSVC 2022 v142 Win32
- GitHub Check: Windows MSVC 2022 v141 Win32
- GitHub Check: Windows MSVC 2022 v140 Win32
- GitHub Check: Windows MSVC 2022 v143 Win64
- GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
- GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
- GitHub Check: Ubuntu GCC SSSE3 UBSAN
- GitHub Check: Ubuntu GCC No Chorba
- GitHub Check: Ubuntu GCC ASAN
- GitHub Check: macOS GCC (ARM64)
- GitHub Check: macOS Clang (Target 10.10)
- GitHub Check: Windows GCC
- GitHub Check: Windows MSVC 2022 v142 Win32
- GitHub Check: Windows MSVC 2022 v141 Win32
- GitHub Check: Windows MSVC 2022 v140 Win32
- GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
- GitHub Check: Windows MSVC 2022 v143 Win64
- GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
- GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
- GitHub Check: Ubuntu GCC SSSE3 UBSAN
- GitHub Check: Ubuntu GCC No Chorba
- GitHub Check: Ubuntu GCC ASAN
- GitHub Check: macOS GCC (ARM64)
- GitHub Check: macOS Clang (Target 10.10)
- GitHub Check: Windows GCC
- GitHub Check: Windows MSVC 2022 v142 Win32
- GitHub Check: Windows MSVC 2022 v141 Win32
- GitHub Check: Windows MSVC 2022 v140 Win32
- GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
- GitHub Check: Windows MSVC 2022 v143 Win64
- GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
- GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
- GitHub Check: Ubuntu GCC SSSE3 UBSAN
- GitHub Check: Ubuntu GCC No Chorba
- GitHub Check: Ubuntu GCC ASAN
- GitHub Check: macOS GCC (ARM64)
🔇 Additional comments (10)
arch/loongarch/adler32_lsx.c (4)
10-30: LGTM! Helper functions correctly implement horizontal sums.The
partial_hsumandhsumfunctions properly use LSX intrinsics to compute horizontal sums of vector elements, which are essential for the Adler-32 algorithm.
32-57: Function initialization and constants are correct.The NULL handling, adler splitting, and dot product vectors are properly set up for the Adler-32 algorithm. The weighted sum vectors (32→17 and 16→1) correctly implement the position-based weighting required by the algorithm.
58-138: Main processing loops correctly implement vectorized Adler-32.The implementation efficiently processes data in 32-byte and 16-byte chunks using LSX SIMD intrinsics. The use of NMAX to limit iterations before modulo reduction prevents overflow, and the weighted sum calculations correctly implement the Adler-32 algorithm.
140-154: Clean interface design with proper remainder handling.The use of
gotofor remainder handling is appropriate here as it efficiently reuses the existing small buffer code path. The public functions provide a clean interface by wrapping the implementation with the appropriate COPY flag.arch/loongarch/lasxintrin_ext.h (6)
1-10: Header structure looks good.The copyright notice, include guards, and required headers are properly implemented.
12-20: Compiler-specific register prefixes are correctly handled.The differentiation between Clang and GCC register prefixes follows LoongArch conventions, and the
LA_ALL_REGSmacro provides a clean way to iterate through all registers in inline assembly.
67-69: Clean broadcast implementation.Efficiently reuses
lasx_set_si128to duplicate the 128-bit vector into both halves.
71-76: Correct SAD implementation with proper width progression.The function correctly implements sum of absolute differences using a logical sequence of horizontal additions that progressively widen from bytes to doublewords.
78-81: Efficient movemask implementation.Correctly extracts sign bits from all 32 bytes and combines them into a 32-bit integer mask.
83-88: Correct shuffle implementation matching x86 semantics.The function properly implements byte shuffling with masking, zeroing bytes when the corresponding index has its MSB set, which matches the behavior of x86's
_mm_shuffle_epi8.
|
PR added CI using qemu for LoongArch (first commit). |
Testing with qemu userland emulation is a good start, but the emulation isn't perfect so some issues might still linger that can be only detected when running on real hardware. There has been quite long discussion about serious bugs in qemu that haven't been fixed since qemu 4.x. Latest qemu version currently is 10.0.2, so that's quite long time. |
|
Need rebased after #1930 |
Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>
Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>
Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>
Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>
Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>
…ntation Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>
Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>
Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>
Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>
Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>
Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>
Co-authored-by: junchao-zhao <zhaojunchao@loongson.cn> Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>
…lignment Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>
|
Ready for merging. |
This implementation is based on replacing x86 intrinsics with Loongarch intrinsics.
Test on real hardware:
Compression performance increase: 6.44%
Decompression performance increase: 31.99%
Summary by CodeRabbit
New Features
Documentation
Tests
Chores