Port SSE/AVX optimization to Loongarch64 LSX/LASX Vector Intrinsics by phprus · Pull Request #1925 · zlib-ng/zlib-ng

phprus · 2025-06-18T18:48:55Z

This implementation is based on replacing x86 intrinsics with Loongarch intrinsics.

Test on real hardware:
Compression performance increase: 6.44%
Decompression performance increase: 31.99%

Summary by CodeRabbit

New Features
- Added full support for the LoongArch64 architecture, including optimized implementations for CRC32, Adler32, chunk operations, compare256, and slide hash using LSX and LASX SIMD extensions.
- Introduced runtime detection and selection of LoongArch CPU features for optimal performance.
- Added LoongArch-specific build options and cross-compilation support in both CMake and configure scripts.
Documentation
- Updated README to reflect LoongArch support and expanded list of optimized implementations.
Tests
- Added benchmarks and test cases for LoongArch-specific CRC32, Adler32, compare256, and slide hash functions.
Chores
- Enhanced CI workflows and build scripts to include LoongArch64 cross-compilation and testing.

coderabbitai · 2025-06-18T18:49:02Z

Walkthrough

This change introduces comprehensive support for the LoongArch architecture, including new SIMD-optimized implementations for CRC32, Adler32, chunk memory operations, compare256, and slide hash. It adds LoongArch-specific build options, intrinsic detection, runtime feature detection, function pointer assignments, toolchain files, and extends benchmarks, tests, and documentation to recognize and exercise these new features.

Changes

File(s)	Change Summary
.github/workflows/cmake.yml, .github/workflows/configure.yml	Added LoongArch64 GCC 14 cross-compilation matrix entries for CI workflows.
CMakeLists.txt, cmake/detect-intrinsics.cmake, cmake/toolchain-loongarch64-gcc-14.cmake	Added LoongArch build options, intrinsic detection macros, and a new toolchain file.
configure	Added LoongArch64 architecture detection, build options, and intrinsic checks.
arch/loongarch/Makefile.in	New Makefile for building LoongArch-specific object files.
arch/loongarch/adler32_lasx.c, arch/loongarch/adler32_lsx.c	New LASX and LSX SIMD-optimized Adler32 implementations (with fold-copy variants).
arch/loongarch/chunkset_lasx.c, arch/loongarch/chunkset_lsx.c	New chunk memory set and manipulation implementations for LASX and LSX.
arch/loongarch/compare256_lasx.c, arch/loongarch/compare256_lsx.c	New LASX and LSX SIMD-optimized compare256 implementations.
arch/loongarch/crc32_la.c	New LoongArch CRC32 implementation using hardware intrinsics.
arch/loongarch/slide_hash_lasx.c, arch/loongarch/slide_hash_lsx.c	New SIMD-optimized slide hash implementations for LASX and LSX.
arch/loongarch/lasxintrin_ext.h, arch/loongarch/lsxintrin_ext.h	New headers providing extended LASX/LSX SIMD intrinsic helpers.
arch/loongarch/loongarch_features.c, arch/loongarch/loongarch_features.h	New CPU feature detection implementation and interface for LoongArch.
arch/loongarch/loongarch_functions.h	New header declaring LoongArch-optimized function prototypes and macro mappings.
arch/generic/chunk_128bit_perm_idx_lut.h	New shared 128-bit permutation index lookup table for SSSE3, NEON, LSX, LASX.
arch/generic/chunk_256bit_perm_idx_lut.h	Updated header guard, comments, and include path for 256-bit permutation table.
arch/arm/chunkset_neon.c, arch/x86/chunkset_ssse3.c, arch/x86/chunkset_avx2.c, arch/x86/chunkset_avx512.c	Switched to shared permutation index lookup table headers; removed local tables.
arch_functions.h	Added inclusion of LoongArch-specific function prototypes.
cpu_features.c, cpu_features.h	Added LoongArch CPU feature struct and detection logic.
functable.c	Added runtime function pointer assignments for LoongArch features.
test/benchmarks/benchmark_adler32.cc, test/benchmarks/benchmark_adler32_copy.cc, test/benchmarks/benchmark_compare256.cc, test/benchmarks/benchmark_crc32.cc, test/benchmarks/benchmark_slidehash.cc	Added LoongArch LSX/LASX/CRC32 benchmarks with runtime feature checks.
test/test_adler32.cc, test/test_compare256.cc, test/test_crc32.cc	Added LoongArch LSX/LASX/CRC32 test cases with runtime feature checks.
README.md	Documented LoongArch support for SIMD and CRC32, and updated CI environments.

Sequence Diagram(s)

sequenceDiagram
    participant BuildSystem as Build System (CMake/configure)
    participant CPU as LoongArch CPU
    participant App as Application/Library
    participant Functable as Function Table
    participant Features as loongarch_features.c
    participant SIMD as SIMD-optimized Functions

    BuildSystem->>CPU: Detect LoongArch architecture
    BuildSystem->>BuildSystem: Check for CRC/LSX/LASX intrinsics
    BuildSystem->>BuildSystem: Enable LoongArch-specific build flags and sources
    App->>Features: loongarch_check_features()
    Features->>CPU: Read CPU config register
    Features-->>App: Return has_crc, has_lsx, has_lasx
    App->>Functable: init_functable()
    Functable->>App: Query CPU features
    Functable->>SIMD: Assign function pointers for CRC32, Adler32, etc. based on features
    App->>SIMD: Call optimized functions at runtime

Possibly related PRs

Add AVX512 version of compare256 #1901: Adds LoongArch-specific SIMD and CRC32 implementations, including compare256 variants for LSX and LASX, which is functionally similar to this PR but targets LoongArch rather than x86.
ARM Neon: Fold a copy into the adler32 function for UPDATEWINDOW #1870: Implements fold-copy Adler-32 optimization for ARM Neon, similar in approach to the LoongArch fold-copy Adler-32 implementations in this PR, but for a different architecture.

Suggested labels

Continuous Integration, Documentation, Testing Framework

Suggested reviewers

nmoinvaz
Dead2

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai auto-generate unit tests to generate unit tests for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (4)

arch/loongarch/loongarch_features.c (1)

22-27: LGTM! Consider using bit shift operations for clarity.

The CPU feature detection implementation is correct and follows the LoongArch documentation. The bit masks accurately correspond to the documented positions.

For improved readability and maintainability, consider using bit shift operations instead of hex literals:
 void Z_INTERNAL loongarch_check_features(struct loongarch_cpu_features *features) {
     unsigned int w1 = __cpucfg(0x1);
-    features->has_crc = w1 & 0x2000000;
-    features->has_lsx = w1 & 0x40;
-    features->has_lasx = w1 & 0x80;
+    features->has_crc = w1 & (1U << 25);
+    features->has_lsx = w1 & (1U << 6);
+    features->has_lasx = w1 & (1U << 7);
 }
This makes the bit positions explicit and matches the documentation comments.

arch/loongarch/slide_hash_lsx.c (1)

19-54: Review the alignment assumptions and loop structure.

The function assumes 16-byte alignment (assert statements on lines 61-62) but processes 32 bytes per iteration (lines 37-42). Consider these potential issues:

Alignment assumption: The comment mentions "64 byte boundaries" but only asserts 15-byte alignment
Loop structure: Using goto for control flow (line 52) reduces readability

Consider refactoring to eliminate goto:

-    int on_chain = 0;
-
-next_chain:
-    table = (on_chain) ? table1 : table0;
-    entries = (on_chain) ? entries1 : entries0;
-
-    table += entries;
-    table -= 16;
-
-    /* ZALLOC allocates this pointer unless the user chose a custom allocator.
-     * Our alloc function is aligned to 64 byte boundaries */
-    do {
-        value0 = __lsx_vld(table, 0);
-        value1 = __lsx_vld(table, 16);
-        result0 = __lsx_vssub_hu(value0, wsize);
-        result1 = __lsx_vssub_hu(value1, wsize);
-        __lsx_vst(result0, table, 0);
-        __lsx_vst(result1, table, 16);
-
-        table -= 16;
-        entries -= 16;
-    } while (entries > 0);
-
-    ++on_chain;
-    if (on_chain > 1) {
-        return;
-    } else {
-        goto next_chain;
-    }
+    for (int chain = 0; chain < 2; chain++) {
+        table = (chain == 1) ? table1 : table0;
+        entries = (chain == 1) ? entries1 : entries0;
+        
+        table += entries;
+        table -= 16;
+        
+        do {
+            value0 = __lsx_vld(table, 0);
+            value1 = __lsx_vld(table, 16);
+            result0 = __lsx_vssub_hu(value0, wsize);
+            result1 = __lsx_vssub_hu(value1, wsize);
+            __lsx_vst(result0, table, 0);
+            __lsx_vst(result1, table, 16);
+            
+            table -= 16;
+            entries -= 16;
+        } while (entries > 0);
+    }

cmake/toolchain-loongarch64-gcc-14.cmake (1)

22-25: Consider consistent error handling for C++ compiler detection.

The C compiler detection fails fatally if not found (lines 17-19), but C++ compiler detection is optional. This asymmetry might cause build issues if C++ compilation is required but the compiler isn't found.

Consider adding a warning or making the behavior consistent:
 find_program(CXX_COMPILER_FULL_PATH NAMES g++-14-${CMAKE_CXX_COMPILER_TARGET} ${CMAKE_CXX_COMPILER_TARGET}-g++-14)
 if(CXX_COMPILER_FULL_PATH)
     set(CMAKE_CXX_COMPILER ${CXX_COMPILER_FULL_PATH})
+else()
+    message(WARNING "C++ cross-compiler for ${CMAKE_CXX_COMPILER_TARGET} not found")
 endif()

arch/loongarch/adler32_lasx.c (1)

62-64: Consider using designated initializers for better readability.

The vector initialization uses a compound literal which is correct but could be more readable.

-    const __m256i dot2v = (__m256i)((v32i8){ 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15,
-                                             14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1 });
+    const __m256i dot2v = (__m256i)((v32i8){ 
+        32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 
+        16, 15, 14, 13, 12, 11, 10,  9,  8,  7,  6,  5,  4,  3,  2,  1 
+    });

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 617c852 and 9780e9e.

📒 Files selected for processing (39)

.github/workflows/cmake.yml (1 hunks)
.github/workflows/configure.yml (1 hunks)
CMakeLists.txt (5 hunks)
arch/arm/chunkset_neon.c (1 hunks)
arch/generic/chunk_128bit_perm_idx_lut.h (1 hunks)
arch/generic/chunk_256bit_perm_idx_lut.h (1 hunks)
arch/loongarch/Makefile.in (1 hunks)
arch/loongarch/adler32_lasx.c (1 hunks)
arch/loongarch/adler32_lsx.c (1 hunks)
arch/loongarch/chunkset_lasx.c (1 hunks)
arch/loongarch/chunkset_lsx.c (1 hunks)
arch/loongarch/compare256_lasx.c (1 hunks)
arch/loongarch/compare256_lsx.c (1 hunks)
arch/loongarch/crc32_la.c (1 hunks)
arch/loongarch/lasxintrin_ext.h (1 hunks)
arch/loongarch/loongarch_features.c (1 hunks)
arch/loongarch/loongarch_features.h (1 hunks)
arch/loongarch/loongarch_functions.h (1 hunks)
arch/loongarch/lsxintrin_ext.h (1 hunks)
arch/loongarch/slide_hash_lasx.c (1 hunks)
arch/loongarch/slide_hash_lsx.c (1 hunks)
arch/x86/chunkset_avx2.c (1 hunks)
arch/x86/chunkset_avx512.c (1 hunks)
arch/x86/chunkset_ssse3.c (1 hunks)
arch_functions.h (1 hunks)
cmake/detect-intrinsics.cmake (1 hunks)
cmake/toolchain-loongarch64-gcc-14.cmake (1 hunks)
configure (8 hunks)
cpu_features.c (1 hunks)
cpu_features.h (2 hunks)
functable.c (1 hunks)
test/benchmarks/benchmark_adler32.cc (1 hunks)
test/benchmarks/benchmark_adler32_copy.cc (1 hunks)
test/benchmarks/benchmark_compare256.cc (1 hunks)
test/benchmarks/benchmark_crc32.cc (1 hunks)
test/benchmarks/benchmark_slidehash.cc (1 hunks)
test/test_adler32.cc (1 hunks)
test/test_compare256.cc (1 hunks)
test/test_crc32.cc (1 hunks)

🔇 Additional comments (69)

arch/generic/chunk_256bit_perm_idx_lut.h (3)

1-5: LGTM! Well-structured header refactoring.

The conversion from architecture-specific to generic header is well-executed. The new header guard CHUNK_256BIT_PERM_IDX_LUT_H_ follows naming conventions and the copyright header is properly added.

11-46: Lookup tables look correct.

The permutation index lookup table and half remainder values are properly structured for 256-bit operations and maintain the same data as the original AVX2-specific implementation.

7-7: Verify the include path change works correctly.

The include path changed from relative "../generic/chunk_permute_table.h" to direct "chunk_permute_table.h". Ensure the build system's include paths are configured to locate this header correctly across all architectures that use this generic lookup table.
#!/bin/bash
# Description: Verify that chunk_permute_table.h can be found by the build system
# Expected: The header should exist and be accessible

# Find the chunk_permute_table.h file
echo "Looking for chunk_permute_table.h:"
fd "chunk_permute_table.h"

# Check if it exists in expected locations
echo -e "\nChecking common include paths:"
ls -la arch/generic/chunk_permute_table.h 2>/dev/null || echo "Not found in arch/generic/"
ls -la chunk_permute_table.h 2>/dev/null || echo "Not found in root"
arch/x86/chunkset_avx512.c (1)

9-9: LGTM! Proper migration to generic header.

The include change from "avx2_tables.h" to "arch/generic/chunk_256bit_perm_idx_lut.h" correctly uses the new consolidated generic header. This maintains the same lookup table data while enabling reuse across architectures.

arch/x86/chunkset_avx2.c (1)

8-8: LGTM! Consistent with AVX512 refactoring.

The include change mirrors the same refactoring applied to AVX512, properly consolidating the permutation lookup tables into the generic header.

arch_functions.h (1)

27-28: LGTM! Proper LoongArch architecture integration.

The addition of LoongArch support follows the established pattern exactly, using elif defined(LOONGARCH_FEATURES) and including the architecture-specific functions header. This maintains consistency with the existing architecture selection logic.
cpu_features.c (1)

22-23: LGTM! Consistent CPU feature detection pattern.

The LoongArch CPU feature detection follows the exact same pattern as other architectures, calling loongarch_check_features(&features->loongarch) within the conditional compilation chain.
#!/bin/bash
# Description: Verify that loongarch_check_features function is declared and defined
# Expected: Function should be declared in header and defined in source

echo "Searching for loongarch_check_features function:"
rg -A 3 "loongarch_check_features"

echo -e "\nLooking for LoongArch-specific files:"  
fd "loongarch" --type f
arch/arm/chunkset_neon.c (1)

10-10: LGTM: Good refactoring to use shared lookup table.

This change eliminates code duplication by centralizing the permutation index lookup table across multiple architectures (SSSE3, NEON, LSX). The shared header approach improves maintainability and consistency.

arch/x86/chunkset_ssse3.c (1)

10-11: LGTM: Consistent refactoring to shared headers.

These includes align with the broader effort to unify permutation index lookup tables across architectures. The changes maintain existing functionality while improving code organization and reducing duplication.

test/benchmarks/benchmark_adler32_copy.cc (1)

130-133: LGTM: Proper benchmark coverage for LoongArch LASX.

The benchmark additions follow the established pattern used by other architectures:

Proper conditional compilation with LOONGARCH_LASX

Runtime feature detection using test_cpu_features.loongarch.has_lasx

Both baseline and optimized variants for comprehensive performance comparison

This ensures the new LoongArch LASX implementation gets proper benchmark coverage.

arch/generic/chunk_128bit_perm_idx_lut.h (1)

1-27: LGTM: Well-structured shared lookup table header.

The header is properly implemented with:

Correct header guards to prevent multiple inclusion

Appropriate static const declaration for the lookup table

Proper dependency on chunk_permute_table.h for the lut_rem_pair type

Clear documentation of purpose and usage

This centralizes the permutation index lookup table that was previously duplicated across SSSE3, NEON, and now LSX implementations, improving maintainability.

test/test_crc32.cc (1)

324-326: LGTM: Proper test coverage for LoongArch CRC32.

The test addition correctly follows the established pattern:

Uses the standard TEST_CRC32 macro for consistency

Proper conditional compilation with LOONGARCH_CRC

Runtime feature detection using test_cpu_features.loongarch.has_crc

Function naming convention matches other architectures

This ensures the new LoongArch CRC32 implementation gets the same comprehensive test coverage as other architecture-specific implementations.

cpu_features.h (1)

21-22: LGTM! LoongArch support follows established architecture pattern.

The addition of LoongArch CPU features detection is well-implemented and consistent with the existing architecture support pattern used for X86, ARM, PPC, S390, and RISCV.

Also applies to: 36-37

test/benchmarks/benchmark_crc32.cc (1)

100-102: LGTM! LoongArch CRC32 benchmark follows established pattern.

The benchmark registration is correctly implemented using the existing BENCHMARK_CRC32 macro with appropriate conditional compilation and CPU feature detection.

test/test_adler32.cc (1)

395-400: LGTM! LoongArch Adler32 tests follow established SIMD testing pattern.

The test instantiations for both LSX and LASX variants are correctly implemented with proper conditional compilation and runtime CPU feature detection, consistent with other SIMD architecture tests.

test/test_compare256.cc (1)

94-99: LGTM! LoongArch compare256 tests follow established SIMD pattern.

The test cases for LSX and LASX variants correctly require both the architecture-specific macros and HAVE_BUILTIN_CTZ, consistent with the pattern used by X86 SSE2/AVX2 implementations. The runtime feature detection is properly implemented.

test/benchmarks/benchmark_compare256.cc (1)

95-100: LGTM! LoongArch compare256 benchmarks follow established SIMD pattern.

The benchmark registrations for LSX and LASX variants correctly mirror the test implementation pattern, requiring both architecture-specific macros and HAVE_BUILTIN_CTZ, consistent with X86 SSE2/AVX2 benchmarks.

test/benchmarks/benchmark_adler32.cc (1)

100-106: LGTM! Consistent with existing SIMD benchmark patterns.

The LoongArch LSX and LASX benchmark registrations correctly follow the established pattern used by other SIMD implementations. The conditional compilation guards and runtime feature checks are appropriate.

test/benchmarks/benchmark_slidehash.cc (1)

98-104: LGTM! Consistent implementation following established patterns.

The LoongArch slide_hash benchmark registrations correctly implement the same pattern used throughout the file for other SIMD extensions. The conditional compilation and runtime feature detection are properly implemented.
arch/loongarch/Makefile.in (2)

13-14: LGTM! Proper compiler flags for LoongArch SIMD extensions.

The LSX and LASX compiler flags are correctly defined and will enable the appropriate SIMD instruction sets during compilation.

44-91: Verify the NOLTOFLAG variable definition.

The Makefile structure is well-organized and includes appropriate build rules for all LoongArch-specific source files. However, the NOLTOFLAG variable is used in multiple targets but not defined in this Makefile.

Please verify that NOLTOFLAG is properly defined in the parent Makefile or build system:
#!/bin/bash
# Description: Check if NOLTOFLAG is defined in parent Makefiles
# Expected: Find NOLTOFLAG definition in build system

# Search for NOLTOFLAG definition in Makefile hierarchy
rg -A 2 "NOLTOFLAG.*=" .
.github/workflows/cmake.yml (1)

313-328: LGTM! Comprehensive CI configuration for LoongArch64.

The GitHub Actions workflow additions properly implement LoongArch64 cross-compilation testing with GCC 14. The dual configuration approach (with and without LASX) ensures both instruction set variants are tested. The package selection, toolchain configuration, and coverage reporting setup follow the established patterns used for other cross-compilation targets.

.github/workflows/configure.yml (1)

252-269: LGTM! LoongArch64 CI configuration follows established patterns.

The addition of two LoongArch64 build matrix entries properly follows the same pattern used for other cross-compiled architectures, with appropriate compiler selection (GCC 14), static linking, and cross-compilation packages.

arch/loongarch/loongarch_features.h (2)

11-15: Well-designed feature detection interface.

The loongarch_cpu_features struct appropriately captures the three key LoongArch SIMD capabilities (CRC32, LSX 128-bit, LASX 256-bit) following the same pattern as other architecture feature detection headers.

17-17: Proper internal function declaration.

The use of Z_INTERNAL for loongarch_check_features is correct for internal runtime detection functions.

arch/loongarch/slide_hash_lasx.c (2)

19-33: Efficient vectorized hash sliding implementation.

The slide_hash_chain function efficiently processes 16 hash values per iteration using LASX 256-bit intrinsics. The use of saturating subtract (__lasx_xvssub_hu) correctly prevents hash value underflow.

35-42: Proper bounds checking and implementation.

The assertion Assert(s->w_size <= UINT16_MAX, ...) is crucial for preventing overflow when casting to uint16_t. The function correctly handles both the head hash table (fixed HASH_SIZE) and the prev table (variable wsize).

functable.c (1)

272-309: Comprehensive LoongArch runtime dispatch integration.

The addition of LoongArch function pointer assignments follows the established pattern perfectly. The progressive feature checking (CRC → LSX → LASX) ensures optimal performance selection, and the conditional use of HAVE_BUILTIN_CTZ is appropriate for functions requiring bit manipulation support.

CMakeLists.txt (3)

132-136: Well-structured LoongArch build options.

The build options properly implement feature dependencies with LASX requiring LSX, following the hardware capabilities of LoongArch processors.

740-742: Consistent architecture directory setup.

The ARCHDIR assignment follows the same pattern as other supported architectures.

1616-1620: Complete feature summary integration.

The feature summary additions properly inform users about LoongArch-specific optimizations that are enabled or disabled during configuration.
cmake/toolchain-loongarch64-gcc-14.cmake (1)

9-9: Verify QEMU CPU model and sysroot path.

The emulator configuration uses la464-loongarch-cpu and assumes sysroot at /usr/${CMAKE_C_COMPILER_TARGET}/. Ensure these paths and CPU models are available in the target environment.
#!/bin/bash
# Description: Verify QEMU LoongArch64 support and sysroot availability

# Check if QEMU supports la464-loongarch-cpu
qemu-loongarch64 -cpu help 2>/dev/null | grep -q "la464-loongarch-cpu" || echo "QEMU CPU model la464-loongarch-cpu not found"

# Check typical sysroot locations
for path in /usr/loongarch64-linux-gnu /usr/loongarch64-unknown-linux-gnu; do
    if [ -d "$path" ]; then
        echo "Found potential sysroot: $path"
    fi
done
cmake/detect-intrinsics.cmake (3)

680-692: LGTM on CRC intrinsics detection.

The macro correctly tests LoongArch CRC intrinsics using <larchintrin.h> and __crc_w_b_w. The test program is minimal and appropriate.

694-709: LGTM on LSX intrinsics detection.

The macro properly sets the -mlsx flag and tests LSX intrinsics using <lsxintrin.h> and __lsx_vabsd_b. The structure follows established patterns.

728-743: LGTM on LASX intrinsics detection.

The macro correctly configures -mlasx flag and tests LASX intrinsics using <lasxintrin.h> and __lasx_xvabsd_b. The implementation is consistent with other intrinsic detection macros.

arch/loongarch/lsxintrin_ext.h (3)

11-16: LGTM on sum of absolute differences implementation.

The lsx_sad_bu function correctly implements sum of absolute differences using LSX intrinsics. The sequence of horizontal add operations progressively reduces from byte to doubleword level.

18-20: LGTM on movemask implementation.

The lsx_movemask_b function properly extracts sign bits using __lsx_vmskltz_b and retrieves the result with __lsx_vpickve2gr_w. This mirrors x86 SSE _mm_movemask_epi8 functionality.

22-31: LGTM on byte shuffle implementation.

The lsx_shuffle_b function correctly implements byte shuffling with MSB handling:

Detects negative indices (MSB set) using __lsx_vslti_b

Masks indices to 4 bits for shuffle operation

Clears destination bytes where source indices had MSB set

This properly emulates x86 SSSE3 _mm_shuffle_epi8 behavior.
arch/loongarch/compare256_lasx.c (3)

17-46: LGTM on LASX compare256 implementation.

The function correctly implements vectorized byte comparison using LASX intrinsics:

Loads 32-byte chunks using __lasx_xvld

Performs byte-wise equality check with __lasx_xvseq_b

Extracts bitmask and uses __builtin_ctz to find first difference

Properly handles loop unrolling and length tracking

The logic mirrors the AVX2 implementation and should provide good performance.

52-61: LGTM on template macro definitions.

The macro definitions correctly set up the template parameters for both regular and slow longest match variants:

LONGEST_MATCH and COMPARE256 macros are properly defined

Both regular and slow variants are included

Template inclusion pattern follows established conventions

15-15: Verify availability of lasx_movemask_b function.

The code references lasx_movemask_b (line 25, 36) but this function isn't defined in this file. Ensure it's properly declared in lasxintrin_ext.h.
#!/bin/bash
# Description: Check if lasx_movemask_b is defined in lasxintrin_ext.h

# Search for lasx_movemask_b definition
rg -A 5 "lasx_movemask_b" arch/loongarch/
arch/loongarch/lasxintrin_ext.h (5)

12-17: LGTM! Correct SAD implementation.

The sum of absolute differences implementation follows the correct pattern of computing absolute differences followed by progressive horizontal additions to accumulate the result.

19-22: LGTM! Correct movemask implementation.

The function correctly extracts sign bits from all 32 bytes in the 256-bit vector by using mask generation and combining the results from two words.

24-27: LGTM! Correct casting implementation.

The function correctly zero-extends a 128-bit vector to 256-bit by placing the input in the lower half and zeroing the upper half.

29-39: LGTM! Correct vector insertion and extension.

Both functions are implemented correctly:

lasx_inserti128_si256 properly handles insertion into lower (imm8==0) or upper (imm8!=0) half

lasx_zextsi128_si256 correctly zero-extends the input vector

The commented alternative implementation for lasx_zextsi128_si256 could be more explicit about the zero-extension intent, but the current implementation is more efficient.

42-46: LGTM! Correct shuffle implementation with negative index handling.

The function correctly implements byte shuffling with proper handling of negative indices (setting corresponding output bytes to zero), matching the behavior of x86 shuffle intrinsics.

arch/loongarch/crc32_la.c (2)

15-37: LGTM! Correct CRC32 implementation using LoongArch hardware instructions.

The function correctly:

Inverts the input CRC at the beginning and output at the end (standard CRC32 practice)

Processes data in optimal chunk sizes using appropriate LoongArch CRC instructions

Uses proper memory reading functions that should handle endianness correctly

Follows the established pattern for hardware-accelerated CRC implementations

41-49: LGTM! Correct wrapper implementations.

Both wrapper functions properly integrate with the fold infrastructure:

crc32_fold_copy_loongarch64 correctly updates CRC and copies data

crc32_fold_loongarch64 correctly updates CRC with proper parameter handling

The Z_UNUSED annotation for init_crc is appropriate as noted in the comment

arch/loongarch/compare256_lsx.c (2)

17-82: LGTM! Well-implemented SIMD comparison with proper alignment handling.

The function correctly implements 256-byte buffer comparison:

Proper handling of unaligned access with initial unaligned load followed by aligned processing

Correct use of LSX intrinsics for 16-byte vector comparisons

Appropriate use of __builtin_ctz to find the first differing bit

Sound algorithm for handling alignment boundaries and remaining bytes

Logic follows established patterns from other architecture implementations

84-98: LGTM! Correct template usage and wrapper implementation.

The code properly:

Provides a clean public interface via compare256_lsx wrapper

Uses the template system correctly to instantiate both fast and slow longest match variants

Follows established patterns from other architecture implementations

arch/loongarch/adler32_lsx.c (3)

18-30: LGTM! Correct horizontal sum implementations.

Both helper functions properly implement horizontal summation:

partial_hsum correctly sums adjacent 32-bit word pairs

hsum correctly sums all four 32-bit elements using standard interleaving and addition patterns

32-146: LGTM! Sophisticated SIMD Adler32 implementation.

This is a well-structured vectorized Adler32 implementation that:

Correctly handles overflow prevention by processing chunks of maximum size NMAX

Uses appropriate vector accumulators for both s1 and s2 components

Implements efficient 32-byte and 16-byte processing loops with proper weight vectors

Properly handles both copy and non-copy variants through the COPY template parameter

Follows established patterns from other architecture's optimized Adler32 implementations

Correctly applies modular reduction and combines results at the end

The complexity is justified for the performance benefits of SIMD optimization.

148-154: LGTM! Clean wrapper functions.

Both wrapper functions properly instantiate the template implementation with appropriate parameters for copy vs non-copy variants.

configure (4)

107-107: LGTM! Properly integrated LoongArch configuration variables.

The new configuration variables and options are well-integrated:

buildcrc32la follows the established pattern for optional features

lsxflag and lasxflag compiler flags follow naming conventions

Command-line option --without-crc32-la matches the established pattern

Architecture detection for loongarch64 is properly added to the existing case structure

Also applies to: 122-123, 230-230, 365-366

1717-1772: LGTM! Well-structured intrinsic detection functions.

All three intrinsic detection functions follow the established pattern:

Proper test programs with appropriate headers and representative intrinsics

Correct compiler flag usage (-mlsx, -mlasx)

Consistent variable naming and logging

Test intrinsics are appropriate for validating each instruction set

2293-2332: LGTM! Comprehensive LoongArch architecture support.

The LoongArch case properly implements:

Correct architecture directory setup

Conditional feature detection based on optimization settings

Appropriate intrinsic checking for CRC32, LSX, and LASX

Proper compiler flag and object file management

Consistent naming and structure matching other architecture implementations

Complete coverage of all LoongArch-specific optimizations introduced in this PR

2400-2401: LGTM! Proper Makefile integration.

The Makefile updates correctly:

Log the new compiler flags following the established pattern

Substitute the flags into the architecture-specific Makefile using the standard sed pattern

Maintain consistency with other architecture flag handling

Also applies to: 2551-2552
arch/loongarch/chunkset_lsx.c (3)

1-14: LGTM: Well-structured includes and conditional compilation.

The file header, copyright, and includes are properly organized. The conditional compilation with LOONGARCH_LSX ensures the code is only compiled when LSX support is available.

23-33: LGTM: Correct LSX intrinsic usage for chunk replication.

The chunkmemset_* functions correctly use LSX intrinsics to replicate small data patterns across 128-bit vectors:

__lsx_vreplgr2vr_h for 2-byte patterns

__lsx_vreplgr2vr_w for 4-byte patterns

__lsx_vreplgr2vr_d for 8-byte patterns

This follows the expected pattern for SIMD chunk operations.

43-59: Verify memory sanitizer integration and permutation logic.

The GET_CHUNK_MAG function has complex logic for byte permutation. While the __msan_unpoison usage is documented as intentional, verify that:

The permutation table indexing is bounds-safe

The lut_rem.idx value is properly validated

The lsx_shuffle_b operation handles all edge cases correctly
#!/bin/bash
# Description: Verify the permutation table and lookup usage is consistent across architectures
# Expected: Similar patterns in other SIMD implementations

# Check if permutation table is properly defined
fd -e h -e c | xargs rg -l "permute_table" | head -5

# Check bounds of perm_idx_lut usage
rg -A 5 -B 5 "perm_idx_lut\[.*\]" --type c
arch/loongarch/adler32_lasx.c (3)

20-32: LGTM: Correct LASX horizontal sum implementation.

The hsum256 and partial_hsum256 functions correctly implement horizontal sum operations using LASX intrinsics:

Proper use of __lasx_xvadd_w for 32-bit additions

Correct bit shifting and permutation for reduction

Appropriate vector element extraction

This follows the standard pattern for SIMD horizontal reductions.

87-87: ```shell
#!/bin/bash

Display context around the usage of lasx_sad_bu in arch/loongarch/adler32_lasx.c

sed -n '1,200p' arch/loongarch/adler32_lasx.c
---

`142-143`: ```shell
#!/bin/bash
# Locate and inspect partial_hsum256 and hsum256 definitions
rg -n -C5 'partial_hsum256' --type c
rg -n -C5 'hsum256' --type c

# Show surrounding code in arch/loongarch/adler32_lasx.c for context
rg -n -C5 'adler0 = partial_hsum256' arch/loongarch/adler32_lasx.c
arch/loongarch/chunkset_lasx.c (3)

38-41: LGTM: Correct implementation of 16-byte chunk replication.

The chunkmemset_16 function correctly implements 16-byte pattern replication using:

Load 128-bit pattern with __lsx_vld

Broadcast to both 128-bit lanes of 256-bit vector using lasx_inserti128_si256

This is the standard approach for creating 256-bit vectors from 128-bit patterns.

62-83: Review complex cross-lane permutation logic.

The GET_CHUNK_MAG function has complex branching logic for handling different distance ranges. The dist < 16 case uses a different approach than the >= 16 case.

Key concerns:

The permute_xform offset calculation with lasx_inserti128_si256

Cross-lane permutation with __lsx_vslti_b and __lsx_vbitsel_v

Memory access patterns for buf + 16
#!/bin/bash
# Description: Verify similar cross-lane permutation logic in other SIMD implementations
# Expected: Similar patterns in AVX2 implementations

# Check for similar cross-lane logic in other architectures
rg -A 10 -B 5 "dist.*16.*permute\|cross.*lane" --type c

# Check if lasx_inserti128_si256 is properly defined
rg "lasx_inserti128_si256" --type c --type h
102-113: ```shell
#!/bin/bash

Check includes in the LoongArch chunkset implementation

rg '#include' -n arch/loongarch/chunkset_lasx.c
echo
echo "First 30 lines of arch/loongarch/chunkset_lasx.c for context:"
sed -n '1,30p' arch/loongarch/chunkset_lasx.c
</details>
<details>
<summary>arch/loongarch/loongarch_functions.h (3)</summary>

`8-10`: **LGTM: Proper header guard implementation.**

The header guard `LOONGARCH_FUNCTIONS_H_` follows the naming convention and prevents multiple inclusions correctly.

---

`45-99`: **Verify macro redefinition safety.**

The `DISABLE_RUNTIME_CPU_DETECTION` section redefines many `native_*` macros. While this approach is common, ensure that:

1. The original macros are properly undefined before redefinition
2. The redefinition order respects LASX > LSX > CRC precedence
3. No macro redefinition warnings occur during compilation



```shell
#!/bin/bash
# Description: Check for potential macro redefinition issues
# Expected: Proper undef/define pairs, no redefinition warnings

# Check if similar patterns exist in other architecture headers
rg -A 10 -B 5 "DISABLE_RUNTIME_CPU_DETECTION" --type h

# Look for macro redefinition patterns
rg -A 3 -B 1 "#undef.*native_.*#define.*native_" --type h
21-25: ```shell
#!/bin/bash

Inspect LOONGARCH_LSX guard in loongarch_functions.h

rg -n 'LOONGARCH_LSX' -A3 -B3 arch/loongarch/loongarch_functions.h

Inspect LOONGARCH_LASX guard in loongarch_functions.h

rg -n 'LOONGARCH_LASX' -A3 -B3 arch/loongarch/loongarch_functions.h
</details>

</blockquote></details>

</details>

CMakeLists.txt

arch/loongarch/slide_hash_lsx.c

cmake/detect-intrinsics.cmake

codecov · 2025-06-18T19:44:17Z

Codecov Report

Attention: Patch coverage is 92.66667% with 22 lines in your changes missing coverage. Please review.

Project coverage is 81.43%. Comparing base (f891bd3) to head (6761ffb).
Report is 16 commits behind head on develop.

Files with missing lines	Patch %	Lines
test/benchmarks/benchmark_adler32_copy.cc	0.00%	4 Missing ⚠️
functable.c	0.00%	0 Missing and 3 partials ⚠️
test/test_crc32.cc	0.00%	0 Missing and 3 partials ⚠️
test/benchmarks/benchmark_adler32.cc	0.00%	2 Missing ⚠️
test/benchmarks/benchmark_compare256.cc	0.00%	2 Missing ⚠️
test/benchmarks/benchmark_slidehash.cc	0.00%	2 Missing ⚠️
test/test_adler32.cc	0.00%	0 Missing and 2 partials ⚠️
test/test_compare256.cc	0.00%	0 Missing and 2 partials ⚠️
arch/loongarch/crc32_la.c	97.67%	0 Missing and 1 partial ⚠️
test/benchmarks/benchmark_crc32.cc	0.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##           develop    #1925       +/-   ##
============================================
+ Coverage    39.39%   81.43%   +42.04%     
============================================
  Files           74      162       +88     
  Lines         7885    13916     +6031     
  Branches      1303     3118     +1815     
============================================
+ Hits          3106    11333     +8227     
+ Misses        4536     1582     -2954     
- Partials       243     1001      +758

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

nmoinvaz · 2025-06-18T20:02:52Z

This looks pretty complete. I think the README.md features might need to be updated to add LSX/LASX.

arch/loongarch/adler32_lasx.c

arch/loongarch/chunkset_lasx.c

arch/loongarch/chunkset_lsx.c

arch/loongarch/slide_hash_lasx.c

arch/loongarch/slide_hash_lsx.c

junchao-loongson · 2025-06-23T08:11:34Z

OS: Loongnix GNU/Linux 20 (DaoXiangHu)
CPU:

Architecture:          loongarch64
  Byte Order:          Little Endian
CPU(s):                8
  On-line CPU(s) list: 0-7
Model name:            Loongson-3A6000-HV
  CPU family:          Loongson-64bit
  Thread(s) per core:  2
  Core(s) per socket:  4
  Socket(s):           1
  CPU(s) scaling MHz:  100%
  CPU max MHz:         2500.0000
  CPU min MHz:         312.0000
  BogoMIPS:            5000.00
  Flags:               cpucfg lam ual fpu lsx lasx crc32 lvz lbt_x86 lbt_arm lbt_mips
Caches (sum of all):   
  L1d:                 512 KiB (8 instances)
  L1i:                 512 KiB (8 instances)
  L2:                  2 MiB (8 instances)
  L3:                  16 MiB (1 instance)
NUMA:                  
  NUMA node(s):        1
  NUMA node0 CPU(s):   0-7

MEM:
8G 3200 MT/s x 2
test and benchmark
benchmark_zlib_3A6000.log
ctest_3A6000.log

The unit tests all pass, and benchmark shows a significant performance increase

phprus · 2025-06-23T08:47:10Z

@junchao-loongson

Thanks for the help with testing.
Thanks for the assembler code to convert __m128i types to __m256i.

mtl1979 · 2025-06-23T18:13:24Z

I'm not familiar with Loongson assembly syntax or intrinsics, so I can't verify if there is or isn't any hidden bugs that CI isn't catching, but generally the patch is starting to look pretty clean. As with any new "port", there might be further improvements that can be made, but not all needs to be in initial version.

The general rule is to use inline assembly only if there is no working intrinsic function to do the same "function" efficiently.

Having at least one person that is familiar with Loongson testing the code is a good thing, but some could argue there should be more than one.

phprus · 2025-06-24T16:44:25Z

The general rule is to use inline assembly only if there is no working intrinsic function to do the same "function" efficiently.

Unfortunately, there are no intrinsics for this function in compilers.

mtl1979 · 2025-06-26T22:47:22Z

The general rule is to use inline assembly only if there is no working intrinsic function to do the same "function" efficiently.

Unfortunately, there are no intrinsics for this function in compilers.

Then it's OK to use inline assembly... Using inline assembly will sometimes disable certain optimizations but still it's generally safer to not let compiler optimize code that it might not fully understand.

phprus · 2025-06-28T18:07:57Z

Rebased

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

arch/loongarch/lasxintrin_ext.h (3)

21-36: Consider documenting the xvpermi.q immediate value.

The implementation is correct, but the magic number 0x20 in the xvpermi.q instruction could benefit from a comment explaining that it selects the lower quadword for zero extension.

 static inline __m256i lasx_zextsi128_si256(__m128i in) {
     __m256i out = __lasx_xvldi(0);
     __asm__ volatile (
         ".irp i," LA_ALL_REGS                  "\n\t"
         " .ifc %[out], " LA_XREGS_PREFIX"\\i    \n\t"
         "  .irp j," LA_ALL_REGS                "\n\t"
         "   .ifc %[in], " LA_VREGS_PREFIX "\\j  \n\t"
+        "    # 0x20: copy lower 128 bits to output\n\t"
         "    xvpermi.q $xr\\i, $xr\\j, 0x20     \n\t"
         "   .endif                              \n\t"
         "  .endr                                \n\t"
         " .endif                                \n\t"
         ".endr                                  \n\t"
         : [out] "+f" (out) : [in] "f" (in)
     );
     return out;
 }

38-65: Complex but correct implementation - consider adding comments.

The inline assembly correctly handles register allocation and vector concatenation. Consider adding comments to explain the xvpermi.q operation and the purpose of the .ifnc block.

 static inline __m256i lasx_set_si128(__m128i inhi, __m128i inlo) {
     __m256i out;
     __asm__ volatile (
         ".irp i," LA_ALL_REGS                  "\n\t"
         " .ifc %[hi], " LA_VREGS_PREFIX "\\i    \n\t"
         "  .irp j," LA_ALL_REGS                "\n\t"
         "   .ifc %[lo], " LA_VREGS_PREFIX "\\j  \n\t"
+        "    # 0x20: place lo in lower 128 bits, hi in upper 128 bits\n\t"
         "    xvpermi.q $xr\\i, $xr\\j, 0x20     \n\t"
         "   .endif                              \n\t"
         "  .endr                                \n\t"
         " .endif                                \n\t"
         ".endr                                  \n\t"
+        "# If output register differs from hi, copy the result\n\t"
         ".ifnc %[out], %[hi]                    \n\t"
         ".irp i," LA_ALL_REGS                  "\n\t"
         " .ifc %[out], " LA_XREGS_PREFIX "\\i   \n\t"
         "  .irp j," LA_ALL_REGS                "\n\t"
         "   .ifc %[hi], " LA_VREGS_PREFIX "\\j  \n\t"
         "    xvori.b $xr\\i, $xr\\j, 0          \n\t"
         "   .endif                              \n\t"
         "  .endr                                \n\t"
         " .endif                                \n\t"
         ".endr                                  \n\t"
         ".endif                                 \n\t"
         : [out] "=f" (out), [hi] "+f" (inhi)
         : [lo] "f" (inlo)
     );
     return out;
 }

90-91: Add newline at end of file.

 #endif // include guard LASXINTRIN_EXT_H
+

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3984d07 and 4d8cf80.

📒 Files selected for processing (41)

.github/workflows/cmake.yml (1 hunks)
.github/workflows/configure.yml (1 hunks)
CMakeLists.txt (5 hunks)
README.md (1 hunks)
arch/arm/chunkset_neon.c (1 hunks)
arch/generic/chunk_128bit_perm_idx_lut.h (1 hunks)
arch/generic/chunk_256bit_perm_idx_lut.h (1 hunks)
arch/loongarch/Makefile.in (1 hunks)
arch/loongarch/adler32_lasx.c (1 hunks)
arch/loongarch/adler32_lsx.c (1 hunks)
arch/loongarch/chunkset_lasx.c (1 hunks)
arch/loongarch/chunkset_lsx.c (1 hunks)
arch/loongarch/compare256_lasx.c (1 hunks)
arch/loongarch/compare256_lsx.c (1 hunks)
arch/loongarch/crc32_la.c (1 hunks)
arch/loongarch/lasxintrin_ext.h (1 hunks)
arch/loongarch/loongarch_features.c (1 hunks)
arch/loongarch/loongarch_features.h (1 hunks)
arch/loongarch/loongarch_functions.h (1 hunks)
arch/loongarch/lsxintrin_ext.h (1 hunks)
arch/loongarch/slide_hash_lasx.c (1 hunks)
arch/loongarch/slide_hash_lsx.c (1 hunks)
arch/x86/chunkset_avx2.c (1 hunks)
arch/x86/chunkset_avx512.c (1 hunks)
arch/x86/chunkset_ssse3.c (1 hunks)
arch_functions.h (1 hunks)
cmake/detect-intrinsics.cmake (1 hunks)
cmake/toolchain-loongarch64-gcc-14.cmake (1 hunks)
configure (8 hunks)
cpu_features.c (1 hunks)
cpu_features.h (2 hunks)
functable.c (1 hunks)
test/benchmarks/benchmark_adler32.cc (1 hunks)
test/benchmarks/benchmark_adler32_copy.cc (1 hunks)
test/benchmarks/benchmark_compare256.cc (1 hunks)
test/benchmarks/benchmark_crc32.cc (1 hunks)
test/benchmarks/benchmark_slidehash.cc (1 hunks)
test/data/.gitignore (1 hunks)
test/test_adler32.cc (1 hunks)
test/test_compare256.cc (1 hunks)
test/test_crc32.cc (1 hunks)

✅ Files skipped from review due to trivial changes (2)

arch/x86/chunkset_avx512.c
arch_functions.h

🚧 Files skipped from review as they are similar to previous changes (37)

test/data/.gitignore
arch/x86/chunkset_avx2.c
cpu_features.c
arch/arm/chunkset_neon.c
test/benchmarks/benchmark_slidehash.cc
arch/x86/chunkset_ssse3.c
test/benchmarks/benchmark_adler32_copy.cc
test/benchmarks/benchmark_crc32.cc
arch/loongarch/Makefile.in
README.md
test/benchmarks/benchmark_adler32.cc
arch/loongarch/loongarch_features.c
test/benchmarks/benchmark_compare256.cc
arch/generic/chunk_128bit_perm_idx_lut.h
arch/generic/chunk_256bit_perm_idx_lut.h
.github/workflows/configure.yml
arch/loongarch/slide_hash_lasx.c
arch/loongarch/slide_hash_lsx.c
cpu_features.h
arch/loongarch/loongarch_features.h
test/test_adler32.cc
CMakeLists.txt
arch/loongarch/compare256_lasx.c
functable.c
arch/loongarch/crc32_la.c
test/test_crc32.cc
arch/loongarch/compare256_lsx.c
test/test_compare256.cc
arch/loongarch/adler32_lasx.c
arch/loongarch/chunkset_lsx.c
.github/workflows/cmake.yml
arch/loongarch/chunkset_lasx.c
cmake/toolchain-loongarch64-gcc-14.cmake
configure
arch/loongarch/loongarch_functions.h
arch/loongarch/lsxintrin_ext.h
cmake/detect-intrinsics.cmake

🧰 Additional context used

🧠 Learnings (3)

📓 Common learnings

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:14-24
Timestamp: 2025-02-21T01:41:50.358Z
Learning: In zlib-ng's SSE2 vectorized Chorba CRC implementation, the code that calls READ_NEXT macro ensures 16-byte alignment, making explicit alignment checks unnecessary within the macro.

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-18T02:53:23.621Z
Learning: In zlib-ng, when macro names and parameters are self-explanatory and scalar versions of the code provide context, additional documentation for the macros may not be necessary.

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: arch/x86/chunkset_avx2.c:82-85
Timestamp: 2024-10-07T21:18:37.806Z
Learning: In `arch/x86/chunkset_avx2.c`, when working with AVX2-capable x86 CPUs, unaligned memory access using `_mm_loadu_si128` is acceptable since there is no performance penalty on architectures after Nehalem. Ensuring alignment may introduce unnecessary overhead due to arbitrary offsets into the window.

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: arch/x86/chunkset_avx2.c:82-85
Timestamp: 2024-10-08T19:37:14.998Z
Learning: In `arch/x86/chunkset_avx2.c`, when working with AVX2-capable x86 CPUs, unaligned memory access using `_mm_loadu_si128` is acceptable since there is no performance penalty on architectures after Nehalem. Ensuring alignment may introduce unnecessary overhead due to arbitrary offsets into the window.

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:42:40.488Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1778
File: arch/x86/chunkset_avx2.c:160-171
Timestamp: 2024-10-08T21:51:45.330Z
Learning: In `arch/x86/chunkset_avx2.c`, within the `GET_HALFCHUNK_MAG` function, using a conditional branch to select between `_mm_loadl_epi64` and `_mm_loadu_si128` is not recommended because the branching cost outweighs the savings from the load.

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/x86_intrins.h:114-117
Timestamp: 2025-02-23T16:51:54.545Z
Learning: In x86/x86_intrins.h, the Clang macros for _mm_cvtsi64x_si128 and _mm_cvtsi128_si64x don't need additional MSVC guards since MSVC's implementation is already protected by `defined(_MSC_VER) && !defined(__clang__)`, making them mutually exclusive.

arch/loongarch/adler32_lsx.c (6)

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:42:40.488Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:14-24
Timestamp: 2025-02-21T01:41:50.358Z
Learning: In zlib-ng's SSE2 vectorized Chorba CRC implementation, the code that calls READ_NEXT macro ensures 16-byte alignment, making explicit alignment checks unnecessary within the macro.

Learnt from: Dead2
PR: zlib-ng/zlib-ng#1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-01-23T22:01:53.422Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: chunkset_tpl.h:92-108
Timestamp: 2024-10-08T19:37:14.998Z
Learning: In the `HALFCHUNKCOPY` function in `chunkset_tpl.h`, the `len` parameter is always bound to be small by the callers and will not exceed the limits of an `int32_t`.

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: chunkset_tpl.h:92-108
Timestamp: 2024-10-07T22:00:02.180Z
Learning: In the `HALFCHUNKCOPY` function in `chunkset_tpl.h`, the `len` parameter is always bound to be small by the callers and will not exceed the limits of an `int32_t`.

arch/loongarch/lasxintrin_ext.h (12)

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/x86_intrins.h:114-117
Timestamp: 2025-02-23T16:51:54.545Z
Learning: In x86/x86_intrins.h, the Clang macros for _mm_cvtsi64x_si128 and _mm_cvtsi128_si64x don't need additional MSVC guards since MSVC's implementation is already protected by `defined(_MSC_VER) && !defined(__clang__)`, making them mutually exclusive.

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:41:10.063Z
Learning: For SSE2 optimizations, `_mm_cvtsi128_si64` should be used instead of `_mm_extract_epi64` (SSE4.1) for extracting 64-bit values from 128-bit vectors, as it generates more efficient movq instructions.

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1778
File: arch/x86/chunkset_avx2.c:160-171
Timestamp: 2024-10-08T21:51:45.330Z
Learning: In `arch/x86/chunkset_avx2.c`, within the `GET_HALFCHUNK_MAG` function, using a conditional branch to select between `_mm_loadl_epi64` and `_mm_loadu_si128` is not recommended because the branching cost outweighs the savings from the load.

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: arch/x86/chunkset_avx2.c:82-85
Timestamp: 2024-10-07T21:18:37.806Z
Learning: In `arch/x86/chunkset_avx2.c`, when working with AVX2-capable x86 CPUs, unaligned memory access using `_mm_loadu_si128` is acceptable since there is no performance penalty on architectures after Nehalem. Ensuring alignment may introduce unnecessary overhead due to arbitrary offsets into the window.

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: arch/x86/chunkset_avx2.c:82-85
Timestamp: 2024-10-08T19:37:14.998Z
Learning: In `arch/x86/chunkset_avx2.c`, when working with AVX2-capable x86 CPUs, unaligned memory access using `_mm_loadu_si128` is acceptable since there is no performance penalty on architectures after Nehalem. Ensuring alignment may introduce unnecessary overhead due to arbitrary offsets into the window.

Learnt from: Dead2
PR: zlib-ng/zlib-ng#1830
File: arch/generic/generic_functions.h:39-53
Timestamp: 2024-12-20T23:35:59.830Z
Learning: The `longest_match_unaligned_*` functions are templated using the LONGEST_MATCH macro in match_tpl.h, so their implementations are generated rather than explicitly defined.

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1805
File: inffast_tpl.h:257-262
Timestamp: 2024-10-29T02:22:52.846Z
Learning: In `inffast_tpl.h`, when AVX512 is enabled, the branch involving `chunkcopy_safe` is intentionally eliminated to optimize performance.

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1805
File: arch/x86/chunkset_avx512.c:32-34
Timestamp: 2024-10-29T02:22:55.489Z
Learning: In `arch/x86/chunkset_avx512.c`, the `gen_mask` function's `len` parameter cannot exceed 32 because it is only called on the remaining bytes from a 32-byte vector.

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: arch/x86/chunkset_avx2.c:56-59
Timestamp: 2024-10-08T03:09:54.717Z
Learning: The array `half_rem_vals` in `arch/x86/chunkset_avx2.c` is used in the function `GET_HALFCHUNK_MAG` on line 148.

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: arch/x86/chunkset_avx2.c:56-59
Timestamp: 2024-10-12T13:02:26.066Z
Learning: The array `half_rem_vals` in `arch/x86/chunkset_avx2.c` is used in the function `GET_HALFCHUNK_MAG` on line 148.

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1805
File: arch/x86/chunkset_avx512.c:28-30
Timestamp: 2024-10-29T02:18:25.966Z
Learning: In `chunkset_avx512.c`, the `gen_half_mask` function does not require validation for `len` since it will never exceed 16 due to computing the remainder for a 16-byte load.

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:26-28
Timestamp: 2025-02-21T01:44:03.996Z
Learning: The alignment requirements for chorba_small_nondestructive_sse2 (16-byte alignment and multiple of 8 length) are enforced by its calling function, making additional checks redundant.

🧬 Code Graph Analysis (1)

arch/loongarch/adler32_lsx.c (3)

test/benchmarks/benchmark_adler32_copy.cc (1)

adler (22-22)

arch/loongarch/lsxintrin_ext.h (2)

lsx_sad_bu (11-16)

lsx_sad_bu (11-11)

arch/loongarch/loongarch_functions.h (2)

adler32_lsx (18-18)

adler32_fold_copy_lsx (19-19)

⏰ Context from checks skipped due to timeout of 90000ms (114)

GitHub Check: Windows GCC
GitHub Check: Windows MSVC 2022 v142 Win32
GitHub Check: Windows MSVC 2022 v141 Win32
GitHub Check: Windows MSVC 2022 v140 Win32
GitHub Check: Windows MSVC 2022 v143 Win64
GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
GitHub Check: Ubuntu GCC SSSE3 UBSAN
GitHub Check: Ubuntu GCC No Chorba
GitHub Check: Ubuntu GCC ASAN
GitHub Check: macOS GCC (ARM64)
GitHub Check: Windows GCC
GitHub Check: Windows MSVC 2022 v142 Win32
GitHub Check: Windows MSVC 2022 v141 Win32
GitHub Check: Windows MSVC 2022 v140 Win32
GitHub Check: Windows MSVC 2022 v143 Win64
GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
GitHub Check: Ubuntu GCC SSSE3 UBSAN
GitHub Check: Ubuntu GCC No Chorba
GitHub Check: Ubuntu GCC ASAN
GitHub Check: macOS GCC (ARM64)
GitHub Check: Windows GCC
GitHub Check: Windows MSVC 2022 v142 Win32
GitHub Check: Windows MSVC 2022 v141 Win32
GitHub Check: Windows MSVC 2022 v140 Win32
GitHub Check: Windows MSVC 2022 v143 Win64
GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
GitHub Check: Ubuntu GCC SSSE3 UBSAN
GitHub Check: Ubuntu GCC No Chorba
GitHub Check: Ubuntu GCC ASAN
GitHub Check: macOS GCC (ARM64)
GitHub Check: Windows GCC
GitHub Check: Windows MSVC 2022 v142 Win32
GitHub Check: Windows MSVC 2022 v141 Win32
GitHub Check: Windows MSVC 2022 v140 Win32
GitHub Check: Windows MSVC 2022 v143 Win64
GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
GitHub Check: Ubuntu GCC SSSE3 UBSAN
GitHub Check: Ubuntu GCC No Chorba
GitHub Check: Ubuntu GCC ASAN
GitHub Check: macOS GCC (ARM64)
GitHub Check: Windows GCC
GitHub Check: Windows MSVC 2022 v142 Win32
GitHub Check: Windows MSVC 2022 v141 Win32
GitHub Check: Windows MSVC 2022 v140 Win32
GitHub Check: Windows MSVC 2022 v143 Win64
GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
GitHub Check: Ubuntu GCC SSSE3 UBSAN
GitHub Check: Ubuntu GCC No Chorba
GitHub Check: Ubuntu GCC ASAN
GitHub Check: macOS GCC (ARM64)
GitHub Check: Windows GCC
GitHub Check: Windows MSVC 2022 v142 Win32
GitHub Check: Windows MSVC 2022 v141 Win32
GitHub Check: Windows MSVC 2022 v140 Win32
GitHub Check: Windows MSVC 2022 v143 Win64
GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
GitHub Check: Ubuntu GCC SSSE3 UBSAN
GitHub Check: Ubuntu GCC No Chorba
GitHub Check: Ubuntu GCC ASAN
GitHub Check: macOS GCC (ARM64)
GitHub Check: Windows GCC
GitHub Check: Windows MSVC 2022 v142 Win32
GitHub Check: Windows MSVC 2022 v141 Win32
GitHub Check: Windows MSVC 2022 v140 Win32
GitHub Check: Windows MSVC 2022 v143 Win64
GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
GitHub Check: Ubuntu GCC SSSE3 UBSAN
GitHub Check: Ubuntu GCC No Chorba
GitHub Check: Ubuntu GCC ASAN
GitHub Check: macOS GCC (ARM64)
GitHub Check: Windows GCC
GitHub Check: Windows MSVC 2022 v142 Win32
GitHub Check: Windows MSVC 2022 v141 Win32
GitHub Check: Windows MSVC 2022 v140 Win32
GitHub Check: Windows MSVC 2022 v143 Win64
GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
GitHub Check: Ubuntu GCC SSSE3 UBSAN
GitHub Check: Ubuntu GCC No Chorba
GitHub Check: Ubuntu GCC ASAN
GitHub Check: macOS GCC (ARM64)
GitHub Check: macOS Clang (Target 10.10)
GitHub Check: Windows GCC
GitHub Check: Windows MSVC 2022 v142 Win32
GitHub Check: Windows MSVC 2022 v141 Win32
GitHub Check: Windows MSVC 2022 v140 Win32
GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
GitHub Check: Windows MSVC 2022 v143 Win64
GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
GitHub Check: Ubuntu GCC SSSE3 UBSAN
GitHub Check: Ubuntu GCC No Chorba
GitHub Check: Ubuntu GCC ASAN
GitHub Check: macOS GCC (ARM64)
GitHub Check: macOS Clang (Target 10.10)
GitHub Check: Windows GCC
GitHub Check: Windows MSVC 2022 v142 Win32
GitHub Check: Windows MSVC 2022 v141 Win32
GitHub Check: Windows MSVC 2022 v140 Win32
GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
GitHub Check: Windows MSVC 2022 v143 Win64
GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
GitHub Check: Ubuntu GCC SSSE3 UBSAN
GitHub Check: Ubuntu GCC No Chorba
GitHub Check: Ubuntu GCC ASAN
GitHub Check: macOS GCC (ARM64)

🔇 Additional comments (10)

arch/loongarch/adler32_lsx.c (4)

10-30: LGTM! Helper functions correctly implement horizontal sums.

The partial_hsum and hsum functions properly use LSX intrinsics to compute horizontal sums of vector elements, which are essential for the Adler-32 algorithm.

32-57: Function initialization and constants are correct.

The NULL handling, adler splitting, and dot product vectors are properly set up for the Adler-32 algorithm. The weighted sum vectors (32→17 and 16→1) correctly implement the position-based weighting required by the algorithm.

58-138: Main processing loops correctly implement vectorized Adler-32.

The implementation efficiently processes data in 32-byte and 16-byte chunks using LSX SIMD intrinsics. The use of NMAX to limit iterations before modulo reduction prevents overflow, and the weighted sum calculations correctly implement the Adler-32 algorithm.

140-154: Clean interface design with proper remainder handling.

The use of goto for remainder handling is appropriate here as it efficiently reuses the existing small buffer code path. The public functions provide a clean interface by wrapping the implementation with the appropriate COPY flag.

arch/loongarch/lasxintrin_ext.h (6)

1-10: Header structure looks good.

The copyright notice, include guards, and required headers are properly implemented.

12-20: Compiler-specific register prefixes are correctly handled.

The differentiation between Clang and GCC register prefixes follows LoongArch conventions, and the LA_ALL_REGS macro provides a clean way to iterate through all registers in inline assembly.

67-69: Clean broadcast implementation.

Efficiently reuses lasx_set_si128 to duplicate the 128-bit vector into both halves.

71-76: Correct SAD implementation with proper width progression.

The function correctly implements sum of absolute differences using a logical sequence of horizontal additions that progressively widen from bytes to doublewords.

78-81: Efficient movemask implementation.

Correctly extracts sign bits from all 32 bytes and combines them into a 32-bit integer mask.

83-88: Correct shuffle implementation matching x86 semantics.

The function properly implements byte shuffling with masking, zeroing bytes when the corresponding index has its MSB set, which matches the behavior of x86's _mm_shuffle_epi8.

arch/loongarch/loongarch_features.c

arch/generic/chunk_128bit_perm_idx_lut.h

arch/arm/chunkset_neon.c

arch/loongarch/chunkset_lsx.c

phprus · 2025-07-08T20:44:51Z

PR added CI using qemu for LoongArch (first commit).

mtl1979 · 2025-07-09T06:52:32Z

PR added CI using qemu for LoongArch (first commit).

Testing with qemu userland emulation is a good start, but the emulation isn't perfect so some issues might still linger that can be only detected when running on real hardware. There has been quite long discussion about serious bugs in qemu that haven't been fixed since qemu 4.x. Latest qemu version currently is 10.0.2, so that's quite long time.

phprus · 2025-07-11T04:34:49Z

Need rebased after #1930

Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>

…ntation Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>

Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>

Co-authored-by: junchao-zhao <zhaojunchao@loongson.cn> Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>

…lignment Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>

phprus · 2025-07-11T04:58:42Z

Ready for merging.
Commit "Add .gitignore to allow run tests with zlib-ng/corpora test data" removed.

coderabbitai bot reviewed Jun 18, 2025

View reviewed changes

CMakeLists.txt Show resolved Hide resolved

arch/loongarch/slide_hash_lsx.c Show resolved Hide resolved

cmake/detect-intrinsics.cmake Outdated Show resolved Hide resolved

phprus force-pushed the loongarch64-1 branch from 9780e9e to eb3d6da Compare June 18, 2025 19:25

phprus marked this pull request as draft June 18, 2025 20:43