Skip to content

Unroll some of the adler checksum for avx2#1949

Merged
Dead2 merged 1 commit intozlib-ng:developfrom
KungFuJesus:adler_avx2_unroll
Aug 20, 2025
Merged

Unroll some of the adler checksum for avx2#1949
Dead2 merged 1 commit intozlib-ng:developfrom
KungFuJesus:adler_avx2_unroll

Conversation

@KungFuJesus
Copy link
Copy Markdown
Collaborator

@KungFuJesus KungFuJesus commented Aug 16, 2025

Similar to what's done for vmx, avx512, and sse4, let's unroll some of this checksum since it's a commutative checksum. We take advantage of ILP and do more intermediate sums before rolling them back together for the finalization of the checksum.

Summary by CodeRabbit

  • Refactor
    • Optimized Adler-32 on x86 with AVX2, adding a 64-byte processing path and improving 32-byte handling.
    • Delivers faster checksum throughput on large buffers, with additional gains when checksum and copy run together.
    • No changes to user-facing behavior or APIs.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Aug 16, 2025

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d6bb724 and 352fcec.

📒 Files selected for processing (1)
  • arch/x86/adler32_avx2.c (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • arch/x86/adler32_avx2.c
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (100)
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: Windows GCC Native Instructions (AVX)
  • GitHub Check: Windows MSVC ARM64 No Test
  • GitHub Check: Windows MSVC 2022 v140 Win64
  • GitHub Check: Windows MSVC 2022 v141 Win32
  • GitHub Check: EL9 Clang S390X DFLTCC MSAN
  • GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
  • GitHub Check: Ubuntu GCC Symbol Prefix
  • GitHub Check: macOS GCC Symbol Prefix
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: Windows GCC Native Instructions (AVX)
  • GitHub Check: Windows MSVC ARM64 No Test
  • GitHub Check: Windows MSVC 2022 v140 Win64
  • GitHub Check: Windows MSVC 2022 v141 Win32
  • GitHub Check: EL9 Clang S390X DFLTCC MSAN
  • GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
  • GitHub Check: Ubuntu GCC Symbol Prefix
  • GitHub Check: macOS GCC Symbol Prefix
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: Windows GCC Native Instructions (AVX)
  • GitHub Check: Windows MSVC ARM64 No Test
  • GitHub Check: Windows MSVC 2022 v140 Win64
  • GitHub Check: Windows MSVC 2022 v141 Win32
  • GitHub Check: EL9 Clang S390X DFLTCC MSAN
  • GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
  • GitHub Check: Ubuntu GCC Symbol Prefix
  • GitHub Check: macOS GCC Symbol Prefix
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: Windows GCC Native Instructions (AVX)
  • GitHub Check: Windows MSVC ARM64 No Test
  • GitHub Check: Windows MSVC 2022 v140 Win64
  • GitHub Check: Windows MSVC 2022 v141 Win32
  • GitHub Check: EL9 Clang S390X DFLTCC MSAN
  • GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
  • GitHub Check: Ubuntu GCC Symbol Prefix
  • GitHub Check: macOS GCC Symbol Prefix
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: Windows GCC Native Instructions (AVX)
  • GitHub Check: Windows MSVC ARM64 No Test
  • GitHub Check: Windows MSVC 2022 v140 Win64
  • GitHub Check: Windows MSVC 2022 v141 Win32
  • GitHub Check: EL9 Clang S390X DFLTCC MSAN
  • GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
  • GitHub Check: Ubuntu GCC Symbol Prefix
  • GitHub Check: macOS GCC Symbol Prefix
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: Windows GCC Native Instructions (AVX)
  • GitHub Check: Windows MSVC ARM64 No Test
  • GitHub Check: Windows MSVC 2022 v140 Win64
  • GitHub Check: Windows MSVC 2022 v141 Win32
  • GitHub Check: EL9 Clang S390X DFLTCC MSAN
  • GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
  • GitHub Check: Ubuntu GCC Symbol Prefix
  • GitHub Check: macOS GCC Symbol Prefix
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: Windows GCC Native Instructions (AVX)
  • GitHub Check: Windows MSVC ARM64 No Test
  • GitHub Check: Windows MSVC 2022 v140 Win64
  • GitHub Check: Windows MSVC 2022 v141 Win32
  • GitHub Check: EL9 Clang S390X DFLTCC MSAN
  • GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
  • GitHub Check: Ubuntu GCC Symbol Prefix
  • GitHub Check: macOS GCC Symbol Prefix
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: Windows GCC Native Instructions (AVX)
  • GitHub Check: Windows MSVC ARM64 No Test
  • GitHub Check: Windows MSVC 2022 v140 Win64
  • GitHub Check: Windows MSVC 2022 v141 Win32
  • GitHub Check: EL9 Clang S390X DFLTCC MSAN
  • GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
  • GitHub Check: Ubuntu GCC Symbol Prefix
  • GitHub Check: macOS GCC Symbol Prefix
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: Windows GCC Native Instructions (AVX)
  • GitHub Check: Windows MSVC ARM64 No Test
  • GitHub Check: Windows MSVC 2022 v140 Win64
  • GitHub Check: Windows MSVC 2022 v141 Win32
  • GitHub Check: EL9 Clang S390X DFLTCC MSAN
  • GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
  • GitHub Check: Ubuntu GCC Symbol Prefix
  • GitHub Check: macOS GCC Symbol Prefix
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: Windows GCC Native Instructions (AVX)
  • GitHub Check: Windows MSVC ARM64 No Test
  • GitHub Check: Windows MSVC 2022 v140 Win64
  • GitHub Check: Windows MSVC 2022 v141 Win32
  • GitHub Check: EL9 Clang S390X DFLTCC MSAN
  • GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
  • GitHub Check: Ubuntu GCC Symbol Prefix
  • GitHub Check: macOS GCC Symbol Prefix
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@KungFuJesus
Copy link
Copy Markdown
Collaborator Author

On a meager U class Haswell chip:

Before

-----------------------------------------------------------------
Benchmark                       Time             CPU   Iterations
-----------------------------------------------------------------
adler32/native/1             5.20 ns         5.19 ns    135075246
adler32/native/8             8.37 ns         8.34 ns     84535112
adler32/native/12            10.0 ns         9.99 ns     69944322
adler32/native/16            11.7 ns         11.6 ns     60030516
adler32/native/32            10.9 ns         10.9 ns     64211571
adler32/native/64            12.5 ns         12.5 ns     56268038
adler32/native/512           29.5 ns         29.5 ns     23825567
adler32/native/4096           186 ns          186 ns      3681299
adler32/native/32768         1415 ns         1412 ns       496706
adler32/native/262144       14151 ns        14119 ns        49597
adler32/native/4194304     313660 ns       311133 ns         2210

After

-----------------------------------------------------------------
Benchmark                       Time             CPU   Iterations
-----------------------------------------------------------------
adler32/native/1             5.13 ns         5.12 ns    136980626
adler32/native/8             8.29 ns         8.28 ns     84562446
adler32/native/12            9.97 ns         9.95 ns     70399291
adler32/native/16            12.4 ns         12.3 ns     59158817
adler32/native/32            10.7 ns         10.7 ns     65591363
adler32/native/64            12.0 ns         12.0 ns     58241380
adler32/native/512           27.9 ns         27.8 ns     25148383
adler32/native/4096           160 ns          160 ns      4374342
adler32/native/32768         1275 ns         1272 ns       552931
adler32/native/262144       11841 ns        11816 ns        59249
adler32/native/4194304     292282 ns       289905 ns         2333

On HEDT class hardware (Cascake Lake X):

Before

---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
adler32/avx2/1             3.27 ns         3.27 ns    214157823
adler32/avx2/8             5.39 ns         5.39 ns    129875561
adler32/avx2/12            6.73 ns         6.73 ns    103986274
adler32/avx2/16            8.12 ns         8.13 ns     86145285
adler32/avx2/32            7.71 ns         7.71 ns     90711137
adler32/avx2/64            8.21 ns         8.21 ns     85186661
adler32/avx2/512           18.2 ns         18.2 ns     38419988
adler32/avx2/4096           119 ns          119 ns      5905768
adler32/avx2/32768          949 ns          949 ns       736810
adler32/avx2/262144        8403 ns         8406 ns        83277
adler32/avx2/4194304     223228 ns       223337 ns         3136

After

---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
adler32/avx2/1             3.42 ns         3.42 ns    204842982
adler32/avx2/8             6.85 ns         6.85 ns    102142369
adler32/avx2/12            8.81 ns         8.81 ns     79447535
adler32/avx2/16            8.02 ns         8.01 ns     87344982
adler32/avx2/32            7.61 ns         7.62 ns     91836008
adler32/avx2/64            8.05 ns         8.05 ns     86905783
adler32/avx2/512           17.1 ns         17.1 ns     40899431
adler32/avx2/4096           103 ns          103 ns      6810843
adler32/avx2/32768          869 ns          869 ns       804832
adler32/avx2/262144        7638 ns         7640 ns        91662
adler32/avx2/4194304     207382 ns       207414 ns         3435

Similar to what's done for vmx, avx512, and sse4, let's unroll some
of this checksum since it's a commutative checksum. We take advantage
of ILP and do more intermediate sums before rolling them back together
for the finalization of the checksum.
@nmoinvaz
Copy link
Copy Markdown
Member

Nice work!

@Dead2 Dead2 added optimization Architecture Architecture specific labels Aug 16, 2025
@Dead2
Copy link
Copy Markdown
Member

Dead2 commented Aug 18, 2025

Tested on i7-11700K, compiled without AVX512* to enforce AVX2 code usage.

Deflatebench benchmark differences were negligible (using minideflate), and within measurement errors.

Develop

   text    data     bss     dec     hex filename
 159302    1344       8  160654   2738e libz-ng.so.2
 
compress_bench/compress_bench/1                 3735 ns         3735 ns       750424
compress_bench/compress_bench/8                 3985 ns         3985 ns       702022
compress_bench/compress_bench/16                4155 ns         4155 ns       673850
compress_bench/compress_bench/32                4485 ns         4485 ns       624271
compress_bench/compress_bench/64                4804 ns         4804 ns       581054
compress_bench/compress_bench/512               4843 ns         4843 ns       578384
compress_bench/compress_bench/4096              5349 ns         5349 ns       523434
compress_bench/compress_bench/32768             9775 ns         9775 ns       286633
uncompress_bench/uncompress_bench/1             69.4 ns         69.4 ns     38911457
uncompress_bench/uncompress_bench/64             217 ns          217 ns     12864435
uncompress_bench/uncompress_bench/1024           318 ns          318 ns      8873912
uncompress_bench/uncompress_bench/16384         2579 ns         2579 ns      1068983
uncompress_bench/uncompress_bench/131072       10425 ns        10425 ns       274728
uncompress_bench/uncompress_bench/1048576      74966 ns        74967 ns        37216

adler32/avx2/1             3.69 ns         3.69 ns    757471509
adler32/avx2/8             6.07 ns         6.07 ns    460718476
adler32/avx2/12            7.46 ns         7.46 ns    375386838
adler32/avx2/16            9.25 ns         9.25 ns    302909956
adler32/avx2/32            8.65 ns         8.65 ns    322817105
adler32/avx2/64            9.48 ns         9.48 ns    295803258
adler32/avx2/512           19.3 ns         19.3 ns    145084535
adler32/avx2/4096           115 ns          115 ns     24001933
adler32/avx2/32768          906 ns          906 ns      3086919
adler32/avx2/262144        7533 ns         7533 ns       372240
adler32/avx2/4194304     133709 ns       133711 ns        20937

PR

   text    data     bss     dec     hex filename
 159174    1344       8  160526   2730e libz-ng.so.2

compress_bench/compress_bench/1                 3733 ns         3733 ns       750064
compress_bench/compress_bench/8                 3986 ns         3986 ns       702455
compress_bench/compress_bench/16                4162 ns         4162 ns       673020
compress_bench/compress_bench/32                4486 ns         4486 ns       624333
compress_bench/compress_bench/64                4805 ns         4805 ns       584482
compress_bench/compress_bench/512               4822 ns         4822 ns       578895
compress_bench/compress_bench/4096              5339 ns         5339 ns       525944
compress_bench/compress_bench/32768             9777 ns         9777 ns       285427
uncompress_bench/uncompress_bench/1             70.7 ns         70.7 ns     39340989
uncompress_bench/uncompress_bench/64             219 ns          219 ns     12990511
uncompress_bench/uncompress_bench/1024           314 ns          314 ns      8983582
uncompress_bench/uncompress_bench/16384         2464 ns         2464 ns      1131842
uncompress_bench/uncompress_bench/131072        9373 ns         9373 ns       296360
uncompress_bench/uncompress_bench/1048576      71434 ns        71435 ns        39576

adler32/avx2/1             3.94 ns         3.94 ns    710914770
adler32/avx2/8             7.07 ns         7.07 ns    402292287
adler32/avx2/12            7.52 ns         7.52 ns    372461807
adler32/avx2/16            9.44 ns         9.44 ns    296351134
adler32/avx2/32            8.49 ns         8.49 ns    329367610
adler32/avx2/64            9.55 ns         9.55 ns    293618746
adler32/avx2/512           16.9 ns         16.9 ns    165390171
adler32/avx2/4096          87.6 ns         87.6 ns     31348828
adler32/avx2/32768          650 ns          650 ns      4311683
adler32/avx2/262144        6594 ns         6594 ns       422591
adler32/avx2/4194304     120868 ns       120869 ns        23166

No idea how the PR one results in a smaller compiled code size..

compress_bench seems not to have changed much at all.
uncompress_bench shows some nice improvements of up to 3% - 5%.
adler32/avx2 is about 9% - 12% faster

Copy link
Copy Markdown
Member

@Dead2 Dead2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Dead2 Dead2 merged commit 24821a5 into zlib-ng:develop Aug 20, 2025
143 of 148 checks passed
@Dead2 Dead2 mentioned this pull request Nov 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Architecture Architecture specific optimization

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants