Skip to content

Optimize compress_block() and build_tree()#1954

Merged
Dead2 merged 4 commits intodevelopfrom
trees-opt
Aug 20, 2025
Merged

Optimize compress_block() and build_tree()#1954
Dead2 merged 4 commits intodevelopfrom
trees-opt

Conversation

@Dead2
Copy link
Copy Markdown
Member

@Dead2 Dead2 commented Aug 18, 2025

  • Use local pointers to avoid indirection penalty in compress_block
  • Use local pointers to avoid indirection penalty in pqdownheap and build_tree
  • Reorder functions related to build_tree
  • Inline pqdownheap

Tested on Intel i7-11700K, AMD E450 and Rpi-5, all showing good speed results in multiple builds of different configs.
Deflatebench shows compression is 0.25 - 0.75% faster
compress_bench shows compression is 0.95% - 2.65% faster

i7-11700K:

Develop

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     54.185%  0.0783/0.0786/0.0787/0.0001    0.0302/0.0310/0.0310/0.0002        8,526,745
 2     43.871%  0.1284/0.1291/0.1293/0.0002    0.0304/0.0305/0.0305/0.0000        6,903,702
 3     42.388%  0.1561/0.1562/0.1564/0.0001    0.0292/0.0293/0.0293/0.0000        6,670,239
 4     41.647%  0.1795/0.1799/0.1801/0.0002    0.0285/0.0286/0.0286/0.0000        6,553,746
 5     41.216%  0.1945/0.1949/0.1953/0.0002    0.0283/0.0284/0.0285/0.0000        6,485,936
 6     41.038%  0.2402/0.2407/0.2411/0.0002    0.0280/0.0281/0.0281/0.0000        6,457,827
 7     40.778%  0.3374/0.3378/0.3381/0.0002    0.0283/0.0283/0.0284/0.0000        6,416,941
 8     40.704%  0.4448/0.4453/0.4457/0.0002    0.0283/0.0284/0.0284/0.0000        6,405,249
 9     40.409%  0.5249/0.5256/0.5260/0.0002    0.0276/0.0276/0.0276/0.0000        6,358,951

 avg1  42.915%                       0.2542                         0.0289
 tot                                68.6448                         7.8051       60,779,336

   text    data     bss     dec     hex filename
 159302    1344       8  160654   2738e libz-ng.so.2

slide_hash/avx2/1024        2469 ns         2469 ns       850225
slide_hash/avx2/2048        2493 ns         2493 ns       841372
slide_hash/avx2/4096        2551 ns         2551 ns       823830
slide_hash/avx2/8192        2669 ns         2669 ns       786739
slide_hash/avx2/16384       2905 ns         2905 ns       723014
slide_hash/avx2/32768       3379 ns         3379 ns       621056

compress_bench/compress_bench/1                 3759 ns         3759 ns       745247
compress_bench/compress_bench/8                 4012 ns         4012 ns       698459
compress_bench/compress_bench/16                4204 ns         4204 ns       668137
compress_bench/compress_bench/32                4527 ns         4527 ns       619999
compress_bench/compress_bench/64                4863 ns         4863 ns       575942
compress_bench/compress_bench/512               4898 ns         4898 ns       572839
compress_bench/compress_bench/4096              5445 ns         5445 ns       512955
compress_bench/compress_bench/32768            10006 ns        10006 ns       280370

PR

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     54.185%  0.0783/0.0785/0.0786/0.0001    0.0301/0.0310/0.0310/0.0002        8,526,745
 2     43.871%  0.1268/0.1271/0.1273/0.0002    0.0303/0.0304/0.0305/0.0000        6,903,702
 3     42.388%  0.1527/0.1531/0.1533/0.0001    0.0292/0.0293/0.0293/0.0000        6,670,239
 4     41.647%  0.1767/0.1769/0.1771/0.0001    0.0285/0.0285/0.0286/0.0000        6,553,746
 5     41.216%  0.1928/0.1932/0.1936/0.0002    0.0283/0.0284/0.0284/0.0000        6,485,936
 6     41.038%  0.2389/0.2393/0.2395/0.0002    0.0280/0.0281/0.0281/0.0000        6,457,827
 7     40.778%  0.3354/0.3359/0.3363/0.0003    0.0283/0.0283/0.0284/0.0000        6,416,941
 8     40.704%  0.4428/0.4435/0.4439/0.0003    0.0283/0.0284/0.0284/0.0000        6,405,249
 9     40.409%  0.5234/0.5242/0.5246/0.0003    0.0276/0.0276/0.0276/0.0000        6,358,951

 avg1  42.915%                       0.2524                         0.0289
 tot                                68.1497                         7.7984       60,779,336

   text    data     bss     dec     hex filename
 159862    1344       8  161214   275be libz-ng.so.2

compress_bench/compress_bench/1                 3730 ns         3730 ns       750349
compress_bench/compress_bench/8                 4002 ns         4002 ns       700884
compress_bench/compress_bench/16                4147 ns         4147 ns       675454
compress_bench/compress_bench/32                4469 ns         4469 ns       627385
compress_bench/compress_bench/64                4734 ns         4734 ns       589830
compress_bench/compress_bench/512               4807 ns         4807 ns       582401
compress_bench/compress_bench/4096              5306 ns         5306 ns       528373
compress_bench/compress_bench/32768             9741 ns         9741 ns       287033

Summary by CodeRabbit

  • Refactor

    • Reworked internal Huffman tree construction and heap management to streamline processing and reduce overhead.
    • Simplified compression loop to reduce indirection and improve data access patterns.
  • Performance

    • Faster compression with improved CPU efficiency and higher throughput, especially on varied data.
    • No changes to output format or compatibility; results remain identical.
  • Documentation

    • Updated inline comments to reflect the new compression pipeline flow.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Aug 18, 2025

Walkthrough

Refactors Huffman tree construction to use local heap/depth arrays and inlined pqdownheap/pqremove, introduces a new static build_tree implementation, and reduces indirection in compress_block by using local pointers and a local sym_next variable.

Changes

Cohort / File(s) Summary
Huffman heap & tree builder refactor
trees.c
Reworked pqdownheap to an inline function taking explicit depth, heap, and heap_len; updated pqremove calls to pass depth/heap; added a static build_tree(deflate_state *s, tree_desc *desc) that constructs Huffman trees using local depth/heap, ensures ≥2 non-zero codes, combines smallest nodes, updates Freq/Dad/Depth, then calls gen_bitlen and gen_codes and sets desc->max_code.
Compression loop optimizations
trees.c
compress_block updated to use local pointers (e.g., depth, heap, symbol buffers) and a local sym_next to reduce repeated indirection; handling for LIT_MEM and sym_buf paths streamlined for fewer field accesses.
Miscellaneous small locals
trees.c
send_tree and other helpers use temporary locals (e.g., bi_valid, bi_buf) to reduce repeated struct field access and are stored back at end.

Sequence Diagram(s)

sequenceDiagram
  participant Deflater as deflate_state
  participant Build as build_tree()
  participant Heap as pqdownheap()/pqremove
  participant Bitlen as gen_bitlen()
  participant Codes as gen_codes()

  Deflater->>Build: build_tree(desc)
  Build->>Heap: init heap from non-zero freqs, ensure 2 nodes
  loop combine nodes until single root
    Build->>Heap: pqremove(depth,heap,tree,SMALLEST) twice
    Heap-->>Build: two smallest nodes
    Build->>Build: create parent node, set Freq/Dad/Depth
    Build->>Heap: pqdownheap(depth,heap,heap_len,tree,k)
  end
  Build->>Bitlen: gen_bitlen(desc)
  Bitlen-->>Build: populated bit lengths
  Build->>Codes: gen_codes(desc)
  Codes-->>Build: populated codes
  Build-->>Deflater: desc->max_code
Loading
sequenceDiagram
  participant Deflater as deflate_state
  participant Compress as compress_block()

  Deflater->>Compress: compress_block(...)
  Compress->>Compress: bind local pointers and sym_next
  loop for each symbol
    Compress->>Compress: process symbol using local buffers and sym_next
  end
  Compress-->>Deflater: emitted compressed block
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45–75 minutes

Suggested labels

cleanup

Suggested reviewers

  • nmoinvaz
  • KungFuJesus

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch trees-opt

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@codecov
Copy link
Copy Markdown

codecov bot commented Aug 18, 2025

Codecov Report

❌ Patch coverage is 98.30508% with 1 line in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@3a52db3). Learn more about missing BASE report.
⚠️ Report is 7 commits behind head on develop.

Files with missing lines Patch % Lines
trees.c 98.30% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #1954   +/-   ##
==========================================
  Coverage           ?   81.90%           
==========================================
  Files              ?      162           
  Lines              ?    13928           
  Branches           ?     3122           
==========================================
  Hits               ?    11408           
  Misses             ?     1555           
  Partials           ?      965           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
trees.c (3)

69-69: Make pqdownheap parameters const to express intent and help the compiler

pqdownheap doesn't mutate depth or tree; marking them const documents intent, can enable better alias analysis, and prevents accidental writes later.

Apply:

- static inline void pqdownheap       (unsigned char *depth, int *heap, const int heap_len, ct_data *tree, int k);
+ static inline void pqdownheap       (const unsigned char *depth, int *heap, const int heap_len, const ct_data *tree, int k);

And the definition:

-static inline void pqdownheap(unsigned char *depth, int *heap, const int heap_len, ct_data *tree, int k) {
+static inline void pqdownheap(const unsigned char *depth, int *heap, const int heap_len, const ct_data *tree, int k) {

Also applies to: 150-150


726-735: Minor micro-optimization/readability in byte-to-uint16 distance load

Avoid two post-increments when reading the 16-bit distance; compute then bump once. Same codegen on many compilers, slightly clearer in reviews.

-            dist = sym_buf[sx++] & 0xff;
-            dist += (unsigned)(sym_buf[sx++] & 0xff) << 8;
-            lc = sym_buf[sx++];
+            dist = (unsigned)sym_buf[sx] | ((unsigned)sym_buf[sx + 1] << 8);
+            sx += 2;
+            lc = sym_buf[sx++];

138-142: Make depth an explicit parameter of the pqremove macro

To avoid an implicit dependency on a free identifier, update the macro and its sole call site as follows:

• In trees.c (lines 138–142), change the macro signature:

-#define pqremove(s, heap, tree, top) { \
+#define pqremove(s, depth, heap, tree, top) { \
     top = heap[SMALLEST]; \
     heap[SMALLEST] = heap[s->heap_len--]; \
     pqdownheap(depth, heap, s->heap_len, tree, SMALLEST); \
 }

• In trees.c (around line 237), update the call in build_tree():

-        pqremove(s, heap, tree, n);  /* n = node of least frequency */
+        pqremove(s, depth, heap, tree, n);  /* n = node of least frequency */

No other pqremove invocations were found.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 3a52db3 and b8e410b.

📒 Files selected for processing (1)
  • trees.c (6 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (40)
  • GitHub Check: Windows MSVC 2022 v141 Win32
  • GitHub Check: Windows MSVC 2022 v140 Win64
  • GitHub Check: EL9 GCC S390X DFLTCC UBSAN
  • GitHub Check: Ubuntu GCC ASAN
  • GitHub Check: Windows MSVC 2022 v141 Win32
  • GitHub Check: Windows MSVC 2022 v140 Win64
  • GitHub Check: EL9 GCC S390X DFLTCC UBSAN
  • GitHub Check: Ubuntu GCC ASAN
  • GitHub Check: Windows MSVC 2022 v141 Win32
  • GitHub Check: Windows MSVC 2022 v140 Win64
  • GitHub Check: EL9 GCC S390X DFLTCC UBSAN
  • GitHub Check: Ubuntu GCC ASAN
  • GitHub Check: Windows MSVC 2022 v141 Win32
  • GitHub Check: Windows MSVC 2022 v140 Win64
  • GitHub Check: EL9 GCC S390X DFLTCC UBSAN
  • GitHub Check: Ubuntu GCC ASAN
  • GitHub Check: Windows MSVC 2022 v141 Win32
  • GitHub Check: Windows MSVC 2022 v140 Win64
  • GitHub Check: EL9 GCC S390X DFLTCC UBSAN
  • GitHub Check: Ubuntu GCC ASAN
  • GitHub Check: Windows MSVC 2022 v141 Win32
  • GitHub Check: Windows MSVC 2022 v140 Win64
  • GitHub Check: EL9 GCC S390X DFLTCC UBSAN
  • GitHub Check: Ubuntu GCC ASAN
  • GitHub Check: Windows MSVC 2022 v141 Win32
  • GitHub Check: Windows MSVC 2022 v140 Win64
  • GitHub Check: EL9 GCC S390X DFLTCC UBSAN
  • GitHub Check: Ubuntu GCC ASAN
  • GitHub Check: Windows MSVC 2022 v141 Win32
  • GitHub Check: Windows MSVC 2022 v140 Win64
  • GitHub Check: EL9 GCC S390X DFLTCC UBSAN
  • GitHub Check: Ubuntu GCC ASAN
  • GitHub Check: Windows MSVC 2022 v141 Win32
  • GitHub Check: Windows MSVC 2022 v140 Win64
  • GitHub Check: EL9 GCC S390X DFLTCC UBSAN
  • GitHub Check: Ubuntu GCC ASAN
  • GitHub Check: Windows MSVC 2022 v141 Win32
  • GitHub Check: Windows MSVC 2022 v140 Win64
  • GitHub Check: EL9 GCC S390X DFLTCC UBSAN
  • GitHub Check: Ubuntu GCC ASAN
🔇 Additional comments (3)
trees.c (3)

156-173: pqdownheap rework looks correct and tighter

Heapify-down logic is standard: picks smaller child, uses depth as tie-breaker, and performs a single placement write at the end. Using v as a cached root value and passing heap_len explicitly avoids repeated s->heap_len loads. Looks good.


183-268: build_tree refactor: semantics preserved with clearer data flow

Good use of local pointers (depth/heap/tree) to reduce indirection. Forcing at least two non-zero codes, heap construction, combining two smallest nodes, and bookkeeping of heap_max match the canonical algorithm. gen_bitlen/gen_codes integration unchanged. Nice cleanup.


717-725: Good: reduce indirection in compress_block by caching sym_next and buffers

Capturing sym_next and buffer pointers locally reduces repeated loads and enables the compiler to optimize the inner loop better. Aligned with the PR goal.

Copy link
Copy Markdown
Member

@nmoinvaz nmoinvaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
trees.c (3)

69-69: Inline pqdownheap signature refactor looks good; consider const-qualifying read-only params

pqdownheap now operates on explicit depth/heap arrays. Since it only reads from depth and tree, you can const-qualify them to help the compiler and document intent.

Apply this diff:

- static inline void pqdownheap       (unsigned char *depth, int *heap, const int heap_len, ct_data *tree, int k);
+ static inline void pqdownheap       (const unsigned char *depth, int *heap, const int heap_len, const ct_data *tree, int k);

134-142: Wrap pqremove macro in do { } while (0) to behave as a single statement

Current multi-statement macro can be used safely as a statement, but do-while(0) is a safer pattern in all contexts (if/else without braces, etc.).

-#define pqremove(s, depth, heap, tree, top) { \
-    top = heap[SMALLEST]; \
-    heap[SMALLEST] = heap[s->heap_len--]; \
-    pqdownheap(depth, heap, s->heap_len, tree, SMALLEST); \
-}
+#define pqremove(s, depth, heap, tree, top) do { \
+    top = heap[SMALLEST]; \
+    heap[SMALLEST] = heap[s->heap_len--]; \
+    pqdownheap(depth, heap, s->heap_len, tree, SMALLEST); \
+} while (0)

150-174: pqdownheap inline implementation: logic matches canonical heapify; no regressions spotted

  • Uses explicit heap_len and local v; comparisons and swaps are correct.
  • Integration with smaller() tie-breaker remains unchanged.

Optional: mirror the header nitpick by const-qualifying depth/tree here too:

-static inline void pqdownheap(unsigned char *depth, int *heap, const int heap_len, ct_data *tree, int k) {
+static inline void pqdownheap(const unsigned char *depth, int *heap, const int heap_len, const ct_data *tree, int k) {
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between b8e410b and 6e67d21.

📒 Files selected for processing (1)
  • trees.c (6 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
trees.c (1)
deflate.h (5)
  • tree_desc_s (261-261)
  • tree_desc_s (260-260)
  • tree_desc_s (259-259)
  • tree_desc_s (108-112)
  • static_tree_desc_s (106-106)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
  • GitHub Check: Windows ClangCl Win64
  • GitHub Check: Windows MSVC 2022 v140 Win64
  • GitHub Check: Windows ClangCl Win64
  • GitHub Check: Windows MSVC 2022 v140 Win64
  • GitHub Check: Windows ClangCl Win64
  • GitHub Check: Windows MSVC 2022 v140 Win64
  • GitHub Check: Windows ClangCl Win64
  • GitHub Check: Windows MSVC 2022 v140 Win64
  • GitHub Check: Windows ClangCl Win64
  • GitHub Check: Windows MSVC 2022 v140 Win64
  • GitHub Check: Windows ClangCl Win64
  • GitHub Check: Windows MSVC 2022 v140 Win64
  • GitHub Check: Windows ClangCl Win64
  • GitHub Check: Windows MSVC 2022 v140 Win64
  • GitHub Check: Windows ClangCl Win64
  • GitHub Check: Windows MSVC 2022 v140 Win64
  • GitHub Check: Windows ClangCl Win64
  • GitHub Check: Windows MSVC 2022 v140 Win64
  • GitHub Check: Windows ClangCl Win64
  • GitHub Check: Windows MSVC 2022 v140 Win64
🔇 Additional comments (5)
trees.c (5)

183-268: build_tree refactor: clear, localized, and faithful to original algorithm

  • Using local pointers (depth, heap, tree) reduces indirection and is cache-friendly.
  • Heap construction, forced minimal codes, combination loop, and bookkeeping match zlib’s classic approach.
  • Properly updates desc->max_code and calls gen_bitlen/gen_codes.

No functional issues found.


560-583: LGTM: send_all_trees adopts the same bi_buf/bi_valid caching

Consistent with send_tree; storing back to s at the end is correct and avoids redundant field accesses within the emission loop.


717-749: compress_block: local aliases and bounded loop guard are solid micro-opts

  • Capturing sym_next once avoids repeated loads and preserves the intended snapshot semantics.
  • Local d_buf/l_buf or sym_buf aliases reduce indirection in the hot loop.
  • The do-while with precheck keeps behavior identical for sym_next == 0.

Looks good.


69-69: Keep declaration and definition in sync if const-qualifying pqdownheap

If you adopt the const-qualification nitpick for the prototype (Line 69), ensure the definition (Line 150) is updated identically to avoid mismatched prototypes.

Also applies to: 150-150


476-521: send_code/send_bits updated with new parameters, no legacy calls remain

  • Macros in trees_emit.h now define send_code(s, c, tree, bi_buf, bi_valid) and send_bits(s, t_val, t_len, bi_buf, bi_valid).
  • No 3-argument send_code calls were found.
  • The only send_bits usage in arch/s390/dfltcc_deflate.c already passes five arguments (state, value, length, bi_buf, bi_valid).

All call sites have been migrated to the new signatures.

@mtl1979
Copy link
Copy Markdown
Collaborator

mtl1979 commented Aug 19, 2025

The failures on macOS workflows seem to be caused by mismatch between gcc/clang version and Xcode version. Xcode 16 is currently oldest supported but some workflows try to use for example Xcode 15.2.

@Dead2 Dead2 merged commit ba3cc63 into develop Aug 20, 2025
285 of 292 checks passed
@Dead2 Dead2 deleted the trees-opt branch August 23, 2025 17:22
@Dead2 Dead2 mentioned this pull request Nov 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants