Skip to content

Inline read_buf and flush_pending#1952

Merged
Dead2 merged 4 commits intodevelopfrom
inlining
Aug 26, 2025
Merged

Inline read_buf and flush_pending#1952
Dead2 merged 4 commits intodevelopfrom
inlining

Conversation

@Dead2
Copy link
Copy Markdown
Member

@Dead2 Dead2 commented Aug 16, 2025

  • Move "Architecture-specific hooks" from deflate.c to the more appropriate deflate.h
  • Inline read_buf and partially inline flush_pending

This results in a small performance improvement for compression.

Summary by CodeRabbit

  • Refactor

    • Streamlined input reading and output flushing into inlined helpers for lower overhead and simpler call paths.
  • New Features

    • Introduced an architecture hook layer to support platform-specific deflate optimizations.
  • Chores

    • Reorganized internal includes and updated call sites to use the new helpers; no public APIs or signatures changed.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Aug 16, 2025

Walkthrough

Adds a functable include and inlines I/O helpers (read_buf, flush_pending_inline); replaces PREFIX-prefixed read/flush call sites across deflate sources; removes S390-specific hooks from deflate.c; and introduces architecture hook macros and window-alignment constants in deflate.h (S390-gated).

Changes

Cohort / File(s) Summary of changes
Inline I/O helpers
deflate_p.h
Added #include "functable.h". Removed public PREFIX(read_buf) prototype. Added Z_FORCEINLINE read_buf(...) (checksum-aware copies using FUNCTABLE_CALL for crc/adler or memcpy) and flush_pending_inline(...) to flush pending output.
Deflate core
deflate.c
Added #include "functable.h" moved earlier. Removed local S390-specific hooks block and related DEFLATE_* macros/hooks. Replaced PREFIX(read_buf)(...) usages with unprefixed read_buf(...). Switched local flush_pending usage to flush_pending_inline(strm); wrapper delegates to inline.
Stored deflate path
deflate_stored.c
Replaced PREFIX(read_buf) macro-wrapped calls with direct read_buf(...) calls in stored-path copy logic; no other control-flow changes.
S390 arch hook implementation
arch/s390/dfltcc_deflate.c
Replaced PREFIX(flush_pending)(strm) with flush_pending_inline(strm) in send_eobs; no other logic changes.
Architecture hooks & alignment
deflate.h
Added S390-gated include arch/s390/dfltcc_deflate.h. Defined PAD_WINDOW, WINDOW_PAD_SIZE, HINT_ALIGNED_WINDOW (4096 for S390, 64 fallback). Introduced DEFLATE_* arch-hook macros with default no-op/value implementations for non-S390 builds.

Sequence Diagram(s)

sequenceDiagram
  participant App
  participant Deflate as deflate()
  participant Read as read_buf
  participant FT as functable
  participant Flush as flush_pending_inline

  App->>Deflate: deflate(strm, flush)
  Deflate->>Read: read_buf(strm, buf, size)
  alt wrap == gzip
    Read->>FT: FUNCTABLE_CALL(crc32_fold_copy,...)
    FT-->>Read: bytes copied
  else wrap == zlib
    Read->>FT: FUNCTABLE_CALL(adler32_fold_copy,...)
    FT-->>Read: bytes copied
  else
    Read-->>Deflate: memcpy (bytes copied)
  end
  Deflate->>Flush: flush_pending_inline(strm)
  Flush-->>Deflate: pending updated / bytes flushed
  Deflate-->>App: return status
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

  • nmoinvaz
  • phprus

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.


📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 79fe5b3 and 7197e0d.

📒 Files selected for processing (5)
  • arch/s390/dfltcc_deflate.c (1 hunks)
  • deflate.c (6 hunks)
  • deflate.h (1 hunks)
  • deflate_p.h (3 hunks)
  • deflate_stored.c (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • deflate_stored.c
  • deflate.h
  • deflate.c
  • deflate_p.h
🔇 Additional comments (1)
arch/s390/dfltcc_deflate.c (1)

95-95: Confirmed: flush_pending_inline fully delegates via wrapper

Verification shows that:

  • flush_pending_inline is defined as an inline helper in deflate_p.h (around line 107) and contains the complete flushing logic.
  • The PREFIX(flush_pending) wrapper (e.g. in deflate.c at lines 746–748) simply calls flush_pending_inline(strm) with no additional logic or conditional paths.
  • All other invocation sites of PREFIX(flush_pending) likewise delegate directly to the inline helper under all configurations.

No behavior is lost by replacing the wrapper call with the inline helper. The use of flush_pending_inline(strm) on this hot path is correct and requires no further changes.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch inlining

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
deflate_stored.c (2)

97-101: Use the returned byte count from read_buf to update out pointers

read_buf returns how many bytes were actually copied. While len is expected to be available here, using the return value makes the code robust and self-documenting.

Apply this diff:

-            read_buf(s->strm, s->strm->next_out, len);
-            s->strm->next_out += len;
-            s->strm->avail_out -= len;
-            s->strm->total_out += len;
+            unsigned got = read_buf(s->strm, s->strm->next_out, len);
+            s->strm->next_out  += got;
+            s->strm->avail_out -= got;
+            s->strm->total_out += got;

160-163: Likewise, prefer read_buf’s return value when updating window/write indices

As above, using the actual number of bytes read avoids any future fragility if the invariants change.

Apply this diff:

-        read_buf(s->strm, s->window + s->strstart, have);
-        s->strstart += have;
-        s->insert += MIN(have, s->w_size - s->insert);
+        unsigned got = read_buf(s->strm, s->window + s->strstart, have);
+        s->strstart += got;
+        s->insert += MIN(got, s->w_size - s->insert);
deflate.c (2)

781-792: Direct use of flush_pending_inline in hot path is fine; consider consistency

Using the inline in the tight loop removes call overhead. Elsewhere in this function we still call PREFIX(flush_pending). It’s okay to mix, but consider standardizing call sites for consistency (either all inline in hot paths, wrapper elsewhere), to reduce reader surprise.


835-841: Minor: consider using flush_pending_inline here too

Not performance-critical, but for symmetry with other header/trailer flushes you could call flush_pending_inline here as well.

Apply this diff if you prefer consistency:

-        PREFIX(flush_pending)(strm);
+        flush_pending_inline(strm);
deflate_p.h (1)

133-156: read_buf inline mirrors previous semantics; checksum handling looks solid

  • Returns 0 early if no input; proper min(size, avail_in).
  • Handles checksum updates:
    • wrap == 2: crc32_fold_copy (guarded by GZIP)
    • wrap == 1: adler32_fold_copy
    • else/no checksum: memcpy
  • Updates avail_in, next_in, total_in.

One tiny nit: if you expect any future call sites to rely on overlapping next_in/target buffers (unlikely for deflate), memmove would be safer than memcpy in the no-checksum branches. Current API assumes non-overlapping buffers, so leaving memcpy is fine.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 3a52db3 and b395866.

📒 Files selected for processing (4)
  • deflate.c (6 hunks)
  • deflate.h (1 hunks)
  • deflate_p.h (2 hunks)
  • deflate_stored.c (2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-02-18T02:53:23.621Z
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-18T02:53:23.621Z
Learning: In zlib-ng, when macro names and parameters are self-explanatory and scalar versions of the code provide context, additional documentation for the macros may not be necessary.

Applied to files:

  • deflate.c
🧬 Code Graph Analysis (3)
deflate_stored.c (1)
deflate_p.h (1)
  • read_buf (133-155)
deflate.h (2)
arch/s390/dfltcc_common.h (2)
  • dfltcc_state (75-75)
  • dfltcc_state (83-83)
arch/s390/dfltcc_inflate.c (7)
  • inflate_state (144-144)
  • inflate_state (115-115)
  • inflate_state (24-24)
  • inflate_state (183-183)
  • inflate_state (38-38)
  • inflate_state (169-169)
  • inflate_state (53-53)
deflate.c (1)
deflate_p.h (2)
  • flush_pending_inline (107-125)
  • read_buf (133-155)
🔇 Additional comments (12)
deflate.c (8)

51-51: Including functable.h here is correct and necessary

This ensures FUNCTABLE_INIT and FUNCTABLE_CALL are available before use in deflateInit2 and other paths. Inclusion order with deflate.h/deflate_p.h is sane.


743-749: Wrapper now delegates to flush_pending_inline — good refactor

Keeping the exported/internal PREFIX(flush_pending) while inlining the core logic centralizes behavior and preserves ABI/visibility.


966-972: LGTM: inline flush after gzip header states

Using flush_pending_inline here keeps the header emission path lean.


1043-1044: LGTM: inline flush for trailer emission

Right place to ensure trailer bytes are pushed without extra call overhead.


1266-1268: Switch to inlined read_buf in fill_window is correct

Semantics preserved: read_buf updates avail_in/next_in/total_in and returns bytes read, which are accumulated into lookahead via n. No behavioral change except reduced call overhead.


430-436: Dictionary path: wrap suppression remains compatible with new read_buf

Setting s->wrap = 0 to suppress checksum updates during dictionary insertion still works with the new read_buf logic that branches on wrap and DEFLATE_NEED_CHECKSUM.


980-985: Hook invocation preserved; no regression risk spotted

The DEFLATE_HOOK macro is retained and evaluated before selecting the strategy function, matching previous behavior while decoupling arch hooks to deflate.h.


249-251: FUNCTABLE_INIT placement is correct

Function table is initialized early in deflateInit2, before checksum folding calls; good.

deflate_p.h (3)

12-13: Including functable.h in this header is appropriate now

Given read_buf and flush_pending_inline depend on FUNCTABLE_CALL, pulling functable.h here prevents accidental missing includes at call sites.


101-126: flush_pending_inline is correct and side-effect accounting is complete

  • Ensures bit buffer is flushed before copying.
  • Correctly updates next_out, total_out, avail_out, pending, and pending_out reset.
  • Tracev retained.

161-169: FLUSH_BLOCK macros still call the wrapper; that’s fine

The wrapper now delegates to the inline; no behavior change.

deflate.h (1)

456-494: Good relocation and centralization of arch-specific hooks

  • Moving hooks and window-alignment macros here simplifies deflate.c and improves modularity.
  • Sensible defaults for non-arch builds:
    • Alignment: 64-byte
    • Hooks: benign no-ops
    • DEFLATE_NEED_CHECKSUM defaults to 1 to keep legacy behavior

No issues spotted.

@Dead2
Copy link
Copy Markdown
Member Author

Dead2 commented Aug 16, 2025

Develop Aug 14

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     54.185%  0.0783/0.0786/0.0787/0.0001    0.0302/0.0310/0.0310/0.0002        8,526,745
 2     43.871%  0.1284/0.1291/0.1293/0.0002    0.0304/0.0305/0.0305/0.0000        6,903,702
 3     42.388%  0.1561/0.1562/0.1564/0.0001    0.0292/0.0293/0.0293/0.0000        6,670,239
 4     41.647%  0.1795/0.1799/0.1801/0.0002    0.0285/0.0286/0.0286/0.0000        6,553,746
 5     41.216%  0.1945/0.1949/0.1953/0.0002    0.0283/0.0284/0.0285/0.0000        6,485,936
 6     41.038%  0.2402/0.2407/0.2411/0.0002    0.0280/0.0281/0.0281/0.0000        6,457,827
 7     40.778%  0.3374/0.3378/0.3381/0.0002    0.0283/0.0283/0.0284/0.0000        6,416,941
 8     40.704%  0.4448/0.4453/0.4457/0.0002    0.0283/0.0284/0.0284/0.0000        6,405,249
 9     40.409%  0.5249/0.5256/0.5260/0.0002    0.0276/0.0276/0.0276/0.0000        6,358,951

 avg1  42.915%                       0.2542                         0.0289
 tot                                68.6448                         7.8051       60,779,336

   text    data     bss     dec     hex filename
 159302    1344       8  160654   2738e libz-ng.so.2

compress_bench/compress_bench/1                 3759 ns         3759 ns       745247
compress_bench/compress_bench/8                 4012 ns         4012 ns       698459
compress_bench/compress_bench/16                4204 ns         4204 ns       668137
compress_bench/compress_bench/32                4527 ns         4527 ns       619999
compress_bench/compress_bench/64                4863 ns         4863 ns       575942
compress_bench/compress_bench/512               4898 ns         4898 ns       572839
compress_bench/compress_bench/4096              5445 ns         5445 ns       512955
compress_bench/compress_bench/32768            10006 ns        10006 ns       280370

PR inline read_buf

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     54.185%  0.0779/0.0785/0.0786/0.0001    0.0302/0.0310/0.0310/0.0002        8,526,745
 2     43.871%  0.1298/0.1302/0.1306/0.0002    0.0304/0.0305/0.0305/0.0000        6,903,702
 3     42.388%  0.1536/0.1539/0.1541/0.0001    0.0292/0.0293/0.0293/0.0000        6,670,239
 4     41.647%  0.1772/0.1777/0.1780/0.0002    0.0285/0.0286/0.0286/0.0000        6,553,746
 5     41.216%  0.1937/0.1941/0.1945/0.0002    0.0284/0.0284/0.0285/0.0000        6,485,936
 6     41.038%  0.2394/0.2400/0.2403/0.0002    0.0280/0.0281/0.0281/0.0000        6,457,827
 7     40.778%  0.3358/0.3366/0.3372/0.0003    0.0283/0.0283/0.0284/0.0000        6,416,941
 8     40.704%  0.4437/0.4442/0.4446/0.0002    0.0283/0.0284/0.0284/0.0000        6,405,249
 9     40.409%  0.5250/0.5256/0.5259/0.0003    0.0275/0.0276/0.0276/0.0000        6,358,951

 avg1  42.915%                       0.2534                         0.0289
 tot                                68.4254                         7.8047       60,779,336

   text    data     bss     dec     hex filename
 160358    1344       8  161710   277ae libz-ng.so.2

compress_bench/compress_bench/1                 3708 ns         3708 ns       754755
compress_bench/compress_bench/8                 3979 ns         3979 ns       705964
compress_bench/compress_bench/16                4149 ns         4149 ns       673768
compress_bench/compress_bench/32                4457 ns         4458 ns       630022
compress_bench/compress_bench/64                4787 ns         4787 ns       588033
compress_bench/compress_bench/512               4810 ns         4810 ns       580866
compress_bench/compress_bench/4096              5334 ns         5334 ns       520376
compress_bench/compress_bench/32768             9814 ns         9814 ns       284802

Deflatebench shows around 0.3-0.4% speedup for compression.
compress_bench shows 1.3% to 1.9% speedup.

@Dead2
Copy link
Copy Markdown
Member Author

Dead2 commented Aug 20, 2025

Rebased and split up the inlining into two commits.
Inlining read_buf seems to be a clear win.
Partial inlining of flush_pending I am uncertain of. If anyone wants to do comparisons of with/without the last commit on their systems, please do.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
deflate_p.h (4)

101-126: flush_pending_inline: Semantics look correct; consider unifying with out-of-line wrapper to avoid divergence

The implementation mirrors the classic flush_pending: flush bits, copy min(pending, avail_out), bump pointers/counters, and reset pending_out when empty. Looks good.

Optional: to minimize the risk of the inline and the out-of-line PREFIX(flush_pending) drifting over time, make the latter a thin wrapper around this inline helper (or have FLUSH_BLOCK call the inline). This keeps one source of truth while still enabling inlining where used.

As a follow-up, consider changing the non-inlined wrapper in deflate.c to:

/* deflate.c */
Z_INTERNAL void PREFIX(flush_pending)(PREFIX3(stream) *strm) {
    flush_pending_inline(strm);
}

And/or update the macro here to use the inline helper:

/* deflate_p.h */
#define FLUSH_BLOCK_ONLY(s, last) { \
    zng_tr_flush_block(s, (s->block_start >= 0 ? \
                   (char *)&s->window[(unsigned)s->block_start] : \
                   NULL), \
                   (uint32_t)((int)s->strstart - s->block_start), \
                   (last)); \
    s->block_start = (int)s->strstart; \
    flush_pending_inline(s->strm); \
}

101-126: Minor type consistency nit: consider uInt for len

len derives from MIN(s->pending, strm->avail_out). Since avail_out is uInt, using uInt for len avoids implicit conversions when updating avail_out, while keeping the same range characteristics on supported platforms.

-static inline void flush_pending_inline(PREFIX3(stream) *strm) {
-    uint32_t len;
+static inline void flush_pending_inline(PREFIX3(stream) *strm) {
+    uInt len;

133-155: Micro-optimization: avoid repeated strm->state dereferences

Hoisting strm->state to a local helps codegen and readability, and shortens the FUNCTABLE_CALL argument for crc/adler paths.

 static inline unsigned read_buf(PREFIX3(stream) *strm, unsigned char *buf, unsigned size) {
-    uint32_t len = MIN(strm->avail_in, size);
+    uint32_t len = MIN(strm->avail_in, size);
+    deflate_state *s = strm->state;
@@
-    if (!DEFLATE_NEED_CHECKSUM(strm)) {
+    if (!DEFLATE_NEED_CHECKSUM(strm)) {
         memcpy(buf, strm->next_in, len);
 #ifdef GZIP
-    } else if (strm->state->wrap == 2) {
-        FUNCTABLE_CALL(crc32_fold_copy)(&strm->state->crc_fold, buf, strm->next_in, len);
+    } else if (s->wrap == 2) {
+        FUNCTABLE_CALL(crc32_fold_copy)(&s->crc_fold, buf, strm->next_in, len);
 #endif
-    } else if (strm->state->wrap == 1) {
+    } else if (s->wrap == 1) {
         strm->adler = FUNCTABLE_CALL(adler32_fold_copy)(strm->adler, buf, strm->next_in, len);
     } else {
         memcpy(buf, strm->next_in, len);
     }

133-155: Consider force-inlining for hot paths

If the intent is to realize consistent perf wins across compilers/optimization levels, using the project’s force-inline attribute macro (if available, e.g., Z_FORCEINLINE) can help ensure these helpers are inlined, especially in non-LTO builds.

-static inline void flush_pending_inline(PREFIX3(stream) *strm) {
+Z_FORCEINLINE static void flush_pending_inline(PREFIX3(stream) *strm) {
@@
-static inline unsigned read_buf(PREFIX3(stream) *strm, unsigned char *buf, unsigned size) {
+Z_FORCEINLINE static unsigned read_buf(PREFIX3(stream) *strm, unsigned char *buf, unsigned size) {

If Z_FORCEINLINE isn’t available or desired, ignore this.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between b395866 and af529f5.

📒 Files selected for processing (4)
  • deflate.c (6 hunks)
  • deflate.h (1 hunks)
  • deflate_p.h (2 hunks)
  • deflate_stored.c (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • deflate.c
  • deflate.h
  • deflate_stored.c
🔇 Additional comments (2)
deflate_p.h (2)

12-13: Include functable.h here is correct and necessary for FUNCTABLE_CALL usage

This enables the crc/adler fold-copy dispatch below. No concerns.


127-156: No stale PREFIX(read_buf) references found; ready to approve

A quick ripgrep search confirms there are no remaining calls to PREFIX(read_buf)(…). The only usages of read_buf(…) in deflate_stored.c and deflate.c will correctly bind to the new static inline implementation. LGTM.

@codecov
Copy link
Copy Markdown

codecov bot commented Aug 20, 2025

Codecov Report

❌ Patch coverage is 92.85714% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.77%. Comparing base (4378e43) to head (7197e0d).
⚠️ Report is 6 commits behind head on develop.

Files with missing lines Patch % Lines
deflate_p.h 91.42% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #1952      +/-   ##
===========================================
- Coverage    81.91%   80.77%   -1.15%     
===========================================
  Files          162      162              
  Lines        14013    14405     +392     
  Branches      3137     3304     +167     
===========================================
+ Hits         11479    11635     +156     
- Misses        1568     1739     +171     
- Partials       966     1031      +65     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Dead2 Dead2 force-pushed the inlining branch 2 times, most recently from b0962c4 to 747f408 Compare August 21, 2025 20:40
@Dead2
Copy link
Copy Markdown
Member Author

Dead2 commented Aug 21, 2025

Redid benches after rebase, force-inline and removal of indirections in read_buf.
I chose the least beneficial benchmark of the several different configs I tested to present here:

## Develop
 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     54.185%  0.0774/0.0775/0.0776/0.0001    0.0299/0.0308/0.0308/0.0002        8,526,745
 2     43.871%  0.1256/0.1258/0.1260/0.0001    0.0302/0.0302/0.0302/0.0000        6,903,702
 3     42.388%  0.1503/0.1507/0.1509/0.0002    0.0290/0.0291/0.0291/0.0000        6,670,239
 4     41.647%  0.1746/0.1750/0.1753/0.0002    0.0283/0.0283/0.0284/0.0000        6,553,746
 5     41.216%  0.1907/0.1911/0.1914/0.0002    0.0282/0.0282/0.0282/0.0000        6,485,936
 6     41.038%  0.2362/0.2366/0.2370/0.0002    0.0278/0.0279/0.0279/0.0000        6,457,827
 7     40.778%  0.3328/0.3334/0.3338/0.0003    0.0281/0.0281/0.0281/0.0000        6,416,941
 8     40.704%  0.4405/0.4409/0.4413/0.0002    0.0281/0.0282/0.0282/0.0000        6,405,249
 9     40.409%  0.5233/0.5239/0.5244/0.0003    0.0273/0.0273/0.0274/0.0000        6,358,951

 avg1  42.915%                       0.2505                         0.0287
 tot                                67.6470                         7.7426       60,779,336

   text    data     bss     dec     hex filename
 176742    1344       8  178094   2b7ae libz-ng.so.2

compress_bench/compress_bench/1                 3711 ns         3711 ns       754332
compress_bench/compress_bench/8                 4004 ns         4004 ns       699798
compress_bench/compress_bench/16                4144 ns         4144 ns       675111
compress_bench/compress_bench/32                4473 ns         4473 ns       623637
compress_bench/compress_bench/64                4772 ns         4772 ns       588462
compress_bench/compress_bench/512               4851 ns         4851 ns       577228
compress_bench/compress_bench/4096              5317 ns         5317 ns       526819
compress_bench/compress_bench/32768             9433 ns         9434 ns       296222

PR

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     54.185%  0.0774/0.0776/0.0777/0.0001    0.0299/0.0308/0.0308/0.0002        8,526,745
 2     43.871%  0.1258/0.1264/0.1266/0.0002    0.0302/0.0302/0.0302/0.0000        6,903,702
 3     42.388%  0.1494/0.1496/0.1499/0.0002    0.0290/0.0291/0.0291/0.0000        6,670,239
 4     41.647%  0.1736/0.1739/0.1741/0.0002    0.0283/0.0283/0.0284/0.0000        6,553,746
 5     41.216%  0.1897/0.1900/0.1903/0.0002    0.0281/0.0282/0.0282/0.0000        6,485,936
 6     41.038%  0.2351/0.2355/0.2357/0.0002    0.0278/0.0279/0.0279/0.0000        6,457,827
 7     40.778%  0.3321/0.3327/0.3332/0.0003    0.0281/0.0281/0.0281/0.0000        6,416,941
 8     40.704%  0.4395/0.4400/0.4404/0.0003    0.0281/0.0282/0.0282/0.0000        6,405,249
 9     40.409%  0.5237/0.5242/0.5246/0.0003    0.0273/0.0273/0.0273/0.0000        6,358,951

 avg1  42.915%                       0.2500                         0.0287
 tot                                67.4958                         7.7423       60,779,336

   text    data     bss     dec     hex filename
 177222    1344       8  178574   2b98e libz-ng.so.2

compress_bench/compress_bench/1                 3724 ns         3724 ns       751829
compress_bench/compress_bench/8                 3958 ns         3958 ns       708012
compress_bench/compress_bench/16                4136 ns         4136 ns       678178
compress_bench/compress_bench/32                4451 ns         4451 ns       631063
compress_bench/compress_bench/64                4725 ns         4725 ns       594877
compress_bench/compress_bench/512               4806 ns         4806 ns       582182
compress_bench/compress_bench/4096              5311 ns         5311 ns       527637
compress_bench/compress_bench/32768             9389 ns         9390 ns       299647

Deflatebench overall 0.22% faster
Deflatebench level-6 0.46% faster
compress_bench/32k 0.46% faster

All in all I am pretty happy with this.

I also retested inlining of read_buf and partial inlining of flush_pending independently and they both improve performance.
Fully inlining flush_pending does decrease performance quite a bit, because in several places it is behind unlikely branches, thus I am only inlining the places where it is often used.

@Dead2
Copy link
Copy Markdown
Member Author

Dead2 commented Aug 22, 2025

Inlined flush_pending in DFLTCC's send_eobs() as well.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
arch/s390/dfltcc_deflate.c (1)

94-104: Missing header for flush_pending_inline leads to implicit declaration / no inlining

flush_pending_inline is a static inline defined in deflate_p.h. This file doesn’t include that header, so the call will either fail to compile (no prototype) or won't inline as intended. Include deflate_p.h after deflate.h.

Apply:

 #include "deflate.h"
+#include "deflate_p.h"
 #include "trees_emit.h"
🧹 Nitpick comments (1)
deflate.c (1)

51-54: Dup include of functable.h (benign); consider relying on deflate_p.h

deflate_p.h already includes functable.h. Keeping both is harmless thanks to include guards, but you can remove this one to reduce redundancy.

Apply:

-#include "functable.h"
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 747f408 and ee8e2f0.

📒 Files selected for processing (3)
  • arch/s390/dfltcc_deflate.c (1 hunks)
  • deflate.c (6 hunks)
  • deflate_p.h (2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-02-18T02:53:23.621Z
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-18T02:53:23.621Z
Learning: In zlib-ng, when macro names and parameters are self-explanatory and scalar versions of the code provide context, additional documentation for the macros may not be necessary.

Applied to files:

  • deflate.c
🧬 Code graph analysis (2)
deflate.c (4)
deflate_p.h (1)
  • void (107-125)
arch/generic/crc32_fold_c.c (1)
  • void (15-18)
adler32.c (3)
  • PREFIX (11-13)
  • PREFIX (22-24)
  • PREFIX (58-60)
crc32.c (2)
  • PREFIX (21-25)
  • PREFIX (35-37)
deflate_p.h (1)
deflate.c (1)
  • void (746-748)
🔇 Additional comments (8)
arch/s390/dfltcc_deflate.c (1)

93-105: Behavior preserved: send_eobs() still flushes bits then pending bytes

Switching to flush_pending_inline(strm) correctly flushes bit buffer first (via zng_tr_flush_bits inside the inline) and copies out pending bytes. The subsequent memmove-on-remaining-pending logic remains valid, since pending_out is only reset to pending_buf when pending == 0. LGTM.

deflate_p.h (2)

12-13: functable.h inclusion is appropriate here

Required for FUNCTABLE_CALL used below and keeps inline helpers self-contained. LGTM.


101-126: Inline flush is correct and side-effect complete

flush_pending_inline() flushes pending bits first, respects avail_out, updates all counters, and restores pending_out when drained. Matches legacy behavior while enabling inlining on hot paths.

deflate.c (5)

743-748: Wrapper retained for ABI/internal call sites; implementation delegates to inline

The wrapper now just calls flush_pending_inline, preserving symbol presence while enabling selective inlining elsewhere. LGTM.


781-791: Hot path switch to inline flush is appropriate

Inlining the flush on the main deflate() path is consistent with the PR goal and should help a bit with register allocation and call overhead in tight loops.


966-972: Header/HCRC path uses inline flush — fine

Using flush_pending_inline here ensures the buffer is empty before proceeding, matching prior behavior while allowing the compiler to inline when profitable.


1043-1054: Trailer path uses inline flush — fine

Keeps semantics identical (trailer written once, then flush). LGTM.


1226-1227: Switched fill_window() to new read_buf helper

The inline read_buf updates checksum via fold-copy when needed and advances stream counters atomically. This centralizes I/O handling and is a good change.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
deflate_p.h (1)

101-126: flush_pending_inline: correct semantics; tiny branch-hint nit

Logic matches the previous non-inlined helper: flush bit buffer, copy min(pending, avail_out), update pointers/counters, and reset pending_out when empty. Consider adding an unlikely-branch hint for the early return to help the predictor in hot paths.

-    if (len == 0)
+    if (UNLIKELY(len == 0))
         return;
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between ee8e2f0 and 86d0fa3.

📒 Files selected for processing (4)
  • arch/s390/dfltcc_deflate.c (1 hunks)
  • deflate.c (6 hunks)
  • deflate_p.h (2 hunks)
  • deflate_stored.c (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • arch/s390/dfltcc_deflate.c
  • deflate_stored.c
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-02-18T02:53:23.621Z
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-18T02:53:23.621Z
Learning: In zlib-ng, when macro names and parameters are self-explanatory and scalar versions of the code provide context, additional documentation for the macros may not be necessary.

Applied to files:

  • deflate.c
🧬 Code graph analysis (2)
deflate_p.h (2)
deflate.c (1)
  • void (746-748)
trees.c (1)
  • void (375-407)
deflate.c (5)
deflate_p.h (1)
  • void (107-125)
arch/generic/crc32_fold_c.c (1)
  • void (15-18)
crc32.c (2)
  • PREFIX (21-25)
  • PREFIX (35-37)
adler32.c (3)
  • PREFIX (11-13)
  • PREFIX (22-24)
  • PREFIX (58-60)
deflate.h (2)
  • deflate_allocs_s (135-143)
  • internal_state (120-120)
🔇 Additional comments (8)
deflate_p.h (2)

12-13: LGTM: functable.h include is appropriate for FUNCTABLE_CALL usage

Including functable.h here is correct since read_buf and (gzip/adler) fold copies rely on FUNCTABLE_CALL. Include guards prevent double inclusion with deflate.c.


127-156: No stale PREFIX(read_buf) calls—read_buf is correctly inlined

A ripgrep search confirmed that there are no call sites using the removed PREFIX(read_buf) macro, and existing calls in deflate_stored.c and deflate.c invoke the bare read_buf, which is now provided as a static inline in deflate_p.h:

  • deflate_stored.c: calls to read_buf(s->strm, …) at lines 97 and 160
  • deflate.c: call to n = read_buf(…) at line 1226
  • No hits for PREFIX(read_buf)(…) outside deflate_p.h

Since each translation unit including deflate_p.h will see the inline definition of read_buf, there are no unresolved symbols to worry about. This comment can be resolved as no further changes are needed.

deflate.c (6)

51-51: Include order is fine

functable.h is needed here because this TU directly uses FUNCTABLE_CALL (e.g., adler32/crc32/slide_hash). Keeping it above deflate.h is consistent with prior style.


743-749: Wrapper preserves ABI while enabling hot-path inlining

Keeping Z_INTERNAL PREFIX(flush_pending) as a thin wrapper over flush_pending_inline retains the exported symbol and avoids code bloat where the wrapper is used in colder paths.


779-792: Good: hot-path switch to flush_pending_inline

Inlining here avoids a function call in the common pending != 0 case and pairs well with the earlier bit-buffer flush in the inline. No functional issues spotted.


966-972: Header/HCRC transition: inline flush is appropriate

Using flush_pending_inline here keeps the pipeline warm as we transition states. Matches the partial inlining strategy noted in the PR description.


1043-1044: Trailer flush: inline usage is consistent

Inlining at the trailer write is sensible, given it’s on the finishing hot path. State updates remain identical to the wrapper.


1226-1227: read_buf callsites fully migrated and verified

All calls to read_buf now resolve to the new TU-local inline defined in deflate_p.h. No legacy wrapper calls remain.

• deflate.c (lines 1226–1227): calls inline read_buf as intended.
• deflate_stored.c (lines 97 and 160): also invoke the inline version from deflate_p.h.

No further changes required.

@Dead2
Copy link
Copy Markdown
Member Author

Dead2 commented Aug 23, 2025

Rebased

@Dead2 Dead2 merged commit 83a84fa into develop Aug 26, 2025
367 of 376 checks passed
@Dead2 Dead2 mentioned this pull request Nov 5, 2025
@Dead2 Dead2 deleted the inlining branch November 29, 2025 22:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants