Skip to content

Make deflate output deterministic if PREFIX3(stream) is reused after deflateReset#2102

Merged
Dead2 merged 1 commit intozlib-ng:developfrom
phprus:deterministic-deflate-2100
Jan 20, 2026
Merged

Make deflate output deterministic if PREFIX3(stream) is reused after deflateReset#2102
Dead2 merged 1 commit intozlib-ng:developfrom
phprus:deterministic-deflate-2100

Conversation

@phprus
Copy link
Copy Markdown
Contributor

@phprus phprus commented Jan 13, 2026

Issue: #2100

@QrczakMK

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Jan 13, 2026

Walkthrough

This PR tightens a boundary check in LONGEST_MATCH (changing len > lookahead to len >= lookahead in match_tpl.h) and adds a new unit test verifying deterministic deflate output across a deflateReset (test/test_deflate_deterministic.cc, plus CMake update).

Changes

Cohort / File(s) Summary
Core logic update
match_tpl.h
Changed boundary check in LONGEST_MATCH from len > lookahead to len >= lookahead; updated comment to reflect checking for current match reaching or exceeding input end.
Test suite expansion
test/CMakeLists.txt, test/test_deflate_deterministic.cc
Added test_deflate_deterministic.cc and included it in TEST_SRCS under the appropriate CMake condition; new test asserts deflate produces the same output for the same input before and after deflateReset.

Possibly related PRs

Suggested labels

cleanup

Suggested reviewers

  • mtl1979
🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: updating the boundary check in LONGEST_MATCH to make deflate output deterministic when reusing streams after deflateReset, which is the primary objective of the PR.
Description check ✅ Passed The description references issue #2100 and mentions the relevant contributor. While brief, it is directly related to the changeset's purpose of addressing deterministic deflate output.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e249c92 and d029f76.

📒 Files selected for processing (3)
  • match_tpl.h
  • test/CMakeLists.txt
  • test/test_deflate_reset_deterministic.cc
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: nmoinvaz
Repo: zlib-ng/zlib-ng PR: 2077
File: test/test_crc32_copy.cc:15-29
Timestamp: 2026-01-12T23:00:52.569Z
Learning: In zlib-ng PR #2077, the test data in hash_test_strings_p.h (hash_tests array) was structured to exclude NULL buffer cases. NULL handling is performed at the API level in adler32.c and crc32.c, so test implementations in test_crc32_copy.cc and test_adler32_copy.cc do not need NULL guards for memcmp calls.
📚 Learning: 2025-11-15T12:43:50.353Z
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 2001
File: test/CMakeLists.txt:10-12
Timestamp: 2025-11-15T12:43:50.353Z
Learning: In zlib-ng, the TEST_STOCK_ZLIB CMake variable is intentionally not defined as a CMake option. It's meant to be set manually via command-line (e.g., -DTEST_STOCK_ZLIB=ON) when testing against stock zlib. The testing workflow for stock zlib requires undocumented manual steps and is considered experimental/incomplete.

Applied to files:

  • test/CMakeLists.txt
📚 Learning: 2026-01-12T23:00:52.569Z
Learnt from: nmoinvaz
Repo: zlib-ng/zlib-ng PR: 2077
File: test/test_crc32_copy.cc:15-29
Timestamp: 2026-01-12T23:00:52.569Z
Learning: In zlib-ng PR #2077, the test data in hash_test_strings_p.h (hash_tests array) was structured to exclude NULL buffer cases. NULL handling is performed at the API level in adler32.c and crc32.c, so test implementations in test_crc32_copy.cc and test_adler32_copy.cc do not need NULL guards for memcmp calls.

Applied to files:

  • test/CMakeLists.txt
  • test/test_deflate_reset_deterministic.cc
📚 Learning: 2024-12-20T23:35:59.830Z
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1830
File: arch/generic/generic_functions.h:39-53
Timestamp: 2024-12-20T23:35:59.830Z
Learning: The `longest_match_unaligned_*` functions are templated using the LONGEST_MATCH macro in match_tpl.h, so their implementations are generated rather than explicitly defined.

Applied to files:

  • match_tpl.h
🪛 Cppcheck (2.19.0)
test/test_deflate_reset_deterministic.cc

[information] 3-3: Include file

(missingInclude)


[information] 7-7: Include file

(missingInclude)


[information] 10-10: Include file

(missingIncludeSystem)


[information] 11-11: Include file

(missingIncludeSystem)


[information] 12-12: Include file

(missingIncludeSystem)


[information] 13-13: Include file

(missingIncludeSystem)


[information] 14-14: Include file

(missingIncludeSystem)


[information] 16-16: Include file

(missingInclude)


[information] 20-20: Include file

(missingIncludeSystem)


[information] 5-5: Include file

(missingInclude)

🔇 Additional comments (4)
match_tpl.h (1)

161-165: LGTM! Correct fix for the determinism issue.

The change from len > lookahead to len >= lookahead properly prevents the algorithm from searching for better matches when the current match already reaches the end of available input. After deflateReset, stale hash table entries could cause different match decisions; by returning early at the boundary condition, the output becomes deterministic regardless of prior stream state.

test/test_deflate_reset_deterministic.cc (2)

1-21: LGTM!

The include structure follows the existing test patterns in zlib-ng, with proper ZLIB_COMPAT handling for the header selection.


346-379: LGTM! Well-designed test case.

The test correctly validates deterministic output after deflateReset by:

  1. Establishing a baseline compression of data_a with a fresh stream
  2. Using a second stream to compress different data (data_b), populating hash tables
  3. Resetting the second stream and re-compressing data_a
  4. Asserting both compressions match

This directly exercises the scenario from issue #2100. The sizeof(data_a)-1 correctly accounts for the implicit null terminator in the string literal.

test/CMakeLists.txt (1)

187-187: LGTM!

Correct placement within the NOT TEST_STOCK_ZLIB block, as this test validates zlib-ng specific deterministic behavior that stock zlib doesn't provide. Based on learnings, TEST_STOCK_ZLIB is for manual testing against stock zlib.

@phprus phprus force-pushed the deterministic-deflate-2100 branch from d029f76 to 9ff8c74 Compare January 13, 2026 20:14
Copy link
Copy Markdown
Collaborator

@mtl1979 mtl1979 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't use implicit cast to remove constness of pointer.

@phprus phprus force-pushed the deterministic-deflate-2100 branch from 9ff8c74 to 9f1e911 Compare January 13, 2026 20:36
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9ff8c74 and 9f1e911.

📒 Files selected for processing (3)
  • match_tpl.h
  • test/CMakeLists.txt
  • test/test_deflate_reset_deterministic.cc
🧰 Additional context used
🧠 Learnings (9)
📓 Common learnings
Learnt from: nmoinvaz
Repo: zlib-ng/zlib-ng PR: 2077
File: test/test_crc32_copy.cc:15-29
Timestamp: 2026-01-12T23:00:52.569Z
Learning: In zlib-ng PR #2077, the test data in hash_test_strings_p.h (hash_tests array) was structured to exclude NULL buffer cases. NULL handling is performed at the API level in adler32.c and crc32.c, so test implementations in test_crc32_copy.cc and test_adler32_copy.cc do not need NULL guards for memcmp calls.
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1830
File: arch/generic/generic_functions.h:39-53
Timestamp: 2024-12-20T23:35:59.830Z
Learning: The `longest_match_unaligned_*` functions are templated using the LONGEST_MATCH macro in match_tpl.h, so their implementations are generated rather than explicitly defined.
📚 Learning: 2025-11-15T12:43:50.353Z
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 2001
File: test/CMakeLists.txt:10-12
Timestamp: 2025-11-15T12:43:50.353Z
Learning: In zlib-ng, the TEST_STOCK_ZLIB CMake variable is intentionally not defined as a CMake option. It's meant to be set manually via command-line (e.g., -DTEST_STOCK_ZLIB=ON) when testing against stock zlib. The testing workflow for stock zlib requires undocumented manual steps and is considered experimental/incomplete.

Applied to files:

  • test/CMakeLists.txt
📚 Learning: 2026-01-12T23:00:52.569Z
Learnt from: nmoinvaz
Repo: zlib-ng/zlib-ng PR: 2077
File: test/test_crc32_copy.cc:15-29
Timestamp: 2026-01-12T23:00:52.569Z
Learning: In zlib-ng PR #2077, the test data in hash_test_strings_p.h (hash_tests array) was structured to exclude NULL buffer cases. NULL handling is performed at the API level in adler32.c and crc32.c, so test implementations in test_crc32_copy.cc and test_adler32_copy.cc do not need NULL guards for memcmp calls.

Applied to files:

  • test/CMakeLists.txt
  • test/test_deflate_reset_deterministic.cc
📚 Learning: 2024-12-20T23:35:59.830Z
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1830
File: arch/generic/generic_functions.h:39-53
Timestamp: 2024-12-20T23:35:59.830Z
Learning: The `longest_match_unaligned_*` functions are templated using the LONGEST_MATCH macro in match_tpl.h, so their implementations are generated rather than explicitly defined.

Applied to files:

  • match_tpl.h
📚 Learning: 2025-02-21T01:41:50.358Z
Learnt from: KungFuJesus
Repo: zlib-ng/zlib-ng PR: 1872
File: arch/x86/chorba_sse2.c:14-24
Timestamp: 2025-02-21T01:41:50.358Z
Learning: In zlib-ng's SSE2 vectorized Chorba CRC implementation, the code that calls READ_NEXT macro ensures 16-byte alignment, making explicit alignment checks unnecessary within the macro.

Applied to files:

  • test/test_deflate_reset_deterministic.cc
📚 Learning: 2026-01-12T22:53:18.025Z
Learnt from: nmoinvaz
Repo: zlib-ng/zlib-ng PR: 2077
File: crc32.c:34-45
Timestamp: 2026-01-12T22:53:18.025Z
Learning: In zlib-ng, defensive NULL checks at public API entry points (like crc32() checking buf before calling crc32_z()) are intentional and should not be flagged as redundant, even when the called function also checks. This is an accepted defensive coding practice.

Applied to files:

  • test/test_deflate_reset_deterministic.cc
📚 Learning: 2025-07-04T20:17:58.531Z
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1928
File: gzlib.c:90-90
Timestamp: 2025-07-04T20:17:58.531Z
Learning: In zlib-ng's gz_buffer_alloc() function, the condition `(out_mult >= 1 && state->out == NULL)` is necessary because state->out is intentionally left as NULL when out_mult == 0 (write + direct mode), but should be allocated when out_mult >= 1. The condition properly distinguishes between intentionally NULL buffers and allocation failures.

Applied to files:

  • test/test_deflate_reset_deterministic.cc
📚 Learning: 2026-01-12T02:33:10.554Z
Learnt from: mtl1979
Repo: zlib-ng/zlib-ng PR: 2097
File: zlib.map.in:20-24
Timestamp: 2026-01-12T02:33:10.554Z
Learning: The zlib.map.in file should remain as close as possible to Mark Adler's version of zlib. Changes that would improve readability or consistency (such as adding explicit `global:` keywords) should not be made unless they are first adopted in the upstream zlib project.

Applied to files:

  • test/test_deflate_reset_deterministic.cc
📚 Learning: 2024-12-25T19:45:06.009Z
Learnt from: samrussell
Repo: zlib-ng/zlib-ng PR: 1837
File: arch/generic/crc32_braid_c.c:596-597
Timestamp: 2024-12-25T19:45:06.009Z
Learning: We should verify out-of-bounds pointer arithmetic in chorba_118960_nondestructive() and related loops where "input + i" is used, especially when len < expected block sizes.

Applied to files:

  • test/test_deflate_reset_deterministic.cc
🧬 Code graph analysis (1)
test/test_deflate_reset_deterministic.cc (2)
test/test_crc32.cc (1)
  • buffer (44-47)
deflate.c (4)
  • deflate (758-1050)
  • deflateInit2 (244-371)
  • deflateEnd (1053-1063)
  • deflateReset (531-536)
🪛 Cppcheck (2.19.0)
test/test_deflate_reset_deterministic.cc

[information] 3-3: Include file

(missingInclude)


[information] 7-7: Include file

(missingInclude)


[information] 10-10: Include file

(missingIncludeSystem)


[information] 11-11: Include file

(missingIncludeSystem)


[information] 12-12: Include file

(missingIncludeSystem)


[information] 13-13: Include file

(missingIncludeSystem)


[information] 14-14: Include file

(missingIncludeSystem)


[information] 16-16: Include file

(missingInclude)


[information] 20-20: Include file

(missingIncludeSystem)


[information] 5-5: Include file

(missingInclude)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (77)
  • GitHub Check: macOS Clang ASAN (Intel)
  • GitHub Check: EL10 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
  • GitHub Check: Ubuntu GCC ASAN
  • GitHub Check: Ubuntu GCC SSE2 UBSAN
  • GitHub Check: Analyze (cpp)
  • GitHub Check: macOS GCC Symbol Prefix & Compat (Intel)
  • GitHub Check: macOS Clang ASAN (Intel)
  • GitHub Check: EL10 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
  • GitHub Check: Ubuntu GCC Compat Symbol Prefix
  • GitHub Check: Ubuntu GCC ASAN
  • GitHub Check: Ubuntu GCC SSE2 UBSAN
  • GitHub Check: Analyze (cpp)
  • GitHub Check: macOS GCC Symbol Prefix & Compat (Intel)
  • GitHub Check: macOS Clang ASAN (Intel)
  • GitHub Check: EL10 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
  • GitHub Check: Ubuntu GCC Compat Symbol Prefix
  • GitHub Check: Ubuntu GCC ASAN
  • GitHub Check: Ubuntu GCC SSE2 UBSAN
  • GitHub Check: Analyze (cpp)
  • GitHub Check: macOS GCC Symbol Prefix & Compat (Intel)
  • GitHub Check: macOS Clang ASAN (Intel)
  • GitHub Check: EL10 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
  • GitHub Check: Ubuntu GCC Compat Symbol Prefix
  • GitHub Check: Ubuntu GCC ASAN
  • GitHub Check: Ubuntu GCC SSE2 UBSAN
  • GitHub Check: Analyze (cpp)
  • GitHub Check: macOS GCC Symbol Prefix & Compat (Intel)
  • GitHub Check: macOS Clang ASAN (Intel)
  • GitHub Check: EL10 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
  • GitHub Check: Ubuntu GCC Compat Symbol Prefix
  • GitHub Check: Ubuntu GCC ASAN
  • GitHub Check: Ubuntu GCC SSE2 UBSAN
  • GitHub Check: Analyze (cpp)
  • GitHub Check: macOS GCC Symbol Prefix & Compat (Intel)
  • GitHub Check: macOS Clang ASAN (Intel)
  • GitHub Check: EL10 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
  • GitHub Check: Ubuntu GCC Compat Symbol Prefix
  • GitHub Check: Ubuntu GCC ASAN
  • GitHub Check: Ubuntu GCC SSE2 UBSAN
  • GitHub Check: Analyze (cpp)
  • GitHub Check: macOS GCC Symbol Prefix & Compat (Intel)
  • GitHub Check: macOS Clang ASAN (Intel)
  • GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
  • GitHub Check: Ubuntu GCC Compat Symbol Prefix
  • GitHub Check: Ubuntu GCC ASAN
  • GitHub Check: Ubuntu GCC SSE2 UBSAN
  • GitHub Check: Analyze (cpp)
  • GitHub Check: macOS GCC Symbol Prefix & Compat (Intel)
  • GitHub Check: macOS Clang ASAN (Intel)
  • GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
  • GitHub Check: Ubuntu GCC Compat Symbol Prefix
  • GitHub Check: Ubuntu GCC ASAN
  • GitHub Check: Ubuntu GCC SSE2 UBSAN
  • GitHub Check: Analyze (cpp)
  • GitHub Check: macOS GCC Symbol Prefix & Compat (Intel)
  • GitHub Check: macOS Clang ASAN (Intel)
  • GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
  • GitHub Check: Ubuntu GCC MIPS
  • GitHub Check: Ubuntu GCC Compat Symbol Prefix
  • GitHub Check: Ubuntu GCC ASAN
  • GitHub Check: Ubuntu GCC SSE2 UBSAN
  • GitHub Check: Analyze (cpp)
  • GitHub Check: macOS GCC Symbol Prefix & Compat (Intel)
  • GitHub Check: macOS Clang ASAN (Intel)
  • GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
  • GitHub Check: Ubuntu GCC MIPS
  • GitHub Check: Ubuntu GCC Compat Symbol Prefix
  • GitHub Check: Ubuntu GCC ASAN
  • GitHub Check: Ubuntu GCC SSE2 UBSAN
  • GitHub Check: Analyze (cpp)
  • GitHub Check: macOS GCC Symbol Prefix & Compat (Intel)
🔇 Additional comments (3)
test/CMakeLists.txt (1)

187-187: LGTM!

The new test file is correctly placed within the NOT TEST_STOCK_ZLIB conditional block, consistent with other zlib-ng specific tests that rely on internal behavior or specific compressed output layouts.

match_tpl.h (1)

161-165: LGTM! Correct fix for determinism after deflateReset.

When len == lookahead, the match already spans all remaining input, so continuing to search the hash chain is pointless and can produce non-deterministic results due to stale hash table entries after deflateReset. The tightened condition ensures early termination in this boundary case, aligning with the OUT assertion that match length should not exceed lookahead.

test/test_deflate_reset_deterministic.cc (1)

348-381: Well-structured regression test for issue #2100.

The test correctly validates the determinism fix by:

  1. Establishing baseline output from a fresh stream
  2. Using a second stream to compress different data (populating hash tables)
  3. Resetting the second stream and compressing the same input
  4. Verifying outputs match

The chosen parameters (level 6, window_bits 15, memLevel 8) are good defaults that exercise the standard code paths. The test data appears crafted to trigger the specific hash table residue scenario.

z_stream->next_out = (uint8_t *)buffer;
z_stream->avail_out = buffer_size;
err = PREFIX(deflate)(z_stream, op);
EXPECT_TRUE(err == Z_OK || err == Z_STREAM_END);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider using ASSERT_TRUE for earlier test termination on compression errors.

Using EXPECT_TRUE allows the loop to continue even if deflate() returns an unexpected error, potentially causing confusing follow-up failures. ASSERT_TRUE would terminate the test immediately on failure.

That said, this is a minor point since the loop will terminate naturally on Z_STREAM_END and the test outcome will still be reported correctly.

♻️ Suggested change
-        EXPECT_TRUE(err == Z_OK || err == Z_STREAM_END);
+        ASSERT_TRUE(err == Z_OK || err == Z_STREAM_END);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
EXPECT_TRUE(err == Z_OK || err == Z_STREAM_END);
ASSERT_TRUE(err == Z_OK || err == Z_STREAM_END);

@codecov
Copy link
Copy Markdown

codecov bot commented Jan 13, 2026

Codecov Report

❌ Patch coverage is 63.88889% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.42%. Comparing base (503b995) to head (8355ad9).
⚠️ Report is 1 commits behind head on develop.

Files with missing lines Patch % Lines
test/test_deflate_deterministic.cc 62.85% 0 Missing and 13 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           develop    #2102       +/-   ##
============================================
+ Coverage    33.76%   81.42%   +47.66%     
============================================
  Files           75      158       +83     
  Lines         7581    13085     +5504     
  Branches      1470     3222     +1752     
============================================
+ Hits          2560    10655     +8095     
+ Misses        4751     1363     -3388     
- Partials       270     1067      +797     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@phprus phprus force-pushed the deterministic-deflate-2100 branch from 9f1e911 to 0a58ddc Compare January 13, 2026 20:51
@nmoinvaz nmoinvaz changed the title The output of deflate is deterministic now if PREFIX3(stream) was reused after deflateReset Make deflate output deterministic if PREFIX3(stream) was reused after deflateReset Jan 13, 2026
@nmoinvaz nmoinvaz changed the title Make deflate output deterministic if PREFIX3(stream) was reused after deflateReset Make deflate output deterministic if PREFIX3(stream) is reused after deflateReset Jan 13, 2026
@Dead2
Copy link
Copy Markdown
Member

Dead2 commented Jan 14, 2026

The bug fix might be correct, but is this actually the root cause though?
It seems to me like something doesn't get reset correctly, otherwise why would anything be different?

@mtl1979
Copy link
Copy Markdown
Collaborator

mtl1979 commented Jan 14, 2026

The bug fix might be correct, but is this actually the root cause though? It seems to me like something doesn't get reset correctly, otherwise why would anything be different?

I don't think this is the root cause... There is likely off-by-one error elsewhere (somewhere in the compare code) and this PR masks it.

@dzetkulic
Copy link
Copy Markdown

The bug fix might be correct, but is this actually the root cause though? It seems to me like something doesn't get reset correctly, otherwise why would anything be different?

I don't think this is the root cause... There is likely off-by-one error elsewhere (somewhere in the compare code) and this PR masks it.

trifectatechfoundation/zlib-rs#459 has some investigation on the root cause.

IIUC 8 byte value is read from the internal buffer: scan_start = zng_memread_8(scan); It has the tail of the data from the new deflate stream plus the garbage bytes which are left over from the previous compression session. These garbage bytes should not affect the result of the compression, but they do.

To make this deterministic, we could memset the internal buffer to any pattern (zeros maybe), but that it not a good solution in my opinion for two reasons:

  1. It will make reset more expensive
  2. It will replace garbage bytes with a different garbage bytes. Instead of trying to find a match with garbage, we would try to make a match for zeros. This makes no sense because those bytes should never matter and we should not match them no matter what they are.

@mtl1979
Copy link
Copy Markdown
Collaborator

mtl1979 commented Jan 14, 2026

IIUC 8 byte value is read from the internal buffer: scan_start = zng_memread_8(scan); It has the tail of the data from the new deflate stream plus the garbage bytes which are left over from the previous compression session. These garbage bytes should not affect the result of the compression, but they do.

So to avoid the garbage bytes, during reset scan needs to be advanced past the bytes from the old session?

@QrczakMK
Copy link
Copy Markdown
Contributor

QrczakMK commented Jan 14, 2026

The high level logic of the relevant part of LONGEST_MATCH() is to iterate through matches in a chain. Whenever the current match is longer than the best match so far, look at it closer, and then update the best match so far. At the end, the best match so far is the answer.

Looking closer involves checking whether the preliminary length exceeds the remaining length. If so, then garbage bytes must have artificially increased the preliminary length, so the actual matching length is only the remaining length. Furthermore, there is no need to iterate further through the chain, because the match is already as long as possible. The function exits early. (This differs from the original zlib, where the length is only trimmed at this stage, without an early exit.)

The problem is that, while comparing whether the current match is longer than the best match so far, the length is still a preliminary length, which may involve comparison of garbage bytes after the data. If a match logically reaches the end of the data, then it is unpredictable whether the preliminary length is equal to the remaining length, or greater.

This causes divergence of the behavior in case the best match so far reached the end of the data, while garbage bytes did not match then, so there was no early exit, and we now have another match which logically reaches the end of the data. Depending on garbage bytes, the preliminary length of the second match can be equal to or greater than the remaining length, and thus the new match can be considered either as good as or better than the best match so far. Only a strictly better match is looked at closer, and ultimately this affects the distance used in the answer.

After the fix, the early exit applies as soon as the match logically reaches the end of the data, not only when the preliminary length exceeds it, i.e. independently from garbage bytes. The first match which reaches the end of the data is already as long as possible, so we exit early.

@Dead2
Copy link
Copy Markdown
Member

Dead2 commented Jan 14, 2026

@QrczakMK That is some real good detective work, I am impressed 💯
Given your explanation, I do agree that this might very well be the fix we need.

We'll have to do some testing of course, but it makes sense and I agree with that @dzetkulic that resetting those garbage bytes might not be needed and might even help us catch match regressions in the future.

@QrczakMK
Copy link
Copy Markdown
Contributor

As for testing, we use zlib-rs in many places in Google. It has been originally transcribed from zlib-ng and has the same issue. I tested the analogous fix to zlib-rs, and all our tests passed.

Copy link
Copy Markdown
Member

@Dead2 Dead2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of my tests pass.
Benchmarking-wise, this has negligible difference. It does look like it is very slightly faster on average (I'd expect so from sometimes not having to look for another match at end of window) over dozens of gbench runs, but benchmarks are too close to really tell for sure.

# include "zlib-ng.h"
#endif

#include <stdio.h>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stdio.h, stdint.h, stdlib.h and string.h are already included from zbuild.h so doesn't need to be included here. string is not though.

Copy link
Copy Markdown
Collaborator

@mtl1979 mtl1979 Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

string is C++ header, so it makes no sense to have it in zbuild.h which should only contain C headers.
C++ equivalent of string.h is cstring.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mtl1979 I didn't mean that we need to add it to zbuild.h, I meant the ones that are already in zbuild.h should be removed from the test source, as they don't need to be parsed twice.
I don't know whether string itself is needed by this test or not.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebased, headers removed.

string is used for the output buffer.
Manual memory management is not needed, and the EXPECT_EQ macro can be used to compare values. If the values ​​don't match, gtest will print them.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mtl1979 I didn't mean that we need to add it to zbuild.h, I meant the ones that are already in zbuild.h should be removed from the test source, as they don't need to be parsed twice.

I don't know whether string itself is needed by this test or not.

Most of the C++ headers are designed to be included before the equivalent C header as they include the C header after defining some preprocessor defines. For a lot of standard C headers the C++ equivalent name is prefixed with c and the extension is omitted.

@QrczakMK
Copy link
Copy Markdown
Contributor

QrczakMK commented Jan 14, 2026

compress_data() is likely more complicated than it deserves.

It evolved from a particular wrapper for streaming compression. For the curious: https://github.com/google/riegeli/blob/master/riegeli/zlib/zlib_writer.cc

Splitting input and output into buffers does not impact the outcome. For simplicity, it might use some appropriate string-to-string one-shot compressor, as long as the z_stream is being reused.

@phprus phprus force-pushed the deterministic-deflate-2100 branch from 0a58ddc to 65ea811 Compare January 14, 2026 17:28
@nmoinvaz
Copy link
Copy Markdown
Member

Can we shorten the filename of the test.. maybe something like dtrmin or whatever?

@phprus
Copy link
Copy Markdown
Contributor Author

phprus commented Jan 14, 2026

Can we shorten the filename of the test.. maybe something like dtrmin or whatever?

Maybe test_deflate_deterministic.cc?

@QrczakMK
Copy link
Copy Markdown
Contributor

In trifectatechfoundation/zlib-rs#461 we might find simpler test data for triggering this.

@folkertdev
Copy link
Copy Markdown

I was able to manually reduce it to

    const DATA_A: &[u8] = b"\0AAAA\0AAAAAAAA";
    const DATA_B: &[u8] = &[1u8; DATA_A.len() + 1];

@phprus phprus force-pushed the deterministic-deflate-2100 branch from 65ea811 to b3e834a Compare January 15, 2026 16:52
@phprus
Copy link
Copy Markdown
Contributor Author

phprus commented Jan 15, 2026

Rebased.
Simpler test data was used.

@phprus phprus force-pushed the deterministic-deflate-2100 branch from b3e834a to b0c5e22 Compare January 15, 2026 18:02
@phprus phprus force-pushed the deterministic-deflate-2100 branch from b0c5e22 to dad8c19 Compare January 15, 2026 20:46
Copy link
Copy Markdown
Member

@nmoinvaz nmoinvaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe test_deflate_deterministic.cc?

That would be fine to me. Don't forget to update the filename in the .cc header comment.

@phprus phprus force-pushed the deterministic-deflate-2100 branch from dad8c19 to f4aa5bc Compare January 15, 2026 21:07
@phprus
Copy link
Copy Markdown
Contributor Author

phprus commented Jan 15, 2026

That would be fine to me. Don't forget to update the filename in the .cc header comment.

Renamed to test_deflate_deterministic.cc.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dad8c19 and f4aa5bc.

📒 Files selected for processing (3)
  • match_tpl.h
  • test/CMakeLists.txt
  • test/test_deflate_deterministic.cc
🧰 Additional context used
🧠 Learnings (6)
📓 Common learnings
Learnt from: nmoinvaz
Repo: zlib-ng/zlib-ng PR: 2077
File: test/test_crc32_copy.cc:15-29
Timestamp: 2026-01-12T23:00:59.417Z
Learning: In zlib-ng PR `#2077`, the test data in hash_test_strings_p.h (hash_tests array) was structured to exclude NULL buffer cases. NULL handling is performed at the API level in adler32.c and crc32.c, so test implementations in test_crc32_copy.cc and test_adler32_copy.cc do not need NULL guards for memcmp calls.
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 2046
File: deflate_quick.c:116-116
Timestamp: 2025-12-08T13:43:07.260Z
Learning: In zlib-ng, the compare256() function (called through the function table) returns at most 256, so match_len calculations of the form `compare256(...) + 2` are inherently bounded to a maximum of 258 (which equals STD_MAX_MATCH), making explicit clamping to STD_MAX_MATCH unnecessary after such calculations.
📚 Learning: 2025-11-15T12:43:50.353Z
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 2001
File: test/CMakeLists.txt:10-12
Timestamp: 2025-11-15T12:43:50.353Z
Learning: In zlib-ng, the TEST_STOCK_ZLIB CMake variable is intentionally not defined as a CMake option. It's meant to be set manually via command-line (e.g., -DTEST_STOCK_ZLIB=ON) when testing against stock zlib. The testing workflow for stock zlib requires undocumented manual steps and is considered experimental/incomplete.

Applied to files:

  • test/CMakeLists.txt
  • test/test_deflate_deterministic.cc
📚 Learning: 2026-01-12T23:00:59.417Z
Learnt from: nmoinvaz
Repo: zlib-ng/zlib-ng PR: 2077
File: test/test_crc32_copy.cc:15-29
Timestamp: 2026-01-12T23:00:59.417Z
Learning: In zlib-ng PR `#2077`, the test data in hash_test_strings_p.h (hash_tests array) was structured to exclude NULL buffer cases. NULL handling is performed at the API level in adler32.c and crc32.c, so test implementations in test_crc32_copy.cc and test_adler32_copy.cc do not need NULL guards for memcmp calls.

Applied to files:

  • test/CMakeLists.txt
  • test/test_deflate_deterministic.cc
📚 Learning: 2024-12-20T23:35:59.830Z
Learnt from: Dead2
Repo: zlib-ng/zlib-ng PR: 1830
File: arch/generic/generic_functions.h:39-53
Timestamp: 2024-12-20T23:35:59.830Z
Learning: The `longest_match_unaligned_*` functions are templated using the LONGEST_MATCH macro in match_tpl.h, so their implementations are generated rather than explicitly defined.

Applied to files:

  • match_tpl.h
📚 Learning: 2025-01-06T07:20:02.028Z
Learnt from: samrussell
Repo: zlib-ng/zlib-ng PR: 1837
File: arch/generic/crc32_braid_c.c:0-0
Timestamp: 2025-01-06T07:20:02.028Z
Learning: Business requirement: Additional memory allocations should be avoided, stack allocations might be used instead.

Applied to files:

  • test/test_deflate_deterministic.cc
📚 Learning: 2024-12-27T10:06:16.184Z
Learnt from: samrussell
Repo: zlib-ng/zlib-ng PR: 1837
File: arch/generic/crc32_braid_c.c:215-217
Timestamp: 2024-12-27T10:06:16.184Z
Learning: There's a business requirement to use stack allocation instead of heap for the Chorba buffer.

Applied to files:

  • test/test_deflate_deterministic.cc
🪛 Cppcheck (2.19.0)
test/test_deflate_deterministic.cc

[information] 3-3: Include file

(missingInclude)


[information] 7-7: Include file

(missingInclude)


[information] 10-10: Include file

(missingIncludeSystem)


[information] 12-12: Include file

(missingInclude)


[information] 16-16: Include file

(missingIncludeSystem)


[information] 5-5: Include file

(missingInclude)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
  • GitHub Check: EL10 Clang S390X DFLTCC MSAN
  • GitHub Check: EL10 Clang S390X DFLTCC MSAN
  • GitHub Check: EL10 Clang S390X DFLTCC MSAN
  • GitHub Check: EL10 Clang S390X DFLTCC MSAN
  • GitHub Check: EL10 Clang S390X DFLTCC MSAN
  • GitHub Check: EL10 Clang S390X DFLTCC MSAN
  • GitHub Check: EL10 Clang S390X DFLTCC MSAN
  • GitHub Check: EL10 Clang S390X DFLTCC MSAN
  • GitHub Check: EL10 Clang S390X DFLTCC MSAN
  • GitHub Check: EL10 Clang S390X DFLTCC MSAN
🔇 Additional comments (3)
match_tpl.h (1)

161-165: Correct fix for nondeterministic behavior.

The boundary condition change from > to >= is the right approach. When len == lookahead, the match exactly reaches the end of available input data. Previously, continuing the search could read garbage bytes past the logical boundary (via 8-byte reads like scan_start = zng_memread_8(scan)), causing nondeterministic match selection when the stream is reused after deflateReset. By returning immediately when len >= lookahead, subsequent comparisons that would read uninitialized memory are avoided.

The comment update accurately reflects the new semantics.

test/test_deflate_deterministic.cc (1)

39-72: Test logic is correct and well-structured.

The test correctly validates the deterministic behavior fix by:

  1. Compressing data_a with a fresh stream
  2. Compressing larger data_b to populate the internal window with different content
  3. Resetting the stream and compressing data_a again
  4. Asserting the outputs match

This reproduces the exact scenario from issue #2100 where garbage bytes from a previous session could influence match selection.

test/CMakeLists.txt (1)

181-191: Correct placement and consistent style.

The new test is appropriately placed within the NOT TEST_STOCK_ZLIB block since it relies on zlib-ng internal behavior (the specific LONGEST_MATCH implementation). Comment alignment and style are consistent with surrounding entries.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

…deflateReset

Co-authored-by: Marcin Kowalczyk <QrczakMK@gmail.com>
Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>
@phprus phprus force-pushed the deterministic-deflate-2100 branch from f4aa5bc to 8355ad9 Compare January 19, 2026 14:37
Copy link
Copy Markdown
Collaborator

@mtl1979 mtl1979 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@Dead2 Dead2 merged commit fb9a6ce into zlib-ng:develop Jan 20, 2026
173 checks passed
@folkertdev
Copy link
Copy Markdown

Do you have any indication on when this fix will make it into a release?

@Dead2
Copy link
Copy Markdown
Member

Dead2 commented Jan 26, 2026

@folkertdev I can't promise anything, but I am hoping to backport this and a few other fixes and make a release later this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants