Skip to content

Conversation

@meshcow
Copy link

@meshcow meshcow commented Dec 1, 2025

Motivation

In workarounds section of CVE-2025-12183 it is suggested in long run to use safe decompressor instead of vulnerable fast one. Also, authors of the original Java Lz4 code outlined in javadoc that native fast decompressor is SLOWER than safe one.
The LZ4BlockInputStream accepts fast decompressor only at the moment.

Change

Added builder for LZ4BlockInputStream which accepts both fast and safe decompressor (prefers fast one).

Copy link

@Marcono1234 Marcono1234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully these comments are useful. Note that I am not a project member, so feel free to consider my comments only as suggestions.

@meshcow meshcow force-pushed the LZ4BlockInputStream-with-fast-or-safe-decompressor branch 4 times, most recently from 1d41aae to 51de1cd Compare December 1, 2025 16:19
@yawkat
Copy link
Owner

yawkat commented Dec 1, 2025

How urgent is this change for you? I was going to wait with the next minor release (1.10) until the unsafe implementations are tested enough to be re-enabled, but that may take a few more weeks since I'm very busy atm and also have vacations coming up.

If this change is important, I can get 1.10 out earlier for this change and wait with the unsafe stuff for 1.11.

@rigglemania
Copy link

We'd really appreciate this being merged soon if possible.

Currently we were using LZ4Factory.nativeInstance() and fastCompressor and fastDecompressor methods. And the results made me very nervous at first.

org.lz4 (1.8.0)
Average Write Time (ms): 210
Average Read Time (ms): 36

at.yawk.lz4 (1.8.1)
Average Write Time (ms): 211
Average Read Time (ms): 214

at.yawk.lz4 (1.9.0)
Average Write Time (ms): 224
Average Read Time (ms): 253

However, it turns out safeDecompressor (still using LZ4Factory.nativeInstance) is indeed faster than fastDecompressor and did seem to restore performance to an acceptable level

Average Write Time (ms): 210
Average Read Time (ms): 32

So this covers the direct factory methods, but still leaves open the block input stream as this issue mentions, and which we do use in our code.

@meshcow meshcow force-pushed the LZ4BlockInputStream-with-fast-or-safe-decompressor branch from 51de1cd to 1dd9c2d Compare December 2, 2025 09:03
@yawkat
Copy link
Owner

yawkat commented Dec 2, 2025

Okay, then I will get out a release with this before I leave for vacation.

Also: Please avoid force pushing in the future. It makes changes harder to review incrementally. You can also apply changes from the GitHub UI in individual commits rather than manually. When I merge this, I will squash all changes into a single commit anyway.

- added builder for LZ4BlockInputStream which accepts both fast and safe decompressor (prefers fast one)
@meshcow meshcow force-pushed the LZ4BlockInputStream-with-fast-or-safe-decompressor branch from 1dd9c2d to 5d8a820 Compare December 2, 2025 09:11
@meshcow
Copy link
Author

meshcow commented Dec 2, 2025

Thank you so much for your time and effort! We will be waiting for the changes to be released.
And I am sorry, will not force-push anymore.

@yawkat yawkat merged commit 999c401 into yawkat:main Dec 2, 2025
1 check passed
@yawkat
Copy link
Owner

yawkat commented Dec 2, 2025

Thanks for the contribution! I will make some more changes and then release 1.10 shortly.

@yawkat
Copy link
Owner

yawkat commented Dec 2, 2025

@meshcow v1.10.0 is out now, I can already see it on central.

@meshcow
Copy link
Author

meshcow commented Dec 2, 2025

@meshcow v1.10.0 is out now, I can already see it on central.
Thank you so much!

@Marcono1234
Copy link

Marcono1234 commented Dec 2, 2025

Should Builder#withChecksum maybe rather take a Supplier<Checksum> or should its Javadoc maybe have a warning? The problem is that Checksum is stateful, so if the same builder is used to create multiple LZ4BlockInputStream instances, then they might affect each other during checksum calculation.

(Sharing the same decompressor for multiple LZ4BlockInputStream instances might not be a problem because the implementations are stateless (?), but I am not completely sure.)

Sorry, I just noticed this now.

@yawkat
Copy link
Owner

yawkat commented Dec 2, 2025

Yea I guess now that the InputStream is passed to the build method, there's nothing preventing you from reusing the builder… I'll add a comment

dongjoon-hyun added a commit to apache/spark that referenced this pull request Dec 5, 2025
### What changes were proposed in this pull request?

This PR aims to upgrade `lz4-java` to 1.10.0 and exclude the legacy groupID version.

### Why are the changes needed?

Since `lz4-java` changed its repository, we had better depend on the live repository for future maintenance.
- https://github.com/lz4/lz4-java (Archived with the last release on 2021-06-19)
   - https://github.com/lz4/lz4-java/releases/tag/1.8.0 (2021-06-19)
- https://github.com/yawkat/lz4-java (New Fork)
   - https://github.com/yawkat/lz4-java/releases/tag/v1.9.0
       - Update lz4 to v1.10.0
   - https://github.com/yawkat/lz4-java/releases/tag/v1.10.0
       - yawkat/lz4-java#3

### Does this PR introduce _any_ user-facing change?

No Spark behavior change.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #53327 from dongjoon-hyun/SPARK-54597.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants