Skip to content

Conversation

@dbtsai
Copy link
Member

@dbtsai dbtsai commented Dec 2, 2025

What changes were proposed in this pull request?

In recent LZ4 versions, safeDecompressor has become highly optimized and can be as fast, or even sometimes faster, than fasterDecompressor. So it does make sense to switch to safeDecompressor.

Why are the changes needed?

It is recommended to switch to .safeDecompressor(), which is not vulnerable and provides better performance per https://sites.google.com/sonatype.com/vulnerabilities/cve-2025-12183

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit tests

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the CORE label Dec 2, 2025
@dbtsai dbtsai changed the title [SPARK-54571] [Core] Use LZ4 safeDecompressor [SPARK-54571][CORE] Use LZ4 safeDecompressor Dec 2, 2025
@dbtsai
Copy link
Member Author

dbtsai commented Dec 3, 2025

It's more involving than I thought as LZ4BlockInputStream doesn't take safeDecompressor. I will take a deeper look tomorrow.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya, please make the CI happy. Thank you, @dbtsai .

@dbtsai dbtsai changed the title [SPARK-54571][CORE] Use LZ4 safeDecompressor [WIP][SPARK-54571][CORE] Use LZ4 safeDecompressor Dec 3, 2025
LuciferYang
LuciferYang previously approved these changes Dec 3, 2025
@LuciferYang
Copy link
Contributor

It seems that the fix for CVE‐2025‐12183 wasn't implemented until version 1.8.1, but Spark is still using version 1.8.0.

https://github.com/yawkat/lz4-java/releases

@LuciferYang LuciferYang dismissed their stale review December 3, 2025 08:05

Spark still use 1.8.0

@yawkat
Copy link

yawkat commented Dec 3, 2025

Note that LZ4BlockInputStream does not support safeDecompressor in lz4-java 1.8.1. If you upgrade to that version, it will still work and be secure, but performance will be much worse than in 1.8.0.

lz4-java 1.10.0 introduces a new builder for LZ4BlockInputStream that accepts a safeDecompressor.

@dongjoon-hyun dongjoon-hyun marked this pull request as draft December 3, 2025 17:38
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reviewers, just for the record, LZ4 1.10.0 is not published to the Maven Central yet.

Image

@yawkat
Copy link

yawkat commented Dec 3, 2025

It is published, but only under the new group id

@SteNicholas
Copy link
Member

@yawkat, which group id does 1.10.0 publish?

@LuciferYang
Copy link
Contributor

we need change to use

<dependency>
    <groupId>at.yawk.lz4</groupId>
    <artifactId>lz4-java</artifactId>
    <version>1.10.0</version>
</dependency>

https://repo.maven.apache.org/maven2/at/yawk/lz4/lz4-java/

image

@dongjoon-hyun
Copy link
Member

Thank you for the updated info, @yawkat , @LuciferYang .

Could you update this PR, @dbtsai ?

@dongjoon-hyun
Copy link
Member

To all, in order to help this PR, I made an independent PR for dependency upgrade.

@dongjoon-hyun
Copy link
Member

We upgraded to 1.10.0 a few minutes ago. Could you rebase this PR to master branch once more, @dbtsai ?

@yawkat
Copy link

yawkat commented Dec 5, 2025

I recommend you wait a few hours with releasing this. Another (smaller, unrelated) CVE has been found in lz4-java.

@yawkat
Copy link

yawkat commented Dec 5, 2025

CVE-2025-66566 has been published and fixed in 1.10.1. I suggest you move to that version. Though cloudflare seems to be having some trouble that breaks maven central at the moment.

@dongjoon-hyun
Copy link
Member

I recommend you wait a few hours with releasing this. Another (smaller, unrelated) CVE has been found in lz4-java.

Just FYI, we have no intention to hurry this, @yawkat . To be safe, this will be tested in master branch for next few months in Apache Spark 4.2.0 timeframe. Currently, we are voting on Apache Spark 4.1.0, it's too late for this kind of risky dependency change.

That's the main reason why LZ4 1.10.0 PR is only in master branch. Dependency change and code change is only building a ground to test further in various workloads.

@dongjoon-hyun
Copy link
Member

Gentle ping once more, @dbtsai .

@dongjoon-hyun
Copy link
Member

Gentle ping, @dbtsai .

@mridulm
Copy link
Contributor

mridulm commented Dec 10, 2025

The PR description mentions that safeDecompressor is fast enough.
While the CVE is a good enough reason to switch - can we link benchmark results to the PR description to point to this ?

@yawkat
Copy link

yawkat commented Dec 10, 2025

I am not aware of spark benchmarks, but as of 1.8.1, safeDecompressor is substantially faster than fastDecompressor. In earlier versions, the difference was minor.

@SteNicholas
Copy link
Member

SteNicholas commented Dec 11, 2025

@yawkat, how about the performance for 1.10.0? @dongjoon-hyun has already bumped to 1.10.0.

@yawkat
Copy link

yawkat commented Dec 11, 2025

Performance between 1.8.1 and 1.10.1 has not changed substantially.

@mridulm
Copy link
Contributor

mridulm commented Dec 11, 2025

@yawkat I am not questioning whether it is faster or not - I am sure it is, else @dbtsai wont be raising this PR :)
Given the PR description, adding a reference, which makes this clear would help.

@pan3793
Copy link
Member

pan3793 commented Dec 12, 2025

@mridulm I found a perf report at yawkat/lz4-java#3 (comment), but without providing the data.

@pan3793
Copy link
Member

pan3793 commented Dec 12, 2025

@mridulm @yawkat @dbtsai @dongjoon-hyun @SteNicholas I created #53453 to add an lz4 benchmark based on TPCDS catalog_sales.dat (SF1), the size is about 283M. My test results show lz4-java 1.10.1 is about 10~15% slower on lz4 compression than 1.8.0, and is about 5% slower on lz4 decompression even with migrating to suggested safeDecompressor (#53454)

@yawkat
Copy link

yawkat commented Dec 12, 2025

The underlying lz4 library was updated in 1.9.0 so a performance difference is possible.

@Dzeri96
Copy link

Dzeri96 commented Dec 19, 2025

While this is being merged, would you consider "setting spark.io.compression.codec to something other than lz4" as good advice for users potentially affected by the vulnerabilities?

As far as I saw, only this config key has LZ4 as default. Also, it seems like spark.eventLog.compression.codec is its own setting, even though the former config key mentions it being used for writing event logs. I guess this is historic and an error in the docs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants