PARQUET-2336: Add caching key to CodecFactory#1134
Conversation
| } | ||
| CompressionCodec codec = CODEC_BY_NAME.get(codecClassName); | ||
| String codecCacheKey = this.cacheKey(codecName); | ||
| CompressionCodec codec = CODEC_BY_NAME.get(codecCacheKey); |
There was a problem hiding this comment.
Since CODEC_BY_NAME is protected, I think this could break something that is relying on the cache, although I'm not sure why someone would access it directly. Maybe that visibility is an accident?
There was a problem hiding this comment.
If that is a concern, we can cache the old key (w/o level) as well.
There was a problem hiding this comment.
Personally, I'm not too worried about this, I don't see anyone doing this. At least nobody in the Apache org: https://github.com/search?q=org%3Aapache%20CODEC_BY_NAME&type=code :)
| if (codec != null) { | ||
| return codec; | ||
| } | ||
|
|
There was a problem hiding this comment.
Nit: unnecessary whitespace change.
rdblue
left a comment
There was a problem hiding this comment.
Overall this looks good. It changes the cache, but I can't think of why anyone would use it directly.
@Fokko, do we have a long-term plan for getting off of Hadoop codecs? It seems like that is a good idea. I think the main blocker is that we are still using these codecs. Otherwise we would be able to remove Hadoop dependencies fairly easily.
| } | ||
| CompressionCodec codec = CODEC_BY_NAME.get(codecClassName); | ||
| String codecCacheKey = this.cacheKey(codecName); | ||
| CompressionCodec codec = CODEC_BY_NAME.get(codecCacheKey); |
There was a problem hiding this comment.
If that is a concern, we can cache the old key (w/o level) as well.
I have added the aircompressor library when I supported the LZ4_RAW codec. Not sure if this makes it easier. @rdblue |
|
Thanks @wgtmac for jumping in here
I have a PR that I need to revisit apache/iceberg#7369. It requires some awkward changes to make it work. One issue with the Aircompressor codec it doesn't provide Brolti due to licensing issues. |
|
Just curious, is brotli widely adopted? It seems that it does not have an official java encoder implementation. |
|
@wgtmac I think it is quite arcane, maybe we can make the |
| level = configuration.get("parquet.compression.codec.zstd.level"); | ||
| if (level == null) { | ||
| // keep "io.compression.codec.zstd.level" for backwards compatibility | ||
| level = configuration.get("io.compression.codec.zstd.level"); |
There was a problem hiding this comment.
Do we need to cache the old config levels ? It's already been deprecated and currently only the new config is used.
There was a problem hiding this comment.
Thanks, let's remove it then 👍🏻
|
Thanks everyone for the reviews! |
Make sure you have checked all steps below.
The
CODEC_BY_NAMEis static and may be used across different configurations. If a codec is initialized it will be re-used no matter what the configuration is. This is a problem when there are different compression levels used.https://github.com/apache/parquet-mr/blob/515734c373f69b5250e8b63eb3d1c973da893b63/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/CodecFactory.java#L45-L46
Therefore we need to cache per compression level as well.
Jira
Tests
Commits
Documentation