PARQUET-1142: Add alternatives to Hadoop classes in the API#429
PARQUET-1142: Add alternatives to Hadoop classes in the API#429rdblue wants to merge 6 commits intoapache:masterfrom
Conversation
|
I recommend reviewing this by commit. It touches a lot of files, but the changes aren't that large. |
|
@zivanfi, can you take a look? |
4293a1e to
f8697f6
Compare
zivanfi
left a comment
There was a problem hiding this comment.
@rdblue, I don't think I'm the best person to review this as I'm not too familiar with the affected code. I read through the PR nevertheless and in general it seemed good to me, but I'm unsure about which classes/methods are considered to be part of the public API, so I raised two questions regarding that.
| * @param maxPaddingSize the maximum padding | ||
| * @throws IOException if the file can not be created | ||
| */ | ||
| public ParquetFileWriter(Configuration configuration, MessageType schema, |
There was a problem hiding this comment.
Is removing this public method not a breaking change?
There was a problem hiding this comment.
Yes, good catch. I'll add the method back.
| <exclude>org/apache/parquet/io/ColumnIOFactory/**</exclude> <!-- removed non-API class and methods--> | ||
| <exclude>org/apache/parquet/hadoop/codec/SnappyCompressor</exclude> <!-- added synchronized modifier --> | ||
| <exclude>org/apache/parquet/hadoop/codec/SnappyCompressor</exclude> <!-- added synchronized modifier --> | ||
| <exclude>org/apache/parquet/bytes/BytesInput</exclude> <!-- moved to parquet-common --> |
There was a problem hiding this comment.
Are these not breaking changes?
There was a problem hiding this comment.
We haven't considered moving classes into Jars further up the dependency chain to be breaking so far because you must have parquet-common to use parquet-encoding, and there is no binary incompatibility introduced.
There was a problem hiding this comment.
My bad. I was worried about moving the CompressionCodecName and CompressionCodecNotSupportedException classes, because I hadn't realized that they were only moved to a different artifact while remaining in the same Java package.
|
@zivanfi, thanks for taking a look. No problem with not being too familiar here, I just wanted you to have a first pass to catch things like binary compatibility that you guys care about the most. Plus, it will make you more familiar with these areas. |
zivanfi
left a comment
There was a problem hiding this comment.
LGTM, but the automated tests fail (haven't looked into the reason).
| <exclude>org/apache/parquet/io/ColumnIOFactory/**</exclude> <!-- removed non-API class and methods--> | ||
| <exclude>org/apache/parquet/hadoop/codec/SnappyCompressor</exclude> <!-- added synchronized modifier --> | ||
| <exclude>org/apache/parquet/hadoop/codec/SnappyCompressor</exclude> <!-- added synchronized modifier --> | ||
| <exclude>org/apache/parquet/bytes/BytesInput</exclude> <!-- moved to parquet-common --> |
There was a problem hiding this comment.
My bad. I was worried about moving the CompressionCodecName and CompressionCodecNotSupportedException classes, because I hadn't realized that they were only moved to a different artifact while remaining in the same Java package.
7e3b061 to
81874ae
Compare
gszadovszky
left a comment
There was a problem hiding this comment.
Only had some small findings. Otherwise, it looks good to me.
| // for hadoop 1 to present. copying it avoids loading DFSConfigKeys. | ||
| private static final int DFS_BUFFER_SIZE_DEFAULT = 4096; | ||
|
|
||
| // visible for testing |
There was a problem hiding this comment.
I'll move the comment.
| } | ||
|
|
||
| public static Set<String> getBlockFileSystems() { | ||
| return BLOCK_FS_SCHEMES; |
There was a problem hiding this comment.
Do we really want to expose a mutable set here?
There was a problem hiding this comment.
This is what's visible for testing. Tests need to add "file". I'm fine with it being mutable because it is so unlikely that callers will change it; plus they're responsible for what happens if they do. The files will be correct either way, they will just get padding in some cases.
There was a problem hiding this comment.
I'm good with having it for testing only.
There was a problem hiding this comment.
Thanks for moving the comment.
| import java.io.IOException; | ||
|
|
||
| class MockInputStream extends ByteArrayInputStream | ||
| class MockHadoopInputStream extends ByteArrayInputStream |
There was a problem hiding this comment.
If we already have MockInputStream in parquet-common why do we need this one as well?
There was a problem hiding this comment.
This one implements Hadoop interfaces, Seekable and PositionedReadable, that are required for backing a FSDataInputStream. Those aren't available to MockInputStream. This could inherit from the other, but that would require moving classes around and exposing the test-jar, which doesn't seem worth it when we're only duplicating ~20 lines that may need to be changed independently in the future anyway.
There was a problem hiding this comment.
My bad, didn't realise the interfaces are in parquet-hadoop. I agree that exposing test-jar etc. would not worth it.
This adds CompressionCodecFactory to parquet-common. This addition required moving org.apache.parquet.bytes into parquet-common from parquet-encoding. This also adds HadoopCodecs to parquet-hadoop, which implements the API. Moving classes between artifacts required adding exclusions to semver checks.
These are the write path's equivalent for InputFile and SeekableInputStream. The new classes capture what is needed for Parquet to write data so that Hadoop classes can be removed from the API in 2.0. This commit also adds convenience classes for implementing Parquet compatible streams: DelegatingPositionOutputStream and DelegatingSeekableInputStream.
81874ae to
35eddd7
Compare
gszadovszky
left a comment
There was a problem hiding this comment.
LGTM, thanks for the changes and for making the modifications clear.
| } | ||
|
|
||
| public static Set<String> getBlockFileSystems() { | ||
| return BLOCK_FS_SCHEMES; |
There was a problem hiding this comment.
Thanks for moving the comment.
|
@julienledem, would you like to have a look? Otherwise I'll merge this tomorrow. |
|
This is great - Will a new version with this PR be released soon? |
|
@shanielh, yes. We are planning a 1.10 release right now. Hopefully it will be out soon. |
This updates the read and write paths to avoid using Hadoop classes where possible.
CompressionCodecFactoryOutputFileandPositionOutputStreamDelegatingSeekableInputStreamandDelegatingPositionOutputStreamParquetReadOptionsto avoid passing options withConfiguration