Skip to content

PARQUET-1142: Add alternatives to Hadoop classes in the API#429

Closed
rdblue wants to merge 6 commits intoapache:masterfrom
rdblue:PARQUET-1142-add-hadoop-alternatives
Closed

PARQUET-1142: Add alternatives to Hadoop classes in the API#429
rdblue wants to merge 6 commits intoapache:masterfrom
rdblue:PARQUET-1142-add-hadoop-alternatives

Conversation

@rdblue
Copy link
Contributor

@rdblue rdblue commented Oct 17, 2017

This updates the read and write paths to avoid using Hadoop classes where possible.

  • Adds a generic compression interface, CompressionCodecFactory
  • Adds OutputFile and PositionOutputStream
  • Adds classes to help implementations wrap input and output streams: DelegatingSeekableInputStream and DelegatingPositionOutputStream
  • Adds ParquetReadOptions to avoid passing options with Configuration
  • Updates the read and write APIs to use new abstractions instead of Hadoop

@rdblue
Copy link
Contributor Author

rdblue commented Oct 17, 2017

I recommend reviewing this by commit. It touches a lot of files, but the changes aren't that large.

@rdblue
Copy link
Contributor Author

rdblue commented Oct 17, 2017

@zivanfi, can you take a look?

@rdblue rdblue force-pushed the PARQUET-1142-add-hadoop-alternatives branch from 4293a1e to f8697f6 Compare October 18, 2017 19:28
Copy link
Contributor

@zivanfi zivanfi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue, I don't think I'm the best person to review this as I'm not too familiar with the affected code. I read through the PR nevertheless and in general it seemed good to me, but I'm unsure about which classes/methods are considered to be part of the public API, so I raised two questions regarding that.

* @param maxPaddingSize the maximum padding
* @throws IOException if the file can not be created
*/
public ParquetFileWriter(Configuration configuration, MessageType schema,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is removing this public method not a breaking change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good catch. I'll add the method back.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

<exclude>org/apache/parquet/io/ColumnIOFactory/**</exclude> <!-- removed non-API class and methods-->
<exclude>org/apache/parquet/hadoop/codec/SnappyCompressor</exclude> <!-- added synchronized modifier -->
<exclude>org/apache/parquet/hadoop/codec/SnappyCompressor</exclude> <!-- added synchronized modifier -->
<exclude>org/apache/parquet/bytes/BytesInput</exclude> <!-- moved to parquet-common -->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these not breaking changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We haven't considered moving classes into Jars further up the dependency chain to be breaking so far because you must have parquet-common to use parquet-encoding, and there is no binary incompatibility introduced.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad. I was worried about moving the CompressionCodecName and CompressionCodecNotSupportedException classes, because I hadn't realized that they were only moved to a different artifact while remaining in the same Java package.

@rdblue
Copy link
Contributor Author

rdblue commented Oct 20, 2017

@zivanfi, thanks for taking a look. No problem with not being too familiar here, I just wanted you to have a first pass to catch things like binary compatibility that you guys care about the most. Plus, it will make you more familiar with these areas.

Copy link
Contributor

@zivanfi zivanfi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but the automated tests fail (haven't looked into the reason).

<exclude>org/apache/parquet/io/ColumnIOFactory/**</exclude> <!-- removed non-API class and methods-->
<exclude>org/apache/parquet/hadoop/codec/SnappyCompressor</exclude> <!-- added synchronized modifier -->
<exclude>org/apache/parquet/hadoop/codec/SnappyCompressor</exclude> <!-- added synchronized modifier -->
<exclude>org/apache/parquet/bytes/BytesInput</exclude> <!-- moved to parquet-common -->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad. I was worried about moving the CompressionCodecName and CompressionCodecNotSupportedException classes, because I hadn't realized that they were only moved to a different artifact while remaining in the same Java package.

@rdblue rdblue force-pushed the PARQUET-1142-add-hadoop-alternatives branch from 7e3b061 to 81874ae Compare November 15, 2017 00:26
Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only had some small findings. Otherwise, it looks good to me.

// for hadoop 1 to present. copying it avoids loading DFSConfigKeys.
private static final int DFS_BUFFER_SIZE_DEFAULT = 4096;

// visible for testing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it is not :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll move the comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

}

public static Set<String> getBlockFileSystems() {
return BLOCK_FS_SCHEMES;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to expose a mutable set here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what's visible for testing. Tests need to add "file". I'm fine with it being mutable because it is so unlikely that callers will change it; plus they're responsible for what happens if they do. The files will be correct either way, they will just get padding in some cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with having it for testing only.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for moving the comment.

import java.io.IOException;

class MockInputStream extends ByteArrayInputStream
class MockHadoopInputStream extends ByteArrayInputStream
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we already have MockInputStream in parquet-common why do we need this one as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one implements Hadoop interfaces, Seekable and PositionedReadable, that are required for backing a FSDataInputStream. Those aren't available to MockInputStream. This could inherit from the other, but that would require moving classes around and exposing the test-jar, which doesn't seem worth it when we're only duplicating ~20 lines that may need to be changed independently in the future anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, didn't realise the interfaces are in parquet-hadoop. I agree that exposing test-jar etc. would not worth it.

This adds CompressionCodecFactory to parquet-common. This addition
required moving org.apache.parquet.bytes into parquet-common from
parquet-encoding.

This also adds HadoopCodecs to parquet-hadoop, which implements the API.

Moving classes between artifacts required adding exclusions to semver
checks.
These are the write path's equivalent for InputFile and
SeekableInputStream. The new classes capture what is needed for Parquet
to write data so that Hadoop classes can be removed from the API in 2.0.

This commit also adds convenience classes for implementing Parquet
compatible streams: DelegatingPositionOutputStream and
DelegatingSeekableInputStream.
@rdblue rdblue force-pushed the PARQUET-1142-add-hadoop-alternatives branch from 81874ae to 35eddd7 Compare December 1, 2017 20:31
Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the changes and for making the modifications clear.

}

public static Set<String> getBlockFileSystems() {
return BLOCK_FS_SCHEMES;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for moving the comment.

@rdblue
Copy link
Contributor Author

rdblue commented Dec 5, 2017

@julienledem, would you like to have a look? Otherwise I'll merge this tomorrow.

@asfgit asfgit closed this in 8bfd9b4 Dec 13, 2017
@shanielh
Copy link

This is great - Will a new version with this PR be released soon?

@rdblue
Copy link
Contributor Author

rdblue commented Jan 22, 2018

@shanielh, yes. We are planning a 1.10 release right now. Hopefully it will be out soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants