Skip to content

PARQUET-968 add flag to enable writing using specs-compliant schemas#2

Merged
costimuraru merged 4 commits intocostimuraru:PARQUET-968from
BenoitHanotte:PARQUET-968
Apr 13, 2018
Merged

PARQUET-968 add flag to enable writing using specs-compliant schemas#2
costimuraru merged 4 commits intocostimuraru:PARQUET-968from
BenoitHanotte:PARQUET-968

Conversation

@BenoitHanotte
Copy link
Copy Markdown

We introduce the "parquet.proto.writeSpecsCompliant" flag that allows writing using the specs compliant schemas (wrapping collection with LIST and MAP) so that files generated are compatible with other formats and frameworks such as Hive, Presto,...
By default this flag is set to false to keep backward compatibility (the schemas generated will be consistent with the ones prior to PARQUET-968).

@BenoitHanotte
Copy link
Copy Markdown
Author

A test suite is available at https://github.com/BenoitHanotte/parquet-968 . It shows the different interpretation of the written schemas with Spark:

  1. old behavior (default):
+-------------+----------------+--------+----------------+
|emptyRepeated|nonEmptyRepeated|emptyMap|     nonEmptyMap|
+-------------+----------------+--------+----------------+
|           []|          [1, 1]|      []|[[1, 1], [2, 2]]|
+-------------+----------------+--------+----------------+
  1. new behavior (specs-compliant)
+-------------+----------------+--------+----------------+
|emptyRepeated|nonEmptyRepeated|emptyMap|     nonEmptyMap|
+-------------+----------------+--------+----------------+
|         null|          [1, 1]|    null|[1 -> 1, 2 -> 2]|
+-------------+----------------+--------+----------------+

@costimuraru
Copy link
Copy Markdown
Owner

Awesome!
Will have a look today

final GroupBuilder<T> builder) {
return builder
.group(Type.Repetition.REQUIRED).as(OriginalType.LIST)
.group(Type.Repetition.OPTIONAL).as(OriginalType.LIST)
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BenoitHanotte, we had a discussion a while back around the LIST wrapper being REQUIRED:
apache#411 (comment)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly believe the wrappers should be optional for the following reasons:

  1. it is consistent with how Maps are handled
  2. it is consistent with the fact that parquet-protobuf does not write fields if they have the default values (protobuf considers them cleared). this is currently the case for primitives and complex messages, I would like collections to behave the same way.
  3. every field is optional with protobuf 3
    I haven't seen any arguments for making the wrapper required in the first place, was there a specific reason I am not aware of?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note than in my comment above I say "If you want to have systematically a 3 level ..."
I think in protobuf you don't want to have it systematically because the parquet schema maps to the proto schema.

GroupBuilder<GroupBuilder<GroupBuilder<GroupBuilder<T>>>> result =
builder
.group(Type.Repetition.REQUIRED).as(OriginalType.LIST)
.group(Type.Repetition.OPTIONAL).as(OriginalType.LIST)
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

BenoitHanotte added 3 commits March 21, 2018 09:52
For users that require backward compatibility with parquet 1.9.0 and
older, the flag "parquet.proto.writeSpecsCompliant" is introduced to
allow writing collection using the old style (using repeated and not
using the LIST and MAP wrappers that are recommended by the parquet
specs).
@BenoitHanotte BenoitHanotte force-pushed the PARQUET-968 branch 7 times, most recently from 033e84d to 897f9e3 Compare March 27, 2018 14:17
@qinghui-xu
Copy link
Copy Markdown

Looks good to me.

@costimuraru
Copy link
Copy Markdown
Owner

costimuraru commented Apr 12, 2018

@BenoitHanotte, I am trying to write a local parquet file with your patch, but using a ProtoParquetWriter. I can't seem to be able to pass the writeSpecsCompliant flag to this class though.

ProtoParquetWriter<MessageOrBuilder> writer = 
   new ProtoParquetWriter<MessageOrBuilder>(file, cls, CompressionCodecName.GZIP, 256 * 1024 * 1024, 1 * 1024 * 1024);

Should we add some extra constructors there in the ProtoParquetWriter.java to be able to pass this flag? (https://github.com/BenoitHanotte/parquet-mr/blob/master/parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoParquetWriter.java#L47)

I see we also have a unit test that uses this class: https://github.com/BenoitHanotte/parquet-mr/blob/master/parquet-protobuf/src/test/java/org/apache/parquet/proto/TestUtils.java#L177

@BenoitHanotte
Copy link
Copy Markdown
Author

BenoitHanotte commented Apr 13, 2018

@costimuraru Yes we may need to add the flag to the ProtoParquetWriter constructor to allow using it directly.

What do you think about doing it in a new PR? This PR has been thoroughly reviewed already, and I would like to avoid needing a new pass on it as it could take some time.
In the current state, the project is fully functional, and you can manually create your ParquetWriter by providing the ProtoWriteSupport as following:

Configuration conf = new Configuration();
ProtoWriteSupport.setWriteSpecsCompliant(true, conf); // set the flag in the configuration
new ParquetWriter(file, conf, new ProtoWriteSupport(protoClass));

(or any variation of this as ParquetWriter has multiple constructors that accept a configuration in which we can set the flag)

Also, it seems that the only non-deprecated ParquetWriter constructor is one that accepts a Configuration as a parameter, thus I propose we do the following:

  1. Add a new constructor to ProtoParquetWrtier that accepts the flag for pure convenience. The flag is then set in a configuration object (as in the example above) and we call a ParquetWriter constructor that takes the configuration as parameter.
  2. In a second step, it would be a good idea to refactor the ProtoParquetWriter class as it currently calls deprecated constructors from ParquetWriter (that would need to be in another PR)

What do you think?

@costimuraru
Copy link
Copy Markdown
Owner

costimuraru commented Apr 13, 2018

Sounds good @BenoitHanotte !
I was able to create the parquet file by creating a ParquetWriter (instead of a ProtoParquetWriter):

Configuration conf = new Configuration();
ProtoWriteSupport.setWriteSpecsCompliant(conf, true);

ParquetWriter<MyMessage> writer 
   = new ParquetWriter<>(file, new ProtoWriteSupport<MyMessage>(cls), CompressionCodecName.GZIP, 256 * 1024 * 1024, 1 * 1024 * 1024,  1 * 1024 * 1024, true, false, ParquetWriter.DEFAULT_WRITER_VERSION, conf);

I was then able to take the parquet file, upload it to AWS S3, and query the file using AWS Athena (which uses Presto). Queries worked for our complex schema (with 90+ columns, lists, maps and inner fields). All seems to be good!

@costimuraru costimuraru merged commit 16eafcb into costimuraru:PARQUET-968 Apr 13, 2018
@BenoitHanotte
Copy link
Copy Markdown
Author

Perfect, thanks :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants