PARQUET-968 add flag to enable writing using specs-compliant schemas#2
Conversation
|
A test suite is available at https://github.com/BenoitHanotte/parquet-968 . It shows the different interpretation of the written schemas with Spark:
|
|
Awesome! |
| final GroupBuilder<T> builder) { | ||
| return builder | ||
| .group(Type.Repetition.REQUIRED).as(OriginalType.LIST) | ||
| .group(Type.Repetition.OPTIONAL).as(OriginalType.LIST) |
There was a problem hiding this comment.
@BenoitHanotte, we had a discussion a while back around the LIST wrapper being REQUIRED:
apache#411 (comment)
There was a problem hiding this comment.
I strongly believe the wrappers should be optional for the following reasons:
- it is consistent with how Maps are handled
- it is consistent with the fact that parquet-protobuf does not write fields if they have the default values (protobuf considers them cleared). this is currently the case for primitives and complex messages, I would like collections to behave the same way.
- every field is optional with protobuf 3
I haven't seen any arguments for making the wrapper required in the first place, was there a specific reason I am not aware of?
There was a problem hiding this comment.
Please note than in my comment above I say "If you want to have systematically a 3 level ..."
I think in protobuf you don't want to have it systematically because the parquet schema maps to the proto schema.
| GroupBuilder<GroupBuilder<GroupBuilder<GroupBuilder<T>>>> result = | ||
| builder | ||
| .group(Type.Repetition.REQUIRED).as(OriginalType.LIST) | ||
| .group(Type.Repetition.OPTIONAL).as(OriginalType.LIST) |
d9adc46 to
dbb1570
Compare
For users that require backward compatibility with parquet 1.9.0 and older, the flag "parquet.proto.writeSpecsCompliant" is introduced to allow writing collection using the old style (using repeated and not using the LIST and MAP wrappers that are recommended by the parquet specs).
033e84d to
897f9e3
Compare
897f9e3 to
c58ccd7
Compare
|
Looks good to me. |
|
@BenoitHanotte, I am trying to write a local parquet file with your patch, but using a ProtoParquetWriter<MessageOrBuilder> writer =
new ProtoParquetWriter<MessageOrBuilder>(file, cls, CompressionCodecName.GZIP, 256 * 1024 * 1024, 1 * 1024 * 1024);Should we add some extra constructors there in the I see we also have a unit test that uses this class: https://github.com/BenoitHanotte/parquet-mr/blob/master/parquet-protobuf/src/test/java/org/apache/parquet/proto/TestUtils.java#L177 |
|
@costimuraru Yes we may need to add the flag to the ProtoParquetWriter constructor to allow using it directly. What do you think about doing it in a new PR? This PR has been thoroughly reviewed already, and I would like to avoid needing a new pass on it as it could take some time. (or any variation of this as Also, it seems that the only non-deprecated
What do you think? |
|
Sounds good @BenoitHanotte ! Configuration conf = new Configuration();
ProtoWriteSupport.setWriteSpecsCompliant(conf, true);
ParquetWriter<MyMessage> writer
= new ParquetWriter<>(file, new ProtoWriteSupport<MyMessage>(cls), CompressionCodecName.GZIP, 256 * 1024 * 1024, 1 * 1024 * 1024, 1 * 1024 * 1024, true, false, ParquetWriter.DEFAULT_WRITER_VERSION, conf);I was then able to take the parquet file, upload it to AWS S3, and query the file using AWS Athena (which uses Presto). Queries worked for our complex schema (with 90+ columns, lists, maps and inner fields). All seems to be good! |
|
Perfect, thanks :) |
We introduce the "parquet.proto.writeSpecsCompliant" flag that allows writing using the specs compliant schemas (wrapping collection with LIST and MAP) so that files generated are compatible with other formats and frameworks such as Hive, Presto,...
By default this flag is set to false to keep backward compatibility (the schemas generated will be consistent with the ones prior to PARQUET-968).