Conversation
…ct type checking for conflicting schemas, which is strict by default.
Conflicts: parquet-hadoop/src/main/java/parquet/hadoop/ParquetInputFormat.java parquet-pig/src/main/java/parquet/pig/convert/TupleConverter.java parquet-pig/src/test/java/parquet/pig/TestParquetLoader.java
There was a problem hiding this comment.
we should make it explicit in the key name when a configuration setting is private (not user facing)
something like parquet.private.pig.required.fields
There was a problem hiding this comment.
We don't currently distinguish for other properties (like parquet.pig.schema) which could potentially cause problems if someone decided to set it explicitly. I'm not sure that this is a real issue.
There was a problem hiding this comment.
I think we should start flagging private keys as such even if we have been sloppy in the past.
parquet.pig.schema is semi public as it can come from the user.
requireFieldList is definitely private. Just a naming convention (parquet.internal...., parquet.private.... ?) would be enough.
There was a problem hiding this comment.
No problem. I'll update the property names.
|
The feature makes sense to me. I made some comments regarding the handling of using index or not. |
|
Updated the pull request. Please review changes. |
There was a problem hiding this comment.
Could you use a subclass of ParquetRuntimeException?
|
This looks good to me. I made a comment about the RuntimeExceptions that we should cleanup eventually but it's not a blocker. |
There was a problem hiding this comment.
If I understand your comment correctly, you mean whey not throw an exception if "p.second <= schemaToFilter.getFieldCount()". We actually want to ignore the 'else' case since that will trigger null padding for fields that don't exist, which is the intended behavior.
See the commit for null padding.
This patch adds the ability to use column index based access to parquet files in pig, which allows for rename capability similar to other file formats. This is achieved by using the parametrized loader with an alternate schema.
Example:
p = LOAD '/data/parquet/' USING parquet.pig.ParquetLoader('n1:int, n2:float, n3:chararray', 'true');
In this example, the names from the requested schema will be translated to the column positions from the file and will produce tuples based on the index position.
Two test cases are included that exercise index based access for both full file reads and column projected reads.
Note: This patch also disables the enforcer plugin on the pig project per discussion at the parquet meetup. The justification for this is that the enforcer is too strict for internal classes and results in dead code because duplicating methods is required to add parameters where there is only one usage of the constructor/method. The interface for the pig loader is imposed by LoadFunc and StoreFunc by the pig project and the implementations internals should not be used directly.
Author: Daniel Weeks <dweeks@netflix.com>
Closes apache#12 from dcw-netflix/column-index-access and squashes the following commits:
1b5c5cf [Daniel Weeks] Refactored based on rewview comments
12b53c1 [Daniel Weeks] Fixed some formatting and the missing filter method sig
e5553f1 [Daniel Weeks] Adding back default constructor to satisfy other project requirements
69d21e0 [Daniel Weeks] Merge branch 'master' into column-index-access
f725c6f [Daniel Weeks] Removed enforcer for pig support
d182dc6 [Daniel Weeks] Introduces column index access
1c3c0c7 [Daniel Weeks] Fixed test with strict checking off
f3cb495 [Daniel Weeks] Added type persuasion for primitive types with a flag to control strict type checking for conflicting schemas, which is strict by default.
Conflicts:
parquet-pig/src/test/java/parquet/pig/TestParquetLoader.java
Resolution: removed parts of 9ad5485 (not backported) in the tests.
…rsion. Depends on PR apache#776 for [PARQUET-1229] and on PR apache#12 in parquet-testing for [PARQUET-1807]. JIRA: https://issues.apache.org/jira/browse/PARQUET-1807 Add a test for writing and reading parquet in a number of encryption and decryption configurations. Add interop test that reads files from parquet-testing GitHub repository, that were written by parquet-cpp. This adds parquet-testing repo as a submodule. Run the following to populate the "submodules/parquet-testing/" folder: git submodule update --init --recursive
…rsion Depends on PR apache#776 for [PARQUET-1229] and on PR apache#12 in parquet-testing for [PARQUET-1807]. JIRA: https://issues.apache.org/jira/browse/PARQUET-1807 Add a test for writing and reading parquet in a number of encryption and decryption configurations. Add interop test that reads files from parquet-testing GitHub repository, that were written by parquet-cpp. This adds parquet-testing repo as a submodule. Run the following to populate the "submodules/parquet-testing/" folder: git submodule update --init --recursive
…rsion Depends on PR apache#776 for [PARQUET-1229] and on PR apache#12 in parquet-testing for [PARQUET-1807]. JIRA: https://issues.apache.org/jira/browse/PARQUET-1807 Add a test for writing and reading parquet in a number of encryption and decryption configurations. Add interop test that reads files from parquet-testing GitHub repository, that were written by parquet-cpp. This adds parquet-testing repo as a submodule. Run the following to populate the "submodules/parquet-testing/" folder: git submodule update --init --recursive
…rsion Depends on PR apache#776 for [PARQUET-1229] and on PR apache#12 in parquet-testing for [PARQUET-1807]. JIRA: https://issues.apache.org/jira/browse/PARQUET-1807 Add a test for writing and reading parquet in a number of encryption and decryption configurations. Add interop test that reads files from parquet-testing GitHub repository, that were written by parquet-cpp. This adds parquet-testing repo as a submodule. Run the following to populate the "submodules/parquet-testing/" folder: git submodule update --init --recursive
This patch adds the ability to use column index based access to parquet files in pig, which allows for rename capability similar to other file formats. This is achieved by using the parametrized loader with an alternate schema.
Example:
File Schema: {c1:int, c2:float, c3:chararray}
p = LOAD '/data/parquet/' USING parquet.pig.ParquetLoader('n1:int, n2:float, n3:chararray', 'true');
In this example, the names from the requested schema will be translated to the column positions from the file and will produce tuples based on the index position.
Two test cases are included that exercise index based access for both full file reads and column projected reads.
Note: This patch also disables the enforcer plugin on the pig project per discussion at the parquet meetup. The justification for this is that the enforcer is too strict for internal classes and results in dead code because duplicating methods is required to add parameters where there is only one usage of the constructor/method. The interface for the pig loader is imposed by LoadFunc and StoreFunc by the pig project and the implementations internals should not be used directly.