Column index access support by danielcweeks · Pull Request #12 · apache/parquet-java

danielcweeks · 2014-07-08T18:06:29Z

This patch adds the ability to use column index based access to parquet files in pig, which allows for rename capability similar to other file formats. This is achieved by using the parametrized loader with an alternate schema.

Example:

File Schema: {c1:int, c2:float, c3:chararray}

p = LOAD '/data/parquet/' USING parquet.pig.ParquetLoader('n1:int, n2:float, n3:chararray', 'true');

In this example, the names from the requested schema will be translated to the column positions from the file and will produce tuples based on the index position.

Two test cases are included that exercise index based access for both full file reads and column projected reads.

Note: This patch also disables the enforcer plugin on the pig project per discussion at the parquet meetup. The justification for this is that the enforcer is too strict for internal classes and results in dead code because duplicating methods is required to add parameters where there is only one usage of the constructor/method. The interface for the pig loader is imposed by LoadFunc and StoreFunc by the pig project and the implementations internals should not be used directly.

…ct type checking for conflicting schemas, which is strict by default.

Conflicts: parquet-hadoop/src/main/java/parquet/hadoop/ParquetInputFormat.java parquet-pig/src/main/java/parquet/pig/convert/TupleConverter.java parquet-pig/src/test/java/parquet/pig/TestParquetLoader.java

julienledem · 2014-07-18T23:33:52Z

parquet-pig/src/main/java/parquet/pig/ParquetLoader.java

we should make it explicit in the key name when a configuration setting is private (not user facing)
something like parquet.private.pig.required.fields

We don't currently distinguish for other properties (like parquet.pig.schema) which could potentially cause problems if someone decided to set it explicitly. I'm not sure that this is a real issue.

I think we should start flagging private keys as such even if we have been sloppy in the past.
parquet.pig.schema is semi public as it can come from the user.
requireFieldList is definitely private. Just a naming convention (parquet.internal...., parquet.private.... ?) would be enough.

No problem. I'll update the property names.

julienledem · 2014-07-19T01:33:30Z

The feature makes sense to me. I made some comments regarding the handling of using index or not.

danielcweeks · 2014-07-21T22:02:16Z

Updated the pull request. Please review changes.

julienledem · 2014-07-29T00:59:38Z

parquet-pig/src/main/java/parquet/pig/PigSchemaConverter.java

Could you use a subclass of ParquetRuntimeException?

julienledem · 2014-07-29T01:01:41Z

This looks good to me. I made a comment about the RuntimeExceptions that we should cleanup eventually but it's not a blocker.

tsdeng · 2014-07-29T16:46:15Z

parquet-pig/src/main/java/parquet/pig/PigSchemaConverter.java

else throw exception ?

If I understand your comment correctly, you mean whey not throw an exception if "p.second <= schemaToFilter.getFieldCount()". We actually want to ignore the 'else' case since that will trigger null padding for fields that don't exist, which is the intended behavior.

See the commit for null padding.

This patch adds the ability to use column index based access to parquet files in pig, which allows for rename capability similar to other file formats. This is achieved by using the parametrized loader with an alternate schema. Example: p = LOAD '/data/parquet/' USING parquet.pig.ParquetLoader('n1:int, n2:float, n3:chararray', 'true'); In this example, the names from the requested schema will be translated to the column positions from the file and will produce tuples based on the index position. Two test cases are included that exercise index based access for both full file reads and column projected reads. Note: This patch also disables the enforcer plugin on the pig project per discussion at the parquet meetup. The justification for this is that the enforcer is too strict for internal classes and results in dead code because duplicating methods is required to add parameters where there is only one usage of the constructor/method. The interface for the pig loader is imposed by LoadFunc and StoreFunc by the pig project and the implementations internals should not be used directly. Author: Daniel Weeks <dweeks@netflix.com> Closes apache#12 from dcw-netflix/column-index-access and squashes the following commits: 1b5c5cf [Daniel Weeks] Refactored based on rewview comments 12b53c1 [Daniel Weeks] Fixed some formatting and the missing filter method sig e5553f1 [Daniel Weeks] Adding back default constructor to satisfy other project requirements 69d21e0 [Daniel Weeks] Merge branch 'master' into column-index-access f725c6f [Daniel Weeks] Removed enforcer for pig support d182dc6 [Daniel Weeks] Introduces column index access 1c3c0c7 [Daniel Weeks] Fixed test with strict checking off f3cb495 [Daniel Weeks] Added type persuasion for primitive types with a flag to control strict type checking for conflicting schemas, which is strict by default. Conflicts: parquet-pig/src/test/java/parquet/pig/TestParquetLoader.java Resolution: removed parts of 9ad5485 (not backported) in the tests.

…rsion. Depends on PR apache#776 for [PARQUET-1229] and on PR apache#12 in parquet-testing for [PARQUET-1807]. JIRA: https://issues.apache.org/jira/browse/PARQUET-1807 Add a test for writing and reading parquet in a number of encryption and decryption configurations. Add interop test that reads files from parquet-testing GitHub repository, that were written by parquet-cpp. This adds parquet-testing repo as a submodule. Run the following to populate the "submodules/parquet-testing/" folder: git submodule update --init --recursive

…rsion Depends on PR apache#776 for [PARQUET-1229] and on PR apache#12 in parquet-testing for [PARQUET-1807]. JIRA: https://issues.apache.org/jira/browse/PARQUET-1807 Add a test for writing and reading parquet in a number of encryption and decryption configurations. Add interop test that reads files from parquet-testing GitHub repository, that were written by parquet-cpp. This adds parquet-testing repo as a submodule. Run the following to populate the "submodules/parquet-testing/" folder: git submodule update --init --recursive

Daniel Weeks added 7 commits June 19, 2014 10:51

Added type persuasion for primitive types with a flag to control stri…

f3cb495

…ct type checking for conflicting schemas, which is strict by default.

Fixed test with strict checking off

1c3c0c7

Introduces column index access

d182dc6

Removed enforcer for pig support

f725c6f

Merge branch 'master' into column-index-access

69d21e0

Conflicts: parquet-hadoop/src/main/java/parquet/hadoop/ParquetInputFormat.java parquet-pig/src/main/java/parquet/pig/convert/TupleConverter.java parquet-pig/src/test/java/parquet/pig/TestParquetLoader.java

Adding back default constructor to satisfy other project requirements

e5553f1

Fixed some formatting and the missing filter method sig

12b53c1

julienledem reviewed Jul 18, 2014
View reviewed changes

Refactored based on rewview comments

1b5c5cf

julienledem reviewed Jul 29, 2014
View reviewed changes

parquet-pig/src/main/java/parquet/pig/PigSchemaConverter.java

Copy link
Copy Markdown

Member

julienledem Jul 29, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use a subclass of ParquetRuntimeException?

asfgit closed this in 17864df Jul 29, 2014

tsdeng reviewed Jul 29, 2014
View reviewed changes

andersonm-ibm mentioned this pull request Apr 6, 2020

PARQUET-1807: Encryption: Interop and Function test suite for Java version #782

Merged

parthchandra pushed a commit to parthchandra/incubator-parquet-mr that referenced this pull request May 13, 2022

rdar://89567863 (Add parquet footer cache feature in SMC) (apache#12)

e3aa11d

sunchao pushed a commit to sunchao/parquet-mr that referenced this pull request Jun 16, 2022

rdar://89567863 (Add parquet footer cache feature in SMC) (apache#12)

ad7a1d1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Column index access support#12

Column index access support#12
danielcweeks wants to merge 8 commits intoapache:masterfrom
danielcweeks:column-index-access

danielcweeks commented Jul 8, 2014

Uh oh!

julienledem Jul 18, 2014

Uh oh!

danielcweeks Jul 21, 2014

Uh oh!

julienledem Jul 21, 2014

Uh oh!

danielcweeks Jul 21, 2014

Uh oh!

julienledem commented Jul 19, 2014

Uh oh!

danielcweeks commented Jul 21, 2014

Uh oh!

julienledem Jul 29, 2014

Uh oh!

julienledem commented Jul 29, 2014

Uh oh!

tsdeng Jul 29, 2014

Uh oh!

danielcweeks Jul 29, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

danielcweeks commented Jul 8, 2014

File Schema: {c1:int, c2:float, c3:chararray}

Uh oh!

julienledem Jul 18, 2014

Choose a reason for hiding this comment

Uh oh!

danielcweeks Jul 21, 2014

Choose a reason for hiding this comment

Uh oh!

julienledem Jul 21, 2014

Choose a reason for hiding this comment

Uh oh!

danielcweeks Jul 21, 2014

Choose a reason for hiding this comment

Uh oh!

julienledem commented Jul 19, 2014

Uh oh!

danielcweeks commented Jul 21, 2014

Uh oh!

julienledem Jul 29, 2014

Choose a reason for hiding this comment

Uh oh!

julienledem commented Jul 29, 2014

Uh oh!

tsdeng Jul 29, 2014

Choose a reason for hiding this comment

Uh oh!

danielcweeks Jul 29, 2014

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants