Skip to content

Column index access support#12

Closed
danielcweeks wants to merge 8 commits intoapache:masterfrom
danielcweeks:column-index-access
Closed

Column index access support#12
danielcweeks wants to merge 8 commits intoapache:masterfrom
danielcweeks:column-index-access

Conversation

@danielcweeks
Copy link
Copy Markdown

This patch adds the ability to use column index based access to parquet files in pig, which allows for rename capability similar to other file formats. This is achieved by using the parametrized loader with an alternate schema.

Example:

File Schema: {c1:int, c2:float, c3:chararray}

p = LOAD '/data/parquet/' USING parquet.pig.ParquetLoader('n1:int, n2:float, n3:chararray', 'true');

In this example, the names from the requested schema will be translated to the column positions from the file and will produce tuples based on the index position.

Two test cases are included that exercise index based access for both full file reads and column projected reads.

Note: This patch also disables the enforcer plugin on the pig project per discussion at the parquet meetup. The justification for this is that the enforcer is too strict for internal classes and results in dead code because duplicating methods is required to add parameters where there is only one usage of the constructor/method. The interface for the pig loader is imposed by LoadFunc and StoreFunc by the pig project and the implementations internals should not be used directly.

Daniel Weeks added 7 commits June 19, 2014 10:51
…ct type checking for conflicting schemas, which is strict by default.
Conflicts:
	parquet-hadoop/src/main/java/parquet/hadoop/ParquetInputFormat.java
	parquet-pig/src/main/java/parquet/pig/convert/TupleConverter.java
	parquet-pig/src/test/java/parquet/pig/TestParquetLoader.java
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should make it explicit in the key name when a configuration setting is private (not user facing)
something like parquet.private.pig.required.fields

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't currently distinguish for other properties (like parquet.pig.schema) which could potentially cause problems if someone decided to set it explicitly. I'm not sure that this is a real issue.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should start flagging private keys as such even if we have been sloppy in the past.
parquet.pig.schema is semi public as it can come from the user.
requireFieldList is definitely private. Just a naming convention (parquet.internal...., parquet.private.... ?) would be enough.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem. I'll update the property names.

@julienledem
Copy link
Copy Markdown
Member

The feature makes sense to me. I made some comments regarding the handling of using index or not.

@danielcweeks
Copy link
Copy Markdown
Author

Updated the pull request. Please review changes.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use a subclass of ParquetRuntimeException?

@julienledem
Copy link
Copy Markdown
Member

This looks good to me. I made a comment about the RuntimeExceptions that we should cleanup eventually but it's not a blocker.

@asfgit asfgit closed this in 17864df Jul 29, 2014
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else throw exception ?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand your comment correctly, you mean whey not throw an exception if "p.second <= schemaToFilter.getFieldCount()". We actually want to ignore the 'else' case since that will trigger null padding for fields that don't exist, which is the intended behavior.

See the commit for null padding.

rdblue pushed a commit to rdblue/parquet-mr that referenced this pull request Feb 6, 2015
This patch adds the ability to use column index based access to parquet files in pig, which allows for rename capability similar to other file formats.  This is achieved by using the parametrized loader with an alternate schema.

Example:
p = LOAD '/data/parquet/' USING parquet.pig.ParquetLoader('n1:int, n2:float, n3:chararray', 'true');

In this example, the names from the requested schema will be translated to the column positions from the file and will produce tuples based on the index position.

Two test cases are included that exercise index based access for both full file reads and column projected reads.

Note:  This patch also disables the enforcer plugin on the pig project per discussion at the parquet meetup.  The justification for this is that the enforcer is too strict for internal classes and results in dead code because duplicating methods is required to add parameters where there is only one usage of the constructor/method.  The interface for the pig loader is imposed by LoadFunc and StoreFunc by the pig project and the implementations internals should not be used directly.

Author: Daniel Weeks <dweeks@netflix.com>

Closes apache#12 from dcw-netflix/column-index-access and squashes the following commits:

1b5c5cf [Daniel Weeks] Refactored based on rewview comments
12b53c1 [Daniel Weeks] Fixed some formatting and the missing filter method sig
e5553f1 [Daniel Weeks] Adding back default constructor to satisfy other project requirements
69d21e0 [Daniel Weeks] Merge branch 'master' into column-index-access
f725c6f [Daniel Weeks] Removed enforcer for pig support
d182dc6 [Daniel Weeks] Introduces column index access
1c3c0c7 [Daniel Weeks] Fixed test with strict checking off
f3cb495 [Daniel Weeks] Added type persuasion for primitive types with a flag to control strict type checking for conflicting schemas, which is strict by default.

Conflicts:
	parquet-pig/src/test/java/parquet/pig/TestParquetLoader.java
    Resolution: removed parts of 9ad5485 (not backported) in the tests.
andersonm-ibm added a commit to andersonm-ibm/parquet-mr that referenced this pull request Apr 6, 2020
…rsion.

Depends on PR apache#776 for [PARQUET-1229] and on PR apache#12 in parquet-testing for [PARQUET-1807].
JIRA: https://issues.apache.org/jira/browse/PARQUET-1807
Add a test for writing and reading parquet in a number of encryption and decryption configurations.
Add interop test that reads files from parquet-testing GitHub repository, that were written by parquet-cpp.
This adds parquet-testing repo as a submodule.
Run the following to populate the "submodules/parquet-testing/" folder:
   git submodule update --init --recursive
andersonm-ibm added a commit to andersonm-ibm/parquet-mr that referenced this pull request Apr 6, 2020
…rsion

Depends on PR apache#776 for [PARQUET-1229] and
on PR apache#12 in parquet-testing for [PARQUET-1807].
JIRA: https://issues.apache.org/jira/browse/PARQUET-1807
Add a test for writing and reading parquet in a number of encryption
and decryption configurations.
Add interop test that reads files from parquet-testing GitHub
repository, that were written by parquet-cpp.
This adds parquet-testing repo as a submodule.
Run the following to populate the "submodules/parquet-testing/" folder:
   git submodule update --init --recursive
andersonm-ibm added a commit to andersonm-ibm/parquet-mr that referenced this pull request May 10, 2020
…rsion

Depends on PR apache#776 for [PARQUET-1229] and
on PR apache#12 in parquet-testing for [PARQUET-1807].
JIRA: https://issues.apache.org/jira/browse/PARQUET-1807
Add a test for writing and reading parquet in a number of encryption
and decryption configurations.
Add interop test that reads files from parquet-testing GitHub
repository, that were written by parquet-cpp.
This adds parquet-testing repo as a submodule.
Run the following to populate the "submodules/parquet-testing/" folder:
   git submodule update --init --recursive
andersonm-ibm added a commit to andersonm-ibm/parquet-mr that referenced this pull request May 29, 2020
…rsion

Depends on PR apache#776 for [PARQUET-1229] and
on PR apache#12 in parquet-testing for [PARQUET-1807].
JIRA: https://issues.apache.org/jira/browse/PARQUET-1807
Add a test for writing and reading parquet in a number of encryption
and decryption configurations.
Add interop test that reads files from parquet-testing GitHub
repository, that were written by parquet-cpp.
This adds parquet-testing repo as a submodule.
Run the following to populate the "submodules/parquet-testing/" folder:
   git submodule update --init --recursive
parthchandra pushed a commit to parthchandra/incubator-parquet-mr that referenced this pull request May 13, 2022
sunchao pushed a commit to sunchao/parquet-mr that referenced this pull request Jun 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants