Add a unified and optionally more constrained API for expressing filters on columns#4
Closed
isnotinvain wants to merge 96 commits intoapache:masterfrom
isnotinvain:alexlevenson/filter-api
Closed
Add a unified and optionally more constrained API for expressing filters on columns#4isnotinvain wants to merge 96 commits intoapache:masterfrom isnotinvain:alexlevenson/filter-api
isnotinvain wants to merge 96 commits intoapache:masterfrom
isnotinvain:alexlevenson/filter-api
Conversation
Conflicts: parquet-hadoop/src/main/java/parquet/hadoop/ParquetInputFormat.java parquet-hadoop/src/test/java/parquet/hadoop/TestInputFormat.java
Member
There was a problem hiding this comment.
the FilterPredicate and UnboundRecordFilter themselves could implement Filter directly.
Contributor
Author
There was a problem hiding this comment.
UnboundRecordFilter is an interface :(
So this would be another breaking change.
Member
|
LGTM. only a few minor comments left. |
Member
|
+1 LGTM |
Contributor
Author
|
Thanks! |
Contributor
There was a problem hiding this comment.
don't need to do put everytime, should do put only when alreadySeen is null?
asfgit
referenced
this pull request
in apache/spark
Oct 31, 2014
The parquet-mr project has introduced a new filter api (https://github.com/apache/incubator-parquet-mr/pull/4), along with several fixes . It can also eliminate entire RowGroups depending on certain statistics like min/max We can leverage that to further improve performance of queries with filters. Also filter2 api introduces ability to create custom filters. We can create a custom filter for the optimized In clause (InSet) , so that elimination happens in the ParquetRecordReader itself Author: Yash Datta <Yash.Datta@guavus.com> Closes #2841 from saucam/master and squashes the following commits: 8282ba0 [Yash Datta] SPARK-3968: fix scala code style and add some more tests for filtering on optional columns 515df1c [Yash Datta] SPARK-3968: Add a test case for filter pushdown on optional column 5f4530e [Yash Datta] SPARK-3968: Fix scala code style f304667 [Yash Datta] SPARK-3968: Using task metadata strategy for row group filtering ec53e92 [Yash Datta] SPARK-3968: No push down should result in case we are unable to create a record filter 48163c3 [Yash Datta] SPARK-3968: Code cleanup cc7b596 [Yash Datta] SPARK-3968: 1. Fix RowGroupFiltering not working 2. Use the serialization/deserialization from Parquet library for filter pushdown caed851 [Yash Datta] Revert "SPARK-3968: Not pushing the filters in case of OPTIONAL columns" since filtering on optional columns is now supported in filter2 api 49703c9 [Yash Datta] SPARK-3968: Not pushing the filters in case of OPTIONAL columns 9d09741 [Yash Datta] SPARK-3968: Change parquet filter pushdown to use filter2 api of parquet-mr
rdblue
pushed a commit
to rdblue/parquet-mr
that referenced
this pull request
Feb 6, 2015
…ers on columns This is a re-opened version of: https://github.com/Parquet/parquet-mr/pull/412 The idea behind this pull request is to add a way to express filters on columns using DSL that allows parquet visibility into what is being filtered and how. This visibility will allow us to make optimizations at read time, the biggest one being filtering entire row groups or pages of records without even reading them based on the statistics / metadata that is stored along with each row group or page. Included in this api are interfaces for user defined predicates, which must operate at the value level by may opt in to operating at the row group / page level as well. This should make this new API a superset of the `parquet.filter` package. This new api will need to be reconciled with the column filters currently in the `parquet.filter` package, but I wanted to get feedback on this first. A limitation in both this api and the old one is that you can't do cross-column filters, eg: columX > columnY. Author: Alex Levenson <alexlevenson@twitter.com> Closes apache#4 from isnotinvain/alexlevenson/filter-api and squashes the following commits: c1ab7e3 [Alex Levenson] Address feedback c1bd610 [Alex Levenson] cleanup dotString in ColumnPath 418bfc1 [Alex Levenson] Update version, add temporary hacks for semantic enforcer 6643bd3 [Alex Levenson] Fix some more non backward incompatible changes 39f977f [Alex Levenson] Put a bunch of backwards compatible stuff back in, add @deprecated 13a02c6 [Alex Levenson] Fix compile errors, add back in overloaded getRecordReader f82edb7 [Alex Levenson] Merge branch 'master' into alexlevenson/filter-api 9bd014f [Alex Levenson] clean up TODOs and reference jiras 4cc7e87 [Alex Levenson] Add some comments 30e3d61 [Alex Levenson] Create a common interface for both kinds of filters ac153a6 [Alex Levenson] Create a Statistics class for use in UDPs fbbf601 [Alex Levenson] refactor IncrementallyUpdatedFilterPredicateGenerator to only generate the parts that require generation 5df47cd [Alex Levenson] Static imports of checkNotNull c1d1823 [Alex Levenson] address some of the minor feedback items 67a3ba0 [Alex Levenson] update binary's toString 3d7372b [Alex Levenson] minor fixes fed9531 [Alex Levenson] Add skipCurrentRecord method to clear events in thrift converter 2e632d5 [Alex Levenson] Make Binary Serializable 09c024f [Alex Levenson] update comments 3169849 [Alex Levenson] fix compilation error 0185030 [Alex Levenson] Add integration test for value level filters 4fde18c [Alex Levenson] move to right package ae36b37 [Alex Levenson] Handle merge issues af69486 [Alex Levenson] Merge branch 'master' into alexlevenson/filter-api 0665271 [Alex Levenson] Add tests for value inspector c5e3b07 [Alex Levenson] Add tests for resetter and evaluator 29f677a [Alex Levenson] Fix scala DSL 8897a28 [Alex Levenson] Fix some tests b448bee [Alex Levenson] Fix mistake in MessageColumnIO c8133f8 [Alex Levenson] Fix some tests 4cf686d [Alex Levenson] more null checks 69e683b [Alex Levenson] check all the nulls 220a682 [Alex Levenson] more cleanup aad5af3 [Alex Levenson] rm generated src file from git 5075243 [Alex Levenson] more minor cleanup 9966713 [Alex Levenson] Hook generation into maven build 8282725 [Alex Levenson] minor cleanup fea3ea9 [Alex Levenson] minor cleanup 9e35406 [Alex Levenson] move statistics filter c52750c [Alex Levenson] finish moving things around 97a6bfd [Alex Levenson] Move things around pt2 843b9fe [Alex Levenson] Move some files around pt 1 5eedcc0 [Alex Levenson] turn off dictionary support for AtomicConverter 541319e [Alex Levenson] various cleanup and fixes 08e9638 [Alex Levenson] rm ColumnPathUtil bfe6795 [Alex Levenson] Add type bounds to FilterApi 6c831ab [Alex Levenson] don't double log exception in SerializationUtil a7a58d1 [Alex Levenson] use ColumnPath instead of String 8f11a6b [Alex Levenson] Move ColumnPath and Canonicalizer to parquet-common 9164359 [Alex Levenson] stash abc2be2 [Alex Levenson] Add null handling to record filters -- this impl is still broken though 90ba8f7 [Alex Levenson] Update Serialization Util 0a261f1 [Alex Levenson] Add compression in SerializationUtil f1278be [Alex Levenson] Add comment, fix tests cbd1a85 [Alex Levenson] Replace some specialization with generic views e496cbf [Alex Levenson] Fix short circuiting in StatisticsFilter db6b32d [Alex Levenson] Address some comments, fix constructor in ParquetReader fd6f44d [Alex Levenson] Fix semver backward compat 2fdd304 [Alex Levenson] Some more cleanup d34fb89 [Alex Levenson] Cleanup some TODOs 544499c [Alex Levenson] stash 7b32016 [Alex Levenson] Merge branch 'master' into alexlevenson/filter-api 0e31251 [Alex Levenson] First pass at values filter, needs reworking 470e409 [Alex Levenson] fix java6/7 bug, minor cleanup ee7b221 [Alex Levenson] more InputFormat tests 5ef849e [Alex Levenson] Add guards for not specifying both kinds of filter 0186b1f [Alex Levenson] Add logging to ParquetInputFormat and tests for configuration a622648 [Alex Levenson] cleanup imports 9b1ea88 [Alex Levenson] Add tests for statistics filter d517373 [Alex Levenson] tests for filter validator b25fc44 [Alex Levenson] small cleanup of filter validator 32067a1 [Alex Levenson] add test for collapse logical nots 1efc198 [Alex Levenson] Add tests for invert filter predicate 046b106 [Alex Levenson] some more fixes d3c4d7a [Alex Levenson] fix some more types, add in test for SerializationUtil cc51274 [Alex Levenson] fix generics in FilterPredicateInverter ea08349 [Alex Levenson] First pass at rowgroup filter, needs testing 156d91b [Alex Levenson] Add runtime type checker 4dfb4f2 [Alex Levenson] Add serialization util 8f80b20 [Alex Levenson] update comment 7c25121 [Alex Levenson] Add class to Column struct 58f1190 [Alex Levenson] Remove filterByUniqueValues 7f20de6 [Alex Levenson] rename user predicates af14b42 [Alex Levenson] Update dsl 04409c5 [Alex Levenson] Add generic types into Visitor ba42884 [Alex Levenson] rm getClassName 65f8af9 [Alex Levenson] Add in support for user defined predicates on columns 6926337 [Alex Levenson] Add explicit tokens for notEq, ltEq, gtEq 667ec9f [Alex Levenson] remove test for collapsing double negation db2f71a [Alex Levenson] rename FilterPredicatesTest a0a0533 [Alex Levenson] Address first round of comments b2bca94 [Alex Levenson] Add scala DSL and tests bedda87 [Alex Levenson] Add tests for FilterPredicate building 238cbbe [Alex Levenson] Add scala dsl 39f7b24 [Alex Levenson] add scala mvn boilerplate 2ec71a7 [Alex Levenson] Add predicate API Conflicts: parquet-column/src/main/java/parquet/io/api/Binary.java parquet-hadoop/src/main/java/parquet/hadoop/InternalParquetRecordReader.java Resolution: InternalParquetRecordReader: conflicts from not backporting PARQUET-2, which were minor. Binary: changed several anonymous classes to private static. Conflict appears to be an artifact of major changes. The important thing to verify is that these don't break binary compatibility. Version conflicts: parquet-avro/pom.xml parquet-cascading/pom.xml parquet-column/pom.xml parquet-common/pom.xml parquet-encoding/pom.xml parquet-generator/pom.xml parquet-hadoop-bundle/pom.xml parquet-hadoop/pom.xml parquet-hive-bundle/pom.xml parquet-hive/parquet-hive-binding/parquet-hive-0.10-binding/pom.xml parquet-hive/parquet-hive-binding/parquet-hive-0.12-binding/pom.xml parquet-hive/parquet-hive-binding/parquet-hive-binding-bundle/pom.xml parquet-hive/parquet-hive-binding/parquet-hive-binding-factory/pom.xml parquet-hive/parquet-hive-binding/parquet-hive-binding-interface/pom.xml parquet-hive/parquet-hive-binding/pom.xml parquet-hive/parquet-hive-storage-handler/pom.xml parquet-hive/pom.xml parquet-jackson/pom.xml parquet-pig-bundle/pom.xml parquet-pig/pom.xml parquet-protobuf/pom.xml parquet-scrooge/pom.xml parquet-test-hadoop2/pom.xml parquet-thrift/pom.xml parquet-tools/pom.xml pom.xml
rdblue
pushed a commit
to rdblue/parquet-mr
that referenced
this pull request
Mar 9, 2015
…ers on columns This is a re-opened version of: https://github.com/Parquet/parquet-mr/pull/412 The idea behind this pull request is to add a way to express filters on columns using DSL that allows parquet visibility into what is being filtered and how. This visibility will allow us to make optimizations at read time, the biggest one being filtering entire row groups or pages of records without even reading them based on the statistics / metadata that is stored along with each row group or page. Included in this api are interfaces for user defined predicates, which must operate at the value level by may opt in to operating at the row group / page level as well. This should make this new API a superset of the `parquet.filter` package. This new api will need to be reconciled with the column filters currently in the `parquet.filter` package, but I wanted to get feedback on this first. A limitation in both this api and the old one is that you can't do cross-column filters, eg: columX > columnY. Author: Alex Levenson <alexlevenson@twitter.com> Closes apache#4 from isnotinvain/alexlevenson/filter-api and squashes the following commits: c1ab7e3 [Alex Levenson] Address feedback c1bd610 [Alex Levenson] cleanup dotString in ColumnPath 418bfc1 [Alex Levenson] Update version, add temporary hacks for semantic enforcer 6643bd3 [Alex Levenson] Fix some more non backward incompatible changes 39f977f [Alex Levenson] Put a bunch of backwards compatible stuff back in, add @deprecated 13a02c6 [Alex Levenson] Fix compile errors, add back in overloaded getRecordReader f82edb7 [Alex Levenson] Merge branch 'master' into alexlevenson/filter-api 9bd014f [Alex Levenson] clean up TODOs and reference jiras 4cc7e87 [Alex Levenson] Add some comments 30e3d61 [Alex Levenson] Create a common interface for both kinds of filters ac153a6 [Alex Levenson] Create a Statistics class for use in UDPs fbbf601 [Alex Levenson] refactor IncrementallyUpdatedFilterPredicateGenerator to only generate the parts that require generation 5df47cd [Alex Levenson] Static imports of checkNotNull c1d1823 [Alex Levenson] address some of the minor feedback items 67a3ba0 [Alex Levenson] update binary's toString 3d7372b [Alex Levenson] minor fixes fed9531 [Alex Levenson] Add skipCurrentRecord method to clear events in thrift converter 2e632d5 [Alex Levenson] Make Binary Serializable 09c024f [Alex Levenson] update comments 3169849 [Alex Levenson] fix compilation error 0185030 [Alex Levenson] Add integration test for value level filters 4fde18c [Alex Levenson] move to right package ae36b37 [Alex Levenson] Handle merge issues af69486 [Alex Levenson] Merge branch 'master' into alexlevenson/filter-api 0665271 [Alex Levenson] Add tests for value inspector c5e3b07 [Alex Levenson] Add tests for resetter and evaluator 29f677a [Alex Levenson] Fix scala DSL 8897a28 [Alex Levenson] Fix some tests b448bee [Alex Levenson] Fix mistake in MessageColumnIO c8133f8 [Alex Levenson] Fix some tests 4cf686d [Alex Levenson] more null checks 69e683b [Alex Levenson] check all the nulls 220a682 [Alex Levenson] more cleanup aad5af3 [Alex Levenson] rm generated src file from git 5075243 [Alex Levenson] more minor cleanup 9966713 [Alex Levenson] Hook generation into maven build 8282725 [Alex Levenson] minor cleanup fea3ea9 [Alex Levenson] minor cleanup 9e35406 [Alex Levenson] move statistics filter c52750c [Alex Levenson] finish moving things around 97a6bfd [Alex Levenson] Move things around pt2 843b9fe [Alex Levenson] Move some files around pt 1 5eedcc0 [Alex Levenson] turn off dictionary support for AtomicConverter 541319e [Alex Levenson] various cleanup and fixes 08e9638 [Alex Levenson] rm ColumnPathUtil bfe6795 [Alex Levenson] Add type bounds to FilterApi 6c831ab [Alex Levenson] don't double log exception in SerializationUtil a7a58d1 [Alex Levenson] use ColumnPath instead of String 8f11a6b [Alex Levenson] Move ColumnPath and Canonicalizer to parquet-common 9164359 [Alex Levenson] stash abc2be2 [Alex Levenson] Add null handling to record filters -- this impl is still broken though 90ba8f7 [Alex Levenson] Update Serialization Util 0a261f1 [Alex Levenson] Add compression in SerializationUtil f1278be [Alex Levenson] Add comment, fix tests cbd1a85 [Alex Levenson] Replace some specialization with generic views e496cbf [Alex Levenson] Fix short circuiting in StatisticsFilter db6b32d [Alex Levenson] Address some comments, fix constructor in ParquetReader fd6f44d [Alex Levenson] Fix semver backward compat 2fdd304 [Alex Levenson] Some more cleanup d34fb89 [Alex Levenson] Cleanup some TODOs 544499c [Alex Levenson] stash 7b32016 [Alex Levenson] Merge branch 'master' into alexlevenson/filter-api 0e31251 [Alex Levenson] First pass at values filter, needs reworking 470e409 [Alex Levenson] fix java6/7 bug, minor cleanup ee7b221 [Alex Levenson] more InputFormat tests 5ef849e [Alex Levenson] Add guards for not specifying both kinds of filter 0186b1f [Alex Levenson] Add logging to ParquetInputFormat and tests for configuration a622648 [Alex Levenson] cleanup imports 9b1ea88 [Alex Levenson] Add tests for statistics filter d517373 [Alex Levenson] tests for filter validator b25fc44 [Alex Levenson] small cleanup of filter validator 32067a1 [Alex Levenson] add test for collapse logical nots 1efc198 [Alex Levenson] Add tests for invert filter predicate 046b106 [Alex Levenson] some more fixes d3c4d7a [Alex Levenson] fix some more types, add in test for SerializationUtil cc51274 [Alex Levenson] fix generics in FilterPredicateInverter ea08349 [Alex Levenson] First pass at rowgroup filter, needs testing 156d91b [Alex Levenson] Add runtime type checker 4dfb4f2 [Alex Levenson] Add serialization util 8f80b20 [Alex Levenson] update comment 7c25121 [Alex Levenson] Add class to Column struct 58f1190 [Alex Levenson] Remove filterByUniqueValues 7f20de6 [Alex Levenson] rename user predicates af14b42 [Alex Levenson] Update dsl 04409c5 [Alex Levenson] Add generic types into Visitor ba42884 [Alex Levenson] rm getClassName 65f8af9 [Alex Levenson] Add in support for user defined predicates on columns 6926337 [Alex Levenson] Add explicit tokens for notEq, ltEq, gtEq 667ec9f [Alex Levenson] remove test for collapsing double negation db2f71a [Alex Levenson] rename FilterPredicatesTest a0a0533 [Alex Levenson] Address first round of comments b2bca94 [Alex Levenson] Add scala DSL and tests bedda87 [Alex Levenson] Add tests for FilterPredicate building 238cbbe [Alex Levenson] Add scala dsl 39f7b24 [Alex Levenson] add scala mvn boilerplate 2ec71a7 [Alex Levenson] Add predicate API Conflicts: parquet-column/src/main/java/parquet/io/api/Binary.java parquet-hadoop/src/main/java/parquet/hadoop/InternalParquetRecordReader.java Resolution: InternalParquetRecordReader: conflicts from not backporting PARQUET-2, which were minor. Binary: changed several anonymous classes to private static. Conflict appears to be an artifact of major changes. The important thing to verify is that these don't break binary compatibility. Version conflicts: parquet-avro/pom.xml parquet-cascading/pom.xml parquet-column/pom.xml parquet-common/pom.xml parquet-encoding/pom.xml parquet-generator/pom.xml parquet-hadoop-bundle/pom.xml parquet-hadoop/pom.xml parquet-hive-bundle/pom.xml parquet-hive/parquet-hive-binding/parquet-hive-0.10-binding/pom.xml parquet-hive/parquet-hive-binding/parquet-hive-0.12-binding/pom.xml parquet-hive/parquet-hive-binding/parquet-hive-binding-bundle/pom.xml parquet-hive/parquet-hive-binding/parquet-hive-binding-factory/pom.xml parquet-hive/parquet-hive-binding/parquet-hive-binding-interface/pom.xml parquet-hive/parquet-hive-binding/pom.xml parquet-hive/parquet-hive-storage-handler/pom.xml parquet-hive/pom.xml parquet-jackson/pom.xml parquet-pig-bundle/pom.xml parquet-pig/pom.xml parquet-protobuf/pom.xml parquet-scrooge/pom.xml parquet-test-hadoop2/pom.xml parquet-thrift/pom.xml parquet-tools/pom.xml pom.xml
asfgit
pushed a commit
that referenced
this pull request
Jul 1, 2015
Author: asingh <asingh@cloudera.com> Author: Alex Levenson <alexlevenson@twitter.com> Author: Ashish Singh <asingh@cloudera.com> Closes #197 from SinghAsDev/PARQUET-251 and squashes the following commits: 68e0eae [asingh] Remove deprecated constructors from private classes 67e4e5f [asingh] Add removed public methods in Binary and deprecate them 0e71728 [asingh] Add comment for BinaryStatistics.setMinMaxFromBytes fbe873f [Ashish Singh] Merge pull request #4 from isnotinvain/PR-197-3 9826ee6 [Alex Levenson] Some minor cleanup 7570035 [asingh] Remove test for stats getting ingnored for version 160 when type is int64 af43d28 [Alex Levenson] Address PR feedback 89ab4ee [Alex Levenson] put the headers in the right location 2838cc9 [Alex Levenson] Split out version checks to separate files, add some tests 5af9142 [Alex Levenson] Generalize tests, make Binary.fromString reused=false e00d9b7 [asingh] Rename isReused => isBackingBytesReused d2ad939 [asingh] Rebase over latest trunk 857141a [asingh] Remove redundant junit dependency 32b88ed [asingh] Remove semver from hadoop-common 7a0e99e [asingh] Revert to fromConstantByteArray for ByteString c820ec9 [asingh] Add unit tests for Binary and to check if stats are ignored for version 160 9bbd1e5 [asingh] Improve version parsing 84a1d8b [asingh] Remove ignoring stats on write side and ignore it on read side 903f8e3 [asingh] Address some review comments. * Ignore stats for writer's version < 1.8.0 * Refactor shoudlIgnoreStatistics method a bit * Assume implementations other than parquet-mr were writing binary statistics correctly * Add toParquetStatistics method's original method signature to maintain backwards compatibility and mark it as deprecated 64c2617 [asingh] Revert changes for ignoring stats at RowGroupFilter level e861b18 [asingh] Ignore max min stats while reading 3a8cb8d [asingh] Fix typo 8e12618 [asingh] Fix usage of fromConstant versions of Binary constructors 860adf7 [asingh] Rename unmodified to constant and isReused instead of isUnmodifiable 0d127a7 [asingh] Add unmodfied and Reused versions for creating a Binary. Add copy() to Binary. b4e2950 [asingh] Skip filtering based on stats when file was written with version older than 1.6.1 6fcee8c [asingh] Add getBytesUnsafe() to Binary that returns backing byte[] if possible, else returns result of getBytes() 30b07dd [asingh] PARQUET-251: Binary column statistics error when reuse byte[] among rows
costimuraru
pushed a commit
to costimuraru/parquet-mr
that referenced
this pull request
Apr 29, 2017
costimuraru
pushed a commit
to costimuraru/parquet-mr
that referenced
this pull request
Apr 29, 2017
julienledem
pushed a commit
to julienledem/parquet-java
that referenced
this pull request
Jun 9, 2017
chenjunjiedada
pushed a commit
to chenjunjiedada/parquet-mr
that referenced
this pull request
Aug 3, 2019
Add thrift consumers for ENCRYPTION_ALGORITHM and FOOTER_SIGNING_KEY_METADATA
shangxinli
added a commit
to shangxinli/parquet-mr
that referenced
this pull request
Mar 25, 2020
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is a re-opened version of:
https://github.com/Parquet/parquet-mr/pull/412
The idea behind this pull request is to add a way to express filters on columns using DSL that allows parquet visibility into what is being filtered and how. This visibility will allow us to make optimizations at read time, the biggest one being filtering entire row groups or pages of records without even reading them based on the statistics / metadata that is stored along with each row group or page.
Included in this api are interfaces for user defined predicates, which must operate at the value level by may opt in to operating at the row group / page level as well. This should make this new API a superset of the
parquet.filterpackage. This new api will need to be reconciled with the column filters currently in theparquet.filterpackage, but I wanted to get feedback on this first.A limitation in both this api and the old one is that you can't do cross-column filters, eg: columX > columnY.