Skip to content

Add a unified and optionally more constrained API for expressing filters on columns#4

Closed
isnotinvain wants to merge 96 commits intoapache:masterfrom
isnotinvain:alexlevenson/filter-api
Closed

Add a unified and optionally more constrained API for expressing filters on columns#4
isnotinvain wants to merge 96 commits intoapache:masterfrom
isnotinvain:alexlevenson/filter-api

Conversation

@isnotinvain
Copy link
Copy Markdown
Contributor

This is a re-opened version of:
https://github.com/Parquet/parquet-mr/pull/412

The idea behind this pull request is to add a way to express filters on columns using DSL that allows parquet visibility into what is being filtered and how. This visibility will allow us to make optimizations at read time, the biggest one being filtering entire row groups or pages of records without even reading them based on the statistics / metadata that is stored along with each row group or page.

Included in this api are interfaces for user defined predicates, which must operate at the value level by may opt in to operating at the row group / page level as well. This should make this new API a superset of the parquet.filter package. This new api will need to be reconciled with the column filters currently in the parquet.filter package, but I wanted to get feedback on this first.

A limitation in both this api and the old one is that you can't do cross-column filters, eg: columX > columnY.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the FilterPredicate and UnboundRecordFilter themselves could implement Filter directly.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UnboundRecordFilter is an interface :(
So this would be another breaking change.

@julienledem
Copy link
Copy Markdown
Member

LGTM. only a few minor comments left.

@julienledem
Copy link
Copy Markdown
Member

+1 LGTM
Thanks for the hard work. definitely this @isnotinvain
An awesome contribution.

@isnotinvain
Copy link
Copy Markdown
Contributor Author

Thanks! #THISWILLWORK

@asfgit asfgit closed this in ad32bf0 Jul 29, 2014
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need to do put everytime, should do put only when alreadySeen is null?

asfgit referenced this pull request in apache/spark Oct 31, 2014
The parquet-mr project has introduced a new filter api  (https://github.com/apache/incubator-parquet-mr/pull/4), along with several fixes . It can also eliminate entire RowGroups depending on certain statistics like min/max
We can leverage that to further improve performance of queries with filters.
Also filter2 api introduces ability to create custom filters. We can create a custom filter for the optimized In clause (InSet) , so that elimination happens in the ParquetRecordReader itself

Author: Yash Datta <Yash.Datta@guavus.com>

Closes #2841 from saucam/master and squashes the following commits:

8282ba0 [Yash Datta] SPARK-3968: fix scala code style and add some more tests for filtering on optional columns
515df1c [Yash Datta] SPARK-3968: Add a test case for filter pushdown on optional column
5f4530e [Yash Datta] SPARK-3968: Fix scala code style
f304667 [Yash Datta] SPARK-3968: Using task metadata strategy for row group filtering
ec53e92 [Yash Datta] SPARK-3968: No push down should result in case we are unable to create a record filter
48163c3 [Yash Datta] SPARK-3968: Code cleanup
cc7b596 [Yash Datta] SPARK-3968: 1. Fix RowGroupFiltering not working             2. Use the serialization/deserialization from Parquet library for filter pushdown
caed851 [Yash Datta] Revert "SPARK-3968: Not pushing the filters in case of OPTIONAL columns" since filtering on optional columns is now supported in filter2 api
49703c9 [Yash Datta] SPARK-3968: Not pushing the filters in case of OPTIONAL columns
9d09741 [Yash Datta] SPARK-3968: Change parquet filter pushdown to use filter2 api of parquet-mr
rdblue pushed a commit to rdblue/parquet-mr that referenced this pull request Feb 6, 2015
…ers on columns

This is a re-opened version of:
https://github.com/Parquet/parquet-mr/pull/412

The idea behind this pull request is to add a way to express filters on columns using DSL that allows parquet visibility into what is being filtered and how. This visibility will allow us to make optimizations at read time, the biggest one being filtering entire row groups or pages of records without even reading them based on the statistics / metadata that is stored along with each row group or page.

Included in this api are interfaces for user defined predicates, which must operate at the value level by may opt in to operating at the row group / page level as well. This should make this new API a superset of the `parquet.filter` package. This new api will need to be reconciled with the column filters currently in the `parquet.filter` package, but I wanted to get feedback on this first.

A limitation in both this api and the old one is that you can't do cross-column filters, eg: columX > columnY.

Author: Alex Levenson <alexlevenson@twitter.com>

Closes apache#4 from isnotinvain/alexlevenson/filter-api and squashes the following commits:

c1ab7e3 [Alex Levenson] Address feedback
c1bd610 [Alex Levenson] cleanup dotString in ColumnPath
418bfc1 [Alex Levenson] Update version, add temporary hacks for semantic enforcer
6643bd3 [Alex Levenson] Fix some more non backward incompatible changes
39f977f [Alex Levenson] Put a bunch of backwards compatible stuff back in, add @deprecated
13a02c6 [Alex Levenson] Fix compile errors, add back in overloaded getRecordReader
f82edb7 [Alex Levenson] Merge branch 'master' into alexlevenson/filter-api
9bd014f [Alex Levenson] clean up TODOs and reference jiras
4cc7e87 [Alex Levenson] Add some comments
30e3d61 [Alex Levenson] Create a common interface for both kinds of filters
ac153a6 [Alex Levenson] Create a Statistics class for use in UDPs
fbbf601 [Alex Levenson] refactor IncrementallyUpdatedFilterPredicateGenerator to only generate the parts that require generation
5df47cd [Alex Levenson] Static imports of checkNotNull
c1d1823 [Alex Levenson] address some of the minor feedback items
67a3ba0 [Alex Levenson] update binary's toString
3d7372b [Alex Levenson] minor fixes
fed9531 [Alex Levenson] Add skipCurrentRecord method to clear events in thrift converter
2e632d5 [Alex Levenson] Make Binary Serializable
09c024f [Alex Levenson] update comments
3169849 [Alex Levenson] fix compilation error
0185030 [Alex Levenson] Add integration test for value level filters
4fde18c [Alex Levenson] move to right package
ae36b37 [Alex Levenson] Handle merge issues
af69486 [Alex Levenson] Merge branch 'master' into alexlevenson/filter-api
0665271 [Alex Levenson] Add tests for value inspector
c5e3b07 [Alex Levenson] Add tests for resetter and evaluator
29f677a [Alex Levenson] Fix scala DSL
8897a28 [Alex Levenson] Fix some tests
b448bee [Alex Levenson] Fix mistake in MessageColumnIO
c8133f8 [Alex Levenson] Fix some tests
4cf686d [Alex Levenson] more null checks
69e683b [Alex Levenson] check all the nulls
220a682 [Alex Levenson] more cleanup
aad5af3 [Alex Levenson] rm generated src file from git
5075243 [Alex Levenson] more minor cleanup
9966713 [Alex Levenson] Hook generation into maven build
8282725 [Alex Levenson] minor cleanup
fea3ea9 [Alex Levenson] minor cleanup
9e35406 [Alex Levenson] move statistics filter
c52750c [Alex Levenson] finish moving things around
97a6bfd [Alex Levenson] Move things around pt2
843b9fe [Alex Levenson] Move some files around pt 1
5eedcc0 [Alex Levenson] turn off dictionary support for AtomicConverter
541319e [Alex Levenson] various cleanup and fixes
08e9638 [Alex Levenson] rm ColumnPathUtil
bfe6795 [Alex Levenson] Add type bounds to FilterApi
6c831ab [Alex Levenson] don't double log exception in SerializationUtil
a7a58d1 [Alex Levenson] use ColumnPath instead of String
8f11a6b [Alex Levenson] Move ColumnPath and Canonicalizer to parquet-common
9164359 [Alex Levenson] stash
abc2be2 [Alex Levenson] Add null handling to record filters -- this impl is still broken though
90ba8f7 [Alex Levenson] Update Serialization Util
0a261f1 [Alex Levenson] Add compression in SerializationUtil
f1278be [Alex Levenson] Add comment, fix tests
cbd1a85 [Alex Levenson] Replace some specialization with generic views
e496cbf [Alex Levenson] Fix short circuiting in StatisticsFilter
db6b32d [Alex Levenson] Address some comments, fix constructor in ParquetReader
fd6f44d [Alex Levenson] Fix semver backward compat
2fdd304 [Alex Levenson] Some more cleanup
d34fb89 [Alex Levenson] Cleanup some TODOs
544499c [Alex Levenson] stash
7b32016 [Alex Levenson] Merge branch 'master' into alexlevenson/filter-api
0e31251 [Alex Levenson] First pass at values filter, needs reworking
470e409 [Alex Levenson] fix java6/7 bug, minor cleanup
ee7b221 [Alex Levenson] more InputFormat tests
5ef849e [Alex Levenson] Add guards for not specifying both kinds of filter
0186b1f [Alex Levenson] Add logging to ParquetInputFormat and tests for configuration
a622648 [Alex Levenson] cleanup imports
9b1ea88 [Alex Levenson] Add tests for statistics filter
d517373 [Alex Levenson] tests for filter validator
b25fc44 [Alex Levenson] small cleanup of filter validator
32067a1 [Alex Levenson] add test for collapse logical nots
1efc198 [Alex Levenson] Add tests for invert filter predicate
046b106 [Alex Levenson] some more fixes
d3c4d7a [Alex Levenson] fix some more types, add in test for SerializationUtil
cc51274 [Alex Levenson] fix generics in FilterPredicateInverter
ea08349 [Alex Levenson] First pass at rowgroup filter, needs testing
156d91b [Alex Levenson] Add runtime type checker
4dfb4f2 [Alex Levenson] Add serialization util
8f80b20 [Alex Levenson] update comment
7c25121 [Alex Levenson] Add class to Column struct
58f1190 [Alex Levenson] Remove filterByUniqueValues
7f20de6 [Alex Levenson] rename user predicates
af14b42 [Alex Levenson] Update dsl
04409c5 [Alex Levenson] Add generic types into Visitor
ba42884 [Alex Levenson] rm getClassName
65f8af9 [Alex Levenson] Add in support for user defined predicates on columns
6926337 [Alex Levenson] Add explicit tokens for notEq, ltEq, gtEq
667ec9f [Alex Levenson] remove test for collapsing double negation
db2f71a [Alex Levenson] rename FilterPredicatesTest
a0a0533 [Alex Levenson] Address first round of comments
b2bca94 [Alex Levenson] Add scala DSL and tests
bedda87 [Alex Levenson] Add tests for FilterPredicate building
238cbbe [Alex Levenson] Add scala dsl
39f7b24 [Alex Levenson] add scala mvn boilerplate
2ec71a7 [Alex Levenson] Add predicate API

Conflicts:
	parquet-column/src/main/java/parquet/io/api/Binary.java
	parquet-hadoop/src/main/java/parquet/hadoop/InternalParquetRecordReader.java
Resolution:
    InternalParquetRecordReader: conflicts from not backporting
        PARQUET-2, which were minor.
    Binary: changed several anonymous classes to private static.
        Conflict appears to be an artifact of major changes. The
        important thing to verify is that these don't break binary
        compatibility.

Version conflicts:
	parquet-avro/pom.xml
	parquet-cascading/pom.xml
	parquet-column/pom.xml
	parquet-common/pom.xml
	parquet-encoding/pom.xml
	parquet-generator/pom.xml
	parquet-hadoop-bundle/pom.xml
	parquet-hadoop/pom.xml
	parquet-hive-bundle/pom.xml
	parquet-hive/parquet-hive-binding/parquet-hive-0.10-binding/pom.xml
	parquet-hive/parquet-hive-binding/parquet-hive-0.12-binding/pom.xml
	parquet-hive/parquet-hive-binding/parquet-hive-binding-bundle/pom.xml
	parquet-hive/parquet-hive-binding/parquet-hive-binding-factory/pom.xml
	parquet-hive/parquet-hive-binding/parquet-hive-binding-interface/pom.xml
	parquet-hive/parquet-hive-binding/pom.xml
	parquet-hive/parquet-hive-storage-handler/pom.xml
	parquet-hive/pom.xml
	parquet-jackson/pom.xml
	parquet-pig-bundle/pom.xml
	parquet-pig/pom.xml
	parquet-protobuf/pom.xml
	parquet-scrooge/pom.xml
	parquet-test-hadoop2/pom.xml
	parquet-thrift/pom.xml
	parquet-tools/pom.xml
	pom.xml
rdblue pushed a commit to rdblue/parquet-mr that referenced this pull request Mar 9, 2015
…ers on columns

This is a re-opened version of:
https://github.com/Parquet/parquet-mr/pull/412

The idea behind this pull request is to add a way to express filters on columns using DSL that allows parquet visibility into what is being filtered and how. This visibility will allow us to make optimizations at read time, the biggest one being filtering entire row groups or pages of records without even reading them based on the statistics / metadata that is stored along with each row group or page.

Included in this api are interfaces for user defined predicates, which must operate at the value level by may opt in to operating at the row group / page level as well. This should make this new API a superset of the `parquet.filter` package. This new api will need to be reconciled with the column filters currently in the `parquet.filter` package, but I wanted to get feedback on this first.

A limitation in both this api and the old one is that you can't do cross-column filters, eg: columX > columnY.

Author: Alex Levenson <alexlevenson@twitter.com>

Closes apache#4 from isnotinvain/alexlevenson/filter-api and squashes the following commits:

c1ab7e3 [Alex Levenson] Address feedback
c1bd610 [Alex Levenson] cleanup dotString in ColumnPath
418bfc1 [Alex Levenson] Update version, add temporary hacks for semantic enforcer
6643bd3 [Alex Levenson] Fix some more non backward incompatible changes
39f977f [Alex Levenson] Put a bunch of backwards compatible stuff back in, add @deprecated
13a02c6 [Alex Levenson] Fix compile errors, add back in overloaded getRecordReader
f82edb7 [Alex Levenson] Merge branch 'master' into alexlevenson/filter-api
9bd014f [Alex Levenson] clean up TODOs and reference jiras
4cc7e87 [Alex Levenson] Add some comments
30e3d61 [Alex Levenson] Create a common interface for both kinds of filters
ac153a6 [Alex Levenson] Create a Statistics class for use in UDPs
fbbf601 [Alex Levenson] refactor IncrementallyUpdatedFilterPredicateGenerator to only generate the parts that require generation
5df47cd [Alex Levenson] Static imports of checkNotNull
c1d1823 [Alex Levenson] address some of the minor feedback items
67a3ba0 [Alex Levenson] update binary's toString
3d7372b [Alex Levenson] minor fixes
fed9531 [Alex Levenson] Add skipCurrentRecord method to clear events in thrift converter
2e632d5 [Alex Levenson] Make Binary Serializable
09c024f [Alex Levenson] update comments
3169849 [Alex Levenson] fix compilation error
0185030 [Alex Levenson] Add integration test for value level filters
4fde18c [Alex Levenson] move to right package
ae36b37 [Alex Levenson] Handle merge issues
af69486 [Alex Levenson] Merge branch 'master' into alexlevenson/filter-api
0665271 [Alex Levenson] Add tests for value inspector
c5e3b07 [Alex Levenson] Add tests for resetter and evaluator
29f677a [Alex Levenson] Fix scala DSL
8897a28 [Alex Levenson] Fix some tests
b448bee [Alex Levenson] Fix mistake in MessageColumnIO
c8133f8 [Alex Levenson] Fix some tests
4cf686d [Alex Levenson] more null checks
69e683b [Alex Levenson] check all the nulls
220a682 [Alex Levenson] more cleanup
aad5af3 [Alex Levenson] rm generated src file from git
5075243 [Alex Levenson] more minor cleanup
9966713 [Alex Levenson] Hook generation into maven build
8282725 [Alex Levenson] minor cleanup
fea3ea9 [Alex Levenson] minor cleanup
9e35406 [Alex Levenson] move statistics filter
c52750c [Alex Levenson] finish moving things around
97a6bfd [Alex Levenson] Move things around pt2
843b9fe [Alex Levenson] Move some files around pt 1
5eedcc0 [Alex Levenson] turn off dictionary support for AtomicConverter
541319e [Alex Levenson] various cleanup and fixes
08e9638 [Alex Levenson] rm ColumnPathUtil
bfe6795 [Alex Levenson] Add type bounds to FilterApi
6c831ab [Alex Levenson] don't double log exception in SerializationUtil
a7a58d1 [Alex Levenson] use ColumnPath instead of String
8f11a6b [Alex Levenson] Move ColumnPath and Canonicalizer to parquet-common
9164359 [Alex Levenson] stash
abc2be2 [Alex Levenson] Add null handling to record filters -- this impl is still broken though
90ba8f7 [Alex Levenson] Update Serialization Util
0a261f1 [Alex Levenson] Add compression in SerializationUtil
f1278be [Alex Levenson] Add comment, fix tests
cbd1a85 [Alex Levenson] Replace some specialization with generic views
e496cbf [Alex Levenson] Fix short circuiting in StatisticsFilter
db6b32d [Alex Levenson] Address some comments, fix constructor in ParquetReader
fd6f44d [Alex Levenson] Fix semver backward compat
2fdd304 [Alex Levenson] Some more cleanup
d34fb89 [Alex Levenson] Cleanup some TODOs
544499c [Alex Levenson] stash
7b32016 [Alex Levenson] Merge branch 'master' into alexlevenson/filter-api
0e31251 [Alex Levenson] First pass at values filter, needs reworking
470e409 [Alex Levenson] fix java6/7 bug, minor cleanup
ee7b221 [Alex Levenson] more InputFormat tests
5ef849e [Alex Levenson] Add guards for not specifying both kinds of filter
0186b1f [Alex Levenson] Add logging to ParquetInputFormat and tests for configuration
a622648 [Alex Levenson] cleanup imports
9b1ea88 [Alex Levenson] Add tests for statistics filter
d517373 [Alex Levenson] tests for filter validator
b25fc44 [Alex Levenson] small cleanup of filter validator
32067a1 [Alex Levenson] add test for collapse logical nots
1efc198 [Alex Levenson] Add tests for invert filter predicate
046b106 [Alex Levenson] some more fixes
d3c4d7a [Alex Levenson] fix some more types, add in test for SerializationUtil
cc51274 [Alex Levenson] fix generics in FilterPredicateInverter
ea08349 [Alex Levenson] First pass at rowgroup filter, needs testing
156d91b [Alex Levenson] Add runtime type checker
4dfb4f2 [Alex Levenson] Add serialization util
8f80b20 [Alex Levenson] update comment
7c25121 [Alex Levenson] Add class to Column struct
58f1190 [Alex Levenson] Remove filterByUniqueValues
7f20de6 [Alex Levenson] rename user predicates
af14b42 [Alex Levenson] Update dsl
04409c5 [Alex Levenson] Add generic types into Visitor
ba42884 [Alex Levenson] rm getClassName
65f8af9 [Alex Levenson] Add in support for user defined predicates on columns
6926337 [Alex Levenson] Add explicit tokens for notEq, ltEq, gtEq
667ec9f [Alex Levenson] remove test for collapsing double negation
db2f71a [Alex Levenson] rename FilterPredicatesTest
a0a0533 [Alex Levenson] Address first round of comments
b2bca94 [Alex Levenson] Add scala DSL and tests
bedda87 [Alex Levenson] Add tests for FilterPredicate building
238cbbe [Alex Levenson] Add scala dsl
39f7b24 [Alex Levenson] add scala mvn boilerplate
2ec71a7 [Alex Levenson] Add predicate API

Conflicts:
	parquet-column/src/main/java/parquet/io/api/Binary.java
	parquet-hadoop/src/main/java/parquet/hadoop/InternalParquetRecordReader.java
Resolution:
    InternalParquetRecordReader: conflicts from not backporting
        PARQUET-2, which were minor.
    Binary: changed several anonymous classes to private static.
        Conflict appears to be an artifact of major changes. The
        important thing to verify is that these don't break binary
        compatibility.

Version conflicts:
	parquet-avro/pom.xml
	parquet-cascading/pom.xml
	parquet-column/pom.xml
	parquet-common/pom.xml
	parquet-encoding/pom.xml
	parquet-generator/pom.xml
	parquet-hadoop-bundle/pom.xml
	parquet-hadoop/pom.xml
	parquet-hive-bundle/pom.xml
	parquet-hive/parquet-hive-binding/parquet-hive-0.10-binding/pom.xml
	parquet-hive/parquet-hive-binding/parquet-hive-0.12-binding/pom.xml
	parquet-hive/parquet-hive-binding/parquet-hive-binding-bundle/pom.xml
	parquet-hive/parquet-hive-binding/parquet-hive-binding-factory/pom.xml
	parquet-hive/parquet-hive-binding/parquet-hive-binding-interface/pom.xml
	parquet-hive/parquet-hive-binding/pom.xml
	parquet-hive/parquet-hive-storage-handler/pom.xml
	parquet-hive/pom.xml
	parquet-jackson/pom.xml
	parquet-pig-bundle/pom.xml
	parquet-pig/pom.xml
	parquet-protobuf/pom.xml
	parquet-scrooge/pom.xml
	parquet-test-hadoop2/pom.xml
	parquet-thrift/pom.xml
	parquet-tools/pom.xml
	pom.xml
asfgit pushed a commit that referenced this pull request Jul 1, 2015
Author: asingh <asingh@cloudera.com>
Author: Alex Levenson <alexlevenson@twitter.com>
Author: Ashish Singh <asingh@cloudera.com>

Closes #197 from SinghAsDev/PARQUET-251 and squashes the following commits:

68e0eae [asingh] Remove deprecated constructors from private classes
67e4e5f [asingh] Add removed public methods in Binary and deprecate them
0e71728 [asingh] Add comment for BinaryStatistics.setMinMaxFromBytes
fbe873f [Ashish Singh] Merge pull request #4 from isnotinvain/PR-197-3
9826ee6 [Alex Levenson] Some minor cleanup
7570035 [asingh] Remove test for stats getting ingnored for version 160 when type is int64
af43d28 [Alex Levenson] Address PR feedback
89ab4ee [Alex Levenson] put the headers in the right location
2838cc9 [Alex Levenson] Split out version checks to separate files, add some tests
5af9142 [Alex Levenson] Generalize tests, make Binary.fromString reused=false
e00d9b7 [asingh] Rename isReused => isBackingBytesReused
d2ad939 [asingh] Rebase over latest trunk
857141a [asingh] Remove redundant junit dependency
32b88ed [asingh] Remove semver from hadoop-common
7a0e99e [asingh] Revert to fromConstantByteArray for ByteString
c820ec9 [asingh] Add unit tests for Binary and to check if stats are ignored for version 160
9bbd1e5 [asingh] Improve version parsing
84a1d8b [asingh] Remove ignoring stats on write side and ignore it on read side
903f8e3 [asingh] Address some review comments. * Ignore stats for writer's version < 1.8.0 * Refactor shoudlIgnoreStatistics method a bit * Assume implementations other than parquet-mr were writing binary   statistics correctly * Add toParquetStatistics method's original method signature to maintain   backwards compatibility and mark it as deprecated
64c2617 [asingh] Revert changes for ignoring stats at RowGroupFilter level
e861b18 [asingh] Ignore max min stats while reading
3a8cb8d [asingh] Fix typo
8e12618 [asingh] Fix usage of fromConstant versions of Binary constructors
860adf7 [asingh] Rename unmodified to constant and isReused instead of isUnmodifiable
0d127a7 [asingh] Add unmodfied and Reused versions for creating a Binary. Add copy() to Binary.
b4e2950 [asingh] Skip filtering based on stats when file was written with version older than 1.6.1
6fcee8c [asingh] Add getBytesUnsafe() to Binary that returns backing byte[] if possible, else returns result of getBytes()
30b07dd [asingh] PARQUET-251: Binary column statistics error when reuse byte[] among rows
costimuraru pushed a commit to costimuraru/parquet-mr that referenced this pull request Apr 29, 2017
costimuraru pushed a commit to costimuraru/parquet-mr that referenced this pull request Apr 29, 2017
julienledem pushed a commit to julienledem/parquet-java that referenced this pull request Jun 9, 2017
chenjunjiedada pushed a commit to chenjunjiedada/parquet-mr that referenced this pull request Aug 3, 2019
Add thrift consumers for ENCRYPTION_ALGORITHM and FOOTER_SIGNING_KEY_METADATA
shangxinli added a commit to shangxinli/parquet-mr that referenced this pull request Mar 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants