PARQUET-34: implement not() for Contains predicate#2941
PARQUET-34: implement not() for Contains predicate#2941wgtmac merged 5 commits intoapache:masterfrom
Conversation
| throw new UnsupportedOperationException( | ||
| "Contains predicates cannot be composed with non-Contains predicates"); | ||
| } | ||
| // If two Contains predicates refer to the same column, optimize by combining into a single predicate |
There was a problem hiding this comment.
This fixes a bug with the way Contains predicates were composed -- previously it would throw an error if you tried to do something like and(contains(binaryColumn("foo", ...), contains(binaryColumn("bar"), ...)) because the columns didn't match. however, that is still a valid predicate -- it's just that they should not be attempted to be combined into a single Contains predicate.
| /** | ||
| * A ValueInspector implementation that keeps state for one or more delegate inspectors. | ||
| */ | ||
| abstract static class DelegatingValueInspector extends ValueInspector { |
There was a problem hiding this comment.
I created this because I thought the generated IncrementallyUpdatedFilterPredicateBuilder was getting complex in terms of LOC -- all the Contains ValueInspectors were overriding public void update(int|float|binary|double|... value) as well as reset(), and it was getting hard to reason about. However.... this is probably a bit less efficient, since the delegate ValueInspectors are accessed via iterator loops rather than directly invoking left/right. Let me know what you think, I'm happy to change it back to the way it was
There was a problem hiding this comment.
IncrementallyUpdatedFilterPredicateGenerator.java is pretty difficult to read and review. Since this is only applied to ContainsPredicate and does not affect others, I'm fine to make this change.
| @Override | ||
| public <T extends Comparable<T>> PrimitiveIterator.OfInt visit(Contains<T> contains) { | ||
| return contains.filter(this, IndexIterator::intersection, IndexIterator::union); | ||
| return contains.filter( |
There was a problem hiding this comment.
My initial idea was to implement IndexIterator::subtract for not(contains()) -- for example, not(contains(eq("bar"))) would return something like IndexIterator.all(getPageCount()).subtract(visit(eq("bar")). But we can't rely on the individual predicate implementations to return exact row ranges -- for example, NotIn always returns IndexIterator.all. So I don't think it's possible to safely subtract any row ranges here.
There was a problem hiding this comment.
Yes, we cannot make such subtraction unless it is a single-value page which has only bar.
wgtmac
left a comment
There was a problem hiding this comment.
Thanks for the change! I just took an initial pass and has some questions overall.
parquet-hadoop/src/main/java/org/apache/parquet/filter2/statisticslevel/StatisticsFilter.java
Outdated
Show resolved
Hide resolved
| /** | ||
| * A ValueInspector implementation that keeps state for one or more delegate inspectors. | ||
| */ | ||
| abstract static class DelegatingValueInspector extends ValueInspector { |
There was a problem hiding this comment.
IncrementallyUpdatedFilterPredicateGenerator.java is pretty difficult to read and review. Since this is only applied to ContainsPredicate and does not affect others, I'm fine to make this change.
| @Override | ||
| public <T extends Comparable<T>> PrimitiveIterator.OfInt visit(Contains<T> contains) { | ||
| return contains.filter(this, IndexIterator::intersection, IndexIterator::union); | ||
| return contains.filter( |
There was a problem hiding this comment.
Yes, we cannot make such subtraction unless it is a single-value page which has only bar.
| Visitor<R> visitor, | ||
| BiFunction<R, R, R> andBehavior, | ||
| BiFunction<R, R, R> orBehavior, | ||
| Function<R, R> notBehavior); |
There was a problem hiding this comment.
IIUC, all kinds of filters return all rows if notBehavior is found, right?
There was a problem hiding this comment.
Yes, for example NotIn 👍
Co-authored-by: Gang Wu <ustcwg@gmail.com>
3974c69 to
1ca2207
Compare
|
@gszadovszky Do you want to take a look? |
|
Merged. Thanks @clairemcginty and @gszadovszky! |
This should be the final subtask for the Contains() predicate work. It adds support for logical inversion --
not(contains(...)).Also, it fixes a small bug with the composition of mixed Contains predicates.
Make sure you have checked all steps below.
Issue
them in the PR title. For example, "PARQUET-2410: Use row count instead of value count to get row count from OffsetIndex #1234: My Parquet PR"
the ASF 3rd Party License Policy.
Tests
Commits
from "How to write a good git commit message":
Style
mvn spotless:apply -Pvector-pluginsDocumentation