-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Fix PruningPredicate interaction with DynamicFilterPhysicalExpr that references partition columns #19129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Todo: add test that is always true. Add test for nested / complex literal trees. |
| enable_page_index: false, | ||
| enable_bloom_filter: false, | ||
| enable_row_group_stats_pruning: true, | ||
| enable_row_group_stats_pruning: false, // note that this is false! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise the test failed because the predicate would successfully prune based on stats
…19130) This improves handling of constant expressions during pruning by trying to evaluate them in the simplifier and the pruning machinery. This is somewhat redundant with #19129 in the simple case of our Parquet implementation but since there may be edge cases where one is hit and not the other, or where users are using them independently I thought it best to implement both approaches.
eebccef to
b1c49b4
Compare
b1c49b4 to
7d3efc8
Compare
| /// Count of distinct column references in an expression. | ||
| /// This is the same as [`collect_columns`] but optimized to stop counting | ||
| /// once more than one distinct column is found. | ||
| /// | ||
| /// For example, in expression `col1 + col2`, the count is `Many`. | ||
| /// In expression `col1 + 5`, the count is `One`. | ||
| /// In expression `5 + 10`, the count is `Zero`. | ||
| #[derive(Debug, PartialEq, Eq)] | ||
| enum ColumnReferenceCount { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This replaces collect_columns because:
- We only ever want to know if there's one or more, this short circuits / avoids extra work if we're going to bail anyway.
- Makes the match statements clearer instead of matching on
.len()integers. - Avoids
columns.iter().first().unwrap()later on (even though this does still contain an unwrap internally)
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @adriangb
This looks interesting -- the only thing I don't understand is why the previously added constant folding expression doesn't cover this
|
|
||
| // Test that always-true literal predicates don't prune any containers | ||
| #[test] | ||
| fn row_group_predicate_literal_true() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please also add a test for literal (boolean) null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added row_group_predicate_literal_null
| prune_with_expr(lit(true).or(lit(false)), &schema, &statistics, &[true]); | ||
|
|
||
| // Complex nested: (1 < 2) AND (3 > 1) = true AND true = true | ||
| prune_with_expr( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also please add an error test for pruning with a non boolean (e.g. lit(1i32)) -- and just verify that it errors resonably (rather than gives the wrong answer)
|
|
||
| // Handle literal-to-literal comparisons (no columns on either side) | ||
| // e.g., lit(1) = lit(2) should evaluate to false and prune all containers | ||
| if left_columns.is_empty() && right_columns.is_empty() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems redundant with the constant folding introduced in #19130
Why do we need both? Maybe we just need to constant fold the expression after applying the physical expr adapter 🤔
It does overlap with that work. My reasoning for doing it in two places was that these are two disconnected APIs (i.e. we don't require you to run the simplifier before calling |
|
@alamb I've removed the handling of literals and instead added documentation and integration tests. So this PR is now tests + refactoring to short circuit |
a60b943 to
49ebece
Compare
|
Okay I did find the one case that this covers: This will generate a dynamic filter that references |
I've copied that change over to here, it seems more appropriate and fits in with the original goal of this PR |
| /// | ||
| /// Returns a `[`Transformed`] indicating whether a snapshot was taken, | ||
| /// along with the resulting `PhysicalExpr`. | ||
| pub fn snapshot_physical_expr_opt( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea here is that instead of doing 1 traversal to determine if it's a dynamic expression and another to snapshot we can do a single traversal. This also handles the case where an arbitrary PhysicalExpr implements snapshotting that is not a dynamic filter.
| .with("c1", ContainerStats::new_i32(vec![Some(0)], vec![Some(10)])); | ||
| let expected_ret = &[true]; | ||
| prune_with_expr(lit(1), &schema, &statistics, expected_ret); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb this is the other test you asked for
f6b92f8 to
ecad024
Compare
| // which does not handle dynamic exprs in general | ||
| let expr = snapshot_physical_expr(expr)?; | ||
| /// | ||
| /// Note that `PruningPredicate` does not attempt to normalize or simplify |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems to me like PruningPredicate does now actually call simplify 🤔 (if it is a snapshot )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update
| // children after snapshotting and previously `replace_columns_with_literals` may have been called with partition values | ||
| // the expression we have now is `8 < 5 and col < 10`. | ||
| // Thus we need as simplifier pass to get `false and col < 10` => `false` here. | ||
| let simplifier = PhysicalExprSimplifier::new(&schema); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this code is specific to dynamic expressions, maybe the call to simplify would make more sense in the snapshot_physical_expr_opt method itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm interesting. I think maybe best to keep things as is. E.g. if you're going to evaluate the expression against data (as opposed to doing the kind of weird stuff PruningPredicate does) then maybe you don't want to pay the simplify cost?
| expr.apply(|expr| { | ||
| if let Some(column) = expr.as_any().downcast_ref::<phys_expr::Column>() { | ||
| seen.insert(column.clone()); | ||
| if seen.len() > 1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am surprised clippy didn't complain about this not using is_empty 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think because len() > 1 != len >= 1
PhysicalExprSimplifierbeforePruningPredicatewe are able to prune.