Skip to content

Conversation

@adriangb
Copy link
Contributor

@adriangb adriangb commented Dec 6, 2025

  • Fix handling of DynamicFilterPhysicalExpr that references partition columns
  • Adds some integration tests for handling of literal expression trees, making sure that if they are passed through PhysicalExprSimplifier before PruningPredicate we are able to prune.
  • Refactors internal tracking of column counts to short circuit early and make match logic easier to follow

@github-actions github-actions bot added the datasource Changes to the datasource crate label Dec 6, 2025
@adriangb
Copy link
Contributor Author

adriangb commented Dec 6, 2025

Todo: add test that is always true. Add test for nested / complex literal trees.

enable_page_index: false,
enable_bloom_filter: false,
enable_row_group_stats_pruning: true,
enable_row_group_stats_pruning: false, // note that this is false!
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise the test failed because the predicate would successfully prune based on stats

github-merge-queue bot pushed a commit that referenced this pull request Dec 7, 2025
…19130)

This improves handling of constant expressions during pruning by trying
to evaluate them in the simplifier and the pruning machinery. This is
somewhat redundant with #19129 in the simple case of our Parquet
implementation but since there may be edge cases where one is hit and
not the other, or where users are using them independently I thought it
best to implement both approaches.
@github-actions github-actions bot removed the datasource Changes to the datasource crate label Dec 7, 2025
@github-actions github-actions bot added the physical-expr Changes to the physical-expr crates label Dec 8, 2025
Comment on lines 1595 to 1601
/// Count of distinct column references in an expression.
/// This is the same as [`collect_columns`] but optimized to stop counting
/// once more than one distinct column is found.
///
/// For example, in expression `col1 + col2`, the count is `Many`.
/// In expression `col1 + 5`, the count is `One`.
/// In expression `5 + 10`, the count is `Zero`.
#[derive(Debug, PartialEq, Eq)]
enum ColumnReferenceCount {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This replaces collect_columns because:

  1. We only ever want to know if there's one or more, this short circuits / avoids extra work if we're going to bail anyway.
  2. Makes the match statements clearer instead of matching on .len() integers.
  3. Avoids columns.iter().first().unwrap() later on (even though this does still contain an unwrap internally)

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @adriangb

This looks interesting -- the only thing I don't understand is why the previously added constant folding expression doesn't cover this


// Test that always-true literal predicates don't prune any containers
#[test]
fn row_group_predicate_literal_true() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please also add a test for literal (boolean) null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added row_group_predicate_literal_null

prune_with_expr(lit(true).or(lit(false)), &schema, &statistics, &[true]);

// Complex nested: (1 < 2) AND (3 > 1) = true AND true = true
prune_with_expr(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also please add an error test for pruning with a non boolean (e.g. lit(1i32)) -- and just verify that it errors resonably (rather than gives the wrong answer)


// Handle literal-to-literal comparisons (no columns on either side)
// e.g., lit(1) = lit(2) should evaluate to false and prune all containers
if left_columns.is_empty() && right_columns.is_empty() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems redundant with the constant folding introduced in #19130

Why do we need both? Maybe we just need to constant fold the expression after applying the physical expr adapter 🤔

@adriangb
Copy link
Contributor Author

adriangb commented Dec 8, 2025

This looks interesting -- the only thing I don't understand is why the previously added constant folding expression doesn't cover this

It does overlap with that work. My reasoning for doing it in two places was that these are two disconnected APIs (i.e. we don't require you to run the simplifier before calling PruningPredicate::try_new) and there may be teams using one without the other. The alternative to this would be to recommend calling the simplifier before calling PruningPredicate and not support this in PruningPredicate? If so I can refactor to recommend that and do so in the tests (i.e. verify that the integration works without implementing the feature here as well).

@adriangb
Copy link
Contributor Author

adriangb commented Dec 8, 2025

@alamb I've removed the handling of literals and instead added documentation and integration tests.

So this PR is now tests + refactoring to short circuit collect_columns.

@adriangb adriangb requested a review from alamb December 8, 2025 15:10
@github-actions github-actions bot removed the physical-expr Changes to the physical-expr crates label Dec 8, 2025
@adriangb
Copy link
Contributor Author

adriangb commented Dec 8, 2025

Okay I did find the one case that this covers: select * from t order by part_col, col limit 10.

This will generate a dynamic filter that references part_col, but since the it's buried in a dynamic filter the simplifier won't simplfiy it. I was able to work around that: 2f591b8

@adriangb
Copy link
Contributor Author

adriangb commented Dec 8, 2025

Okay I did find the one case that this covers: select * from t order by part_col, col limit 10.

This will generate a dynamic filter that references part_col, but since the it's buried in a dynamic filter the simplifier won't simplfiy it. I was able to work around that: 2f591b8

I've copied that change over to here, it seems more appropriate and fits in with the original goal of this PR

///
/// Returns a `[`Transformed`] indicating whether a snapshot was taken,
/// along with the resulting `PhysicalExpr`.
pub fn snapshot_physical_expr_opt(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here is that instead of doing 1 traversal to determine if it's a dynamic expression and another to snapshot we can do a single traversal. This also handles the case where an arbitrary PhysicalExpr implements snapshotting that is not a dynamic filter.

.with("c1", ContainerStats::new_i32(vec![Some(0)], vec![Some(10)]));
let expected_ret = &[true];
prune_with_expr(lit(1), &schema, &statistics, expected_ret);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb this is the other test you asked for

@adriangb adriangb changed the title Support literal-only predicates in PruningPredicate Fix PruningPredicate interaction with DynamicFilterPhysicalExpr that references partition columns Dec 8, 2025
@github-actions github-actions bot added the physical-expr Changes to the physical-expr crates label Dec 8, 2025
// which does not handle dynamic exprs in general
let expr = snapshot_physical_expr(expr)?;
///
/// Note that `PruningPredicate` does not attempt to normalize or simplify
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems to me like PruningPredicate does now actually call simplify 🤔 (if it is a snapshot )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update

// children after snapshotting and previously `replace_columns_with_literals` may have been called with partition values
// the expression we have now is `8 < 5 and col < 10`.
// Thus we need as simplifier pass to get `false and col < 10` => `false` here.
let simplifier = PhysicalExprSimplifier::new(&schema);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this code is specific to dynamic expressions, maybe the call to simplify would make more sense in the snapshot_physical_expr_opt method itself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm interesting. I think maybe best to keep things as is. E.g. if you're going to evaluate the expression against data (as opposed to doing the kind of weird stuff PruningPredicate does) then maybe you don't want to pay the simplify cost?

expr.apply(|expr| {
if let Some(column) = expr.as_any().downcast_ref::<phys_expr::Column>() {
seen.insert(column.clone());
if seen.len() > 1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am surprised clippy didn't complain about this not using is_empty 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think because len() > 1 != len >= 1

@adriangb adriangb added this pull request to the merge queue Dec 9, 2025
Merged via the queue into apache:main with commit 83736ef Dec 9, 2025
31 checks passed
@adriangb adriangb deleted the prune-literals branch December 9, 2025 03:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-expr Changes to the physical-expr crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants