Skip to content

Conversation

@ding-young
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

This pr adds a rule in ExprSimplifier to simplify x=x as follows.

  • if x is not nullable, x=x -> true
  • else, x=x -> x is NOT NULL

Are these changes tested?

Yes, via unit test and slt

Are there any user-facing changes?

@github-actions github-actions bot added optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Apr 5, 2025
@2010YOUY01
Copy link
Contributor

This optimizer transformation doesn't look right for projected columns like

create table t1(v1 int);
insert into t1 values (null);

select v1=v1 from t1;
-- Should return `NULL`, but get `False`

However it makes sense if it is only applied on filter expressions.

@ding-young
Copy link
Contributor Author

This optimizer transformation doesn't look right for projected columns like

create table t1(v1 int);
insert into t1 values (null);

select v1=v1 from t1;
-- Should return `NULL`, but get `False`

However it makes sense if it is only applied on filter expressions.

Thank you for review. I'll make the change

@ding-young ding-young force-pushed the simplify-trivial-eq branch from 18e4bf0 to 7a17897 Compare April 6, 2025 06:43
@ding-young
Copy link
Contributor Author

@2010YOUY01 Instead of applying transformation on filter expression, I adjusted the rule to transform x=x into x IS NOT NULL OR NULL. This preserves the behavior where NULL = NULL evaluates to NULL, which is treated as FALSE in filter expressions.

About the performance, I'm not 100% sure whether this rule worth the change, but I made this change because IS NOT NULL OR NULL was a bit better than actual comparision.
image

@berkaysynnada
Copy link
Contributor

berkaysynnada commented Apr 6, 2025

About the performance, I'm not 100% sure whether this rule worth the change, but I made this change because IS NOT NULL OR NULL was a bit better than actual comparision.

I recommend writing a simple benchmark by criterion or other tools to get more reliable results instead of measuring the CLI execution. But I agree that this should be more performant

@alamb
Copy link
Contributor

alamb commented Apr 6, 2025

We could also use a CASE

CASE x IS NOT NULL THEN true ELSE null END

@alamb
Copy link
Contributor

alamb commented Apr 6, 2025

I expect this to make a large performance difference when x is a string type (as string comparisons are fairly expensive)

Thank you for this PR @ding-young and the great reviews @2010YOUY01 and @berkaysynnada

@ding-young
Copy link
Contributor Author

I wrote a simple criterion bench (https://github.com/ding-young/datafusion/blob/simplify-trivial-eq-bench/datafusion/core/benches/simplify_trivial_eq.rs) and the result is as follows.
It showed performance improvement, especially for comparing long strings.

select_f32_eq_f32       time:   [366.19 µs 366.42 µs 366.69 µs]
                        change: [-79.571% -79.180% -78.787%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) high mild
  6 (6.00%) high severe

select_str_eq_str       time:   [372.53 µs 373.02 µs 373.61 µs]
                        change: [-85.333% -85.117% -84.884%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe

select_str_eq_str_long  time:   [372.49 µs 372.85 µs 373.25 µs]
                        change: [-98.351% -98.316% -98.280%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe

Plus, since x IS NOT NULL OR NULL is slightly better than CASE, I'll leave the pr as is :)

select_CASE_str_eq_str_long
                        time:   [396.36 µs 396.73 µs 397.15 µs]
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) high mild
  8 (8.00%) high severe

Thanks for the review! Trying out the bench was also a great experience for me.

@berkaysynnada
Copy link
Contributor

I wrote a simple criterion bench (https://github.com/ding-young/datafusion/blob/simplify-trivial-eq-bench/datafusion/core/benches/simplify_trivial_eq.rs) and the result is as follows. It showed performance improvement, especially for comparing long strings.

Thank you for sharing the results. It doesn't effect much here but next time you can exclude the create_context() part out of bench_function :)

@ding-young ding-young force-pushed the simplify-trivial-eq branch from 7a17897 to ce1089b Compare April 7, 2025 10:05
Copy link
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 👍 @ding-young

Left a few suggestions.

@alamb alamb added the performance Make DataFusion faster label Apr 7, 2025
@alamb alamb mentioned this pull request Apr 7, 2025
12 tasks
@ding-young ding-young force-pushed the simplify-trivial-eq branch from ce1089b to 44906b2 Compare April 8, 2025 01:17
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @ding-young @kosiew @2010YOUY01 and @berkaysynnada -- go team 🦾 !

06)----------Filter: __common_expr_3 = __common_expr_3
07)------------Projection: substr(CAST(md5(CAST(tmp_table.value AS Utf8)) AS Utf8), Int64(1), Int64(32)) AS __common_expr_3
08)--------------TableScan: tmp_table projection=[value]
05)--------Projection:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a comment a few lines above that says:

TODO: this should probably be possible to completely remove the filter as always true?

We can probably update that too -- but we could do it in a follow on PR as well

@alamb alamb changed the title Enhance: simplify x=x Enhance: simplify x=x --> x IS NOT NULL OR NULL Apr 9, 2025
- if x is not nullable, x=x -> true
- else, x=x -> x is NOT NULL OR NULL
@ding-young ding-young force-pushed the simplify-trivial-eq branch from 44906b2 to 115c8a2 Compare April 10, 2025 06:11
Comment on lines +6133 to +6134
# The optimizer does not currently eliminate the filter;
# Instead, it's rewritten as `IS NULL OR NOT NULL` due to SQL null semantics
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the comment in slt as well :)

@alamb alamb merged commit a11aa42 into apache:main Apr 12, 2025
27 checks passed
@alamb
Copy link
Contributor

alamb commented Apr 12, 2025

Thanks again @ding-young 🙏

nirnayroy pushed a commit to nirnayroy/datafusion that referenced this pull request May 2, 2025
- if x is not nullable, x=x -> true
- else, x=x -> x is NOT NULL OR NULL
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

optimizer Optimizer rules performance Make DataFusion faster sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Trivial WHERE filter not eliminated when combined with CTE

5 participants