Skip to content

BUGFIX: DELIM_JOINS should reflect functionality of NULL filtering conditions in joins with DELIM_GETS#16910

Merged
Mytherin merged 8 commits intoduckdb:mainfrom
Tmonster:fix_anti_join
May 12, 2025
Merged

BUGFIX: DELIM_JOINS should reflect functionality of NULL filtering conditions in joins with DELIM_GETS#16910
Mytherin merged 8 commits intoduckdb:mainfrom
Tmonster:fix_anti_join

Conversation

@Tmonster
Copy link
Contributor

@Tmonster Tmonster commented Mar 31, 2025

fixes #16803

Take a look at the plan produced by the BUG

explain select * FROM t1 WHERE NOT EXISTS (SELECT 1 FROM t0 WHERE ((t0.c0)!=(t1.c0)));
┌─────────────┴─────────────┐
│         PROJECTION        │
│    ────────────────────   │
│      Expressions: c0      │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         DELIM_JOIN        │
│    ────────────────────   │
│      Join Type: ANTI      │
│                           ├──────────────┐
│        Conditions:        │              │
│  (c0 IS NOT DISTINCT FROM │              │
│             c0)           │              │
└─────────────┬─────────────┘              │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│          SEQ_SCAN         ││         PROJECTION        │
│    ────────────────────   ││    ────────────────────   │
│         Table: t1         ││        Expressions:       │
│   Type: Sequential Scan   ││             1             │
│                           ││             c0            │
└───────────────────────────┘└─────────────┬─────────────┘
                             ┌─────────────┴─────────────┐
                             │      COMPARISON_JOIN      │
                             │    ────────────────────   │
                             │      Join Type: INNER     │
                             │                           ├──────────────┐
                             │        Conditions:        │              │
                             │         (c0 != c0)        │              │
                             └─────────────┬─────────────┘              │
                             ┌─────────────┴─────────────┐┌─────────────┴─────────────┐
                             │          SEQ_SCAN         ││         DELIM_GET         │
                             │    ────────────────────   ││    ────────────────────   │
                             │         Table: t0         ││                           │
                             │   Type: Sequential Scan   ││                           │
                             └───────────────────────────┘└───────────────────────────┘

Normally, if you have a join condition that is not (COMPARE NOT DISTINCT FROM) with DELIM_GET, we can remove the DELIM_GET and push a IS NOT NULL filter on the non-DELIM_GET child (see deliminator.cpp::258).

For all join conditions that filter nulls this makes sense. The only one where it is confusing is COMPARE DISTINCT FROM, but that is a non-null-filtering INEQUALITY join condition. For all inequality joins with DELIM Gets, there is some special logic, see Deliminator.cpp::RemoveInequalityJoinWithDelimGet. The special logic flips the DELIM_JOIN condition to act as the inequality join while also switching the column binding of the delim join conditio to bind to the non-DELIM_GET column of the inequality join.

So with the flipped DELIM_JOIN condition, and removal of the DELIM_GET, we have the plan

┌─────────────┴─────────────┐
│      COMPARISON_JOIN      │
│    ────────────────────   │
│      Join Type: ANTI      │
│                           ├──────────────┐
│        Conditions:        │              │
│  (c1 IS DISTINCT FROM c0) │              │
└─────────────┬─────────────┘              │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│          SEQ_SCAN         ││         PROJECTION        │
│    ────────────────────   ││    ────────────────────   │
│         Table: t1         ││        Expressions:       │
│   Type: Sequential Scan   ││             1             │
│                           ││             c1            │
└───────────────────────────┘└─────────────┬─────────────┘
                             ┌─────────────┴─────────────┐
                             │           FILTER          │
                             │    ────────────────────   │
                             │        Expressions:       │
                             │      (c0 IS NOT NULL)     │
                             └─────────────┬─────────────┘
                             ┌─────────────┴─────────────┐
                             │          SEQ_SCAN         │
                             │    ────────────────────   │
                             │         Table: t0         │
                             │   Type: Sequential Scan   │
                             └───────────────────────────┘

The DELIM_JOIN is now a comparison join, which checks if c1 is distinct from c0 (i.e the inequality join). Now it's a bit more clear what is missing. What if t1 has a NULL value? It could be distinct from c0 and satisfy the inequalityness that we are looking for. However, in our original query, the NULL value of c1 joined with c0 never propagates out of the RHS, so the c1 value should not propagate on the RHS either. Also, considering this is an ANTI JOIN, that means NULL is always distinct from values in t0 since there is a IS NOT NULL filter.

Another way to look at it is that the != comparison filters out all NULLS, we push a IS NOT NULL on the GET that is not the DELIM GET, but we should then also push a IS NOT NULL on the DELIM_GET, but we don't.

The fix I've come up with is to change the IS DISTINCT FROM to a != to respect NULL values in this case. Also the other way around where a NOT DISTINCT FROM generated by a EXISTS subquery with an = condition should be changed to a = condition.

We don't do this switch for DELIM_JOINS with mark joins because if the condition evaluates to NULL, the MARK JOIN column that is output is NULL, when an EXISTS is always true or false (I think this is a separate bug though, and we should be able to apply the fix in this PR regardless of delim join type)

Also have discovered that just disabling the optimization when the JOIN with the DELIM_GET has a != condition will result in a regression on q21 of tpch. See #4950

@Tmonster
Copy link
Contributor Author

CC @lnkuiper

Copy link
Collaborator

@lnkuiper lnkuiper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (fingers crossed for CI)

@Mytherin
Copy link
Collaborator

Looks like just a missing rowsort in the test - could you have a look?

@duckdb-draftbot duckdb-draftbot marked this pull request as draft April 25, 2025 16:41
@Tmonster Tmonster changed the base branch from v1.2-histrionicus to main April 25, 2025 16:41
@Tmonster Tmonster marked this pull request as ready for review April 25, 2025 16:41
@Tmonster Tmonster marked this pull request as draft April 28, 2025 00:44
@Tmonster Tmonster changed the title BUGFIX: Do not run deliminator if there is a semi/anti delim join with a child join that has an != join condition [WIP] BUGFIX: Do not run deliminator if there is a semi/anti delim join with a child join that has an != join condition Apr 28, 2025
@Tmonster Tmonster changed the title [WIP] BUGFIX: Do not run deliminator if there is a semi/anti delim join with a child join that has an != join condition BUGFIX: Do not run deliminator if there is a semi/anti delim join with a child join that has an != join condition Apr 28, 2025
@Tmonster Tmonster marked this pull request as ready for review April 28, 2025 17:37
@Tmonster Tmonster changed the title BUGFIX: Do not run deliminator if there is a semi/anti delim join with a child join that has an != join condition BUGFIX: DELIM_JOINS should reflect functionality of NULL filtering conditions in joins with DELIM_GETS Apr 28, 2025
@TheoristCoder
Copy link

Thanks for your effort in fixing!

@duckdb-draftbot duckdb-draftbot marked this pull request as draft May 12, 2025 07:16
@Tmonster Tmonster marked this pull request as ready for review May 12, 2025 07:20
@Tmonster Tmonster requested a review from lnkuiper May 12, 2025 10:24
Copy link
Collaborator

@lnkuiper lnkuiper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes, it looks much better now! Now we won't bail out of the optimization as often.

@Mytherin Mytherin merged commit 2727f58 into duckdb:main May 12, 2025
51 checks passed
@Mytherin
Copy link
Collaborator

Thanks!

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 18, 2025
BUGFIX: DELIM_JOINS should reflect functionality of NULL filtering conditions in joins with DELIM_GETS (duckdb/duckdb#16910)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 18, 2025
BUGFIX: DELIM_JOINS should reflect functionality of NULL filtering conditions in joins with DELIM_GETS (duckdb/duckdb#16910)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 19, 2025
BUGFIX: DELIM_JOINS should reflect functionality of NULL filtering conditions in joins with DELIM_GETS (duckdb/duckdb#16910)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 19, 2025
BUGFIX: DELIM_JOINS should reflect functionality of NULL filtering conditions in joins with DELIM_GETS (duckdb/duckdb#16910)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

An Anti-Join produces wrong result

4 participants