Skip to content

opt: incorrect results due to cross-joining input of semi-join #78681

@mgartner

Description

@mgartner

GenerateLookupJoins can cross-join the input of the join with a set of constant values, from a CHECK constraint, in order to constrain index prefix columns. This enables lookup joins to be generated in more cases. It is valid for inner joins, but it is not valid for left, anti, and semi joins. We currently disable this only for left and anti-joins, but not semi-joins. This incorrect transformation can cause incorrect query results.

Example:

CREATE TABLE a (
  a INT NOT NULL
);

CREATE TABLE bc (
  b INT NOT NULL,
  c INT NOT NULL,
  CHECK (b IN (1, 2, 3)),
  INDEX (b, c)
);

# Insert stats so that a lookup semi-join is selected.
statement ok
ALTER TABLE a INJECT STATISTICS '[
  {
    "columns": ["a"],
    "created_at": "2018-05-01 1:00:00.00000+00:00",
    "row_count": 10,
    "distinct_count": 10
  }
]';
ALTER TABLE bc INJECT STATISTICS '[
  {
    "columns": ["b"],
    "created_at": "2018-05-01 1:00:00.00000+00:00",
    "row_count": 10000000,
    "distinct_count": 3
  }
]'

INSERT INTO a VALUES (1);

INSERT INTO bc VALUES (1, 1), (2, 1), (3, 1);

SELECT a FROM a WHERE EXISTS (SELECT * FROM bc WHERE a = c);
----
1
1
1

The query plan reveals the problem. Notice how the input to the semi-join is cross-joined with all the possible values of b. This incorrectly increases the number of rows output by the join.

EXPLAIN (OPT, VERBOSE) SELECT a FROM a WHERE EXISTS (SELECT * FROM bc WHERE a =
c);
----
semi-join (lookup bc@bc_b_c_idx)
 ├── columns: a:1
 ├── key columns: [11 1] = [5 6]
 ├── stats: [rows=10, distinct(1)=10, null(1)=0, avgsize(1)=4]
 ├── cost: 844.7775
 ├── distribution: test
 ├── inner-join (cross)
 │    ├── columns: a:1 "lookup_join_const_col_@5":11
 │    ├── multiplicity: left-rows(one-or-more), right-rows(zero-or-more)
 │    ├── stats: [rows=30, distinct(1)=10, null(1)=0, avgsize(1)=4, distinct(11)=3, null(11)=0, avgsize(11)=4]
 │    ├── cost: 35.7575
 │    ├── distribution: test
 │    ├── scan a
 │    │    ├── columns: a:1
 │    │    ├── stats: [rows=10, distinct(1)=10, null(1)=0, avgsize(1)=4]
 │    │    ├── cost: 35.22
 │    │    ├── distribution: test
 │    │    ├── prune: (1)
 │    │    └── unfiltered-cols: (1-4)
 │    ├── values
 │    │    ├── columns: "lookup_join_const_col_@5":11
 │    │    ├── cardinality: [3 - 3]
 │    │    ├── stats: [rows=3, distinct(11)=3, null(11)=0, avgsize(11)=4]
 │    │    ├── cost: 0.04
 │    │    ├── distribution: test
 │    │    ├── (1,)
 │    │    ├── (2,)
 │    │    └── (3,)
 │    └── filters (true)
 └── filters (true)

Version Affected

This bug has been present since version 21.1.0 and affects all later releases.

Workaround

The best known workaround is 1) determine the prefix columns in the lookup join for which the cross join is created (b in the example above) and 2) make that column nullable or remove the CHECK constraint. This will prevent the incorrect lookup join from being generated.

Jira issue: CRDB-14215

gz#11864

Metadata

Metadata

Assignees

Labels

C-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.C-technical-advisoryCaused a technical advisoryS-0-visible-logical-errorDatabase stores inconsistent data in some cases, or queries return invalid results silently.T-sql-queriesSQL Queries Teambranch-release-21.1Used to mark GA and release blockers, technical advisories, and bugs for 21.1branch-release-21.2Used to mark GA and release blockers, technical advisories, and bugs for 21.2

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions