-
Notifications
You must be signed in to change notification settings - Fork 4.1k
opt: left lookup join returns incorrect results due to check constraints #59615
Description
Consider the following example. First create a table with a check constraint and insert some data:
CREATE TABLE t (
x INT NOT NULL CHECK (x IN (1, 3)),
y INT NOT NULL,
z INT,
PRIMARY KEY (x, y)
);
INSERT INTO t SELECT 1, generate_series(1, 1000);
INSERT INTO t SELECT 3, generate_series(1, 1000);
ANALYZE t;
Now run the following query:
> SELECT * FROM (VALUES (1), (2000)) AS u (y) LEFT JOIN t ON u.y = t.y;
y | x | y | z
-------+------+------+-------
1 | 1 | 1 | NULL
1 | 3 | 1 | NULL
2000 | NULL | NULL | NULL
2000 | NULL | NULL | NULL
(4 rows)
These results are incorrect. There should only be one row with u.y=2000, not two. The reason this happens is because in the GenerateLookupJoins exploration rule, the optimizer creates a cross join with the valid values of the check constraint in order to create a lookup join with the primary index:
> EXPLAIN (OPT) SELECT * FROM (VALUES (1), (2000)) AS u (y) LEFT JOIN t ON u.y = t.y;
info
-------------------------------
left-join (lookup t)
├── lookup columns are key
├── inner-join (cross)
│ ├── values
│ │ ├── (1,)
│ │ └── (2000,)
│ ├── values
│ │ ├── (1,)
│ │ └── (3,)
│ └── filters (true)
└── filters (true)
(11 rows)
If we remove the check constraint and try again, we now get correct results due to using a different plan:
> ALTER TABLE t DROP CONSTRAINT check_x;
> SELECT * FROM (VALUES (1), (2000)) AS u (y) LEFT JOIN t ON u.y = t.y;
y | x | y | z
-------+------+------+-------
1 | 1 | 1 | NULL
1 | 3 | 1 | NULL
2000 | NULL | NULL | NULL
(3 rows)
> EXPLAIN (OPT) SELECT * FROM (VALUES (1), (2000)) AS u (y) LEFT JOIN t ON u.y = t.y;
info
-------------------------
right-join (hash)
├── scan t
├── values
│ ├── (1,)
│ └── (2000,)
└── filters
└── column1 = y
(7 rows)
Although the cross join plan is correct in the case of an inner join, it is not correct for left joins. As a first step, we should disable the creation of the cross join as input to left lookup joins. Eventually, however, we should still be able to create a left lookup join in this case. I think that requires execution support to allow performing each lookup with multiple spans instead of a single equality condition.
This has implications for multi-region support, because the crdb_region column behaves the same way as column x in this example. Left joins often show up when building UPSERTs, which is how I ran into this issue in the first place. We need to fix this problem to avoid upserting incorrect data, but we should really find a way to support left lookup joins for this case for performance reasons -- to avoid requiring a full table scan on the primary index.