Is your feature request related to a problem? Please describe.
Queries with ORed equality join predicates always use cross join, which can be slow.
Describe the solution you'd like
Teach the optimizer to split the join into multiple joins, one per ORed equality join predicate so each join can use a more performant join method.
Describe alternatives you've considered
None
Additional context
Here is an example test case (students signing up for classes with each signup request having a first choice and a second choice) :
CREATE TABLE classes (classID int, description varchar(50)
, primary key(classID)
);
CREATE TABLE classRequest (studentID int, firstChoiceClassID int, secondChoiceClassID int, primary key(studentID));
CREATE INDEX firstIdx ON classRequest(firstChoiceClassID) storing(secondChoiceClassID);
CREATE INDEX secondIdx ON classRequest(secondChoiceClassID) storing(firstChoiceClassID);
insert into classes select g, 'dummy description' FROM generate_series(1,1000) g(g);
insert into classRequest select g,cast(random()*900 as int)+1, cast(random()*100 as int)+1 FROM generate_series(1,100000) g(g);
analyze classes;
analyze classRequest;
-- Find the number of classes having more than 1000 signup requests where the class is the first or second choice.
explain
SELECT count(*) from (
SELECT count(*) FROM classes, classRequest
WHERE classRequest.firstChoiceClassID = classes.classID or
classRequest.secondChoiceClassID = classes.classID
GROUP BY classID
HAVING count(*) > 1000);
info
------------------------------------------------------------------------------------------------------
distribution: local
vectorized: true
• group (scalar)
│ estimated row count: 1
│
└── • filter
│ estimated row count: 333
│ filter: count_rows > 1000
│
└── • group (hash)
│ estimated row count: 1,000
│ group by: classid
│
└── • cross join
│ estimated row count: 33,333,333
│ pred: (firstchoiceclassid = classid) OR (secondchoiceclassid = classid)
│
├── • scan
│ estimated row count: 100,000 (100% of the table; stats collected 12 minutes ago)
│ table: classrequest@classrequest_pkey
│ spans: FULL SCAN
│
└── • scan
estimated row count: 1,000 (100% of the table; stats collected 12 minutes ago)
table: classes@classes_pkey
spans: FULL SCAN
SELECT count(*) from (
SELECT count(*) FROM classes, classRequest
WHERE classRequest.firstChoiceClassID = classes.classID or
classRequest.secondChoiceClassID = classes.classID
GROUP BY classID
HAVING count(*) > 1000);
count
---------
100
(1 row)
Time: 1.216s
A faster query plan:
info
------------------------------------------------------------------------------------------------------------
distribution: local
vectorized: true
• group (scalar)
│ estimated row count: 1
│
└── • filter
│ estimated row count: 333
│ filter: count_rows > 1000
│
└── • group (streaming)
│ estimated row count: 1,000
│ group by: classid
│ ordered: +classid
│
└── • distinct
│ estimated row count: 200,000
│ distinct on: classid, studentid
│ order key: classid
│
└── • union all
│ estimated row count: 200,000
│
├── • merge join
│ │ estimated row count: 100,000
│ │ equality: (firstchoiceclassid) = (classid)
│ │ right cols are key
│ │
│ ├── • scan
│ │ estimated row count: 100,000 (100% of the table; stats collected 1 minute ago)
│ │ table: classrequest@firstidx
│ │ spans: FULL SCAN
│ │
│ └── • scan
│ estimated row count: 1,000 (100% of the table; stats collected 1 minute ago)
│ table: classes@classes_pkey
│ spans: FULL SCAN
│
└── • merge join
│ estimated row count: 100,000
│ equality: (secondchoiceclassid) = (classid)
│ right cols are key
│
├── • scan
│ estimated row count: 100,000 (100% of the table; stats collected 1 minute ago)
│ table: classrequest@secondidx
│ spans: FULL SCAN
│
└── • scan
estimated row count: 1,000 (100% of the table; stats collected 1 minute ago)
table: classes@classes_pkey
spans: FULL SCAN
SELECT count(*) FROM classes, classRequest WHERE classRequest.firstChoiceClassID = classes.classID or
classRequest.secondChoiceClassID = classes.classID
group by classID
having count(*) > 1000);
count
---------
100
(1 row)
Time: 164ms
Jira issue: CRDB-12023
Is your feature request related to a problem? Please describe.
Queries with ORed equality join predicates always use cross join, which can be slow.
Describe the solution you'd like
Teach the optimizer to split the join into multiple joins, one per ORed equality join predicate so each join can use a more performant join method.
Describe alternatives you've considered
None
Additional context
Here is an example test case (students signing up for classes with each signup request having a first choice and a second choice) :
A faster query plan:
Jira issue: CRDB-12023