-
Notifications
You must be signed in to change notification settings - Fork 4.1k
opt: infer lookup join equality conditions from unique key and foreign key #69617
Description
Consider the following example, using ./cockroach demo --global --empty --nodes=9:
CREATE DATABASE db PRIMARY REGION "us-east1" REGIONS "us-west1", "europe-west1";
USE db;
CREATE TABLE accounts (
account_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
name STRING NOT NULL,
crdb_region crdb_internal_region NOT NULL,
UNIQUE INDEX (account_id, crdb_region) -- this shouldn't be needed, but is. See https://github.com/cockroachdb/cockroach/issues/64619.
) LOCALITY GLOBAL;
CREATE TABLE tweets (
account_id UUID NOT NULL,
tweet_id UUID DEFAULT gen_random_uuid(),
message STRING NOT NULL,
crdb_region crdb_internal_region NOT NULL,
PRIMARY KEY (crdb_region, account_id, tweet_id),
FOREIGN KEY (account_id, crdb_region) REFERENCES accounts (account_id, crdb_region) ON DELETE CASCADE ON UPDATE CASCADE
) LOCALITY REGIONAL BY ROW;
INSERT INTO accounts VALUES (DEFAULT, 'starburst', 'us-east1') RETURNING account_id;
INSERT INTO tweets VALUES (
'6f781502-4936-43cc-b384-04e5cf292cc8',
DEFAULT,
'this is tweeet 1',
(SELECT crdb_region FROM accounts WHERE account_id = '6f781502-4936-43cc-b384-04e5cf292cc8')
) RETURNING tweet_id;
INSERT INTO tweets VALUES (
'6f781502-4936-43cc-b384-04e5cf292cc8',
DEFAULT,
'this is tweeet 2',
(SELECT crdb_region FROM accounts WHERE account_id = '6f781502-4936-43cc-b384-04e5cf292cc8')
) RETURNING tweet_id;Ideally, the following query would be able to use a lookup join and visit only a single region, but it currently uses a merge join:
EXPLAIN SELECT tweet_id, message
FROM accounts
JOIN tweets USING (account_id)
WHERE account_id = '6f781502-4936-43cc-b384-04e5cf292cc8';
info
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
distribution: full
vectorized: true
• merge join
│ estimated row count: 2
│ equality: (account_id) = (account_id)
│ left cols are key
│
├── • scan
│ estimated row count: 1 (100% of the table; stats collected 1 minute ago)
│ table: accounts@accounts_account_id_crdb_region_key
│ spans: [/'6f781502-4936-43cc-b384-04e5cf292cc8'/'europe-west1' - /'6f781502-4936-43cc-b384-04e5cf292cc8'/'us-west1']
│
└── • scan
estimated row count: 2 (100% of the table; stats collected 1 minute ago)
table: tweets@primary
spans: [/'europe-west1'/'6f781502-4936-43cc-b384-04e5cf292cc8' - /'europe-west1'/'6f781502-4936-43cc-b384-04e5cf292cc8'] [/'us-east1'/'6f781502-4936-43cc-b384-04e5cf292cc8' - /'us-east1'/'6f781502-4936-43cc-b384-04e5cf292cc8'] [/'us-west1'/'6f781502-4936-43cc-b384-04e5cf292cc8' - /'us-west1'/'6f781502-4936-43cc-b384-04e5cf292cc8']
(17 rows)We can force a lookup join, but it still requires visiting all regions:
EXPLAIN SELECT tweet_id, message
FROM accounts
INNER LOOKUP JOIN tweets USING (account_id)
WHERE account_id = '6f781502-4936-43cc-b384-04e5cf292cc8';
info
------------------------------------------------------------------------------------------------------------------------------
distribution: local
vectorized: true
• lookup join
│ estimated row count: 2
│ table: tweets@primary
│ lookup condition: (account_id = account_id) AND (crdb_region IN ('europe-west1', 'us-east1', 'us-west1'))
│ pred: account_id = '6f781502-4936-43cc-b384-04e5cf292cc8'
│
└── • scan
estimated row count: 1 (100% of the table; stats collected 2 minutes ago)
table: accounts@accounts_account_id_crdb_region_key
spans: [/'6f781502-4936-43cc-b384-04e5cf292cc8'/'europe-west1' - /'6f781502-4936-43cc-b384-04e5cf292cc8'/'us-west1']
(13 rows)In order for the optimizer to plan a lookup join that visits a single region, it would need to infer that the join condition USING (account_id) is actually equivalent to USING (account_id, crdb_region). This inference should be possible using the fact that account_id is the primary key of accounts as well as the fact that there is a foreign key in tweets that references accounts (account_id, crdb_region). Including crdb_region would allow the optimizer to change the lookup condition from (crdb_region IN ('europe-west1', 'us-east1', 'us-west1')) to another equality condition (crdb_region = crdb_region), and thus visit the single region that contains 'ab887bbc-1a83-4324-8998-d18ebe448fa7'. The resulting plan should look like the following:
EXPLAIN SELECT tweet_id, message
FROM accounts
JOIN tweets USING (crdb_region, account_id)
WHERE account_id = '6f781502-4936-43cc-b384-04e5cf292cc8';
info
------------------------------------------------------------------------------------------------------------------------------
distribution: local
vectorized: true
• lookup join
│ estimated row count: 2
│ table: tweets@primary
│ equality: (crdb_region, account_id) = (crdb_region,account_id)
│ pred: account_id = '6f781502-4936-43cc-b384-04e5cf292cc8'
│
└── • scan
estimated row count: 1 (100% of the table; stats collected 3 minutes ago)
table: accounts@accounts_account_id_crdb_region_key
spans: [/'6f781502-4936-43cc-b384-04e5cf292cc8'/'europe-west1' - /'6f781502-4936-43cc-b384-04e5cf292cc8'/'us-west1']
(13 rows)Epic CRDB-26292
Jira issue: CRDB-9686
Metadata
Metadata
Assignees
Labels
Type
Projects
Status