Skip to content

opt: infer lookup join equality conditions from unique key and foreign key #69617

@rytaft

Description

@rytaft

Consider the following example, using ./cockroach demo --global --empty --nodes=9:

CREATE DATABASE db PRIMARY REGION "us-east1" REGIONS "us-west1", "europe-west1";
USE db;

CREATE TABLE accounts (
    account_id  UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name        STRING NOT NULL,
    crdb_region crdb_internal_region NOT NULL,
    UNIQUE INDEX (account_id, crdb_region) -- this shouldn't be needed, but is. See https://github.com/cockroachdb/cockroach/issues/64619.
) LOCALITY GLOBAL;

CREATE TABLE tweets (
    account_id UUID NOT NULL,
    tweet_id   UUID DEFAULT gen_random_uuid(),
    message    STRING NOT NULL,
    crdb_region crdb_internal_region NOT NULL,
    PRIMARY KEY (crdb_region, account_id, tweet_id),
    FOREIGN KEY (account_id, crdb_region) REFERENCES accounts (account_id, crdb_region) ON DELETE CASCADE ON UPDATE CASCADE
) LOCALITY REGIONAL BY ROW;

INSERT INTO accounts VALUES (DEFAULT, 'starburst', 'us-east1') RETURNING account_id;

INSERT INTO tweets VALUES (
    '6f781502-4936-43cc-b384-04e5cf292cc8',
    DEFAULT,
    'this is tweeet 1',
    (SELECT crdb_region FROM accounts WHERE account_id = '6f781502-4936-43cc-b384-04e5cf292cc8')
) RETURNING tweet_id;

INSERT INTO tweets VALUES (
    '6f781502-4936-43cc-b384-04e5cf292cc8',
    DEFAULT,
    'this is tweeet 2',
    (SELECT crdb_region FROM accounts WHERE account_id = '6f781502-4936-43cc-b384-04e5cf292cc8')
) RETURNING tweet_id;

Ideally, the following query would be able to use a lookup join and visit only a single region, but it currently uses a merge join:

EXPLAIN SELECT tweet_id, message
FROM accounts
JOIN tweets USING (account_id)
WHERE account_id = '6f781502-4936-43cc-b384-04e5cf292cc8';
                                                                                                                                                                         info
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  distribution: full
  vectorized: true

  • merge join
  │ estimated row count: 2
  │ equality: (account_id) = (account_id)
  │ left cols are key
  │
  ├── • scan
  │     estimated row count: 1 (100% of the table; stats collected 1 minute ago)
  │     table: accounts@accounts_account_id_crdb_region_key
  │     spans: [/'6f781502-4936-43cc-b384-04e5cf292cc8'/'europe-west1' - /'6f781502-4936-43cc-b384-04e5cf292cc8'/'us-west1']
  │
  └── • scan
        estimated row count: 2 (100% of the table; stats collected 1 minute ago)
        table: tweets@primary
        spans: [/'europe-west1'/'6f781502-4936-43cc-b384-04e5cf292cc8' - /'europe-west1'/'6f781502-4936-43cc-b384-04e5cf292cc8'] [/'us-east1'/'6f781502-4936-43cc-b384-04e5cf292cc8' - /'us-east1'/'6f781502-4936-43cc-b384-04e5cf292cc8'] [/'us-west1'/'6f781502-4936-43cc-b384-04e5cf292cc8' - /'us-west1'/'6f781502-4936-43cc-b384-04e5cf292cc8']
(17 rows)

We can force a lookup join, but it still requires visiting all regions:

EXPLAIN SELECT tweet_id, message
FROM accounts
INNER LOOKUP JOIN tweets USING (account_id)
WHERE account_id = '6f781502-4936-43cc-b384-04e5cf292cc8';
                                                             info
------------------------------------------------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • lookup join
  │ estimated row count: 2
  │ table: tweets@primary
  │ lookup condition: (account_id = account_id) AND (crdb_region IN ('europe-west1', 'us-east1', 'us-west1'))
  │ pred: account_id = '6f781502-4936-43cc-b384-04e5cf292cc8'
  │
  └── • scan
        estimated row count: 1 (100% of the table; stats collected 2 minutes ago)
        table: accounts@accounts_account_id_crdb_region_key
        spans: [/'6f781502-4936-43cc-b384-04e5cf292cc8'/'europe-west1' - /'6f781502-4936-43cc-b384-04e5cf292cc8'/'us-west1']
(13 rows)

In order for the optimizer to plan a lookup join that visits a single region, it would need to infer that the join condition USING (account_id) is actually equivalent to USING (account_id, crdb_region). This inference should be possible using the fact that account_id is the primary key of accounts as well as the fact that there is a foreign key in tweets that references accounts (account_id, crdb_region). Including crdb_region would allow the optimizer to change the lookup condition from (crdb_region IN ('europe-west1', 'us-east1', 'us-west1')) to another equality condition (crdb_region = crdb_region), and thus visit the single region that contains 'ab887bbc-1a83-4324-8998-d18ebe448fa7'. The resulting plan should look like the following:

EXPLAIN SELECT tweet_id, message
FROM accounts
JOIN tweets USING (crdb_region, account_id)
WHERE account_id = '6f781502-4936-43cc-b384-04e5cf292cc8';
                                                             info
------------------------------------------------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • lookup join
  │ estimated row count: 2
  │ table: tweets@primary
  │ equality: (crdb_region, account_id) = (crdb_region,account_id)
  │ pred: account_id = '6f781502-4936-43cc-b384-04e5cf292cc8'
  │
  └── • scan
        estimated row count: 1 (100% of the table; stats collected 3 minutes ago)
        table: accounts@accounts_account_id_crdb_region_key
        spans: [/'6f781502-4936-43cc-b384-04e5cf292cc8'/'europe-west1' - /'6f781502-4936-43cc-b384-04e5cf292cc8'/'us-west1']
(13 rows)

cc @nvanbenschoten

Epic CRDB-26292

Jira issue: CRDB-9686

Metadata

Metadata

Assignees

Labels

A-sql-optimizerSQL logical planning and optimizations.C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)T-sql-queriesSQL Queries Team

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions