Skip to content

opt: row estimates for bytes columns inaccurate #68346

@michae2

Description

@michae2

Optimizer row estimates for BYTES columns are dividing the histogram bucket by 2 instead of by the portion of the key space covered by the span. (Not totally sure this is specific to BYTES but that is where I observed it.)

Here's an example:

SET CLUSTER SETTING sql.stats.automatic_collection.enabled = false;
CREATE TABLE t (
  i INT PRIMARY KEY,
  s STRING,
  b BYTES,
  INDEX s (s),
  INDEX b (b)
);
ALTER TABLE t INJECT STATISTICS '[
  {
    "columns": [
      "i"
    ],
    "created_at": "2021-08-02 17:23:30.293908",
    "distinct_count": 65536,
    "histo_buckets": [
      {
        "distinct_range": 0,
        "num_eq": 1,
        "num_range": 0,
        "upper_bound": "0"
      },
      {
        "distinct_range": 32766,
        "num_eq": 1,
        "num_range": 32766,
        "upper_bound": "32767"
      },
      {
        "distinct_range": 32767,
        "num_eq": 1,
        "num_range": 32767,
        "upper_bound": "65535"
      }
    ],
    "histo_col_type": "INT8",
    "name": "repro",
    "null_count": 0,
    "row_count": 65536
  },
  {
    "columns": [
      "s"
    ],
    "created_at": "2021-08-02 17:23:30.293908",
    "distinct_count": 65536,
    "histo_buckets": [
      {
        "distinct_range": 0,
        "num_eq": 1,
        "num_range": 0,
        "upper_bound": "\\x0000"
      },
      {
        "distinct_range": 32766,
        "num_eq": 1,
        "num_range": 32766,
        "upper_bound": "\\x7FFF"
      },
      {
        "distinct_range": 32767,
        "num_eq": 1,
        "num_range": 32767,
        "upper_bound": "\\xFFFF"
      }
    ],
    "histo_col_type": "STRING",
    "name": "repro",
    "null_count": 0,
    "row_count": 65536
  },
  {
    "columns": [
      "b"
    ],
    "created_at": "2021-08-02 17:23:30.293908",
    "distinct_count": 65536,
    "histo_buckets": [
      {
        "distinct_range": 0,
        "num_eq": 1,
        "num_range": 0,
        "upper_bound": "\\x0000"
      },
      {
        "distinct_range": 32766,
        "num_eq": 1,
        "num_range": 32766,
        "upper_bound": "\\x7FFF"
      },
      {
        "distinct_range": 32767,
        "num_eq": 1,
        "num_range": 32767,
        "upper_bound": "\\xFFFF"
      }
    ],
    "histo_col_type": "BYTES",
    "name": "repro",
    "null_count": 0,
    "row_count": 65536
  }
]';
EXPLAIN SELECT i FROM t WHERE i >= 0x55B4 AND i < 0x55B5;
EXPLAIN (OPT, VERBOSE) SELECT i FROM t WHERE i >= 0x55B4 AND i < 0x55B5;
EXPLAIN SELECT s FROM t WHERE s >= '\x55B4' AND s < '\x55B5';
EXPLAIN (OPT, VERBOSE) SELECT s FROM t WHERE s >= '\x55B4' AND s < '\x55B5';
EXPLAIN SELECT b FROM t WHERE b >= '\x55B4' AND b < '\x55B5';
EXPLAIN (OPT, VERBOSE) SELECT b FROM t WHERE b >= '\x55B4' AND b < '\x55B5';

For the BYTES column the row estimate is 16383 instead of 1, like I would expect.

                           info
----------------------------------------------------------
  scan t@b
   ├── columns: b:3
   ├── constraint: /3/1: [/'\x55b4' - /'\x55b5')
   ├── stats: [rows=16383, distinct(3)=16383, null(3)=0]
   │   histogram(3)=  0     0      16383     0
   │                <--- '\x55b4' ------- '\x55b5'
   └── cost: 17380.59
(7 rows)

Environment:

  • CockroachDB version 21.2 (development) cd27f55
  • Server OS: MacOS 11.5.1

Metadata

Metadata

Assignees

Labels

A-sql-optimizerSQL logical planning and optimizations.C-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.T-sql-queriesSQL Queries Team

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions