Skip to content

Column statistics #55065

@hanfei1991

Description

@hanfei1991

RFC and the first PR: #53240

This issue is for discussing what we will do in the future.

Use cases of column statistics:

Related proposal: #64210

Usability

  • cache of statistics
  • more grammar suger
    • ADD STATISTIC column_name TYPE ALL to create all kinds of useful statistics as we can
    • DROP/CLEAR/MATERIALIZE STATISTIC column_name to drop/clear/materialize all statistics if we omit TYPE ...
  • support more condition pattern for selectivity estimation
    • a between 100 and 200 && a > 100 && a < 200
    • a < 100 or a > 200
  • system tables
    • reveal statistics information in system.parts and system.parts_columns
  • support more data types
    • support decimal type for tdigest
  • compact statistics files into single file per part

functionality

  • support hyperloglog
    • then automatically decide if a string column can be stored as low cardinality format
  • support cmsketch
    • to estimate a = 1
  • support equi-depth histograms
  • support heavy hitter (e.g. top 20 frequency)
    • to estimate a = 1 better
  • support min_max
  • support sample: store a configurable number of values
  • support more counters
    • NULL values
    • Default values
    • Deleted values for LWD
  • estimate by combinator of above statistics
    • e.g. a = 1 will at first see if 1 is top 20 of column a.
  • statistic for other tables / materialized views / projections ...
  • automatically create & maintain statistic
    • for cheap statistic like hyperloglog & min_max & null_count
    • for frequently queried columns
  • support statistics name aliases, e.g. min_max and MinMax statistics shall mean the same statistics type

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions