Skip to content

[RFC] Use global LowCardinality dictionary for optimizations if it is small enough #72717

@CurtizJ

Description

@CurtizJ

Use case

Optimization of aggregation and JOINs over LowCardinality columns that have low number of unique values. In these cases LowCardinality column usually can be replaced with Enum but it is less convinient since it requires to change schema every time when set of possible values changes.

Describe the solution you'd like

  • Build global dictionary for LowCardinality columns which are suitable for optimization (are in GROUP BY key or in ON section of JOIN) up to a certain size (refuse optimization if the dictionary becomes large). It will require reading dictionaries on a new stage of query execution: after filtering parts by primary key and before pipeline execution is started. Dictionaries can be pushed down and reused. Also global dictionary can be cached in MergeTreeData.
  • Pushdown the global dictionary to LowCardinality serializations in data parts. Encode positions of LowCardinality columns with new dictionary and set shared dictionary to them.
  • Use positions in dictionary as keys for hash table in aggregation or in JOIN. It will allow to choose more optimal hash method:
    • method with single numeric key (often UInt8 which has its own optimization of aggregation) instead of specialized LowCardinality method in case of one column
    • method with fixed numeric keys in case of aggregation by LowCardinality and numeric columns

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions