[RFC] Different representation of columns. Sparse columns.

**Use case**
This issue describes the details of refactoring, which moves `ser/de` of columns from `IDataType` and allows to choose dynamically different representations of columns in every data part. Also it describes implementation of sparse representation on disk.
It is the prerequisite for other bigger task: to choose dynamically optimal representation (`LowCardinality`, `Sparse` or `Dense`) and codec of columns.

**Describe the solution you'd like**
The plan:

Introduce interface `ISerialization`  which will be responsible to how serialize and deserialize columns. 
Move `enumerateStreams`, `getFileNameForStream` and `serialize*/deserialize*` method to it from `IDataType` as is.
Implement them for every data type. 

Introduce methods:
- `SerializationPtr IDataType::getSerialization(const IColumn & column)`
  It will determine to which representation (sparse/dense) write column according to its content and will return appropriate `Serialization`. Can be used at inserts, when we write full column from memory.
- `SerializationPtr IDataType::getSerialization(const SerializationSettings & settings)`
 The same as above, but can be used when we don't have column in memory and know statistics about its content (number of rows, number of non-default values, etc..). Can be used for merges.
- `SeserializationPtr IDataType::getSerialization(cosnt NameAndTypePair & name_and_type, ExistenceCallback callback)`
  Used for deserialization, when we need to deterimine which serialization to use from files, written on disk. It will ask by callback existence of some files and according to this information will return appropriate `Seserialization`.

For now there will be 3 types of Serialization:
 - Default.
 - Sparse. It will write only non zero values from column using default serialization and will write separate stream with offsets for them.
 - Subcolumns. The same logic as in `DataTypeOneElementTupleStreams` now. Wraps default deserializations, but some substreams will have names like in named tuples to proper reading of subcolumns. It will be used only for deserialization. Getting it will have the same logic as `IDataType::tryGetSubcolumnType` now.

For getting all types of `Serializations` there will be the corresponding methods.

**Details of sparse serialization**

Every column in part will be written in Default or Sparse serizalition. Serialization will be chosen before reading/writing column.
Parts will store number of non-empty values for every column. During merges serializations will be chosen according to summary number of non-empty values in column among all merged parts.
For loading data parts we can store this metadata in separate file.

For first iteration in-memory column representation will be always dense. But, when some kind of `ColumnSparse` will be implemented, not much changes in code are expected.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Different representation of columns. Sparse columns. #19953

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Different representation of columns. Sparse columns. #19953

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions