-
Notifications
You must be signed in to change notification settings - Fork 8.3k
[RFC] Different representation of columns. Sparse columns. #19953
Description
Use case
This issue describes the details of refactoring, which moves ser/de of columns from IDataType and allows to choose dynamically different representations of columns in every data part. Also it describes implementation of sparse representation on disk.
It is the prerequisite for other bigger task: to choose dynamically optimal representation (LowCardinality, Sparse or Dense) and codec of columns.
Describe the solution you'd like
The plan:
Introduce interface ISerialization which will be responsible to how serialize and deserialize columns.
Move enumerateStreams, getFileNameForStream and serialize*/deserialize* method to it from IDataType as is.
Implement them for every data type.
Introduce methods:
SerializationPtr IDataType::getSerialization(const IColumn & column)
It will determine to which representation (sparse/dense) write column according to its content and will return appropriateSerialization. Can be used at inserts, when we write full column from memory.SerializationPtr IDataType::getSerialization(const SerializationSettings & settings)
The same as above, but can be used when we don't have column in memory and know statistics about its content (number of rows, number of non-default values, etc..). Can be used for merges.SeserializationPtr IDataType::getSerialization(cosnt NameAndTypePair & name_and_type, ExistenceCallback callback)
Used for deserialization, when we need to deterimine which serialization to use from files, written on disk. It will ask by callback existence of some files and according to this information will return appropriateSeserialization.
For now there will be 3 types of Serialization:
- Default.
- Sparse. It will write only non zero values from column using default serialization and will write separate stream with offsets for them.
- Subcolumns. The same logic as in
DataTypeOneElementTupleStreamsnow. Wraps default deserializations, but some substreams will have names like in named tuples to proper reading of subcolumns. It will be used only for deserialization. Getting it will have the same logic asIDataType::tryGetSubcolumnTypenow.
For getting all types of Serializations there will be the corresponding methods.
Details of sparse serialization
Every column in part will be written in Default or Sparse serizalition. Serialization will be chosen before reading/writing column.
Parts will store number of non-empty values for every column. During merges serializations will be chosen according to summary number of non-empty values in column among all merged parts.
For loading data parts we can store this metadata in separate file.
For first iteration in-memory column representation will be always dense. But, when some kind of ColumnSparse will be implemented, not much changes in code are expected.