-
Notifications
You must be signed in to change notification settings - Fork 8.3k
Support of dynamic subcolumns in tables. #23516
Description
Inroduce new data type Object(<schema format>), which will get the name of format for semi-structured data (JSON, XML, etc.).
Initially it will work only with MergeTree tables. Maybe later will add some other storages.
Values will be inserted as strings to columns of that type. Then separate subcolumns will be parsed according to schema format and written in columnar format.
Subcolumns of type Object will be stored in the lowest common type in which all values of subcolumn can be represented.
Columns of type Object will support reading of subcolumns and slices at arbitrary level of nesting.
For example if we will write in the column with name user and type Object the following JSON:
{
"name" : "John",
"age" : 42,
"location" : {
"country" : "USA",
"city" : "New York"
}
}There will be 4 columns, available for reading: user.name, user.age, user.location.country, user.location.city. Also slice user.location.*.
Preferred type of subcolumn in query should be specified with operator ::, e.g. user.age::UInt8. If it will be omitted, subcolumn will be converted to String.
Implementation details
Introduce DataTypeObject, SerializationObject and ColumnObject.
Metadata of storage will store DataTypeObject for column. It will know nothing about its subcolumns. There will be concrete data types (named tuples) in data parts, which will be deduced from data. For queries there are 2 approaches:
-
Don't get any information about all subcolumns from table. In inserts always allow to insert new subcolumns. In selects it's not a problem for query analysis, because type will be specified in query (or will be
String). If it will be absent in some parts (possibly all) column will be filled by defaults.
Caveats: in inserts we won't be able to return error in case, when new type of subcolumn is totally incompatible with old type, (f.e.StringandArray(UInt32)). However we always can write data as string, but any small mistake will lead to writing garbage to the table. In selects we won't be able to read slices (user.location.*from example above). -
Before execution of query get types of all subcolumns. We can deduce common type from parts.
Caveats: may be too slow due to large number of parts and subcolumns. Also some races are possible and maybe additional synchronisation will be required.
For now I am not sure which one is better.
ColumnObject will store map from keys to regular columns.
Insertion will be supported from row-based input formats. Structure (JSON and similar) will be parsed for every row, then all paths will be extracted and added to ColumnObject as strings. When new key will appear, new subcolumn with all defaults will be added to ColumnObject. After all rows will be parsed, types of every subcolumns will be optimized to lowest common for all values. And finally DataTypeObject will be converted to named Tuple, which will be written in part metadata. It's ok to use path as keys, because serialization of such flat tuple will be the same as complex structure of tuple, f.e. for types Tuple(a String, b Tuple(c UInt32, d UInt32)) and Tuple(a String, `b.c` UInt32 `b.d` UInt32).
Caveats: it will be harder to read intermediate structure as tuple, but anyway you can't do very much with tuples in ClickHouse. Some problems may exist with Nested type with multiple levels.
Before merge for every subcolumn of type Object will be deduced the least common type among all parts and in new part it will be written in that type.