Support of dynamic subcolumns in tables.

Inroduce new data type `Object(<schema format>)`, which will get the name of format for semi-structured data (`JSON`, `XML`, etc.).
Initially it will work only with `MergeTree` tables. Maybe later will add some other storages.

Values will be inserted as strings to columns of that type. Then separate subcolumns will be parsed according to schema format and written in columnar format.

Subcolumns of type `Object` will be stored in the lowest common type in which all values of subcolumn can be represented.

Columns of type `Object` will support reading of subcolumns and slices at arbitrary level of nesting.

For example if we will write in the column with name `user` and type `Object` the following `JSON`:
```json
{
    "name" : "John",
    "age" : 42,
    "location" : {
        "country" : "USA",
        "city" : "New York"
    }
}
```

There will be 4 columns, available for reading: `user.name`, `user.age`, `user.location.country`, `user.location.city`. Also slice `user.location.*`.

Preferred type of subcolumn in query should be specified with operator `::`, e.g. `user.age::UInt8`. If it will be omitted, subcolumn will be converted to `String`.

### Implementation details

Introduce `DataTypeObject`, `SerializationObject` and `ColumnObject`.

Metadata of storage will store `DataTypeObject` for column. It will know nothing about its subcolumns. There will be concrete data types (named tuples) in data parts, which will be deduced from data. For queries there are 2 approaches:

- Don't get any information about all subcolumns from table. In inserts always allow to insert new subcolumns. In selects it's not a problem for query analysis, because type will be specified in query (or will be `String`). If it will be absent in some parts (possibly all) column will be filled by defaults.
Caveats: in inserts we won't be able to return error in case, when new type of subcolumn is totally incompatible with old type, (f.e. `String` and `Array(UInt32)`). However we always can write data as string, but any small mistake will lead to writing garbage to the table. In selects we won't be able to read slices (`user.location.*` from example above).

- Before execution of query get types of all subcolumns. We can deduce common type from parts.
Caveats: may be too slow due to large number of parts and subcolumns. Also some races are possible and maybe additional synchronisation will be required.

For now I am not sure which one is better.

`ColumnObject` will store map from keys to regular columns.

Insertion will be supported from row-based input formats. Structure (JSON and similar) will be parsed for every row, then all paths will be extracted and added to `ColumnObject` as strings. When new key will appear, new subcolumn with all defaults will be added to `ColumnObject`. After all rows will be parsed, types of every subcolumns will be optimized to lowest common for all values. And finally `DataTypeObject` will be converted to named `Tuple`, which will be written in part metadata. It's ok to use path as keys, because serialization of such flat tuple will be the same as complex structure of tuple, f.e. for types ```Tuple(a String, b Tuple(c UInt32, d UInt32))``` and ```Tuple(a String, `b.c` UInt32 `b.d` UInt32)```.

Caveats: it will be harder to read intermediate structure as tuple, but anyway you can't do very much with tuples in ClickHouse. Some problems may exist with `Nested` type with multiple levels.

Before merge for every subcolumn of type `Object` will be deduced the least common type among all parts and in new part it will be written in that type.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support of dynamic subcolumns in tables. #23516

Implementation details

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support of dynamic subcolumns in tables. #23516

Description

Implementation details

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions