-
Notifications
You must be signed in to change notification settings - Fork 8.3k
Schema inference (RFC) #14450
Description
Level 1:
If the format already contains a schema (Parquet, Avro, Protobuf, Native, TSVWithNamesAndTypes), allow to use it to derive
- the schema argument of table functions file, hdfs, url, s3...
- the structure argument of
clickhouse-localtool; - columns definition in CREATE TABLE statement;
Use case: just don't repeat the schema if we already have it in format.
Level 2 (troublesome):
If the format does not contain the schema, derive "best effort" schema from the data in buffer.
If the format contains names of the columns (e.g. CSVWithNames), use these names. Otherwise use _1, _2... as column names.
Various tweaks can be used to adjust the derived schema. For example: always treat values as Nullable even if they already present in the first chunk of data in buffer. Or: use Float64 for numbers instead of Int64 / UInt64. Or: treat everything as String. Or: only extract data from subpath in JSON.
Use case:
See https://github.com/dinedal/textql
Provide a similar tool for those who like ClickHouse SQL capabilities.
Also look the discussion of the huge list of similar tools:
https://news.ycombinator.com/item?id=16781294
https://news.ycombinator.com/item?id=7175830
Look at the subthread: https://news.ycombinator.com/item?id=16782355
How to implement:
I'm not sure that this will work.
If inference is requested:
- make a constructor of the format without the
headerargument; - start reading data in constructor of the format (fetch first chunk of data into buffer) and derive the header from the first chunk of data in buffer.
Alllow to execute CREATE TABLE t ENGINE = ... without specifying the list of columns. It will mean that the list of columns will be automatically constructed in Storage constructor. Provide a Storage constructor without the list of columns and register these constructors in StorageFactory for storages that allow schema inference.
This will be also used to construct ReplicatedMergeTree tables when we add new replicas of existing table.
Merge, Distributed, Buffer will also benefit from that although CREATE TABLE ... AS syntax is available.