Automatic schema inference for input formats#32455
Automatic schema inference for input formats#32455Avogar merged 17 commits intoClickHouse:masterfrom
Conversation
|
Tests, comments and full description will be added soon |
1a129af to
06e8c97
Compare
f71a0f3 to
ac6915f
Compare
|
@Mergifyio update |
✅ Branch has been successfully updated |
|
Test failures are not related to this PR |
src/Formats/EscapingRuleUtils.cpp
Outdated
There was a problem hiding this comment.
Just to reduce code duplication in readFieldByEscapingRule and readStringByEscapingRule. If you want to know the difference between these two functions, then readStringByEscapingRule reads String by escaping rule and return the value of this String (ex: for Quoted rule we read 'String' and return String), readFieldByEscapingRule reads an arbitrary field by escaping rule and return it (ex: for Quoted rule we read 'String' and return 'String'), so we save all quotes in this function (because we will use them to determine the type).
src/Formats/EscapingRuleUtils.h
Outdated
There was a problem hiding this comment.
Wouldn't this be too slow?
There was a problem hiding this comment.
If you are about expression evaluation, then I think yes, it may be slow, but also I think that we shouldn't worry about it, because we process too few lines in schema inference. By the way, I plan to remove evaluating expressions form this place and add small parser for Quoted/CSV/TSV
|
Will checkout the code locally and read it in IDE |
399fc30 to
4611bf6
Compare
|
Stress test failure: #33254 |
07f5205 to
364b4f5
Compare
nikitamikhaylov
left a comment
There was a problem hiding this comment.
Generally it is Ok. I don't fully read the some format specific code, will try to do it later.
|
Let's merge it and I will implement necessary adjustments and improvements later. |
|
It would be helpful to have the possibility to pass in a schema or template file as a parameter. When I tried out this feature, the first loaded dataset dominates the "schema". It looks like if the first JSON has not all fields filled but the next has, only the fields of the first where available. The other way delvers all fields (missings filelds are shown as empty) Does schema-inference work with the kafka table engine too? This would be a little gamechanger for us. |
* Generally code is taken from ClickHouse/ClickHouse#32455 * It uses evaluation of constant expressions for inferring types, which is not present in our code and in ClickHoust was later replaced with more explicit inference by hand. The latter seemed as bringing less code, thus was used instead
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Implement data schema inference for input formats. Allow to skip structure (or write just
auto) in table functionsfile,url,s3,hdfsand in parameters ofclickhouse-local. Allow to skip structure in create query for table enginesFile,HDFS,S3,URL,Merge,Buffer,DistributedandReplicatedMergeTree(if we add new replicas).Closes #14450
Detailed description / Documentation draft:
Now you can read data from a file without specifying the structure almost for all input formats.
Read data from file:
Or:
Create a table from existing file:
Attach table from file:
See the structure of any file (if format supports schema inference):
The same for
HDFS,S3andURLtable engines/table functions.For table engines
Merge,BufferandDistributedyou can skip table structure in create query, the structure will be determined by target table.Example of create query:
For
ReplicatedMergeTreetable engine you can skip table structure in create query if you already have at least one replica. The structure will be determined by existing replicas.For
clickhouse-localyou can skip parameter-S/--structureor set it to 'auto' when it reads data from a file (it doesn't work when data passed by stdin):Formats that support schema inference:
ProtobufandCapnProto.In these formats ClickHouse will read an external schema and convert it to ClickHouse schema. It won't read any data to determine table structure (so, data file can be empty). If you want to see how
ProtobuforCapnProtoschema transforms into ClickHouse schema, you can run the next query:Parquet,ORC,Arrow,ArrowStream,Avro,Native, formats with suffix-WithNamesAndTypesThese formats contains the schema directly in the data file, in order to determine the schema ClickHouse reads some part of the data.
Formats
LineAsString,RawBLOB,JSONAsStringFor these formats the schema is always the same - one column with type
StringFormats
Values,CSV(WithNames),TSV(WithNames),TSVRaw(WithNames),CustomSeparated(WithNames),TSKV,Template,JSONEachRow,JSONCompactEachRow(WithNames),Regexp,MsgPack.In these formats ClickHouse reads first
input_format_max_rows_to_read_for_schema_inferencerows of data and tries to determine the types of columns by using some tweaks and heuristics. If format doesn't contain column names, ClickHouse will use namescolumn_1,column_2, ...,column_N.Some limitations:
Float64(exceptMsgPackwhere we can differentiate integers and floats).Date,DateTime, etc.) will be detected as strings.TSV,TSVRaw,TSKVand for escaping rulesEscapedandRawwe treat all columns asString(no tweaks are used here yet).Mapcan be detected only for formatsMsgPack,JSONEachRow,JSONCompactEachRow(WIthNames)and forJSONescaping rule (will be improved later).MsgPackformat is binary and you should specify the number of columns in it for schema inference, use settinginput_format_msgpack_number_of_columns.