Skip to content

Automatic schema inference for input formats#32455

Merged
Avogar merged 17 commits intoClickHouse:masterfrom
Avogar:schema-inference
Dec 29, 2021
Merged

Automatic schema inference for input formats#32455
Avogar merged 17 commits intoClickHouse:masterfrom
Avogar:schema-inference

Conversation

@Avogar
Copy link
Copy Markdown
Member

@Avogar Avogar commented Dec 9, 2021

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Implement data schema inference for input formats. Allow to skip structure (or write just auto) in table functions file, url, s3, hdfs and in parameters of clickhouse-local . Allow to skip structure in create query for table engines File, HDFS, S3, URL, Merge, Buffer, Distributed and ReplicatedMergeTree (if we add new replicas).

Closes #14450

Detailed description / Documentation draft:
Now you can read data from a file without specifying the structure almost for all input formats.
Read data from file:

SELECT * FROM file(<file_path>, <file_format>)

Or:

SELECT * FROM file(<file_path>, <file_format>, 'auto')

Create a table from existing file:

CREATE TABLE <table_name> ENGINE=File(<file_format>, <file_path>)

Attach table from file:

ATTACH TABLE <table_name> FROM <file_path> ENGINE = File(<file_format>)

See the structure of any file (if format supports schema inference):

DESC file(<file_path>, <file_format>)

The same for HDFS, S3 and URL table engines/table functions.

For table engines Merge, Buffer and Distributed you can skip table structure in create query, the structure will be determined by target table.
Example of create query:

CREATE TABLE <table_name> ENGINE=Merge(<db_name>, <tables_regexp>)

For ReplicatedMergeTree table engine you can skip table structure in create query if you already have at least one replica. The structure will be determined by existing replicas.

CREATE TABLE <table_name> ENGINE=ReplicatedMergeTree(...)

For clickhouse-local you can skip parameter -S/--structure or set it to 'auto' when it reads data from a file (it doesn't work when data passed by stdin):

clickhouse-local --input-format='<table_format>' --file='<data_file>' --query='select * from table'

Formats that support schema inference:

  1. Formats Protobuf and CapnProto.
    In these formats ClickHouse will read an external schema and convert it to ClickHouse schema. It won't read any data to determine table structure (so, data file can be empty). If you want to see how Protobuf or CapnProto schema transforms into ClickHouse schema, you can run the next query:
DESC file('nonexist', 'Protobuf') SETTINGS format_schema='<path_to_schema>:<message_name>'
  1. Parquet, ORC, Arrow, ArrowStream, Avro, Native, formats with suffix -WithNamesAndTypes
    These formats contains the schema directly in the data file, in order to determine the schema ClickHouse reads some part of the data.

  2. Formats LineAsString, RawBLOB, JSONAsString
    For these formats the schema is always the same - one column with type String

  3. Formats Values, CSV(WithNames), TSV(WithNames), TSVRaw(WithNames), CustomSeparated(WithNames), TSKV, Template, JSONEachRow, JSONCompactEachRow(WithNames), Regexp, MsgPack.
    In these formats ClickHouse reads first input_format_max_rows_to_read_for_schema_inference rows of data and tries to determine the types of columns by using some tweaks and heuristics. If format doesn't contain column names, ClickHouse will use names column_1, column_2, ..., column_N.
    Some limitations:

    • All numbers are treated as Float64 (except MsgPack where we can differentiate integers and floats).
    • All types that are output as strings (Date, DateTime, etc.) will be detected as strings.
    • For formats TSV, TSVRaw, TSKV and for escaping rules Escaped and Raw we treat all columns as String (no tweaks are used here yet).
    • Type Map can be detected only for formats MsgPack, JSONEachRow, JSONCompactEachRow(WIthNames) and for JSON escaping rule (will be improved later).
    • MsgPack format is binary and you should specify the number of columns in it for schema inference, use setting input_format_msgpack_number_of_columns.

@robot-clickhouse robot-clickhouse added doc-alert pr-feature Pull request with new product feature labels Dec 9, 2021
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Dec 9, 2021

CLA assistant check
All committers have signed the CLA.

@Avogar
Copy link
Copy Markdown
Member Author

Avogar commented Dec 9, 2021

Tests, comments and full description will be added soon

@alexey-milovidov alexey-milovidov added the 🎅 🎁 gift🎄 To make people wonder label Dec 9, 2021
@nikitamikhaylov nikitamikhaylov self-assigned this Dec 9, 2021
@Avogar Avogar force-pushed the schema-inference branch 5 times, most recently from 1a129af to 06e8c97 Compare December 15, 2021 11:32
@Avogar
Copy link
Copy Markdown
Member Author

Avogar commented Dec 23, 2021

@Mergifyio update

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Dec 23, 2021

update

✅ Branch has been successfully updated

@Avogar
Copy link
Copy Markdown
Member Author

Avogar commented Dec 24, 2021

Test failures are not related to this PR

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it needed?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to reduce code duplication in readFieldByEscapingRule and readStringByEscapingRule. If you want to know the difference between these two functions, then readStringByEscapingRule reads String by escaping rule and return the value of this String (ex: for Quoted rule we read 'String' and return String), readFieldByEscapingRule reads an arbitrary field by escaping rule and return it (ex: for Quoted rule we read 'String' and return 'String'), so we save all quotes in this function (because we will use them to determine the type).

Comment on lines 39 to 51
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this be too slow?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are about expression evaluation, then I think yes, it may be slow, but also I think that we shouldn't worry about it, because we process too few lines in schema inference. By the way, I plan to remove evaluating expressions form this place and add small parser for Quoted/CSV/TSV

@nikitamikhaylov
Copy link
Copy Markdown
Member

Will checkout the code locally and read it in IDE

@Avogar
Copy link
Copy Markdown
Member Author

Avogar commented Dec 28, 2021

Stress test failure: #33254

Copy link
Copy Markdown
Member

@nikitamikhaylov nikitamikhaylov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally it is Ok. I don't fully read the some format specific code, will try to do it later.

@Avogar
Copy link
Copy Markdown
Member Author

Avogar commented Dec 29, 2021

Let's merge it and I will implement necessary adjustments and improvements later.

@mafiore
Copy link
Copy Markdown

mafiore commented Apr 11, 2022

It would be helpful to have the possibility to pass in a schema or template file as a parameter. When I tried out this feature, the first loaded dataset dominates the "schema". It looks like if the first JSON has not all fields filled but the next has, only the fields of the first where available. The other way delvers all fields (missings filelds are shown as empty)

Does schema-inference work with the kafka table engine too? This would be a little gamechanger for us.

EgorkaZ added a commit to EgorkaZ/ydb that referenced this pull request Feb 7, 2024
* Generally code is taken from ClickHouse/ClickHouse#32455
* It uses evaluation of constant expressions for inferring types, which
    is not present in our code and in ClickHoust was later replaced with
    more explicit inference by hand. The latter seemed as bringing less
    code, thus was used instead
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🎅 🎁 gift🎄 To make people wonder pr-feature Pull request with new product feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Schema inference (RFC)

6 participants