Automatic schema inference for input formats by Avogar · Pull Request #32455 · ClickHouse/ClickHouse

Avogar · 2021-12-09T13:25:36Z

Changelog category (leave one):

New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Implement data schema inference for input formats. Allow to skip structure (or write just auto) in table functions file, url, s3, hdfs and in parameters of clickhouse-local . Allow to skip structure in create query for table engines File, HDFS, S3, URL, Merge, Buffer, Distributed and ReplicatedMergeTree (if we add new replicas).

Closes #14450

Detailed description / Documentation draft:
Now you can read data from a file without specifying the structure almost for all input formats.
Read data from file:

SELECT * FROM file(<file_path>, <file_format>)

Or:

SELECT * FROM file(<file_path>, <file_format>, 'auto')

Create a table from existing file:

CREATE TABLE <table_name> ENGINE=File(<file_format>, <file_path>)

Attach table from file:

ATTACH TABLE <table_name> FROM <file_path> ENGINE = File(<file_format>)

See the structure of any file (if format supports schema inference):

DESC file(<file_path>, <file_format>)

The same for HDFS, S3 and URL table engines/table functions.

For table engines Merge, Buffer and Distributed you can skip table structure in create query, the structure will be determined by target table.
Example of create query:

CREATE TABLE <table_name> ENGINE=Merge(<db_name>, <tables_regexp>)

For ReplicatedMergeTree table engine you can skip table structure in create query if you already have at least one replica. The structure will be determined by existing replicas.

CREATE TABLE <table_name> ENGINE=ReplicatedMergeTree(...)

For clickhouse-local you can skip parameter -S/--structure or set it to 'auto' when it reads data from a file (it doesn't work when data passed by stdin):

clickhouse-local --input-format='<table_format>' --file='<data_file>' --query='select * from table'

Formats that support schema inference:

Formats Protobuf and CapnProto.
In these formats ClickHouse will read an external schema and convert it to ClickHouse schema. It won't read any data to determine table structure (so, data file can be empty). If you want to see how Protobuf or CapnProto schema transforms into ClickHouse schema, you can run the next query:

DESC file('nonexist', 'Protobuf') SETTINGS format_schema='<path_to_schema>:<message_name>'

Parquet, ORC, Arrow, ArrowStream, Avro, Native, formats with suffix -WithNamesAndTypes
These formats contains the schema directly in the data file, in order to determine the schema ClickHouse reads some part of the data.
Formats LineAsString, RawBLOB, JSONAsString
For these formats the schema is always the same - one column with type String
Formats Values, CSV(WithNames), TSV(WithNames), TSVRaw(WithNames), CustomSeparated(WithNames), TSKV, Template, JSONEachRow, JSONCompactEachRow(WithNames), Regexp, MsgPack.
In these formats ClickHouse reads first input_format_max_rows_to_read_for_schema_inference rows of data and tries to determine the types of columns by using some tweaks and heuristics. If format doesn't contain column names, ClickHouse will use names column_1, column_2, ..., column_N.
Some limitations:
- All numbers are treated as Float64 (except MsgPack where we can differentiate integers and floats).
- All types that are output as strings (Date, DateTime, etc.) will be detected as strings.
- For formats TSV, TSVRaw, TSKV and for escaping rules Escaped and Raw we treat all columns as String (no tweaks are used here yet).
- Type Map can be detected only for formats MsgPack, JSONEachRow, JSONCompactEachRow(WIthNames) and for JSON escaping rule (will be improved later).
- MsgPack format is binary and you should specify the number of columns in it for schema inference, use setting input_format_msgpack_number_of_columns.

CLAassistant · 2021-12-09T13:25:41Z

All committers have signed the CLA.

Avogar · 2021-12-09T13:26:00Z

Tests, comments and full description will be added soon

Avogar · 2021-12-23T16:46:22Z

@Mergifyio update

mergify · 2021-12-23T16:46:47Z

update

✅ Branch has been successfully updated

Avogar · 2021-12-24T08:48:00Z

Test failures are not related to this PR

nikitamikhaylov · 2021-12-24T12:33:05Z

src/Formats/EscapingRuleUtils.cpp

Why is it needed?

Just to reduce code duplication in readFieldByEscapingRule and readStringByEscapingRule. If you want to know the difference between these two functions, then readStringByEscapingRule reads String by escaping rule and return the value of this String (ex: for Quoted rule we read 'String' and return String), readFieldByEscapingRule reads an arbitrary field by escaping rule and return it (ex: for Quoted rule we read 'String' and return 'String'), so we save all quotes in this function (because we will use them to determine the type).

nikitamikhaylov · 2021-12-24T12:36:35Z

src/Formats/EscapingRuleUtils.h

Wouldn't this be too slow?

If you are about expression evaluation, then I think yes, it may be slow, but also I think that we shouldn't worry about it, because we process too few lines in schema inference. By the way, I plan to remove evaluating expressions form this place and add small parser for Quoted/CSV/TSV

src/Formats/JSONEachRowUtils.cpp

nikitamikhaylov · 2021-12-24T12:56:19Z

Will checkout the code locally and read it in IDE

Avogar · 2021-12-28T11:27:21Z

Stress test failure: #33254

nikitamikhaylov

Generally it is Ok. I don't fully read the some format specific code, will try to do it later.

Avogar · 2021-12-29T18:03:38Z

Let's merge it and I will implement necessary adjustments and improvements later.

mafiore · 2022-04-11T09:02:42Z

It would be helpful to have the possibility to pass in a schema or template file as a parameter. When I tried out this feature, the first loaded dataset dominates the "schema". It looks like if the first JSON has not all fields filled but the next has, only the fields of the first where available. The other way delvers all fields (missings filelds are shown as empty)

Does schema-inference work with the kafka table engine too? This would be a little gamechanger for us.

* Generally code is taken from ClickHouse/ClickHouse#32455 * It uses evaluation of constant expressions for inferring types, which is not present in our code and in ClickHoust was later replaced with more explicit inference by hand. The latter seemed as bringing less code, thus was used instead

robot-clickhouse added doc-alert pr-feature Pull request with new product feature labels Dec 9, 2021

Avogar mentioned this pull request Dec 9, 2021

[Draft] Schema inference for input text formats #24756

Closed

alexey-milovidov added the 🎅 🎁 gift🎄 To make people wonder label Dec 9, 2021

nikitamikhaylov self-assigned this Dec 9, 2021

alexey-milovidov mentioned this pull request Dec 10, 2021

Roadmap 2022 (discussion) #32513

Closed

Avogar force-pushed the schema-inference branch 5 times, most recently from 1a129af to 06e8c97 Compare December 15, 2021 11:32

Avogar force-pushed the schema-inference branch from f71a0f3 to ac6915f Compare December 20, 2021 14:39

nikitamikhaylov reviewed Dec 24, 2021

View reviewed changes

alexey-milovidov mentioned this pull request Dec 26, 2021

extract table schema from .native file #32364

Closed

Avogar force-pushed the schema-inference branch from 399fc30 to 4611bf6 Compare December 27, 2021 17:31

Avogar added 8 commits December 29, 2021 12:18

Implement schema inference for most input formats

8112a71

Add some tests and some fixes

dd994aa

Add more tests and fixes

aaf9f85

Fix tests

3d38e46

Fix tests

74f09d6

Fix shellcheck

8a65c26

Fix tests

9f30c17

Fix tests

622a013

Avogar and others added 8 commits December 29, 2021 12:18

Fix tests

bc5f428

Better test url engine

e3dbfe6

Remove code duplication, use simdjson and rapidjson instead of Poco

26abf7a

Clean up

d718a2e

Fix style

8436638

Fix fasttest build

78b522f

Fix typo

cb0ed7f

Fix special build

364b4f5

Avogar force-pushed the schema-inference branch from 07f5205 to 364b4f5 Compare December 29, 2021 09:21

Merge branch 'master' into schema-inference

dd7f61b

nikitamikhaylov approved these changes Dec 29, 2021

View reviewed changes

Avogar merged commit 489a308 into ClickHouse:master Dec 29, 2021

This was referenced Jan 14, 2022

Detect format by file name in file/hdfs/s3/url table functions #33565

Merged

Allow to create Merge, Distributed and Replicated tables without specifying a list of columns. #30671

Closed

alexey-milovidov mentioned this pull request Jan 20, 2022

File/S3/HDFS as Database engine #33810

Closed

UnamedRus mentioned this pull request Sep 19, 2022

Allow to CREATE another replica of ReplicatedMergeTree table without specifying its structure. #6079

Closed

EgorkaZ mentioned this pull request Feb 7, 2024

YQ-2824: Bring minimal schema inference to cloned clickhouse code ydb-platform/ydb#1678

Closed

Conversation

Avogar commented Dec 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CLAassistant commented Dec 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Avogar commented Dec 9, 2021

Uh oh!

Avogar commented Dec 23, 2021

Uh oh!

mergify bot commented Dec 23, 2021

✅ Branch has been successfully updated

Uh oh!

Avogar commented Dec 24, 2021

Uh oh!

nikitamikhaylov Dec 24, 2021

Choose a reason for hiding this comment

Uh oh!

Avogar Dec 24, 2021

Choose a reason for hiding this comment

Uh oh!

nikitamikhaylov Dec 24, 2021

Choose a reason for hiding this comment

Uh oh!

Avogar Dec 24, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

nikitamikhaylov commented Dec 24, 2021

Uh oh!

Avogar commented Dec 28, 2021

Uh oh!

nikitamikhaylov left a comment

Choose a reason for hiding this comment

Uh oh!

Avogar commented Dec 29, 2021

Uh oh!

mafiore commented Apr 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Avogar commented Dec 9, 2021 •

edited

Loading

CLAassistant commented Dec 9, 2021 •

edited

Loading

mafiore commented Apr 11, 2022 •

edited

Loading