Schema inference for freeform text formats by dmtri35 · Pull Request #43449 · ClickHouse/ClickHouse

dmtri35 · 2022-11-21T14:50:08Z

Changelog category (leave one):

New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Implement schema inference for row freeform text format.

Once this is done you should be able to ingest arbitrary tabular text files (ie. application logs) without specifying the structure (Template).

Example usage:

Read data from file

SELECT * FROM file(<file_path>,'Freeform')

Describe data from file

Desc file(<file_path>,'Freeform')

How it is implemented

FreeformFieldMatcher houses an array of matchers that parse text data by escaping rules (JSON, CSV, Raw(but cutoff by whitespace), Quoted and Escaped).

First, FreeformFieldMatcher will iterate over this set of matchers and generate as many solutions as possible.

A Solution is defined as the order of the matchers to parse one row successfully associated with a score. The score here is generated by continuously adding the score of each field in the solution, specifically by the escaping rule and the type of the field. More specific fields (ie. DateTime) yield higher scores.

For now we only generate solutions for the first row because the data is supposed to be tabular to we should be able to get an acceptable solution just by looking at one row. Then the solutions are sorted by score and they are check against max_rows_to_check. The first solution matches all of the max_rows_to_check would be picked as the final solution.

Currently this works with CSV and syslog but more testing and tunings needs to be made for it to be in an acceptable state.

From

1,2021-04-01 00:00:18,2021-04-01 00:21:54,1,8.40,1,N,79,116,1,25.5,3,0.5,5.85,0,0.3,35.15,2.5
1,2021-04-01 00:42:37,2021-04-01 00:46:23,1,.90,1,N,75,236,2,5,3,0.5,0,0,0.3,8.8,2.5

we have

1       2021    4       1       0       0       18      2021    4       1       0       21      54      1       8.4     1       N       79      116     1       25.5    3       0.5     5.85    0       0.3     35.15   2.5
1       2021    4       1       0       42      37      2021    4       1       0       46      23      1       0.9     1       N       75      236     2       5       3       0.5     0       0       0.3     8.8     2.5

From

Nov 13 10:29:56 Tri chronyd[227]: Selected source PHC0
Nov 13 13:24:02 Tri kernel: [91566.562562] hv_utils: TimeSync IC version 4.0
Nov 13 13:24:02 Tri chronyd[227]: Forward time jump detected!

we have

Nov     13      10      29      56      Tri     chronyd[227]:   Selected source PHC0
Nov     13      13      24      2       Tri     kernel: [91566.562562] hv_utils: TimeSync IC version 4.0
Nov     13      13      24      2       Tri     chronyd[227]:   Forward time jump detected!

Update 2022-28-11:

With the new changes on increasing the search tree size, we now iterate over all matchers and don't exit early. However, a new field is added only if it has a better type score than the previous field or if no field other than type string has been found. This helps us to better parse the previous CSV example:

1       2021-04-01      0       0       18      2021-04-01      0       21      54      1       8.4     1       N       79      116     1       25.5    3       0.5     5.85    0       0.3     35.15   2.5
1       2021-04-01      0       42      37      2021-04-01      0       46      23      1       0.9     1       N       75      236     2       5       3       0.5     0       0       0.3     8.8     2.5

And also ClickHouse's logs:

2022-11-17      11      7       30.825405       [395843]        {345bf223-db80-4772-9a07-73542321e715::202211_15349_15604_53}   <Debug> MergeTask::PrepareStage:        Merging 6 parts: from 202211_15349_15599_52 to 202211_15604_15604_0 into Compact
2022-11-17      11      7       30.826372       [395843]        {345bf223-db80-4772-9a07-73542321e715::202211_15349_15604_53}   <Debug> MergeTask::PrepareStage:        Selected MergeAlgorithm: Horizontal

Update 2022-30-12:

Now we could parse and unmarshal JSON object fields into their own columns.
TODO:

Make use of the SchemaCache to store the solution
Fix the new added test

CLAassistant · 2022-11-21T14:50:19Z

All committers have signed the CLA.

src/Formats/EscapingRuleUtils.cpp

qoega · 2022-11-22T11:28:23Z

More specific fields (ie. DateTime) yield higher scores.

Why then your first example is not detecting DateTime? It is CSV with single delimiter - IMHO it should have higher score than custom delimited with one of ,:\ - as delimiter

dmtri35 · 2022-11-22T12:45:33Z

More specific fields (ie. DateTime) yield higher scores.

Why then your first example is not detecting DateTime? It is CSV with single delimiter - IMHO it should have higher score than custom delimited with one of ,:\ - as delimiter

Yes, I think it should too but currently there are a few reasons why this happen:

Matchers are ordered in order of priority and right now, the JSON matcher is put before CSV so it would generate its result first. So in 2021-04-01 00:00:18 2021 would be matched as an Int64 first.
There is an early exit path in the algorithm such that any fields with types other than String will be allow to exit and Int64 satisfies this. The purpose of this early exit is to cut down the search tree and runtime. So in this case, we would stop at the JSON matcher and move on to the next field.
I have tried running this example as a CSV input format and although it takes in the whole 2021-04-01 00:00:18 field, it's still inferred as a String. That's because of this line which only allow us to infer it as DateTime if there's a quote.

1       2021-04-01 00:00:18     2021-04-01 00:21:54     1       8.4     1       N       79      116     1       25.5    3       0.5     5.85    0       0.3     35.15      2.5

Additionally if I change the example to

1,"2021-04-01 00:00:18","2021-04-01 00:21:54",1,8.40,1,N,79,116,1,25.5,3,0.5,5.85,0,0.3,35.15,2
1,"2021-04-01 00:42:37","2021-04-01 00:46:23",1,.90,1,N,75,236,2,5,3,0.5,0,0,0.3,8.8,2.5

we get the desired result:

1       2021-04-01 00:00:18.000000000   2021-04-01 00:21:54.000000000   1       8.4     1       N       79      116     1       25.5    3       0.5     5.85    0 0.3      35.15   2
1       2021-04-01 00:42:37.000000000   2021-04-01 00:46:23.000000000   1       0.9     1       N       75      236     2       5       3       0.5     0       0 0.3      8.8     2.5

I will try to think of a way to get this to be closer to being parsed as CSV, if you have any suggestions, let me know! I think one possible way is to reuse the matcher of the previous field before we iterate over the list of matchers.

Edit: With the new commits (increasing the search tree size), this has been somewhat mitigated

Avogar · 2023-03-31T11:09:38Z

Sorry for waiting for so long. I will review this PR next week. Feel free to ping me.

dmtri35 · 2023-04-05T23:06:46Z

@Avogar Could you have a look at my PR? Thanks!

Avogar · 2023-04-06T11:31:59Z

I am in process. Already took a look several times. Just need some time to understand all logic.
Looks good in general, but need to change/discuss some details. I will write comments today/tomorrow

src/Formats/EscapingRuleUtils.cpp

src/Formats/FormatSettings.h

src/IO/ReadHelpers.cpp

Avogar · 2023-04-06T13:49:24Z

src/Processors/Formats/Impl/FreeformRowInputFormat.cpp

I am not sure that we want to skip multiple ',' and : with possible spaces between them as one delimiter. For example in CSV format we can have empty field and 'abc,,def'will represent 3 columns (maybe we can support it in Freedom format somehow?).
Also we can add more possible delimiters like '|', ';', maybe some other punctuation symbols or their combinations (not sure)
Maybe do smth like this:

skipWhitespaceIfAny(in); /// Skip possible delimiters like ',', '|', ':', etc or their combinations. skipWhitespaceIfAny(in);

Good point, I will try to experiment if we could really support empty fields, as the type we inferred from them will be null I assumed?

src/Processors/Formats/Impl/FreeformRowInputFormat.cpp

Avogar · 2023-04-06T16:32:42Z

I will make a few more iterations of review later. Now the most important think is how we work with read buffer (see comments). Need to fix it first. Also, please, do more tests. I ran a few simple tests and most of them just crashed client:

avogar-dev :) desc format(Freeform, 'Hello,,, World')

DESCRIBE TABLE format(Freeform, 'Hello,,, World')

Query id: 2481d436-44c2-415b-b2e0-ad7dd58e2226

[avogar-dev] 2023.04.06 16:30:38.873029 [ 1718730 ] <Fatal> BaseDaemon: ########################################
[avogar-dev] 2023.04.06 16:30:38.873167 [ 1718730 ] <Fatal> BaseDaemon: (version 23.3.1.2853, build id: 05A03133648206EEA601942CB5AD845307862AA6) (from thread 1718382) (query_id: 2481d436-44c2-415b-b2e0-ad7dd58e2226) (query: desc format(Freeform, 'Hello,,, World')) Received signal Segmentation fault (11)
[avogar-dev] 2023.04.06 16:30:38.873227 [ 1718730 ] <Fatal> BaseDaemon: Address: 0x7fb7b91feff8. Access: write. Attempted access has violated the permissions assigned to the memory area.
[avogar-dev] 2023.04.06 16:30:38.873291 [ 1718730 ] <Fatal> BaseDaemon: Stack trace: 0x177f31e1
[avogar-dev] 2023.04.06 16:30:38.875175 [ 1718730 ] <Fatal> BaseDaemon: 2. ./build_docker/./base/base/phdr_cache.cpp:65: dl_iterate_phdr @ 0x177f31e1 in /home/avogar/tmp/bin/clickhouse
[avogar-dev] 2023.04.06 16:30:38.875213 [ 1718730 ] <Fatal> BaseDaemon: Integrity check of the executable skipped because the reference checksum could not be read.
Exception on client:
Code: 32. DB::Exception: Attempt to read after eof: while receiving packet from localhost:9000. (ATTEMPT_TO_READ_AFTER_EOF)

Connecting to localhost:9000 as user default.
Code: 210. DB::NetException: Connection refused (localhost:9000). (NETWORK_ERROR)

I guess there is a problem when data doesn't contain \n at the end.

Also you can experiment with setting max_read_buffer_size to reproduce problems when data in buffer is changing (you should also set setting storage_file_read_method='pread' for it)

dmtri35 · 2023-05-09T03:56:59Z

Sorry for being a little late, I've been busy with school and life lately. I have addressed the majority of your comments. I will add more comments, do more testing and think about how should we skip over delimiters.

robot-clickhouse-ci-2 · 2023-05-09T04:08:21Z

This is an automated comment for commit e31ea77 with description of existing statuses. It's updated for the latest CI running
The full report is available here
The overall status of the commit is 🔴 failure

Check name	Description	Status
AST fuzzer	Runs randomly generated queries to catch program errors. The build type is optionally given in parenthesis. If it fails, ask a maintainer for help	🟢 success
CI running	A meta-check that indicates the running CI. Normally, it's in success or pending state. The failed status indicates some problems with the PR	🟢 success
ClickHouse build check	Builds ClickHouse in various configurations for use in further steps. You have to fix the builds that fail. Build logs often has enough information to fix the error, but you might have to reproduce the failure locally. The cmake options can be found in the build log, grepping for cmake. Use these options and follow the general build process	🟢 success
Compatibility check	Checks that clickhouse binary runs on distributions with old libc versions. If it fails, ask a maintainer for help	🟢 success
Docker image for servers	The check to build and optionally push the mentioned image to docker hub	🟢 success
Docs Check	Builds and tests the documentation	🟡 pending
Fast test	Normally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here	🟢 success
Flaky tests	Checks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integrational tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc	🔴 failure
Install packages	Checks that the built packages are installable in a clear environment	🟢 success
Integration tests	The integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests	🟢 success
Mergeable Check	Checks if all other necessary checks are successful	🔴 failure
Performance Comparison	Measure changes in query performance. The performance test report is described in detail here. In square brackets are the optional part/total tests	🟢 success
Push to Dockerhub	The check for building and pushing the CI related docker images to docker hub	🟢 success
SQLancer	Fuzzing tests that detect logical bugs with SQLancer tool	🟢 success
Sqllogic	Run clickhouse on the sqllogic test set against sqlite and checks that all statements are passed	🟢 success
Stateful tests	Runs stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc	🟢 success
Stateless tests	Runs stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc	🔴 failure
Stress test	Runs stateless functional tests concurrently from several clients to detect concurrency-related errors	🟢 success
Style Check	Runs a set of checks to keep the code style clean. If some of tests failed, see the related log from the report	🟢 success
Unit tests	Runs the unit tests for different release types	🟢 success
Upgrade check	Runs stress tests on server version from last release and then tries to upgrade it to the version from the PR. It checks if the new server can successfully startup without any errors, crashes or sanitizer asserts	🟢 success

parseFields now returns std::vector<String> to allow a matcher to parse multiple fields at once

Refactor parseFields to return both the column name and value, this allows us to use JSON field names as column names

Use a map to map column names to column index, this allow JSON fields to be shuffled

Checkout style changes, formatting, and use std::tie

This is needed for JSONFieldMatcher as it could parse root fields into different columns

clickhouse-gh · 2024-06-25T16:54:04Z

Dear @Avogar, this PR hasn't been updated for a while. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself.

robot-clickhouse added the pr-feature Pull request with new product feature label Nov 21, 2022

alexey-milovidov added the can be tested Allows running workflows for external contributors label Nov 21, 2022

Avogar self-assigned this Nov 21, 2022

dmtri35 commented Nov 21, 2022

View reviewed changes

src/Formats/EscapingRuleUtils.cpp Outdated Show resolved Hide resolved

dmtri35 force-pushed the heuristics-inference branch from 9a7bc1b to cf0daa7 Compare November 26, 2022 22:59

dmtri35 force-pushed the heuristics-inference branch from d9d1f29 to 084689e Compare December 29, 2022 15:36

dmtri35 marked this pull request as ready for review December 31, 2022 02:50

dmtri35 force-pushed the heuristics-inference branch from 03fbd7d to 9b5966e Compare March 30, 2023 23:17