Skip to content

Schema inference for freeform text formats#43449

Open
dmtri35 wants to merge 37 commits intoClickHouse:masterfrom
dmtri35:heuristics-inference
Open

Schema inference for freeform text formats#43449
dmtri35 wants to merge 37 commits intoClickHouse:masterfrom
dmtri35:heuristics-inference

Conversation

@dmtri35
Copy link
Copy Markdown

@dmtri35 dmtri35 commented Nov 21, 2022

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Implement schema inference for row freeform text format.

Closes #34006

Once this is done you should be able to ingest arbitrary tabular text files (ie. application logs) without specifying the structure (Template).

Example usage:

Read data from file

SELECT * FROM file(<file_path>,'Freeform')

Describe data from file

Desc file(<file_path>,'Freeform')

How it is implemented

FreeformFieldMatcher houses an array of matchers that parse text data by escaping rules (JSON, CSV, Raw(but cutoff by whitespace), Quoted and Escaped).

First, FreeformFieldMatcher will iterate over this set of matchers and generate as many solutions as possible.

A Solution is defined as the order of the matchers to parse one row successfully associated with a score. The score here is generated by continuously adding the score of each field in the solution, specifically by the escaping rule and the type of the field. More specific fields (ie. DateTime) yield higher scores.

For now we only generate solutions for the first row because the data is supposed to be tabular to we should be able to get an acceptable solution just by looking at one row. Then the solutions are sorted by score and they are check against max_rows_to_check. The first solution matches all of the max_rows_to_check would be picked as the final solution.

Currently this works with CSV and syslog but more testing and tunings needs to be made for it to be in an acceptable state.

From

1,2021-04-01 00:00:18,2021-04-01 00:21:54,1,8.40,1,N,79,116,1,25.5,3,0.5,5.85,0,0.3,35.15,2.5
1,2021-04-01 00:42:37,2021-04-01 00:46:23,1,.90,1,N,75,236,2,5,3,0.5,0,0,0.3,8.8,2.5

we have

1       2021    4       1       0       0       18      2021    4       1       0       21      54      1       8.4     1       N       79      116     1       25.5    3       0.5     5.85    0       0.3     35.15   2.5
1       2021    4       1       0       42      37      2021    4       1       0       46      23      1       0.9     1       N       75      236     2       5       3       0.5     0       0       0.3     8.8     2.5

From

Nov 13 10:29:56 Tri chronyd[227]: Selected source PHC0
Nov 13 13:24:02 Tri kernel: [91566.562562] hv_utils: TimeSync IC version 4.0
Nov 13 13:24:02 Tri chronyd[227]: Forward time jump detected!

we have

Nov     13      10      29      56      Tri     chronyd[227]:   Selected source PHC0
Nov     13      13      24      2       Tri     kernel: [91566.562562] hv_utils: TimeSync IC version 4.0
Nov     13      13      24      2       Tri     chronyd[227]:   Forward time jump detected!

Update 2022-28-11:

With the new changes on increasing the search tree size, we now iterate over all matchers and don't exit early. However, a new field is added only if it has a better type score than the previous field or if no field other than type string has been found. This helps us to better parse the previous CSV example:

1       2021-04-01      0       0       18      2021-04-01      0       21      54      1       8.4     1       N       79      116     1       25.5    3       0.5     5.85    0       0.3     35.15   2.5
1       2021-04-01      0       42      37      2021-04-01      0       46      23      1       0.9     1       N       75      236     2       5       3       0.5     0       0       0.3     8.8     2.5

And also ClickHouse's logs:

2022-11-17      11      7       30.825405       [395843]        {345bf223-db80-4772-9a07-73542321e715::202211_15349_15604_53}   <Debug> MergeTask::PrepareStage:        Merging 6 parts: from 202211_15349_15599_52 to 202211_15604_15604_0 into Compact
2022-11-17      11      7       30.826372       [395843]        {345bf223-db80-4772-9a07-73542321e715::202211_15349_15604_53}   <Debug> MergeTask::PrepareStage:        Selected MergeAlgorithm: Horizontal

Update 2022-30-12:

Now we could parse and unmarshal JSON object fields into their own columns.
TODO:

  • Make use of the SchemaCache to store the solution
  • Fix the new added test

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Nov 21, 2022

CLA assistant check
All committers have signed the CLA.

@robot-clickhouse robot-clickhouse added the pr-feature Pull request with new product feature label Nov 21, 2022
@alexey-milovidov alexey-milovidov added the can be tested Allows running workflows for external contributors label Nov 21, 2022
@Avogar Avogar self-assigned this Nov 21, 2022
@qoega
Copy link
Copy Markdown
Member

qoega commented Nov 22, 2022

More specific fields (ie. DateTime) yield higher scores.

Why then your first example is not detecting DateTime? It is CSV with single delimiter - IMHO it should have higher score than custom delimited with one of ,:\ - as delimiter

@dmtri35
Copy link
Copy Markdown
Author

dmtri35 commented Nov 22, 2022

More specific fields (ie. DateTime) yield higher scores.

Why then your first example is not detecting DateTime? It is CSV with single delimiter - IMHO it should have higher score than custom delimited with one of ,:\ - as delimiter

Yes, I think it should too but currently there are a few reasons why this happen:

  • Matchers are ordered in order of priority and right now, the JSON matcher is put before CSV so it would generate its result first. So in 2021-04-01 00:00:18 2021 would be matched as an Int64 first.
  • There is an early exit path in the algorithm such that any fields with types other than String will be allow to exit and Int64 satisfies this. The purpose of this early exit is to cut down the search tree and runtime. So in this case, we would stop at the JSON matcher and move on to the next field.
  • I have tried running this example as a CSV input format and although it takes in the whole 2021-04-01 00:00:18 field, it's still inferred as a String. That's because of this line which only allow us to infer it as DateTime if there's a quote.
1       2021-04-01 00:00:18     2021-04-01 00:21:54     1       8.4     1       N       79      116     1       25.5    3       0.5     5.85    0       0.3     35.15      2.5

Additionally if I change the example to

1,"2021-04-01 00:00:18","2021-04-01 00:21:54",1,8.40,1,N,79,116,1,25.5,3,0.5,5.85,0,0.3,35.15,2
1,"2021-04-01 00:42:37","2021-04-01 00:46:23",1,.90,1,N,75,236,2,5,3,0.5,0,0,0.3,8.8,2.5

we get the desired result:

1       2021-04-01 00:00:18.000000000   2021-04-01 00:21:54.000000000   1       8.4     1       N       79      116     1       25.5    3       0.5     5.85    0 0.3      35.15   2
1       2021-04-01 00:42:37.000000000   2021-04-01 00:46:23.000000000   1       0.9     1       N       75      236     2       5       3       0.5     0       0 0.3      8.8     2.5

I will try to think of a way to get this to be closer to being parsed as CSV, if you have any suggestions, let me know! I think one possible way is to reuse the matcher of the previous field before we iterate over the list of matchers.

Edit: With the new commits (increasing the search tree size), this has been somewhat mitigated

@dmtri35 dmtri35 force-pushed the heuristics-inference branch from 9a7bc1b to cf0daa7 Compare November 26, 2022 22:59
@dmtri35 dmtri35 force-pushed the heuristics-inference branch from d9d1f29 to 084689e Compare December 29, 2022 15:36
@dmtri35 dmtri35 marked this pull request as ready for review December 31, 2022 02:50
@dmtri35 dmtri35 force-pushed the heuristics-inference branch from 03fbd7d to 9b5966e Compare March 30, 2023 23:17
@Avogar
Copy link
Copy Markdown
Member

Avogar commented Mar 31, 2023

Sorry for waiting for so long. I will review this PR next week. Feel free to ping me.

@dmtri35
Copy link
Copy Markdown
Author

dmtri35 commented Apr 5, 2023

@Avogar Could you have a look at my PR? Thanks!

@Avogar
Copy link
Copy Markdown
Member

Avogar commented Apr 6, 2023

I am in process. Already took a look several times. Just need some time to understand all logic.
Looks good in general, but need to change/discuss some details. I will write comments today/tomorrow

Comment on lines 31 to 32
Copy link
Copy Markdown
Member

@Avogar Avogar Apr 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure that we want to skip multiple ',' and : with possible spaces between them as one delimiter. For example in CSV format we can have empty field and 'abc,,def'will represent 3 columns (maybe we can support it in Freedom format somehow?).
Also we can add more possible delimiters like '|', ';', maybe some other punctuation symbols or their combinations (not sure)
Maybe do smth like this:

skipWhitespaceIfAny(in);
/// Skip possible delimiters like ',', '|', ':', etc or their combinations.
skipWhitespaceIfAny(in);

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I will try to experiment if we could really support empty fields, as the type we inferred from them will be null I assumed?

@Avogar
Copy link
Copy Markdown
Member

Avogar commented Apr 6, 2023

I will make a few more iterations of review later. Now the most important think is how we work with read buffer (see comments). Need to fix it first. Also, please, do more tests. I ran a few simple tests and most of them just crashed client:

avogar-dev :) desc format(Freeform, 'Hello,,, World')

DESCRIBE TABLE format(Freeform, 'Hello,,, World')

Query id: 2481d436-44c2-415b-b2e0-ad7dd58e2226

[avogar-dev] 2023.04.06 16:30:38.873029 [ 1718730 ] <Fatal> BaseDaemon: ########################################
[avogar-dev] 2023.04.06 16:30:38.873167 [ 1718730 ] <Fatal> BaseDaemon: (version 23.3.1.2853, build id: 05A03133648206EEA601942CB5AD845307862AA6) (from thread 1718382) (query_id: 2481d436-44c2-415b-b2e0-ad7dd58e2226) (query: desc format(Freeform, 'Hello,,, World')) Received signal Segmentation fault (11)
[avogar-dev] 2023.04.06 16:30:38.873227 [ 1718730 ] <Fatal> BaseDaemon: Address: 0x7fb7b91feff8. Access: write. Attempted access has violated the permissions assigned to the memory area.
[avogar-dev] 2023.04.06 16:30:38.873291 [ 1718730 ] <Fatal> BaseDaemon: Stack trace: 0x177f31e1
[avogar-dev] 2023.04.06 16:30:38.875175 [ 1718730 ] <Fatal> BaseDaemon: 2. ./build_docker/./base/base/phdr_cache.cpp:65: dl_iterate_phdr @ 0x177f31e1 in /home/avogar/tmp/bin/clickhouse
[avogar-dev] 2023.04.06 16:30:38.875213 [ 1718730 ] <Fatal> BaseDaemon: Integrity check of the executable skipped because the reference checksum could not be read.
Exception on client:
Code: 32. DB::Exception: Attempt to read after eof: while receiving packet from localhost:9000. (ATTEMPT_TO_READ_AFTER_EOF)

Connecting to localhost:9000 as user default.
Code: 210. DB::NetException: Connection refused (localhost:9000). (NETWORK_ERROR)

I guess there is a problem when data doesn't contain \n at the end.

Also you can experiment with setting max_read_buffer_size to reproduce problems when data in buffer is changing (you should also set setting storage_file_read_method='pread' for it)

@dmtri35
Copy link
Copy Markdown
Author

dmtri35 commented May 9, 2023

Sorry for being a little late, I've been busy with school and life lately. I have addressed the majority of your comments. I will add more comments, do more testing and think about how should we skip over delimiters.

@dmtri35 dmtri35 force-pushed the heuristics-inference branch from 04e379c to e4b2491 Compare May 9, 2023 04:07
@robot-clickhouse-ci-2 robot-clickhouse-ci-2 added the submodule changed At least one submodule changed in this PR. label May 9, 2023
@robot-clickhouse-ci-2
Copy link
Copy Markdown
Contributor

robot-clickhouse-ci-2 commented May 9, 2023

This is an automated comment for commit e31ea77 with description of existing statuses. It's updated for the latest CI running
The full report is available here
The overall status of the commit is 🔴 failure

Check nameDescriptionStatus
AST fuzzerRuns randomly generated queries to catch program errors. The build type is optionally given in parenthesis. If it fails, ask a maintainer for help🟢 success
CI runningA meta-check that indicates the running CI. Normally, it's in success or pending state. The failed status indicates some problems with the PR🟢 success
ClickHouse build checkBuilds ClickHouse in various configurations for use in further steps. You have to fix the builds that fail. Build logs often has enough information to fix the error, but you might have to reproduce the failure locally. The cmake options can be found in the build log, grepping for cmake. Use these options and follow the general build process🟢 success
Compatibility checkChecks that clickhouse binary runs on distributions with old libc versions. If it fails, ask a maintainer for help🟢 success
Docker image for serversThe check to build and optionally push the mentioned image to docker hub🟢 success
Docs CheckBuilds and tests the documentation🟡 pending
Fast testNormally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here🟢 success
Flaky testsChecks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integrational tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc🔴 failure
Install packagesChecks that the built packages are installable in a clear environment🟢 success
Integration testsThe integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests🟢 success
Mergeable CheckChecks if all other necessary checks are successful🔴 failure
Performance ComparisonMeasure changes in query performance. The performance test report is described in detail here. In square brackets are the optional part/total tests🟢 success
Push to DockerhubThe check for building and pushing the CI related docker images to docker hub🟢 success
SQLancerFuzzing tests that detect logical bugs with SQLancer tool🟢 success
SqllogicRun clickhouse on the sqllogic test set against sqlite and checks that all statements are passed🟢 success
Stateful testsRuns stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc🟢 success
Stateless testsRuns stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc🔴 failure
Stress testRuns stateless functional tests concurrently from several clients to detect concurrency-related errors🟢 success
Style CheckRuns a set of checks to keep the code style clean. If some of tests failed, see the related log from the report🟢 success
Unit testsRuns the unit tests for different release types🟢 success
Upgrade checkRuns stress tests on server version from last release and then tries to upgrade it to the version from the PR. It checks if the new server can successfully startup without any errors, crashes or sanitizer asserts🟢 success

@dmtri35 dmtri35 force-pushed the heuristics-inference branch from 93232bc to 9439117 Compare May 9, 2023 20:33
@robot-ch-test-poll4 robot-ch-test-poll4 removed the submodule changed At least one submodule changed in this PR. label May 9, 2023
@dmtri35 dmtri35 force-pushed the heuristics-inference branch from ad8b32e to 967da47 Compare May 9, 2023 20:50
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Jun 25, 2024

Dear @Avogar, this PR hasn't been updated for a while. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself.

@nikitamikhaylov nikitamikhaylov added the comp-formats Input/output formats (CSV/JSON/Parquet/ORC/Arrow/Protobuf/etc.). label Dec 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors comp-formats Input/output formats (CSV/JSON/Parquet/ORC/Arrow/Protobuf/etc.). pr-feature Pull request with new product feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Schema Inference For Freeform Text Formats

9 participants