Schema inference for freeform text formats#43449
Schema inference for freeform text formats#43449dmtri35 wants to merge 37 commits intoClickHouse:masterfrom
Conversation
Why then your first example is not detecting DateTime? It is CSV with single delimiter - IMHO it should have higher score than custom delimited with one of |
Yes, I think it should too but currently there are a few reasons why this happen:
1 2021-04-01 00:00:18 2021-04-01 00:21:54 1 8.4 1 N 79 116 1 25.5 3 0.5 5.85 0 0.3 35.15 2.5Additionally if I change the example to 1,"2021-04-01 00:00:18","2021-04-01 00:21:54",1,8.40,1,N,79,116,1,25.5,3,0.5,5.85,0,0.3,35.15,2
1,"2021-04-01 00:42:37","2021-04-01 00:46:23",1,.90,1,N,75,236,2,5,3,0.5,0,0,0.3,8.8,2.5we get the desired result: I will try to think of a way to get this to be closer to being parsed as CSV, if you have any suggestions, let me know! I think one possible way is to reuse the matcher of the previous field before we iterate over the list of matchers. Edit: With the new commits (increasing the search tree size), this has been somewhat mitigated |
9a7bc1b to
cf0daa7
Compare
d9d1f29 to
084689e
Compare
03fbd7d to
9b5966e
Compare
|
Sorry for waiting for so long. I will review this PR next week. Feel free to ping me. |
|
@Avogar Could you have a look at my PR? Thanks! |
|
I am in process. Already took a look several times. Just need some time to understand all logic. |
There was a problem hiding this comment.
I am not sure that we want to skip multiple ',' and : with possible spaces between them as one delimiter. For example in CSV format we can have empty field and 'abc,,def'will represent 3 columns (maybe we can support it in Freedom format somehow?).
Also we can add more possible delimiters like '|', ';', maybe some other punctuation symbols or their combinations (not sure)
Maybe do smth like this:
skipWhitespaceIfAny(in);
/// Skip possible delimiters like ',', '|', ':', etc or their combinations.
skipWhitespaceIfAny(in);
There was a problem hiding this comment.
Good point, I will try to experiment if we could really support empty fields, as the type we inferred from them will be null I assumed?
|
I will make a few more iterations of review later. Now the most important think is how we work with read buffer (see comments). Need to fix it first. Also, please, do more tests. I ran a few simple tests and most of them just crashed client: I guess there is a problem when data doesn't contain Also you can experiment with setting |
|
Sorry for being a little late, I've been busy with school and life lately. I have addressed the majority of your comments. I will add more comments, do more testing and think about how should we skip over delimiters. |
04e379c to
e4b2491
Compare
|
This is an automated comment for commit e31ea77 with description of existing statuses. It's updated for the latest CI running
|
93232bc to
9439117
Compare
parseFields now returns std::vector<String> to allow a matcher to parse multiple fields at once
Refactor parseFields to return both the column name and value, this allows us to use JSON field names as column names
Use a map to map column names to column index, this allow JSON fields to be shuffled
Checkout style changes, formatting, and use std::tie
This is needed for JSONFieldMatcher as it could parse root fields into different columns
ad8b32e to
967da47
Compare
|
Dear @Avogar, this PR hasn't been updated for a while. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself. |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Implement schema inference for row freeform text format.
Closes #34006
Once this is done you should be able to ingest arbitrary tabular text files (ie. application logs) without specifying the structure (Template).
Example usage:
Read data from file
Describe data from file
How it is implemented
FreeformFieldMatcherhouses an array of matchers that parse text data by escaping rules (JSON, CSV, Raw(but cutoff by whitespace), Quoted and Escaped).First, FreeformFieldMatcher will iterate over this set of matchers and generate as many solutions as possible.
A
Solutionis defined as the order of the matchers to parse one row successfully associated with a score. The score here is generated by continuously adding the score of each field in the solution, specifically by the escaping rule and the type of the field. More specific fields (ie. DateTime) yield higher scores.For now we only generate solutions for the first row because the data is supposed to be tabular to we should be able to get an acceptable solution just by looking at one row. Then the solutions are sorted by score and they are check against
max_rows_to_check. The first solution matches all of themax_rows_to_checkwould be picked as the final solution.Currently this works with CSV and syslog but more testing and tunings needs to be made for it to be in an acceptable state.
From
we have
From
we have
Update 2022-28-11:
With the new changes on increasing the search tree size, we now iterate over all matchers and don't exit early. However, a new field is added only if it has a better type score than the previous field or if no field other than type string has been found. This helps us to better parse the previous CSV example:
And also ClickHouse's logs:
Update 2022-30-12:
Now we could parse and unmarshal JSON object fields into their own columns.
TODO: