Schema Inference For Freeform Text Formats

Continuation of #14450.

**Use case**

Parse application logs without specifying the structure.
Parse tabular dumps with unknown format and delimiters.
Extract data from HTML tables.

**Context**

This task is not well defined and there is no clear well-known solution.
Every year in every database conference there is a paper about how to solve this task.
Typical solutions are tricky and error-prone. We don't pretend to be any better.

Some decent implementations exists, like https://github.com/tstack/lnav
They are usually represented by a bunch of heuristics and fine tuning for special cases.

**Requirements**

It should be good enough to parse newline delimited JSON, array of JSON objects and arrays,
parse CSV with arbitrary delimiters including whitespaces, parse ClickHouse Pretty format automatically,
determine if CSV contains header or not, determine optional fields in JSON,
determine types including strings, numbers and DateTimes in various formats,
parse application logs similar to ClickHouse logs,
allow error recovery with sync to next record,
correctly distinguish various escaping methods, like `"CSV ""style"""` and `"C-\"style\""`.

**Describe the solution you'd like**

First focus on special case when data is represented by newline delimited records.
Note: it rules out CSV with header, array of JSON objects, and even TSV produced by mysqldump.
But still ok for application logs.

Data can be parsed with *Matchers*. A matcher can recognize and parse some data type according to some rules.
Examples: parse integer; parse string in double quotes with C-style escaping; parse all the remaining bytes until end of line into a String. Also it can skip some bytes without producing any output. Examples: whitespaces; comma; punctuation characters.

Matchers can be modified with *Combinators*, for example, for any matcher we can have *Optional* of it that can parse empty string.
Additional option is to structure matchers as a tree. For example, a matcher can recognize XML element, but it will require subtree of other matchers to recognize the content.

Matcher also produce some score about how specific was the match. For example, successful parsing of double quoted string is better than eating the rest of the line into a string.

For every line of input data we can have possibly infinite set of ways how it can be parsed by a sequence of matchers. We should have a method to lazily iterate over this infinite set in some order that mostly correspond with the score (more specific options first).

Matchers can also have partial order by generalization. For example, a matcher for floating point number can also match unsigned integers.

To recognize the structure, we can take first N (like 100) lines, and start simultaneous iteration over sets of possible sequences of matchers for every line while saving successful solutions. We can also iterate infinite number of lines with "diagonal method". Then we can incrementally pick solutions that are identical for every line or pick near identical solutions and generalize them, then sort possible solutions by some aggregate score.

**Additional context**

This is too heavy, not sure if it will work and if it is possible to implement.

This method can recognize Dates like 2000-01-01 as a sequence of three integers. This is acceptable.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Schema Inference For Freeform Text Formats #34006

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Schema Inference For Freeform Text Formats #34006

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions