Skip to content

Schema Inference For Freeform Text Formats #34006

@alexey-milovidov

Description

@alexey-milovidov

Continuation of #14450.

Use case

Parse application logs without specifying the structure.
Parse tabular dumps with unknown format and delimiters.
Extract data from HTML tables.

Context

This task is not well defined and there is no clear well-known solution.
Every year in every database conference there is a paper about how to solve this task.
Typical solutions are tricky and error-prone. We don't pretend to be any better.

Some decent implementations exists, like https://github.com/tstack/lnav
They are usually represented by a bunch of heuristics and fine tuning for special cases.

Requirements

It should be good enough to parse newline delimited JSON, array of JSON objects and arrays,
parse CSV with arbitrary delimiters including whitespaces, parse ClickHouse Pretty format automatically,
determine if CSV contains header or not, determine optional fields in JSON,
determine types including strings, numbers and DateTimes in various formats,
parse application logs similar to ClickHouse logs,
allow error recovery with sync to next record,
correctly distinguish various escaping methods, like "CSV ""style""" and "C-\"style\"".

Describe the solution you'd like

First focus on special case when data is represented by newline delimited records.
Note: it rules out CSV with header, array of JSON objects, and even TSV produced by mysqldump.
But still ok for application logs.

Data can be parsed with Matchers. A matcher can recognize and parse some data type according to some rules.
Examples: parse integer; parse string in double quotes with C-style escaping; parse all the remaining bytes until end of line into a String. Also it can skip some bytes without producing any output. Examples: whitespaces; comma; punctuation characters.

Matchers can be modified with Combinators, for example, for any matcher we can have Optional of it that can parse empty string.
Additional option is to structure matchers as a tree. For example, a matcher can recognize XML element, but it will require subtree of other matchers to recognize the content.

Matcher also produce some score about how specific was the match. For example, successful parsing of double quoted string is better than eating the rest of the line into a string.

For every line of input data we can have possibly infinite set of ways how it can be parsed by a sequence of matchers. We should have a method to lazily iterate over this infinite set in some order that mostly correspond with the score (more specific options first).

Matchers can also have partial order by generalization. For example, a matcher for floating point number can also match unsigned integers.

To recognize the structure, we can take first N (like 100) lines, and start simultaneous iteration over sets of possible sequences of matchers for every line while saving successful solutions. We can also iterate infinite number of lines with "diagonal method". Then we can incrementally pick solutions that are identical for every line or pick near identical solutions and generalize them, then sort possible solutions by some aggregate score.

Additional context

This is too heavy, not sure if it will work and if it is possible to implement.

This method can recognize Dates like 2000-01-01 as a sequence of three integers. This is acceptable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions