Skip to content

Delayed reading of columns #45868

@alexey-milovidov

Description

@alexey-milovidov

Given this query:

SELECT id, view_count, like_count, dislike_count, title FROM youtube ORDER BY view_count DESC LIMIT 100

The column title is very large, and the query takes a long time:

100 rows in set. Elapsed: 26.280 sec. Processed 4.56 billion rows, 500.52 GB (173.38 million rows/s., 19.05 GB/s.)

It will be much faster to read the view_count column first while keeping track of the data parts and row number in the data part where the records are located, ORDER BY and LIMIT the required number of records, then using the same table snapshot, read the remaining columns.

For example, this query:

SELECT view_count FROM youtube ORDER BY view_count DESC LIMIT 100

Only takes:

100 rows in set. Elapsed: 6.920 sec. Processed 4.56 billion rows, 41.01 GB (658.44 million rows/s., 5.93 GB/s.)

Implementation proposal

Introduce a new column type, ColumnLazy, that will store the reference to the table snapshot to use, the list of data parts, and its main content will be (part num, row num) pairs. Multiple ColumnLazy can share the same content (but represent different delayed columns).

If reading is performed from a table of MergeTree family, and the columns are heavily filtered by ORDER BY ... LIMIT or in some other way, and the materialization will be possible on the same server, instruct making ColumnLazy instead of reading the columns. An additional materialization step will be added to the pipeline.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions