Delayed reading of columns

Given this query:

```
SELECT id, view_count, like_count, dislike_count, title FROM youtube ORDER BY view_count DESC LIMIT 100
```

The column `title` is very large, and the query takes a long time:
```
100 rows in set. Elapsed: 26.280 sec. Processed 4.56 billion rows, 500.52 GB (173.38 million rows/s., 19.05 GB/s.)
```

It will be much faster to read the `view_count` column first while keeping track of the data parts and row number in the data part where the records are located, ORDER BY and LIMIT the required number of records, then using the same table snapshot, read the remaining columns.

For example, this query:
```
SELECT view_count FROM youtube ORDER BY view_count DESC LIMIT 100
```

Only takes:
```
100 rows in set. Elapsed: 6.920 sec. Processed 4.56 billion rows, 41.01 GB (658.44 million rows/s., 5.93 GB/s.)
```

**Implementation proposal**

Introduce a new column type, `ColumnLazy`, that will store the reference to the table snapshot to use, the list of data parts, and its main content will be (part num, row num) pairs. Multiple `ColumnLazy` can share the same content (but represent different delayed columns).

If reading is performed from a table of MergeTree family, and the columns are heavily filtered by `ORDER BY ... LIMIT` or in some other way, and the materialization will be possible on the same server, instruct making `ColumnLazy` instead of reading the columns. An additional materialization step will be added to the pipeline.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delayed reading of columns #45868

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Delayed reading of columns #45868

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions