[RFC] Disk friendly format for inverted text index


### Motivation

The goal is to build an inverted text index whose format is friendly for streaming processing and doesn't require loading too much data into memory.

The current implementation of an inverted index uses FST to store the dictionary, which optimizes the storage space but has a serialization format that is hard to read without loading the whole index data into memory. But I think that it should be the opposite. Also, the current inverted index stores the dictionary and posting lists in segments, which are decoupled from the skip index granule, making index analysis more complex.

The idea of the new text index is to make everything as simple as possible.

### Index format

A text index is a skip index that can have arbitrary granularity. Granules are aggregated the same way as for other skip indexes and dumped to the disk when the index reaches the desired granularity.

Text index has three files (and marks for them):
- File with index granules (.idx)
- File with dictionary blocks (.dct)
- File with posting lists  (.pst)

Each index granule accumulates tokens from all documents and collects the posting lists (indexes of documents that contain the token) for each token. Tokens are sorted and split into blocks before the granule is finalized (the block size is controlled by the index parameter `dictionary_block_size`). The first rows of each block form a sparse index (similar to the primary key of MergeTree). Also, all tokens are added to the bloom filter to skip the granule quickly. Posting lists are dumped first, and the offset in the file to the posting list for each token is saved. Posting lists are built and saved as Roaring Bitmaps. If the cardinality of the posting list is less than a threshold (index parameter `max_cardinality_for_embedded_postings`), it is embedded into the dictionary. Then, dictionary blocks are dumped, and the offset in the dictionary file to the block is saved into the sparse index.

The format of the index granule:
  - Bloom filter
  - Sparse index - a mapping (first token in block -> offset in file to the beginning of the block).

The dictionary file consists of blocks. Each block has a mapping (token -> (cardinality, header of the posting list, offset to the posting list or embedded posting list)). Tokens are saved as plain string column, however an FST can be used to save space later.

### Index analysis

During index analysis, we have a list of constant tokens. We need to:
1. Understand whether the granule can be skipped completely
2. Find all positions of documents that contain searched tokens
3. Apply `AND` or `OR` to the bitmap with posting lists depending on the query type
4. Filter the data

How will it be in terms of the ClickHouse codebase?

PR #81526 implemeted reading and analysing skipping indexes as a step of `MergeTreeReaderChain`. This PR introduced a new type of MergeTree reader for reading skip indexes. Now the MergeTree reader has methods `canSkipMark` and `readRows`. `canSkipMark` is called before reading rows, and if it returns true, we skip entirely the granule; otherwise, we call `readRows`. Therefore, we have two stages of reading. An inverted index will have its own implementation of a reader that utilizes both stages.

Functions `search{All/Any}(column, [token1, token2, ...])`, `hasToken(column, token)` are replaced to special virtual column `_<index_name>_<function_name>` on query plan optimization step. Constant tokens are pushed down to the skip index condition to make a mapping from the virtual column to searched tokens. The virtual column is pushed down to the skip index reader and filled during reading on a `PREWHERE` step.

At the first stage of index analysis (`canSkipMark`), we do the following:
- Read the bloom filter and sparse index from the granule.
- Check tokens in the bloom filter.
- Find blocks for the remainingtokens using binary search in sparse index.
- Read the blocks and collect offsets to the posting lists or embedded posting lists.

At the second stage of index analysis (`readRows`), if a granule cannot be skipped after the first stage, we do the following:
- Use the collected offsets to read the posting lists for the tokens.
- Calculate the row numbers from the bitmap and fill the virtual column according to search mode (`All` or `Any`).
- Then, data can be filtered by the virtual column naturally as a stage of `PREWHERE`.

### Performance analysis

For analyzing an index in one granule, we have one seek to the beginning of the granule, and two seeks for each searched token in the worst case. The first seek is unnecessary for analyzing sequential granules; it is needed only for the first mark of the analyzed range. If a bloom filter filters the token, there are no additional seeks. If the dictionary skips the token, there is one additional seek. Also, note that we don't have pure random reads, but seeks forward in the file by relatively small ranges, which can be optimized (e.g. like it's done [here](https://github.com/ClickHouse/ClickHouse/blob/0274f9a627a5c411e42971c93754df65e9ded069/src/IO/ReadBufferFromFileDescriptor.cpp#L152-L153)).

It should work more or less fast if we assume that reading is optimized in `ReadBuffer`, i.e., reads from object storage are done in big chunks, data is saved in the filesystem cache, and we actually do seeks on SSD. To optimize random reads, we can prefetch to the filesystem cache the ranges of granules that we are going to analyze.

Consider that each index granule has 100000 unique tokens, each token is 8 bytes on average (e.g., English text has about 5-6 characters per word on average), and the block size is 256.

Then we have the size of the sparse index: `(100000 / 256) * (8 + 4 + 4) = 6250` bytes.
Size of block: `256 * (8 + 4 + 4) = 4096` bytes.
Sparse index and blocks can be effectively cached in memory with the LRU policy. This will allow to keep in cache blocks with frequently searched tokens.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Disk friendly format for inverted text index #87003

Motivation

Index format

Index analysis

Performance analysis

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Disk friendly format for inverted text index #87003

Description

Motivation

Index format

Index analysis

Performance analysis

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions