Skip to content

[RFC] Disk friendly format for inverted text index #87003

@CurtizJ

Description

@CurtizJ

Motivation

The goal is to build an inverted text index whose format is friendly for streaming processing and doesn't require loading too much data into memory.

The current implementation of an inverted index uses FST to store the dictionary, which optimizes the storage space but has a serialization format that is hard to read without loading the whole index data into memory. But I think that it should be the opposite. Also, the current inverted index stores the dictionary and posting lists in segments, which are decoupled from the skip index granule, making index analysis more complex.

The idea of the new text index is to make everything as simple as possible.

Index format

A text index is a skip index that can have arbitrary granularity. Granules are aggregated the same way as for other skip indexes and dumped to the disk when the index reaches the desired granularity.

Text index has three files (and marks for them):

  • File with index granules (.idx)
  • File with dictionary blocks (.dct)
  • File with posting lists (.pst)

Each index granule accumulates tokens from all documents and collects the posting lists (indexes of documents that contain the token) for each token. Tokens are sorted and split into blocks before the granule is finalized (the block size is controlled by the index parameter dictionary_block_size). The first rows of each block form a sparse index (similar to the primary key of MergeTree). Also, all tokens are added to the bloom filter to skip the granule quickly. Posting lists are dumped first, and the offset in the file to the posting list for each token is saved. Posting lists are built and saved as Roaring Bitmaps. If the cardinality of the posting list is less than a threshold (index parameter max_cardinality_for_embedded_postings), it is embedded into the dictionary. Then, dictionary blocks are dumped, and the offset in the dictionary file to the block is saved into the sparse index.

The format of the index granule:

  • Bloom filter
  • Sparse index - a mapping (first token in block -> offset in file to the beginning of the block).

The dictionary file consists of blocks. Each block has a mapping (token -> (cardinality, header of the posting list, offset to the posting list or embedded posting list)). Tokens are saved as plain string column, however an FST can be used to save space later.

Index analysis

During index analysis, we have a list of constant tokens. We need to:

  1. Understand whether the granule can be skipped completely
  2. Find all positions of documents that contain searched tokens
  3. Apply AND or OR to the bitmap with posting lists depending on the query type
  4. Filter the data

How will it be in terms of the ClickHouse codebase?

PR #81526 implemeted reading and analysing skipping indexes as a step of MergeTreeReaderChain. This PR introduced a new type of MergeTree reader for reading skip indexes. Now the MergeTree reader has methods canSkipMark and readRows. canSkipMark is called before reading rows, and if it returns true, we skip entirely the granule; otherwise, we call readRows. Therefore, we have two stages of reading. An inverted index will have its own implementation of a reader that utilizes both stages.

Functions search{All/Any}(column, [token1, token2, ...]), hasToken(column, token) are replaced to special virtual column _<index_name>_<function_name> on query plan optimization step. Constant tokens are pushed down to the skip index condition to make a mapping from the virtual column to searched tokens. The virtual column is pushed down to the skip index reader and filled during reading on a PREWHERE step.

At the first stage of index analysis (canSkipMark), we do the following:

  • Read the bloom filter and sparse index from the granule.
  • Check tokens in the bloom filter.
  • Find blocks for the remainingtokens using binary search in sparse index.
  • Read the blocks and collect offsets to the posting lists or embedded posting lists.

At the second stage of index analysis (readRows), if a granule cannot be skipped after the first stage, we do the following:

  • Use the collected offsets to read the posting lists for the tokens.
  • Calculate the row numbers from the bitmap and fill the virtual column according to search mode (All or Any).
  • Then, data can be filtered by the virtual column naturally as a stage of PREWHERE.

Performance analysis

For analyzing an index in one granule, we have one seek to the beginning of the granule, and two seeks for each searched token in the worst case. The first seek is unnecessary for analyzing sequential granules; it is needed only for the first mark of the analyzed range. If a bloom filter filters the token, there are no additional seeks. If the dictionary skips the token, there is one additional seek. Also, note that we don't have pure random reads, but seeks forward in the file by relatively small ranges, which can be optimized (e.g. like it's done here).

It should work more or less fast if we assume that reading is optimized in ReadBuffer, i.e., reads from object storage are done in big chunks, data is saved in the filesystem cache, and we actually do seeks on SSD. To optimize random reads, we can prefetch to the filesystem cache the ranges of granules that we are going to analyze.

Consider that each index granule has 100000 unique tokens, each token is 8 bytes on average (e.g., English text has about 5-6 characters per word on average), and the block size is 256.

Then we have the size of the sparse index: (100000 / 256) * (8 + 4 + 4) = 6250 bytes.
Size of block: 256 * (8 + 4 + 4) = 4096 bytes.
Sparse index and blocks can be effectively cached in memory with the LRU policy. This will allow to keep in cache blocks with frequently searched tokens.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions