Optimize segment merging in the tsdb doc value codec

The doc values codec iterates a few times over the doc value instance that needs to be written to disk. In case when merging and index sorting is enabled, this is much more expensive, as each time the doc values instance is iterated a merge sort is performed (in order to get the doc ids from different segments in order of index sorting).

There are several reasons why the doc value instance is iterated multiple times:

- To compute stats (num values, number of docs with value) required for writing values to disk.
- To write bitset that indicate which documents have a value. (indexed disi, jump table)
- To write the actual values to disk.
- To write the addresses to disk (in case docs have multiple values)

This applies for numeric doc values, but also for the ordinals of sorted (set) doc values.

The following changes should be made to address this performance issue:

- [x] Change the tsdb doc values format to allows store `numDocsWithField` as metadata and store jump table after the values (#125933).
- [x] Reuse statistics used during merging from the metadata instead of computing it on the fly by creating a merged `SortedNumericDocValues ` (#125403).
- [x] Keep track of documents with value while iterating over values and use that to write jump table later (#126499)
- [x] Keep track of `docValueCount` while iterating over values and write to later for the address offsets. (#126732)
- [x] Optimize merging binary doc value. By accumulating offsets and disi, so that we iterate once. (#127278)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize segment merging in the tsdb doc value codec #126111

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Optimize segment merging in the tsdb doc value codec #126111

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions