Skip to content

Optimize memory consumption of the Inverted Index #1604

@generall

Description

@generall

Optimize memory consumption of the Inverted Index

Is your feature request related to a problem? Please describe.
The full-text index uses inverted index to perform filtering and retrieval fast.
But the implementation of inverted index is naive and consumes a lot of memory.
Especially when the number of documents is large, and we use prefix tokenization.

It can quickly consume lots of memory for even a medium-sized dataset.

Inverted index is located here:
https://github.com/qdrant/qdrant/tree/master/lib/segment/src/index/field_index/full_text_index

Instructions how to use it in qdrant:
https://qdrant.tech/documentation/indexing/#full-text-index

Describe the solution you'd like

Optimize the in-memory representation of inverted index.
Avoid using String as a key in the hash set.
Use a more compact representation of the inverted index.

Ideally, the new implementation should be able to use the exising persistent storage format, but build the inverted index in a more compact way.

Describe alternatives you've considered
It might be required to change the persistent storage format as well, to make it more compact.
If you find it necessary, it will be required to also implement a migration logic for the existing storage.

Additional context

Acceptance criteria:

  • At least 2x memory consumption reduction for the inverted index, actually it should be possible to reduce it 5x or more
  • Search performance should not be affected (at least not significantly)
  • Build time should not be affected (at least not significantly)
  • Functionalities of the inverted index should be preserved

Experimental data:

gzip -d startups-8476456695034750907-2023-03-25-09-41-29.snapshot.gz

cargo run -r --bin qdrant -- --snapshot path/to/startups-8476456695034750907-2023-03-25-09-41-29.snapshot:startups

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions