-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Optimize memory consumption of the Inverted Index #1604
Description
Optimize memory consumption of the Inverted Index
Is your feature request related to a problem? Please describe.
The full-text index uses inverted index to perform filtering and retrieval fast.
But the implementation of inverted index is naive and consumes a lot of memory.
Especially when the number of documents is large, and we use prefix tokenization.
It can quickly consume lots of memory for even a medium-sized dataset.
Inverted index is located here:
https://github.com/qdrant/qdrant/tree/master/lib/segment/src/index/field_index/full_text_index
Instructions how to use it in qdrant:
https://qdrant.tech/documentation/indexing/#full-text-index
Describe the solution you'd like
Optimize the in-memory representation of inverted index.
Avoid using String as a key in the hash set.
Use a more compact representation of the inverted index.
Ideally, the new implementation should be able to use the exising persistent storage format, but build the inverted index in a more compact way.
Describe alternatives you've considered
It might be required to change the persistent storage format as well, to make it more compact.
If you find it necessary, it will be required to also implement a migration logic for the existing storage.
Additional context
Acceptance criteria:
- At least 2x memory consumption reduction for the inverted index, actually it should be possible to reduce it 5x or more
- Search performance should not be affected (at least not significantly)
- Build time should not be affected (at least not significantly)
- Functionalities of the inverted index should be preserved
Experimental data:
- Feel free to use this snapshot for testing: https://storage.googleapis.com/qdrant-common-shared/startups-8476456695034750907-2023-03-25-09-41-29.snapshot.gz
gzip -d startups-8476456695034750907-2023-03-25-09-41-29.snapshot.gz
cargo run -r --bin qdrant -- --snapshot path/to/startups-8476456695034750907-2023-03-25-09-41-29.snapshot:startups