Improve vector search speed by using FixedBitSet#12789
Improve vector search speed by using FixedBitSet#12789benwtrent wants to merge 2 commits intoapache:mainfrom
Conversation
|
I can believe that FixedBitSet is faster in some cases, but it's surprising to me that the memory usage of SparseFixedBitSet can go up to 2x that of FixedBitSet, this makes me wonder if Separately, I wonder if you know the number of nodes that get visited (ie. the number of bits that end up being set) in your benchmark? Is it a force-merged index or do you have multiple segments? |
|
@jpountz I re-ran my tests and double checked my numbers, I have some corrections, I accidentally double-counted sparse sizes, so previous numbers are 2x too big. GLOVE-100-100_000: Cohere-768-400_000: EDIT: @jpountz to confirm the sparse fixed bitset memory usage, I rewrote the ramBytesUsed to be more exact (instead of summing up the estimation). Obviously, this ran slightly slower, but from what I found, this didn't reduce the memory estimation. I still got |
|
Thanks, the numbers make more sense to me now. Intuitively, Is it possible to estimate the order of the number of nodes that a nn search needs to visit, so that we could use it as a threshold? |
|
@jpountz searching scales logarithmically, but we do have to explore more if there are any pre-filtered nodes. We can run some experiments to determine the appropriate threshold. I imagine it will be something along the lines of |
|
++ This feels similar to |
While doing some performance testing and digging into flamegraphs, I noticed for smaller vectors (96dim float32), we were losing a fair bit of time within the
SparseFixedBitSet#getAndSetmethod.I am assuming we are using
SparseFixedBitSetfor performance reasons to reduce memory usage?I ran some tests with topK=100 and fanOut=1000. To check memory usage, in a separate run, I printed out
bitSet.ramBytesUsed()after every search.I tested using FixedBitSet instead with GLOVE and saw almost a 10% improvement in search speed:
Vs. baseline
The total ramBytesUsed (allocated and then gc'd collected) for 1000 searches over glove was
21288656bytes. For fixed bit set, every search only allocates12544, which pans out to12544000bytes (actually less than sparse).To confirm this was still true for larger vectors and a larger graph, I tested against 400k cohere vectors (same params). There is a bit more noise in the measurements, so I averaged the latency over 4 runs:
candidate:
6.115with a min:5.96baseline:
6.23with a min:6.15.Total memory usage for sparse:
103982464vs for total memory usage for Fixed50040000Do we know the goal for using a SparseFixedBitSet and under what conditions it would actually perform better than a regular FixedBitSet? I will happily test some more.