In many of our vector search benchmark we time pure compute - timing the vector operation when both the vectors are in CPU cache. In many scenarios, e.g. HNSW, we score vectors that may not be in cache. In these scenarios it may be better to reflow the bulk scorer to small batches (say 4) vectors at a time, rather than aggressive unrolling per single-vector. If we do this we can improve the memory-level parallelism of the complete bulk operation. We've already seen this improve float32 vector ops in Lucene.
In many of our vector search benchmark we time pure compute - timing the vector operation when both the vectors are in CPU cache. In many scenarios, e.g. HNSW, we score vectors that may not be in cache. In these scenarios it may be better to reflow the bulk scorer to small batches (say 4) vectors at a time, rather than aggressive unrolling per single-vector. If we do this we can improve the memory-level parallelism of the complete bulk operation. We've already seen this improve float32 vector ops in Lucene.