Embeddings: quantize to int8 and improve perf#50953
Conversation
69de48b to
385dda0
Compare
|
Codenotify: Notifying subscribers in CODENOTIFY files for diff 8abb1fc...1607c9d. No notifications. |
5b38ef3 to
709f6fc
Compare
|
Here are some references to give background about this approach (often called "scalar quantization" or "SQ8" when you're using the popular FAISS library):
It's a common trick both in vector search, and in developing deep learning models, where you can use lower-precision values during training. The intuition is that most information is stored across the high number of dimensions, not the precision of individual elements. |
There was a problem hiding this comment.
Why do we Dequantize before storing the embeddings?
There was a problem hiding this comment.
Great question. It's just so all the embeddings stored in blob storage still are float32s. Gob can't decode a float32 slice into an int8 slice (or vice versa), so it would fail if we have a mix of encodings.
We will probably need to write some migration code at some point to convert the storage to int8, but I felt it was best to not modify the external data in this initial PR. I started doing that, but it was pretty complex to do without loading the full float32 embeddings into memory (which would fail at Dropbox's scale).
There was a problem hiding this comment.
I think it'd be nice if we always stored the embeddings at full original precision, instead of quantizing then unquantizing as we do now, which reduces their precision. It feels like this would make debugging easier -- for example when we're asking questions like "why is this query similar to this snippet?", we'll be sure what's due to the embeddings model itself vs. our approximation. It also gives more flexibility for future changes/ experiments, we could try out different search approximations without regenerating all the embeddings (can be expensive).
Would this be a big change?
There was a problem hiding this comment.
Unfortunately, this would be a pretty difficult change because we would have to stream the embeddings to blob storage as we create them since the purpose of this change is to help large embeddings fit in memory. If we collected the float32 embeddings in memory as we create them, we have the same issues as we have now that the embeddings for large repos will cause failures.
I think the real solution here is to create embeddings in chunks. That will be a significantly larger lift though.
There was a problem hiding this comment.
I think the real solution here is to create embeddings in chunks.
I see, I hadn't read up on the embedding + uploading logic yet! Makes sense not to change it for this PR, but could be worthwhile to look into as a follow-up.
jtibshirani
left a comment
There was a problem hiding this comment.
This is great! In addition to the memory savings, the ~10x speed-up is exciting. It's actually a bit unexpected for me -- I wonder if the conversion to int8 also enabled the compiler to recognize an opportunity for SIMD?
If we were being really conservative, we would maintain two separate similarity search implementations (quantized version vs. original full precision), and feature flag it. But I feel good just moving fast with this change, it's a very common technique and the recall results look good. Before approving I will also pull it down locally and run through my manual prompt tests to check there's no obvious regressions.
Last note -- could you paste the recall results into the PR description? You sent me some helpful ones over chat (recall@1, recall@10, etc.)
There was a problem hiding this comment.
Quantization is most useful for the similarity search case, where we are concerned about memory and perform a large number of similarity computations. But here, we're just comparing a vector against two other vectors. So approximating here doesn't seem worth it, and could make the comparison more error prone / a little harder to debug.
To support this, maybe we can keep a version of CosineSimilarity that works on float vectors, it doesn't seem too complex?
There was a problem hiding this comment.
@camdencheek any thoughts on this? Other than this, to me it looks ready to merge!
There was a problem hiding this comment.
Whoops, I missed this!
Seems reasonable and simple to me, will update it before merge 👍
There was a problem hiding this comment.
I just pushed a minimal commit that addresses this. I plan to clean things up a bit and make better-typed embedding vectors, but since there are so many PRs stacked on this one, I'll do that as a followup
There was a problem hiding this comment.
I think it'd be nice if we always stored the embeddings at full original precision, instead of quantizing then unquantizing as we do now, which reduces their precision. It feels like this would make debugging easier -- for example when we're asking questions like "why is this query similar to this snippet?", we'll be sure what's due to the embeddings model itself vs. our approximation. It also gives more flexibility for future changes/ experiments, we could try out different search approximations without regenerating all the embeddings (can be expensive).
Would this be a big change?
709f6fc to
61a3268
Compare
This converts our embeddings to an int8 in memory. It does not convert the storage format so that we can be backwards compatible with the embeddings stored in blob storage. Also, I could see it being useful to maintain the more precise embeddings in case we see regressions in ranking from this change. Some benchmarks showing the impact on search: name old time/op new time/op delta SimilaritySearch/numWorkers=1-10 1.89s ± 1% 0.24s ± 1% -87.27% (p=0.008 n=5+5) SimilaritySearch/numWorkers=2-10 965ms ± 2% 122ms ± 1% -87.39% (p=0.008 n=5+5) SimilaritySearch/numWorkers=4-10 493ms ± 1% 63ms ± 1% -87.17% (p=0.008 n=5+5) SimilaritySearch/numWorkers=8-10 262ms ± 4% 34ms ± 3% -87.06% (p=0.008 n=5+5) SimilaritySearch/numWorkers=16-10 255ms ± 4% 36ms ± 3% -85.82% (p=0.008 n=5+5) name old alloc/op new alloc/op delta SimilaritySearch/numWorkers=1-10 84.2kB ± 0% 83.3kB ± 0% -1.00% (p=0.008 n=5+5) SimilaritySearch/numWorkers=2-10 148kB ± 0% 147kB ± 0% -0.38% (p=0.008 n=5+5) SimilaritySearch/numWorkers=4-10 272kB ± 0% 268kB ± 0% -1.48% (p=0.008 n=5+5) SimilaritySearch/numWorkers=8-10 493kB ± 0% 495kB ± 0% +0.35% (p=0.008 n=5+5) SimilaritySearch/numWorkers=16-10 898kB ± 0% 909kB ± 0% +1.19% (p=0.008 n=5+5) name old allocs/op new allocs/op delta SimilaritySearch/numWorkers=1-10 2.08k ± 0% 2.06k ± 0% -1.25% (p=0.008 n=5+5) SimilaritySearch/numWorkers=2-10 3.70k ± 0% 3.69k ± 0% ~ (p=0.079 n=4+5) SimilaritySearch/numWorkers=4-10 6.84k ± 0% 6.73k ± 0% -1.63% (p=0.008 n=5+5) SimilaritySearch/numWorkers=8-10 12.3k ± 0% 12.4k ± 0% +0.42% (p=0.008 n=5+5) SimilaritySearch/numWorkers=16-10 22.1k ± 0% 22.4k ± 0% +1.64% (p=0.008 n=5+5)
Add a method to the index that returns the subslice representing a single row. I just think this is easier to reason about.
The indexing on this test was very weird. I just flipped the indexing so that the expected is just the ordered set of results. Also, I added the row num to the result which is a minimal cost but is very useful for debugging, testing, and benchmarking.
These are the changes in ranking from the change in precision. There are only two swaps of adjacent results.
This allocates a closure on every call to score, and allocates a bunch more strings than necessary when debug is enabled. name old time/op new time/op delta SimilaritySearch/numWorkers=1-10 240ms ± 1% 238ms ± 1% ~ (p=0.056 n=5+5) SimilaritySearch/numWorkers=2-10 122ms ± 1% 122ms ± 1% ~ (p=0.690 n=5+5) SimilaritySearch/numWorkers=4-10 63.2ms ± 1% 62.1ms ± 1% -1.84% (p=0.008 n=5+5) SimilaritySearch/numWorkers=8-10 33.9ms ± 3% 34.2ms ± 1% ~ (p=0.222 n=5+5) SimilaritySearch/numWorkers=16-10 36.2ms ± 3% 36.0ms ± 3% ~ (p=0.690 n=5+5) name old alloc/op new alloc/op delta SimilaritySearch/numWorkers=1-10 83.3kB ± 0% 66.1kB ± 0% -20.66% (p=0.008 n=5+5) SimilaritySearch/numWorkers=2-10 147kB ± 0% 114kB ± 0% -22.70% (p=0.008 n=5+5) SimilaritySearch/numWorkers=4-10 268kB ± 0% 204kB ± 0% -23.89% (p=0.008 n=5+5) SimilaritySearch/numWorkers=8-10 495kB ± 0% 375kB ± 0% -24.24% (p=0.008 n=5+5) SimilaritySearch/numWorkers=16-10 909kB ± 0% 684kB ± 0% -24.77% (p=0.008 n=5+5) name old allocs/op new allocs/op delta SimilaritySearch/numWorkers=1-10 2.06k ± 0% 2.06k ± 0% ~ (all equal) SimilaritySearch/numWorkers=2-10 3.69k ± 0% 3.69k ± 0% ~ (all equal) SimilaritySearch/numWorkers=4-10 6.73k ± 0% 6.73k ± 0% ~ (all equal) SimilaritySearch/numWorkers=8-10 12.4k ± 0% 12.4k ± 0% ~ (p=1.000 n=5+5) SimilaritySearch/numWorkers=16-10 22.4k ± 0% 22.4k ± 0% ~ (all equal)
61a3268 to
ad6a75f
Compare
This checks the lengths of the input slices ahead of time so the compiler doesn't need to insert panics on slice accesses. Bounds check elimination: ``` name old time/op new time/op delta SimilaritySearch/numWorkers=1-10 559ms ± 0% 499ms ± 0% -10.74% (p=0.016 n=4+5) SimilaritySearch/numWorkers=2-10 283ms ± 0% 253ms ± 0% -10.65% (p=0.008 n=5+5) SimilaritySearch/numWorkers=4-10 146ms ± 0% 130ms ± 0% -10.52% (p=0.008 n=5+5) SimilaritySearch/numWorkers=8-10 73.7ms ± 0% 66.0ms ± 0% -10.45% (p=0.008 n=5+5) SimilaritySearch/numWorkers=16-10 83.6ms ± 2% 75.0ms ± 3% -10.29% (p=0.008 n=5+5) ``` Loop unrolling: ``` name old time/op new time/op delta SimilaritySearch/numWorkers=1-10 499ms ± 0% 439ms ± 0% -12.12% (p=0.008 n=5+5) SimilaritySearch/numWorkers=2-10 253ms ± 0% 222ms ± 0% -12.21% (p=0.008 n=5+5) SimilaritySearch/numWorkers=4-10 130ms ± 0% 115ms ± 0% -11.99% (p=0.008 n=5+5) SimilaritySearch/numWorkers=8-10 66.0ms ± 0% 57.9ms ± 0% -12.20% (p=0.008 n=5+5) SimilaritySearch/numWorkers=16-10 75.0ms ± 3% 65.7ms ± 2% -12.36% (p=0.008 n=5+5) ``` Both: ``` name old time/op new time/op delta SimilaritySearch/numWorkers=1-10 559ms ± 0% 439ms ± 0% -21.55% (p=0.016 n=4+5) SimilaritySearch/numWorkers=2-10 283ms ± 0% 222ms ± 0% -21.56% (p=0.008 n=5+5) SimilaritySearch/numWorkers=4-10 146ms ± 0% 115ms ± 0% -21.24% (p=0.008 n=5+5) SimilaritySearch/numWorkers=8-10 73.7ms ± 0% 57.9ms ± 0% -21.37% (p=0.008 n=5+5) SimilaritySearch/numWorkers=16-10 83.6ms ± 2% 65.7ms ± 2% -21.38% (p=0.008 n=5+5) ```
|
I ran through my usual manual tests, and also ran my hacky test script for context quality. Everything is unchanged with quantization 🎉 |
There was a problem hiding this comment.
Looks good to me. I know you're planning on addressing https://github.com/sourcegraph/sourcegraph/pull/50953#discussion_r1174033914 before merging.
e9332fd to
1607c9d
Compare
This cuts memory usage of the index by ~4x by using int8 quantizations of the float32 embeddings. Additionally, seemingly since int8s are so much more cache efficient, this speeds up search by ~4x. Ignore the benchmarks in the commits -- they are a bit out of date. The reduction in precision comes at a cost in the form of less accurate ranking/recall. However, we still get ~94% recall with these quantizations, which seems well within the bounds of acceptable. Notably, I chose not to use a more complex quantization scheme (like windowing to min/max) because 1) it means scores cannot be comparable across repos, and 2) it means we need to download the full embedding before quantizing, which means it has to fit in memory. On the Sourcegraph repo, compared to the float32 embeddings, int8 embeddings had a retrieval rate of 90% at Top10, 93% at Top100, and 94% at Top1000. The mean rank of each embedding changed by only 1.2%. There are a few somewhat independent parts to this PR, but each is isolated to its own standalone commit. Commit messages are descriptive, so I'd recommend reviewing commit-by-commit. The messages include benchmark comparisons where appropriate to justify the changes.
This cuts memory usage of the index by ~4x by using int8 quantizations of the float32 embeddings. Additionally, seemingly since int8s are so much more cache efficient, this speeds up search by ~4x. Ignore the benchmarks in the commits -- they are a bit out of date.
The reduction in precision comes at a cost in the form of less accurate ranking/recall. However, we still get ~94% recall with these quantizations, which seems well within the bounds of acceptable. Notably, I chose not to use a more complex quantization scheme (like windowing to min/max) because 1) it means scores cannot be comparable across repos, and 2) it means we need to download the full embedding before quantizing, which means it has to fit in memory.
On the Sourcegraph repo, compared to the float32 embeddings, int8 embeddings had a retrieval rate of 90% at Top10, 93% at Top100, and 94% at Top1000. The mean rank of each embedding changed by only 1.2%.
There are a few somewhat independent parts to this PR, but each is isolated to its own standalone commit. Commit messages are descriptive, so I'd recommend reviewing commit-by-commit. The messages include benchmark comparisons where appropriate to justify the changes.
Things not implemented in this PR:
Test plan
Large numbers of benchmarks to validate performance. Existing tests are updated appropriately. I ran a bunch of comparisons of recall rate against the original float32 rankings. The code for that can be found (in hacky form) at this commit. It won't run directly without some setup (sorry, couldn't make things run with both int8 and float32 at the same time), but it might be useful for reference.
All the tests and benchmarks were run on pre-existing float32 embeddings to ensure that we can still operate with the old embeddings. This PR does not cover migrating storage to int8, but we should consider that.