Embeddings: quantize to int8 and improve perf by camdencheek · Pull Request #50953 · sourcegraph/sourcegraph-public-snapshot

camdencheek · 2023-04-20T23:31:02Z

This cuts memory usage of the index by ~4x by using int8 quantizations of the float32 embeddings. Additionally, seemingly since int8s are so much more cache efficient, this speeds up search by ~4x. Ignore the benchmarks in the commits -- they are a bit out of date.

The reduction in precision comes at a cost in the form of less accurate ranking/recall. However, we still get ~94% recall with these quantizations, which seems well within the bounds of acceptable. Notably, I chose not to use a more complex quantization scheme (like windowing to min/max) because 1) it means scores cannot be comparable across repos, and 2) it means we need to download the full embedding before quantizing, which means it has to fit in memory.

On the Sourcegraph repo, compared to the float32 embeddings, int8 embeddings had a retrieval rate of 90% at Top10, 93% at Top100, and 94% at Top1000. The mean rank of each embedding changed by only 1.2%.

There are a few somewhat independent parts to this PR, but each is isolated to its own standalone commit. Commit messages are descriptive, so I'd recommend reviewing commit-by-commit. The messages include benchmark comparisons where appropriate to justify the changes.

Things not implemented in this PR:

Converting existing embeddings blobs to int8. For easy backwards compat, we just convert to int8 as we read them into memory.
Converting ranks to int8. Ranks only account for ~1/1536 of the memory use now, and only ~1/384 after this change. Didn't seem worth spending the time to reduce that.
MMAPing the embeddings files. Embedding services currently have no disks, so we'd have to update the shape of all the deployments, and I think this gets us far enough for Dropbox.

Test plan

Large numbers of benchmarks to validate performance. Existing tests are updated appropriately. I ran a bunch of comparisons of recall rate against the original float32 rankings. The code for that can be found (in hacky form) at this commit. It won't run directly without some setup (sorry, couldn't make things run with both int8 and float32 at the same time), but it might be useful for reference.

All the tests and benchmarks were run on pre-existing float32 embeddings to ensure that we can still operate with the old embeddings. This PR does not cover migrating storage to int8, but we should consider that.

sourcegraph-bot · 2023-04-21T02:49:15Z

Codenotify: Notifying subscribers in CODENOTIFY files for diff 8abb1fc...1607c9d.

No notifications.

jtibshirani · 2023-04-21T16:18:24Z

Here are some references to give background about this approach (often called "scalar quantization" or "SQ8" when you're using the popular FAISS library):

It's a common trick both in vector search, and in developing deep learning models, where you can use lower-precision values during training. The intuition is that most information is stored across the high number of dimensions, not the precision of individual elements.

stefanhengl · 2023-04-21T16:31:58Z

Why do we Dequantize before storing the embeddings?

Great question. It's just so all the embeddings stored in blob storage still are float32s. Gob can't decode a float32 slice into an int8 slice (or vice versa), so it would fail if we have a mix of encodings.

We will probably need to write some migration code at some point to convert the storage to int8, but I felt it was best to not modify the external data in this initial PR. I started doing that, but it was pretty complex to do without loading the full float32 embeddings into memory (which would fail at Dropbox's scale).

I think it'd be nice if we always stored the embeddings at full original precision, instead of quantizing then unquantizing as we do now, which reduces their precision. It feels like this would make debugging easier -- for example when we're asking questions like "why is this query similar to this snippet?", we'll be sure what's due to the embeddings model itself vs. our approximation. It also gives more flexibility for future changes/ experiments, we could try out different search approximations without regenerating all the embeddings (can be expensive).

Would this be a big change?

Unfortunately, this would be a pretty difficult change because we would have to stream the embeddings to blob storage as we create them since the purpose of this change is to help large embeddings fit in memory. If we collected the float32 embeddings in memory as we create them, we have the same issues as we have now that the embeddings for large repos will cause failures.

I think the real solution here is to create embeddings in chunks. That will be a significantly larger lift though.

I think the real solution here is to create embeddings in chunks.

I see, I hadn't read up on the embedding + uploading logic yet! Makes sense not to change it for this PR, but could be worthwhile to look into as a follow-up.

jtibshirani

This is great! In addition to the memory savings, the ~10x speed-up is exciting. It's actually a bit unexpected for me -- I wonder if the conversion to int8 also enabled the compiler to recognize an opportunity for SIMD?

If we were being really conservative, we would maintain two separate similarity search implementations (quantized version vs. original full precision), and feature flag it. But I feel good just moving fast with this change, it's a very common technique and the recall results look good. Before approving I will also pull it down locally and run through my manual prompt tests to check there's no obvious regressions.

Last note -- could you paste the recall results into the PR description? You sent me some helpful ones over chat (recall@1, recall@10, etc.)

jtibshirani · 2023-04-21T17:54:27Z

Quantization is most useful for the similarity search case, where we are concerned about memory and perform a large number of similarity computations. But here, we're just comparing a vector against two other vectors. So approximating here doesn't seem worth it, and could make the comparison more error prone / a little harder to debug.

To support this, maybe we can keep a version of CosineSimilarity that works on float vectors, it doesn't seem too complex?

@camdencheek any thoughts on this? Other than this, to me it looks ready to merge!

Whoops, I missed this!

Seems reasonable and simple to me, will update it before merge 👍

I just pushed a minimal commit that addresses this. I plan to clean things up a bit and make better-typed embedding vectors, but since there are so many PRs stacked on this one, I'll do that as a followup

jtibshirani · 2023-04-21T18:22:19Z

I think it'd be nice if we always stored the embeddings at full original precision, instead of quantizing then unquantizing as we do now, which reduces their precision. It feels like this would make debugging easier -- for example when we're asking questions like "why is this query similar to this snippet?", we'll be sure what's due to the embeddings model itself vs. our approximation. It also gives more flexibility for future changes/ experiments, we could try out different search approximations without regenerating all the embeddings (can be expensive).

Would this be a big change?

This converts our embeddings to an int8 in memory. It does not convert the storage format so that we can be backwards compatible with the embeddings stored in blob storage. Also, I could see it being useful to maintain the more precise embeddings in case we see regressions in ranking from this change. Some benchmarks showing the impact on search: name old time/op new time/op delta SimilaritySearch/numWorkers=1-10 1.89s ± 1% 0.24s ± 1% -87.27% (p=0.008 n=5+5) SimilaritySearch/numWorkers=2-10 965ms ± 2% 122ms ± 1% -87.39% (p=0.008 n=5+5) SimilaritySearch/numWorkers=4-10 493ms ± 1% 63ms ± 1% -87.17% (p=0.008 n=5+5) SimilaritySearch/numWorkers=8-10 262ms ± 4% 34ms ± 3% -87.06% (p=0.008 n=5+5) SimilaritySearch/numWorkers=16-10 255ms ± 4% 36ms ± 3% -85.82% (p=0.008 n=5+5) name old alloc/op new alloc/op delta SimilaritySearch/numWorkers=1-10 84.2kB ± 0% 83.3kB ± 0% -1.00% (p=0.008 n=5+5) SimilaritySearch/numWorkers=2-10 148kB ± 0% 147kB ± 0% -0.38% (p=0.008 n=5+5) SimilaritySearch/numWorkers=4-10 272kB ± 0% 268kB ± 0% -1.48% (p=0.008 n=5+5) SimilaritySearch/numWorkers=8-10 493kB ± 0% 495kB ± 0% +0.35% (p=0.008 n=5+5) SimilaritySearch/numWorkers=16-10 898kB ± 0% 909kB ± 0% +1.19% (p=0.008 n=5+5) name old allocs/op new allocs/op delta SimilaritySearch/numWorkers=1-10 2.08k ± 0% 2.06k ± 0% -1.25% (p=0.008 n=5+5) SimilaritySearch/numWorkers=2-10 3.70k ± 0% 3.69k ± 0% ~ (p=0.079 n=4+5) SimilaritySearch/numWorkers=4-10 6.84k ± 0% 6.73k ± 0% -1.63% (p=0.008 n=5+5) SimilaritySearch/numWorkers=8-10 12.3k ± 0% 12.4k ± 0% +0.42% (p=0.008 n=5+5) SimilaritySearch/numWorkers=16-10 22.1k ± 0% 22.4k ± 0% +1.64% (p=0.008 n=5+5)

Add a method to the index that returns the subslice representing a single row. I just think this is easier to reason about.

The indexing on this test was very weird. I just flipped the indexing so that the expected is just the ordered set of results. Also, I added the row num to the result which is a minimal cost but is very useful for debugging, testing, and benchmarking.

These are the changes in ranking from the change in precision. There are only two swaps of adjacent results.

This allocates a closure on every call to score, and allocates a bunch more strings than necessary when debug is enabled. name old time/op new time/op delta SimilaritySearch/numWorkers=1-10 240ms ± 1% 238ms ± 1% ~ (p=0.056 n=5+5) SimilaritySearch/numWorkers=2-10 122ms ± 1% 122ms ± 1% ~ (p=0.690 n=5+5) SimilaritySearch/numWorkers=4-10 63.2ms ± 1% 62.1ms ± 1% -1.84% (p=0.008 n=5+5) SimilaritySearch/numWorkers=8-10 33.9ms ± 3% 34.2ms ± 1% ~ (p=0.222 n=5+5) SimilaritySearch/numWorkers=16-10 36.2ms ± 3% 36.0ms ± 3% ~ (p=0.690 n=5+5) name old alloc/op new alloc/op delta SimilaritySearch/numWorkers=1-10 83.3kB ± 0% 66.1kB ± 0% -20.66% (p=0.008 n=5+5) SimilaritySearch/numWorkers=2-10 147kB ± 0% 114kB ± 0% -22.70% (p=0.008 n=5+5) SimilaritySearch/numWorkers=4-10 268kB ± 0% 204kB ± 0% -23.89% (p=0.008 n=5+5) SimilaritySearch/numWorkers=8-10 495kB ± 0% 375kB ± 0% -24.24% (p=0.008 n=5+5) SimilaritySearch/numWorkers=16-10 909kB ± 0% 684kB ± 0% -24.77% (p=0.008 n=5+5) name old allocs/op new allocs/op delta SimilaritySearch/numWorkers=1-10 2.06k ± 0% 2.06k ± 0% ~ (all equal) SimilaritySearch/numWorkers=2-10 3.69k ± 0% 3.69k ± 0% ~ (all equal) SimilaritySearch/numWorkers=4-10 6.73k ± 0% 6.73k ± 0% ~ (all equal) SimilaritySearch/numWorkers=8-10 12.4k ± 0% 12.4k ± 0% ~ (p=1.000 n=5+5) SimilaritySearch/numWorkers=16-10 22.4k ± 0% 22.4k ± 0% ~ (all equal)

This checks the lengths of the input slices ahead of time so the compiler doesn't need to insert panics on slice accesses. Bounds check elimination: ``` name old time/op new time/op delta SimilaritySearch/numWorkers=1-10 559ms ± 0% 499ms ± 0% -10.74% (p=0.016 n=4+5) SimilaritySearch/numWorkers=2-10 283ms ± 0% 253ms ± 0% -10.65% (p=0.008 n=5+5) SimilaritySearch/numWorkers=4-10 146ms ± 0% 130ms ± 0% -10.52% (p=0.008 n=5+5) SimilaritySearch/numWorkers=8-10 73.7ms ± 0% 66.0ms ± 0% -10.45% (p=0.008 n=5+5) SimilaritySearch/numWorkers=16-10 83.6ms ± 2% 75.0ms ± 3% -10.29% (p=0.008 n=5+5) ``` Loop unrolling: ``` name old time/op new time/op delta SimilaritySearch/numWorkers=1-10 499ms ± 0% 439ms ± 0% -12.12% (p=0.008 n=5+5) SimilaritySearch/numWorkers=2-10 253ms ± 0% 222ms ± 0% -12.21% (p=0.008 n=5+5) SimilaritySearch/numWorkers=4-10 130ms ± 0% 115ms ± 0% -11.99% (p=0.008 n=5+5) SimilaritySearch/numWorkers=8-10 66.0ms ± 0% 57.9ms ± 0% -12.20% (p=0.008 n=5+5) SimilaritySearch/numWorkers=16-10 75.0ms ± 3% 65.7ms ± 2% -12.36% (p=0.008 n=5+5) ``` Both: ``` name old time/op new time/op delta SimilaritySearch/numWorkers=1-10 559ms ± 0% 439ms ± 0% -21.55% (p=0.016 n=4+5) SimilaritySearch/numWorkers=2-10 283ms ± 0% 222ms ± 0% -21.56% (p=0.008 n=5+5) SimilaritySearch/numWorkers=4-10 146ms ± 0% 115ms ± 0% -21.24% (p=0.008 n=5+5) SimilaritySearch/numWorkers=8-10 73.7ms ± 0% 57.9ms ± 0% -21.37% (p=0.008 n=5+5) SimilaritySearch/numWorkers=16-10 83.6ms ± 2% 65.7ms ± 2% -21.38% (p=0.008 n=5+5) ```

jtibshirani · 2023-04-22T00:18:39Z

I ran through my usual manual tests, and also ran my hacky test script for context quality. Everything is unchanged with quantization 🎉

jtibshirani

Looks good to me. I know you're planning on addressing https://github.com/sourcegraph/sourcegraph/pull/50953#discussion_r1174033914 before merging.

This cuts memory usage of the index by ~4x by using int8 quantizations of the float32 embeddings. Additionally, seemingly since int8s are so much more cache efficient, this speeds up search by ~4x. Ignore the benchmarks in the commits -- they are a bit out of date. The reduction in precision comes at a cost in the form of less accurate ranking/recall. However, we still get ~94% recall with these quantizations, which seems well within the bounds of acceptable. Notably, I chose not to use a more complex quantization scheme (like windowing to min/max) because 1) it means scores cannot be comparable across repos, and 2) it means we need to download the full embedding before quantizing, which means it has to fit in memory. On the Sourcegraph repo, compared to the float32 embeddings, int8 embeddings had a retrieval rate of 90% at Top10, 93% at Top100, and 94% at Top1000. The mean rank of each embedding changed by only 1.2%. There are a few somewhat independent parts to this PR, but each is isolated to its own standalone commit. Commit messages are descriptive, so I'd recommend reviewing commit-by-commit. The messages include benchmark comparisons where appropriate to justify the changes.

cla-bot Bot added the cla-signed label Apr 20, 2023

camdencheek force-pushed the cc/int8-embeddings-2 branch 5 times, most recently from 69de48b to 385dda0 Compare April 21, 2023 02:22

camdencheek marked this pull request as ready for review April 21, 2023 02:47

camdencheek requested a review from jtibshirani April 21, 2023 02:47

camdencheek requested a review from a team April 21, 2023 02:49

camdencheek commented Apr 21, 2023

View reviewed changes

Comment thread enterprise/cmd/embeddings/shared/context_detection.go Outdated

Comment thread enterprise/internal/embeddings/similarity_search_test.go

Comment thread enterprise/internal/embeddings/similarity_search_test.go

Comment thread enterprise/internal/embeddings/types.go

camdencheek commented Apr 21, 2023

View reviewed changes

Comment thread enterprise/internal/embeddings/quantize.go

camdencheek force-pushed the cc/int8-embeddings-2 branch from 5b38ef3 to 709f6fc Compare April 21, 2023 16:05

stefanhengl reviewed Apr 21, 2023

View reviewed changes

jtibshirani reviewed Apr 21, 2023

View reviewed changes

camdencheek force-pushed the cc/int8-embeddings-2 branch from 709f6fc to 61a3268 Compare April 21, 2023 19:46

camdencheek added 6 commits April 21, 2023 15:57

add row method

ac5dce5

Add a method to the index that returns the subslice representing a single row. I just think this is easier to reason about.

fix flipped scoring

f0ca6ae

These are the changes in ranking from the change in precision. There are only two swaps of adjacent results.

update build files

ad6a75f

camdencheek force-pushed the cc/int8-embeddings-2 branch from 61a3268 to ad6a75f Compare April 21, 2023 21:57

jtibshirani mentioned this pull request Apr 22, 2023

Embeddings performance on monorepos #50861

Closed

camdencheek mentioned this pull request Apr 22, 2023

Embeddings: simplify batching #51014

Merged

jtibshirani approved these changes Apr 24, 2023

View reviewed changes

camdencheek enabled auto-merge (squash) April 24, 2023 16:20

maintain float32 embeddings where possible

1607c9d

camdencheek force-pushed the cc/int8-embeddings-2 branch from e9332fd to 1607c9d Compare April 24, 2023 16:48

camdencheek merged commit dbfb241 into main Apr 24, 2023

camdencheek deleted the cc/int8-embeddings-2 branch April 24, 2023 17:19

jtibshirani mentioned this pull request May 4, 2023

embeddings: calculate recall on CI #51218

Merged

Conversation

camdencheek commented Apr 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test plan

Uh oh!

sourcegraph-bot commented Apr 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jtibshirani commented Apr 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

camdencheek Apr 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jtibshirani left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jtibshirani commented Apr 22, 2023

Uh oh!

jtibshirani left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

camdencheek commented Apr 20, 2023 •

edited

Loading

sourcegraph-bot commented Apr 21, 2023 •

edited

Loading

jtibshirani commented Apr 21, 2023 •

edited

Loading

camdencheek Apr 21, 2023 •

edited

Loading

jtibshirani left a comment •

edited

Loading