Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

Embeddings: quantize to int8 and improve perf#50953

Merged
camdencheek merged 8 commits into
mainfrom
cc/int8-embeddings-2
Apr 24, 2023
Merged

Embeddings: quantize to int8 and improve perf#50953
camdencheek merged 8 commits into
mainfrom
cc/int8-embeddings-2

Conversation

@camdencheek

@camdencheek camdencheek commented Apr 20, 2023

Copy link
Copy Markdown
Member

This cuts memory usage of the index by ~4x by using int8 quantizations of the float32 embeddings. Additionally, seemingly since int8s are so much more cache efficient, this speeds up search by ~4x. Ignore the benchmarks in the commits -- they are a bit out of date.

The reduction in precision comes at a cost in the form of less accurate ranking/recall. However, we still get ~94% recall with these quantizations, which seems well within the bounds of acceptable. Notably, I chose not to use a more complex quantization scheme (like windowing to min/max) because 1) it means scores cannot be comparable across repos, and 2) it means we need to download the full embedding before quantizing, which means it has to fit in memory.

On the Sourcegraph repo, compared to the float32 embeddings, int8 embeddings had a retrieval rate of 90% at Top10, 93% at Top100, and 94% at Top1000. The mean rank of each embedding changed by only 1.2%.

There are a few somewhat independent parts to this PR, but each is isolated to its own standalone commit. Commit messages are descriptive, so I'd recommend reviewing commit-by-commit. The messages include benchmark comparisons where appropriate to justify the changes.

Things not implemented in this PR:

  • Converting existing embeddings blobs to int8. For easy backwards compat, we just convert to int8 as we read them into memory.
  • Converting ranks to int8. Ranks only account for ~1/1536 of the memory use now, and only ~1/384 after this change. Didn't seem worth spending the time to reduce that.
  • MMAPing the embeddings files. Embedding services currently have no disks, so we'd have to update the shape of all the deployments, and I think this gets us far enough for Dropbox.

Test plan

Large numbers of benchmarks to validate performance. Existing tests are updated appropriately. I ran a bunch of comparisons of recall rate against the original float32 rankings. The code for that can be found (in hacky form) at this commit. It won't run directly without some setup (sorry, couldn't make things run with both int8 and float32 at the same time), but it might be useful for reference.

All the tests and benchmarks were run on pre-existing float32 embeddings to ensure that we can still operate with the old embeddings. This PR does not cover migrating storage to int8, but we should consider that.

@cla-bot cla-bot Bot added the cla-signed label Apr 20, 2023
@camdencheek camdencheek force-pushed the cc/int8-embeddings-2 branch 5 times, most recently from 69de48b to 385dda0 Compare April 21, 2023 02:22
@camdencheek camdencheek marked this pull request as ready for review April 21, 2023 02:47
@camdencheek camdencheek requested a review from jtibshirani April 21, 2023 02:47
@sourcegraph-bot

sourcegraph-bot commented Apr 21, 2023

Copy link
Copy Markdown
Contributor

Codenotify: Notifying subscribers in CODENOTIFY files for diff 8abb1fc...1607c9d.

No notifications.

@camdencheek camdencheek requested a review from a team April 21, 2023 02:49
Comment thread enterprise/cmd/embeddings/shared/context_detection.go Outdated
Comment thread enterprise/internal/embeddings/similarity_search_test.go
Comment thread enterprise/internal/embeddings/similarity_search_test.go
Comment thread enterprise/internal/embeddings/types.go
Comment thread enterprise/internal/embeddings/quantize.go
@camdencheek camdencheek force-pushed the cc/int8-embeddings-2 branch from 5b38ef3 to 709f6fc Compare April 21, 2023 16:05
@jtibshirani

jtibshirani commented Apr 21, 2023

Copy link
Copy Markdown
Contributor

Here are some references to give background about this approach (often called "scalar quantization" or "SQ8" when you're using the popular FAISS library):

It's a common trick both in vector search, and in developing deep learning models, where you can use lower-precision values during training. The intuition is that most information is stored across the high number of dimensions, not the precision of individual elements.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we Dequantize before storing the embeddings?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question. It's just so all the embeddings stored in blob storage still are float32s. Gob can't decode a float32 slice into an int8 slice (or vice versa), so it would fail if we have a mix of encodings.

We will probably need to write some migration code at some point to convert the storage to int8, but I felt it was best to not modify the external data in this initial PR. I started doing that, but it was pretty complex to do without loading the full float32 embeddings into memory (which would fail at Dropbox's scale).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be nice if we always stored the embeddings at full original precision, instead of quantizing then unquantizing as we do now, which reduces their precision. It feels like this would make debugging easier -- for example when we're asking questions like "why is this query similar to this snippet?", we'll be sure what's due to the embeddings model itself vs. our approximation. It also gives more flexibility for future changes/ experiments, we could try out different search approximations without regenerating all the embeddings (can be expensive).

Would this be a big change?

@camdencheek camdencheek Apr 21, 2023

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, this would be a pretty difficult change because we would have to stream the embeddings to blob storage as we create them since the purpose of this change is to help large embeddings fit in memory. If we collected the float32 embeddings in memory as we create them, we have the same issues as we have now that the embeddings for large repos will cause failures.

I think the real solution here is to create embeddings in chunks. That will be a significantly larger lift though.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the real solution here is to create embeddings in chunks.

I see, I hadn't read up on the embedding + uploading logic yet! Makes sense not to change it for this PR, but could be worthwhile to look into as a follow-up.

@jtibshirani jtibshirani left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! In addition to the memory savings, the ~10x speed-up is exciting. It's actually a bit unexpected for me -- I wonder if the conversion to int8 also enabled the compiler to recognize an opportunity for SIMD?

If we were being really conservative, we would maintain two separate similarity search implementations (quantized version vs. original full precision), and feature flag it. But I feel good just moving fast with this change, it's a very common technique and the recall results look good. Before approving I will also pull it down locally and run through my manual prompt tests to check there's no obvious regressions.

Last note -- could you paste the recall results into the PR description? You sent me some helpful ones over chat (recall@1, recall@10, etc.)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quantization is most useful for the similarity search case, where we are concerned about memory and perform a large number of similarity computations. But here, we're just comparing a vector against two other vectors. So approximating here doesn't seem worth it, and could make the comparison more error prone / a little harder to debug.

To support this, maybe we can keep a version of CosineSimilarity that works on float vectors, it doesn't seem too complex?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@camdencheek any thoughts on this? Other than this, to me it looks ready to merge!

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, I missed this!

Seems reasonable and simple to me, will update it before merge 👍

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pushed a minimal commit that addresses this. I plan to clean things up a bit and make better-typed embedding vectors, but since there are so many PRs stacked on this one, I'll do that as a followup

Comment thread enterprise/internal/embeddings/quantize.go
Comment thread enterprise/internal/embeddings/similarity_search.go Outdated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be nice if we always stored the embeddings at full original precision, instead of quantizing then unquantizing as we do now, which reduces their precision. It feels like this would make debugging easier -- for example when we're asking questions like "why is this query similar to this snippet?", we'll be sure what's due to the embeddings model itself vs. our approximation. It also gives more flexibility for future changes/ experiments, we could try out different search approximations without regenerating all the embeddings (can be expensive).

Would this be a big change?

@camdencheek camdencheek force-pushed the cc/int8-embeddings-2 branch from 709f6fc to 61a3268 Compare April 21, 2023 19:46
This converts our embeddings to an int8 in memory. It does not convert
the storage format so that we can be backwards compatible with the
embeddings stored in blob storage. Also, I could see it being useful to
maintain the more precise embeddings in case we see regressions in
ranking from this change.

Some benchmarks showing the impact on search:

name                               old time/op    new time/op    delta
SimilaritySearch/numWorkers=1-10      1.89s ± 1%     0.24s ± 1%  -87.27%  (p=0.008 n=5+5)
SimilaritySearch/numWorkers=2-10      965ms ± 2%     122ms ± 1%  -87.39%  (p=0.008 n=5+5)
SimilaritySearch/numWorkers=4-10      493ms ± 1%      63ms ± 1%  -87.17%  (p=0.008 n=5+5)
SimilaritySearch/numWorkers=8-10      262ms ± 4%      34ms ± 3%  -87.06%  (p=0.008 n=5+5)
SimilaritySearch/numWorkers=16-10     255ms ± 4%      36ms ± 3%  -85.82%  (p=0.008 n=5+5)

name                               old alloc/op   new alloc/op   delta
SimilaritySearch/numWorkers=1-10     84.2kB ± 0%    83.3kB ± 0%   -1.00%  (p=0.008 n=5+5)
SimilaritySearch/numWorkers=2-10      148kB ± 0%     147kB ± 0%   -0.38%  (p=0.008 n=5+5)
SimilaritySearch/numWorkers=4-10      272kB ± 0%     268kB ± 0%   -1.48%  (p=0.008 n=5+5)
SimilaritySearch/numWorkers=8-10      493kB ± 0%     495kB ± 0%   +0.35%  (p=0.008 n=5+5)
SimilaritySearch/numWorkers=16-10     898kB ± 0%     909kB ± 0%   +1.19%  (p=0.008 n=5+5)

name                               old allocs/op  new allocs/op  delta
SimilaritySearch/numWorkers=1-10      2.08k ± 0%     2.06k ± 0%   -1.25%  (p=0.008 n=5+5)
SimilaritySearch/numWorkers=2-10      3.70k ± 0%     3.69k ± 0%     ~     (p=0.079 n=4+5)
SimilaritySearch/numWorkers=4-10      6.84k ± 0%     6.73k ± 0%   -1.63%  (p=0.008 n=5+5)
SimilaritySearch/numWorkers=8-10      12.3k ± 0%     12.4k ± 0%   +0.42%  (p=0.008 n=5+5)
SimilaritySearch/numWorkers=16-10     22.1k ± 0%     22.4k ± 0%   +1.64%  (p=0.008 n=5+5)
Add a method to the index that returns the subslice representing a
single row. I just think this is easier to reason about.
The indexing on this test was very weird. I just flipped the indexing so
that the expected is just the ordered set of results. Also, I added the
row num to the result which is a minimal cost but is very useful for
debugging, testing, and benchmarking.
These are the changes in ranking from the change in precision. There are
only two swaps of adjacent results.
This allocates a closure on every call to score, and allocates a bunch
more strings than necessary when debug is enabled.

name                               old time/op    new time/op    delta
SimilaritySearch/numWorkers=1-10      240ms ± 1%     238ms ± 1%     ~     (p=0.056 n=5+5)
SimilaritySearch/numWorkers=2-10      122ms ± 1%     122ms ± 1%     ~     (p=0.690 n=5+5)
SimilaritySearch/numWorkers=4-10     63.2ms ± 1%    62.1ms ± 1%   -1.84%  (p=0.008 n=5+5)
SimilaritySearch/numWorkers=8-10     33.9ms ± 3%    34.2ms ± 1%     ~     (p=0.222 n=5+5)
SimilaritySearch/numWorkers=16-10    36.2ms ± 3%    36.0ms ± 3%     ~     (p=0.690 n=5+5)

name                               old alloc/op   new alloc/op   delta
SimilaritySearch/numWorkers=1-10     83.3kB ± 0%    66.1kB ± 0%  -20.66%  (p=0.008 n=5+5)
SimilaritySearch/numWorkers=2-10      147kB ± 0%     114kB ± 0%  -22.70%  (p=0.008 n=5+5)
SimilaritySearch/numWorkers=4-10      268kB ± 0%     204kB ± 0%  -23.89%  (p=0.008 n=5+5)
SimilaritySearch/numWorkers=8-10      495kB ± 0%     375kB ± 0%  -24.24%  (p=0.008 n=5+5)
SimilaritySearch/numWorkers=16-10     909kB ± 0%     684kB ± 0%  -24.77%  (p=0.008 n=5+5)

name                               old allocs/op  new allocs/op  delta
SimilaritySearch/numWorkers=1-10      2.06k ± 0%     2.06k ± 0%     ~     (all equal)
SimilaritySearch/numWorkers=2-10      3.69k ± 0%     3.69k ± 0%     ~     (all equal)
SimilaritySearch/numWorkers=4-10      6.73k ± 0%     6.73k ± 0%     ~     (all equal)
SimilaritySearch/numWorkers=8-10      12.4k ± 0%     12.4k ± 0%     ~     (p=1.000 n=5+5)
SimilaritySearch/numWorkers=16-10     22.4k ± 0%     22.4k ± 0%     ~     (all equal)
@camdencheek camdencheek force-pushed the cc/int8-embeddings-2 branch from 61a3268 to ad6a75f Compare April 21, 2023 21:57
This checks the lengths of the input slices ahead of time so the
compiler doesn't need to insert panics on slice accesses.

Bounds check elimination: 
```
name                               old time/op    new time/op    delta
SimilaritySearch/numWorkers=1-10      559ms ± 0%     499ms ± 0%  -10.74%  (p=0.016 n=4+5)
SimilaritySearch/numWorkers=2-10      283ms ± 0%     253ms ± 0%  -10.65%  (p=0.008 n=5+5)
SimilaritySearch/numWorkers=4-10      146ms ± 0%     130ms ± 0%  -10.52%  (p=0.008 n=5+5)
SimilaritySearch/numWorkers=8-10     73.7ms ± 0%    66.0ms ± 0%  -10.45%  (p=0.008 n=5+5)
SimilaritySearch/numWorkers=16-10    83.6ms ± 2%    75.0ms ± 3%  -10.29%  (p=0.008 n=5+5)
```

Loop unrolling:
```
name                               old time/op    new time/op    delta
SimilaritySearch/numWorkers=1-10      499ms ± 0%     439ms ± 0%  -12.12%  (p=0.008 n=5+5)
SimilaritySearch/numWorkers=2-10      253ms ± 0%     222ms ± 0%  -12.21%  (p=0.008 n=5+5)
SimilaritySearch/numWorkers=4-10      130ms ± 0%     115ms ± 0%  -11.99%  (p=0.008 n=5+5)
SimilaritySearch/numWorkers=8-10     66.0ms ± 0%    57.9ms ± 0%  -12.20%  (p=0.008 n=5+5)
SimilaritySearch/numWorkers=16-10    75.0ms ± 3%    65.7ms ± 2%  -12.36%  (p=0.008 n=5+5)
```

Both:
```
name                               old time/op    new time/op    delta
SimilaritySearch/numWorkers=1-10      559ms ± 0%     439ms ± 0%  -21.55%  (p=0.016 n=4+5)
SimilaritySearch/numWorkers=2-10      283ms ± 0%     222ms ± 0%  -21.56%  (p=0.008 n=5+5)
SimilaritySearch/numWorkers=4-10      146ms ± 0%     115ms ± 0%  -21.24%  (p=0.008 n=5+5)
SimilaritySearch/numWorkers=8-10     73.7ms ± 0%    57.9ms ± 0%  -21.37%  (p=0.008 n=5+5)
SimilaritySearch/numWorkers=16-10    83.6ms ± 2%    65.7ms ± 2%  -21.38%  (p=0.008 n=5+5)
```
@jtibshirani

Copy link
Copy Markdown
Contributor

I ran through my usual manual tests, and also ran my hacky test script for context quality. Everything is unchanged with quantization 🎉

@jtibshirani jtibshirani left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I know you're planning on addressing https://github.com/sourcegraph/sourcegraph/pull/50953#discussion_r1174033914 before merging.

@camdencheek camdencheek enabled auto-merge (squash) April 24, 2023 16:20
@camdencheek camdencheek force-pushed the cc/int8-embeddings-2 branch from e9332fd to 1607c9d Compare April 24, 2023 16:48
@camdencheek camdencheek merged commit dbfb241 into main Apr 24, 2023
@camdencheek camdencheek deleted the cc/int8-embeddings-2 branch April 24, 2023 17:19
camdencheek added a commit that referenced this pull request Apr 27, 2023
This cuts memory usage of the index by ~4x by using int8 quantizations
of the float32 embeddings. Additionally, seemingly since int8s are so
much more cache efficient, this speeds up search by ~4x. Ignore the
benchmarks in the commits -- they are a bit out of date.

The reduction in precision comes at a cost in the form of less accurate
ranking/recall. However, we still get ~94% recall with these
quantizations, which seems well within the bounds of acceptable.
Notably, I chose not to use a more complex quantization scheme (like
windowing to min/max) because 1) it means scores cannot be comparable
across repos, and 2) it means we need to download the full embedding
before quantizing, which means it has to fit in memory.

On the Sourcegraph repo, compared to the float32 embeddings, int8
embeddings had a retrieval rate of 90% at Top10, 93% at Top100, and 94%
at Top1000. The mean rank of each embedding changed by only 1.2%.

There are a few somewhat independent parts to this PR, but each is
isolated to its own standalone commit. Commit messages are descriptive,
so I'd recommend reviewing commit-by-commit. The messages include
benchmark comparisons where appropriate to justify the changes.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants