Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

Embeddings: optimize memory usage during decode#54972

Merged
camdencheek merged 5 commits into
mainfrom
cc/optimize-decode-memory
Jul 17, 2023
Merged

Embeddings: optimize memory usage during decode#54972
camdencheek merged 5 commits into
mainfrom
cc/optimize-decode-memory

Conversation

@camdencheek

@camdencheek camdencheek commented Jul 14, 2023

Copy link
Copy Markdown
Member

This makes memory usage ~constant during the decode process. Previously, we were undersizing the large allocation (which caused it to reallocate ~10 extra times) for the index and making a bunch of additional smaller allocations per loop. This meant we were spiking memory usage during decoding, and even if the garbage collector could keep up (which I wouldn't expect it to), we'd still need 2x the index size at the peak for live memory. This was causing an OOM.

Before:

BenchmarkCustomRepoEmbeddingIndexDownload-10                   1        1212576041 ns/op        1299284528 B/op    96393 allocs/op

After:

BenchmarkCustomRepoEmbeddingIndexDownload-10                   1        1193088500 ns/op        589103720 B/op     65660 allocs/op

That's a 2.2x reduction in allocated bytes and a 1.4x reduction in allocation count. This is quite significant when our indexes are many GBs in memory.

Test plan

A new benchmark to demonstrate better memory behavior, a new quick test on the quantize changes, existing tests on encode/decode, and manual E2E tests for generating and searching with an embeddings index.

@cla-bot cla-bot Bot added the cla-signed label Jul 14, 2023
@camdencheek camdencheek changed the title Cc/optimize decode memory Embeddings: optimize memory usage during decode Jul 14, 2023
return nil, err
}

ei.Embeddings = make([]int8, 0, numChunks*ei.ColumnDimension)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First change: this we should be using embeddingsChunkSize instead of ColumnDimension because each chunk is not a vector, but rather a chunk of 10_000 floats.


ei.Embeddings = make([]int8, 0, numChunks*ei.ColumnDimension)
ei.Embeddings = make([]int8, 0, numChunks*embeddingsChunkSize)
embeddingsBuf := make([]float32, 0, embeddingsChunkSize)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second change: move the decode buffer outside the loop so we don't allocate on every iteration.

return nil, err
}
ei.Embeddings = append(ei.Embeddings, Quantize(embeddingSlice)...)
ei.Embeddings = append(ei.Embeddings, Quantize(embeddingsBuf, quantizeBuf)...)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Third change: update Quantize to accept an optional buffer, which it will use if it's large enough to fit the output.

}
}

func BenchmarkCustomRepoEmbeddingIndexDownload(b *testing.B) {

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this benchmark to prove to myself that this cuts down the allocations. See PR description for results.

@camdencheek camdencheek requested a review from a team July 14, 2023 19:38
@camdencheek camdencheek marked this pull request as ready for review July 14, 2023 19:39

@keegancsmith keegancsmith left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess on the switch to a third party vector store we won't keep this store around? IE it isn't worth making improvements to how we store stuff?

@camdencheek

Copy link
Copy Markdown
Member Author

Yeah, there is a lot more room for improvement here, but hopefully (🤞) this won't be around too much longer

@camdencheek camdencheek merged commit 2880b4f into main Jul 17, 2023
@camdencheek camdencheek deleted the cc/optimize-decode-memory branch July 17, 2023 14:06
@github-actions

Copy link
Copy Markdown
Contributor

The backport to 5.1 failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-5.1 5.1
# Navigate to the new working tree
cd .worktrees/backport-5.1
# Create a new branch
git switch --create backport-54972-to-5.1
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 2880b4ffa0b8591d7a46ace259c8f522931dd617
# Push it to GitHub
git push --set-upstream origin backport-54972-to-5.1
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-5.1

Then, create a pull request where the base branch is 5.1 and the compare/head branch is backport-54972-to-5.1.

@github-actions github-actions Bot added backports failed-backport-to-5.1 release-blocker Prevents us from releasing: https://about.sourcegraph.com/handbook/engineering/releases labels Jul 17, 2023
camdencheek added a commit that referenced this pull request Jul 17, 2023
This makes memory usage ~constant during the decode process. Previously,
we were undersizing the large allocation (which caused it to reallocate
~10 extra times) for the index and making a bunch of additional smaller
allocations per loop. This meant we were spiking memory usage during
decoding, and even if the garbage collector could keep up (which I
wouldn't expect it to), we'd still need 2x the index size at the peak
for live memory. This was causing an OOM.

This PR accounts for a 2.2x reduction in allocated bytes and a 1.4x reduction in
allocation count. This is quite significant when our indexes are many
GBs in memory.
camdencheek added a commit that referenced this pull request Jul 17, 2023
This makes memory usage ~constant during the decode process. Previously,
we were undersizing the large allocation (which caused it to reallocate
~10 extra times) for the index and making a bunch of additional smaller
allocations per loop. This meant we were spiking memory usage during
decoding, and even if the garbage collector could keep up (which I
wouldn't expect it to), we'd still need 2x the index size at the peak
for live memory. This was causing an OOM.

This PR accounts for a 2.2x reduction in allocated bytes and a 1.4x reduction in
allocation count. This is quite significant when our indexes are many
GBs in memory.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

backports cla-signed release-blocker Prevents us from releasing: https://about.sourcegraph.com/handbook/engineering/releases

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants